Document (#34103)

Author
Fersini, E.
Messina, E.
Archetti, F.
Title
Enhancing web page classification through image-block importance analysis
Source
Information processing and management. 44(2008) no.4, S.1431-1447
Year
2008
Abstract
We present a term weighting approach for improving web page classification, based on the assumption that the images of a web page are those elements which mainly attract the attention of the user. This assumption implies that the text contained in the visual block in which an image is located, called image-block, should contain significant information about the page contents. In this paper we propose a new metric, called the Inverse Term Importance Metric, aimed at assigning higher weights to important terms contained into important image-blocks identified by performing a visual layout analysis. We propose different methods to estimate the visual image-blocks importance, to smooth the term weight according to the importance of the blocks in which the term is located. The traditional TFxIDF model is modified accordingly and used in the classification task. The effectiveness of this new metric and the proposed block evaluation methods have been validated using different classification algorithms.

Similar documents (content)

  1. Moura, E.S. de; Fernandes, D.; Ribeiro-Neto, B.; Silva, A.S. da; Gonçalves, M.A.: Using structural information to improve search in Web collections (2010) 0.44
    0.44125175 = sum of:
      0.44125175 = product of:
        1.3789117 = sum of:
          0.08846319 = weight(abstract_txt:weight in 4119) [ClassicSimilarity], result of:
            0.08846319 = score(doc=4119,freq=2.0), product of:
              0.1083047 = queryWeight, product of:
                1.0106579 = boost
                7.3928223 = idf(docFreq=73, maxDocs=44218)
                0.014495489 = queryNorm
              0.81679916 = fieldWeight in 4119, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.3928223 = idf(docFreq=73, maxDocs=44218)
                0.078125 = fieldNorm(doc=4119)
          0.07039151 = weight(abstract_txt:inverse in 4119) [ClassicSimilarity], result of:
            0.07039151 = score(doc=4119,freq=1.0), product of:
              0.117173426 = queryWeight, product of:
                1.0512236 = boost
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.014495489 = queryNorm
              0.6007464 = fieldWeight in 4119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.078125 = fieldNorm(doc=4119)
          0.022078574 = weight(abstract_txt:methods in 4119) [ClassicSimilarity], result of:
            0.022078574 = score(doc=4119,freq=1.0), product of:
              0.06815111 = queryWeight, product of:
                1.1337883 = boost
                4.146752 = idf(docFreq=1900, maxDocs=44218)
                0.014495489 = queryNorm
              0.32396498 = fieldWeight in 4119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.146752 = idf(docFreq=1900, maxDocs=44218)
                0.078125 = fieldNorm(doc=4119)
          0.042260204 = weight(abstract_txt:propose in 4119) [ClassicSimilarity], result of:
            0.042260204 = score(doc=4119,freq=1.0), product of:
              0.10506224 = queryWeight, product of:
                1.4077283 = boost
                5.1486683 = idf(docFreq=697, maxDocs=44218)
                0.014495489 = queryNorm
              0.4022397 = fieldWeight in 4119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1486683 = idf(docFreq=697, maxDocs=44218)
                0.078125 = fieldNorm(doc=4119)
          0.09692647 = weight(abstract_txt:term in 4119) [ClassicSimilarity], result of:
            0.09692647 = score(doc=4119,freq=2.0), product of:
              0.18272047 = queryWeight, product of:
                2.6254523 = boost
                4.8012047 = idf(docFreq=987, maxDocs=44218)
                0.014495489 = queryNorm
              0.53046316 = fieldWeight in 4119, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.8012047 = idf(docFreq=987, maxDocs=44218)
                0.078125 = fieldNorm(doc=4119)
          0.30078894 = weight(abstract_txt:blocks in 4119) [ClassicSimilarity], result of:
            0.30078894 = score(doc=4119,freq=2.0), product of:
              0.35319993 = queryWeight, product of:
                3.1611962 = boost
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.014495489 = queryNorm
              0.851611 = fieldWeight in 4119, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.078125 = fieldNorm(doc=4119)
          0.13352363 = weight(abstract_txt:page in 4119) [ClassicSimilarity], result of:
            0.13352363 = score(doc=4119,freq=1.0), product of:
              0.28501934 = queryWeight, product of:
                3.2790473 = boost
                5.9964437 = idf(docFreq=298, maxDocs=44218)
                0.014495489 = queryNorm
              0.46847218 = fieldWeight in 4119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9964437 = idf(docFreq=298, maxDocs=44218)
                0.078125 = fieldNorm(doc=4119)
          0.62447923 = weight(abstract_txt:block in 4119) [ClassicSimilarity], result of:
            0.62447923 = score(doc=4119,freq=4.0), product of:
              0.5021432 = queryWeight, product of:
                4.352355 = boost
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.014495489 = queryNorm
              1.2436278 = fieldWeight in 4119, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.078125 = fieldNorm(doc=4119)
        0.32 = coord(8/25)
    
  2. Tsai, R.T.-H.; Chiu, B.; Wu, C.-E.: Visual webpage block importance prediction using conditional random fields (2011) 0.21
    0.20624606 = sum of:
      0.20624606 = product of:
        1.0312303 = sum of:
          0.018439109 = weight(abstract_txt:which in 4924) [ClassicSimilarity], result of:
            0.018439109 = score(doc=4924,freq=4.0), product of:
              0.050575007 = queryWeight, product of:
                1.1962147 = boost
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.014495489 = queryNorm
              0.36458933 = fieldWeight in 4924, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.0625 = fieldNorm(doc=4924)
          0.09284915 = weight(abstract_txt:importance in 4924) [ClassicSimilarity], result of:
            0.09284915 = score(doc=4924,freq=2.0), product of:
              0.20603968 = queryWeight, product of:
                2.7879562 = boost
                5.098378 = idf(docFreq=733, maxDocs=44218)
                0.014495489 = queryNorm
              0.45063722 = fieldWeight in 4924, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.098378 = idf(docFreq=733, maxDocs=44218)
                0.0625 = fieldNorm(doc=4924)
          0.3804713 = weight(abstract_txt:blocks in 4924) [ClassicSimilarity], result of:
            0.3804713 = score(doc=4924,freq=5.0), product of:
              0.35319993 = queryWeight, product of:
                3.1611962 = boost
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.014495489 = queryNorm
              1.0772122 = fieldWeight in 4924, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.0625 = fieldNorm(doc=4924)
          0.1068189 = weight(abstract_txt:page in 4924) [ClassicSimilarity], result of:
            0.1068189 = score(doc=4924,freq=1.0), product of:
              0.28501934 = queryWeight, product of:
                3.2790473 = boost
                5.9964437 = idf(docFreq=298, maxDocs=44218)
                0.014495489 = queryNorm
              0.37477773 = fieldWeight in 4924, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9964437 = idf(docFreq=298, maxDocs=44218)
                0.0625 = fieldNorm(doc=4924)
          0.43265188 = weight(abstract_txt:block in 4924) [ClassicSimilarity], result of:
            0.43265188 = score(doc=4924,freq=3.0), product of:
              0.5021432 = queryWeight, product of:
                4.352355 = boost
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.014495489 = queryNorm
              0.86161053 = fieldWeight in 4924, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.0625 = fieldNorm(doc=4924)
        0.2 = coord(5/25)
    
  3. Wan, X.; Yang, J.; Xiao, J.: Towards a unified approach to document similarity search using manifold-ranking of blocks (2008) 0.14
    0.14379367 = sum of:
      0.14379367 = product of:
        0.8987105 = sum of:
          0.05177917 = weight(abstract_txt:validated in 2081) [ClassicSimilarity], result of:
            0.05177917 = score(doc=2081,freq=1.0), product of:
              0.1107964 = queryWeight, product of:
                1.0222176 = boost
                7.4773793 = idf(docFreq=67, maxDocs=44218)
                0.014495489 = queryNorm
              0.4673362 = fieldWeight in 2081, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.4773793 = idf(docFreq=67, maxDocs=44218)
                0.0625 = fieldNorm(doc=2081)
          0.033808164 = weight(abstract_txt:propose in 2081) [ClassicSimilarity], result of:
            0.033808164 = score(doc=2081,freq=1.0), product of:
              0.10506224 = queryWeight, product of:
                1.4077283 = boost
                5.1486683 = idf(docFreq=697, maxDocs=44218)
                0.014495489 = queryNorm
              0.32179177 = fieldWeight in 2081, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1486683 = idf(docFreq=697, maxDocs=44218)
                0.0625 = fieldNorm(doc=2081)
          0.3804713 = weight(abstract_txt:blocks in 2081) [ClassicSimilarity], result of:
            0.3804713 = score(doc=2081,freq=5.0), product of:
              0.35319993 = queryWeight, product of:
                3.1611962 = boost
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.014495489 = queryNorm
              1.0772122 = fieldWeight in 2081, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.0625 = fieldNorm(doc=2081)
          0.43265188 = weight(abstract_txt:block in 2081) [ClassicSimilarity], result of:
            0.43265188 = score(doc=2081,freq=3.0), product of:
              0.5021432 = queryWeight, product of:
                4.352355 = boost
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.014495489 = queryNorm
              0.86161053 = fieldWeight in 2081, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.0625 = fieldNorm(doc=2081)
        0.16 = coord(4/25)
    
  4. Baeza-Yates, R.; Navarro, G.: Block addressing indices for approximate text retrieval (2000) 0.13
    0.13298263 = sum of:
      0.13298263 = product of:
        0.5540943 = sum of:
          0.01854633 = weight(abstract_txt:important in 4295) [ClassicSimilarity], result of:
            0.01854633 = score(doc=4295,freq=1.0), product of:
              0.07040512 = queryWeight, product of:
                1.1523851 = boost
                4.2147684 = idf(docFreq=1775, maxDocs=44218)
                0.014495489 = queryNorm
              0.26342303 = fieldWeight in 4295, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2147684 = idf(docFreq=1775, maxDocs=44218)
                0.0625 = fieldNorm(doc=4295)
          0.013038418 = weight(abstract_txt:which in 4295) [ClassicSimilarity], result of:
            0.013038418 = score(doc=4295,freq=2.0), product of:
              0.050575007 = queryWeight, product of:
                1.1962147 = boost
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.014495489 = queryNorm
              0.2578036 = fieldWeight in 4295, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.0625 = fieldNorm(doc=4295)
          0.036911696 = weight(abstract_txt:called in 4295) [ClassicSimilarity], result of:
            0.036911696 = score(doc=4295,freq=1.0), product of:
              0.11139737 = queryWeight, product of:
                1.4495493 = boost
                5.3016257 = idf(docFreq=598, maxDocs=44218)
                0.014495489 = queryNorm
              0.3313516 = fieldWeight in 4295, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.3016257 = idf(docFreq=598, maxDocs=44218)
                0.0625 = fieldNorm(doc=4295)
          0.06565426 = weight(abstract_txt:importance in 4295) [ClassicSimilarity], result of:
            0.06565426 = score(doc=4295,freq=1.0), product of:
              0.20603968 = queryWeight, product of:
                2.7879562 = boost
                5.098378 = idf(docFreq=733, maxDocs=44218)
                0.014495489 = queryNorm
              0.31864864 = fieldWeight in 4295, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.098378 = idf(docFreq=733, maxDocs=44218)
                0.0625 = fieldNorm(doc=4295)
          0.17015193 = weight(abstract_txt:blocks in 4295) [ClassicSimilarity], result of:
            0.17015193 = score(doc=4295,freq=1.0), product of:
              0.35319993 = queryWeight, product of:
                3.1611962 = boost
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.014495489 = queryNorm
              0.48174396 = fieldWeight in 4295, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.0625 = fieldNorm(doc=4295)
          0.24979168 = weight(abstract_txt:block in 4295) [ClassicSimilarity], result of:
            0.24979168 = score(doc=4295,freq=1.0), product of:
              0.5021432 = queryWeight, product of:
                4.352355 = boost
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.014495489 = queryNorm
              0.4974511 = fieldWeight in 4295, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.0625 = fieldNorm(doc=4295)
        0.24 = coord(6/25)
    
  5. Riehm, S.M.: ¬A first look at FirstSearch (1992) 0.12
    0.11981561 = sum of:
      0.11981561 = product of:
        0.7488476 = sum of:
          0.022078574 = weight(abstract_txt:methods in 2345) [ClassicSimilarity], result of:
            0.022078574 = score(doc=2345,freq=1.0), product of:
              0.06815111 = queryWeight, product of:
                1.1337883 = boost
                4.146752 = idf(docFreq=1900, maxDocs=44218)
                0.014495489 = queryNorm
              0.32396498 = fieldWeight in 2345, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.146752 = idf(docFreq=1900, maxDocs=44218)
                0.078125 = fieldNorm(doc=2345)
          0.046139624 = weight(abstract_txt:called in 2345) [ClassicSimilarity], result of:
            0.046139624 = score(doc=2345,freq=1.0), product of:
              0.11139737 = queryWeight, product of:
                1.4495493 = boost
                5.3016257 = idf(docFreq=598, maxDocs=44218)
                0.014495489 = queryNorm
              0.41418952 = fieldWeight in 2345, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.3016257 = idf(docFreq=598, maxDocs=44218)
                0.078125 = fieldNorm(doc=2345)
          0.36838976 = weight(abstract_txt:blocks in 2345) [ClassicSimilarity], result of:
            0.36838976 = score(doc=2345,freq=3.0), product of:
              0.35319993 = queryWeight, product of:
                3.1611962 = boost
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.014495489 = queryNorm
              1.0430063 = fieldWeight in 2345, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.078125 = fieldNorm(doc=2345)
          0.31223962 = weight(abstract_txt:block in 2345) [ClassicSimilarity], result of:
            0.31223962 = score(doc=2345,freq=1.0), product of:
              0.5021432 = queryWeight, product of:
                4.352355 = boost
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.014495489 = queryNorm
              0.6218139 = fieldWeight in 2345, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.078125 = fieldNorm(doc=2345)
        0.16 = coord(4/25)