Document (#33017)

Author
Trotman, A.
Title
Choosing document structure weights
Source
Information processing and management. 41(2005) no.2, S.243-264
Year
2005
Abstract
Existing ranking schemes assume all term occurrences in a given document are of equal influence. Intuitively, terms occurring in some places should have a greater influence than those elsewhere. An occurrence in an abstract may be more important than an occurrence in the body text. Although this observation is not new, there remains the issue of finding good weights for each structure. Vector space, probability, and Okapi BM25 ranking are extended to include structure weighting. Weights are then selected for the TREC WSJ collection using a genetic algorithm. The learned weights are then tested on an evaluation set of queries. Structure weighted vector space inner product and structure weighted probabilistic retrieval show an about 5% improvement in mean average precision over their unstructured counterparts. Structure weighted BM25 shows nearly no improvement. Analysis suggests BM25 cannot be improved using structure weighting.

Similar documents (content)

  1. Moura, E.S. de; Fernandes, D.; Ribeiro-Neto, B.; Silva, A.S. da; Gonçalves, M.A.: Using structural information to improve search in Web collections (2010) 0.31
    0.3067483 = sum of:
      0.3067483 = product of:
        1.0955297 = sum of:
          0.067241974 = weight(abstract_txt:occurrences in 4119) [ClassicSimilarity], result of:
            0.067241974 = score(doc=4119,freq=1.0), product of:
              0.11599348 = queryWeight, product of:
                1.037272 = boost
                7.4202213 = idf(docFreq=71, maxDocs=44218)
                0.015070375 = queryNorm
              0.57970476 = fieldWeight in 4119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.4202213 = idf(docFreq=71, maxDocs=44218)
                0.078125 = fieldNorm(doc=4119)
          0.02603656 = weight(abstract_txt:document in 4119) [ClassicSimilarity], result of:
            0.02603656 = score(doc=4119,freq=1.0), product of:
              0.07763764 = queryWeight, product of:
                1.2001265 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.015070375 = queryNorm
              0.33536002 = fieldWeight in 4119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.078125 = fieldNorm(doc=4119)
          0.03239368 = weight(abstract_txt:then in 4119) [ClassicSimilarity], result of:
            0.03239368 = score(doc=4119,freq=1.0), product of:
              0.08980974 = queryWeight, product of:
                1.290781 = boost
                4.616861 = idf(docFreq=1187, maxDocs=44218)
                0.015070375 = queryNorm
              0.36069226 = fieldWeight in 4119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.616861 = idf(docFreq=1187, maxDocs=44218)
                0.078125 = fieldNorm(doc=4119)
          0.14150879 = weight(abstract_txt:ranking in 4119) [ClassicSimilarity], result of:
            0.14150879 = score(doc=4119,freq=6.0), product of:
              0.13207535 = queryWeight, product of:
                1.565315 = boost
                5.598813 = idf(docFreq=444, maxDocs=44218)
                0.015070375 = queryNorm
              1.0714246 = fieldWeight in 4119, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                5.598813 = idf(docFreq=444, maxDocs=44218)
                0.078125 = fieldNorm(doc=4119)
          0.11186269 = weight(abstract_txt:weighting in 4119) [ClassicSimilarity], result of:
            0.11186269 = score(doc=4119,freq=1.0), product of:
              0.2051824 = queryWeight, product of:
                1.9510164 = boost
                6.9783883 = idf(docFreq=111, maxDocs=44218)
                0.015070375 = queryNorm
              0.5451866 = fieldWeight in 4119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9783883 = idf(docFreq=111, maxDocs=44218)
                0.078125 = fieldNorm(doc=4119)
          0.6211297 = weight(abstract_txt:bm25 in 4119) [ClassicSimilarity], result of:
            0.6211297 = score(doc=4119,freq=3.0), product of:
              0.51065564 = queryWeight, product of:
                3.769646 = boost
                8.988837 = idf(docFreq=14, maxDocs=44218)
                0.015070375 = queryNorm
              1.2163377 = fieldWeight in 4119, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                8.988837 = idf(docFreq=14, maxDocs=44218)
                0.078125 = fieldNorm(doc=4119)
          0.0953563 = weight(abstract_txt:structure in 4119) [ClassicSimilarity], result of:
            0.0953563 = score(doc=4119,freq=1.0), product of:
              0.2800736 = queryWeight, product of:
                4.26443 = boost
                4.3579993 = idf(docFreq=1538, maxDocs=44218)
                0.015070375 = queryNorm
              0.3404687 = fieldWeight in 4119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.3579993 = idf(docFreq=1538, maxDocs=44218)
                0.078125 = fieldNorm(doc=4119)
        0.28 = coord(7/25)
    
  2. Dang, E.K.F.; Luk, R.W.P.; Allan, J.; Ho, K.S.; Chung, K.F.L.; Lee, D.L.: ¬A new context-dependent term weight computed by boost and discount using relevance information (2010) 0.30
    0.30387267 = sum of:
      0.30387267 = product of:
        1.2661362 = sum of:
          0.036077317 = weight(abstract_txt:document in 4120) [ClassicSimilarity], result of:
            0.036077317 = score(doc=4120,freq=3.0), product of:
              0.07763764 = queryWeight, product of:
                1.2001265 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.015070375 = queryNorm
              0.46468848 = fieldWeight in 4120, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0625 = fieldNorm(doc=4120)
          0.086698644 = weight(abstract_txt:improvement in 4120) [ClassicSimilarity], result of:
            0.086698644 = score(doc=4120,freq=2.0), product of:
              0.15944888 = queryWeight, product of:
                1.7198937 = boost
                6.1517096 = idf(docFreq=255, maxDocs=44218)
                0.015070375 = queryNorm
              0.54373944 = fieldWeight in 4120, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.1517096 = idf(docFreq=255, maxDocs=44218)
                0.0625 = fieldNorm(doc=4120)
          0.08222653 = weight(abstract_txt:occurrence in 4120) [ClassicSimilarity], result of:
            0.08222653 = score(doc=4120,freq=1.0), product of:
              0.19392386 = queryWeight, product of:
                1.8967342 = boost
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.015070375 = queryNorm
              0.4240145 = fieldWeight in 4120, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.0625 = fieldNorm(doc=4120)
          0.08949015 = weight(abstract_txt:weighting in 4120) [ClassicSimilarity], result of:
            0.08949015 = score(doc=4120,freq=1.0), product of:
              0.2051824 = queryWeight, product of:
                1.9510164 = boost
                6.9783883 = idf(docFreq=111, maxDocs=44218)
                0.015070375 = queryNorm
              0.43614927 = fieldWeight in 4120, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9783883 = idf(docFreq=111, maxDocs=44218)
                0.0625 = fieldNorm(doc=4120)
          0.40572023 = weight(abstract_txt:bm25 in 4120) [ClassicSimilarity], result of:
            0.40572023 = score(doc=4120,freq=2.0), product of:
              0.51065564 = queryWeight, product of:
                3.769646 = boost
                8.988837 = idf(docFreq=14, maxDocs=44218)
                0.015070375 = queryNorm
              0.79450846 = fieldWeight in 4120, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.988837 = idf(docFreq=14, maxDocs=44218)
                0.0625 = fieldNorm(doc=4120)
          0.5659233 = weight(abstract_txt:weights in 4120) [ClassicSimilarity], result of:
            0.5659233 = score(doc=4120,freq=8.0), product of:
              0.4420197 = queryWeight, product of:
                4.0497355 = boost
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.015070375 = queryNorm
              1.2803123 = fieldWeight in 4120, product of:
                2.828427 = tf(freq=8.0), with freq of:
                  8.0 = termFreq=8.0
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.0625 = fieldNorm(doc=4120)
        0.24 = coord(6/25)
    
  3. Wolfram, D.; Zhang, J.: ¬The influence of indexing practices and weighting algorithms on document spaces (2008) 0.27
    0.2670686 = sum of:
      0.2670686 = product of:
        0.83458936 = sum of:
          0.08926037 = weight(abstract_txt:occurring in 1963) [ClassicSimilarity], result of:
            0.08926037 = score(doc=1963,freq=3.0), product of:
              0.11272286 = queryWeight, product of:
                1.0225437 = boost
                7.314861 = idf(docFreq=79, maxDocs=44218)
                0.015070375 = queryNorm
              0.7918569 = fieldWeight in 1963, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.314861 = idf(docFreq=79, maxDocs=44218)
                0.0625 = fieldNorm(doc=1963)
          0.015561853 = weight(abstract_txt:than in 1963) [ClassicSimilarity], result of:
            0.015561853 = score(doc=1963,freq=1.0), product of:
              0.06392403 = queryWeight, product of:
                1.0889876 = boost
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.015070375 = queryNorm
              0.24344292 = fieldWeight in 1963, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.0625 = fieldNorm(doc=1963)
          0.051021032 = weight(abstract_txt:document in 1963) [ClassicSimilarity], result of:
            0.051021032 = score(doc=1963,freq=6.0), product of:
              0.07763764 = queryWeight, product of:
                1.2001265 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.015070375 = queryNorm
              0.65716875 = fieldWeight in 1963, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0625 = fieldNorm(doc=1963)
          0.085186124 = weight(abstract_txt:influence in 1963) [ClassicSimilarity], result of:
            0.085186124 = score(doc=1963,freq=5.0), product of:
              0.11611255 = queryWeight, product of:
                1.4676769 = boost
                5.2495813 = idf(docFreq=630, maxDocs=44218)
                0.015070375 = queryNorm
              0.7336513 = fieldWeight in 1963, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.2495813 = idf(docFreq=630, maxDocs=44218)
                0.0625 = fieldNorm(doc=1963)
          0.08258391 = weight(abstract_txt:space in 1963) [ClassicSimilarity], result of:
            0.08258391 = score(doc=1963,freq=4.0), product of:
              0.1225181 = queryWeight, product of:
                1.5076169 = boost
                5.3924384 = idf(docFreq=546, maxDocs=44218)
                0.015070375 = queryNorm
              0.6740548 = fieldWeight in 1963, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.3924384 = idf(docFreq=546, maxDocs=44218)
                0.0625 = fieldNorm(doc=1963)
          0.07301298 = weight(abstract_txt:vector in 1963) [ClassicSimilarity], result of:
            0.07301298 = score(doc=1963,freq=1.0), product of:
              0.17915268 = queryWeight, product of:
                1.8230666 = boost
                6.5207376 = idf(docFreq=176, maxDocs=44218)
                0.015070375 = queryNorm
              0.4075461 = fieldWeight in 1963, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5207376 = idf(docFreq=176, maxDocs=44218)
                0.0625 = fieldNorm(doc=1963)
          0.15500149 = weight(abstract_txt:weighting in 1963) [ClassicSimilarity], result of:
            0.15500149 = score(doc=1963,freq=3.0), product of:
              0.2051824 = queryWeight, product of:
                1.9510164 = boost
                6.9783883 = idf(docFreq=111, maxDocs=44218)
                0.015070375 = queryNorm
              0.75543267 = fieldWeight in 1963, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.9783883 = idf(docFreq=111, maxDocs=44218)
                0.0625 = fieldNorm(doc=1963)
          0.28296164 = weight(abstract_txt:weights in 1963) [ClassicSimilarity], result of:
            0.28296164 = score(doc=1963,freq=2.0), product of:
              0.4420197 = queryWeight, product of:
                4.0497355 = boost
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.015070375 = queryNorm
              0.64015615 = fieldWeight in 1963, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.0625 = fieldNorm(doc=1963)
        0.32 = coord(8/25)
    
  4. Dang, E.K.F.; Luk, R.W.P.; Allan, J.: Beyond bag-of-words : bigram-enhanced context-dependent term weights (2014) 0.21
    0.21071154 = sum of:
      0.21071154 = product of:
        0.87796474 = sum of:
          0.03156765 = weight(abstract_txt:document in 1283) [ClassicSimilarity], result of:
            0.03156765 = score(doc=1283,freq=3.0), product of:
              0.07763764 = queryWeight, product of:
                1.2001265 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.015070375 = queryNorm
              0.4066024 = fieldWeight in 1283, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1283)
          0.05364205 = weight(abstract_txt:improvement in 1283) [ClassicSimilarity], result of:
            0.05364205 = score(doc=1283,freq=1.0), product of:
              0.15944888 = queryWeight, product of:
                1.7198937 = boost
                6.1517096 = idf(docFreq=255, maxDocs=44218)
                0.015070375 = queryNorm
              0.3364216 = fieldWeight in 1283, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.1517096 = idf(docFreq=255, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1283)
          0.071948215 = weight(abstract_txt:occurrence in 1283) [ClassicSimilarity], result of:
            0.071948215 = score(doc=1283,freq=1.0), product of:
              0.19392386 = queryWeight, product of:
                1.8967342 = boost
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.015070375 = queryNorm
              0.3710127 = fieldWeight in 1283, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1283)
          0.07830388 = weight(abstract_txt:weighting in 1283) [ClassicSimilarity], result of:
            0.07830388 = score(doc=1283,freq=1.0), product of:
              0.2051824 = queryWeight, product of:
                1.9510164 = boost
                6.9783883 = idf(docFreq=111, maxDocs=44218)
                0.015070375 = queryNorm
              0.3816306 = fieldWeight in 1283, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9783883 = idf(docFreq=111, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1283)
          0.25102657 = weight(abstract_txt:bm25 in 1283) [ClassicSimilarity], result of:
            0.25102657 = score(doc=1283,freq=1.0), product of:
              0.51065564 = queryWeight, product of:
                3.769646 = boost
                8.988837 = idf(docFreq=14, maxDocs=44218)
                0.015070375 = queryNorm
              0.49157703 = fieldWeight in 1283, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.988837 = idf(docFreq=14, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1283)
          0.3914764 = weight(abstract_txt:weights in 1283) [ClassicSimilarity], result of:
            0.3914764 = score(doc=1283,freq=5.0), product of:
              0.4420197 = queryWeight, product of:
                4.0497355 = boost
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.015070375 = queryNorm
              0.88565373 = fieldWeight in 1283, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1283)
        0.24 = coord(6/25)
    
  5. Kim, W.; Wilbur, W.J.: Corpus-based statistical screening for content-bearing terms (2001) 0.17
    0.17476775 = sum of:
      0.17476775 = product of:
        0.48546594 = sum of:
          0.03788261 = weight(abstract_txt:equal in 5188) [ClassicSimilarity], result of:
            0.03788261 = score(doc=5188,freq=1.0), product of:
              0.11122414 = queryWeight, product of:
                1.0157232 = boost
                7.2660704 = idf(docFreq=83, maxDocs=44218)
                0.015070375 = queryNorm
              0.34059703 = fieldWeight in 5188, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.2660704 = idf(docFreq=83, maxDocs=44218)
                0.046875 = fieldNorm(doc=5188)
          0.054660592 = weight(abstract_txt:occurring in 5188) [ClassicSimilarity], result of:
            0.054660592 = score(doc=5188,freq=2.0), product of:
              0.11272286 = queryWeight, product of:
                1.0225437 = boost
                7.314861 = idf(docFreq=79, maxDocs=44218)
                0.015070375 = queryNorm
              0.48491132 = fieldWeight in 5188, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.314861 = idf(docFreq=79, maxDocs=44218)
                0.046875 = fieldNorm(doc=5188)
          0.057056703 = weight(abstract_txt:occurrences in 5188) [ClassicSimilarity], result of:
            0.057056703 = score(doc=5188,freq=2.0), product of:
              0.11599348 = queryWeight, product of:
                1.037272 = boost
                7.4202213 = idf(docFreq=71, maxDocs=44218)
                0.015070375 = queryNorm
              0.4918958 = fieldWeight in 5188, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.4202213 = idf(docFreq=71, maxDocs=44218)
                0.046875 = fieldNorm(doc=5188)
          0.011671389 = weight(abstract_txt:than in 5188) [ClassicSimilarity], result of:
            0.011671389 = score(doc=5188,freq=1.0), product of:
              0.06392403 = queryWeight, product of:
                1.0889876 = boost
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.015070375 = queryNorm
              0.1825822 = fieldWeight in 5188, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.046875 = fieldNorm(doc=5188)
          0.038265776 = weight(abstract_txt:document in 5188) [ClassicSimilarity], result of:
            0.038265776 = score(doc=5188,freq=6.0), product of:
              0.07763764 = queryWeight, product of:
                1.2001265 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.015070375 = queryNorm
              0.49287656 = fieldWeight in 5188, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.046875 = fieldNorm(doc=5188)
          0.019436205 = weight(abstract_txt:then in 5188) [ClassicSimilarity], result of:
            0.019436205 = score(doc=5188,freq=1.0), product of:
              0.08980974 = queryWeight, product of:
                1.290781 = boost
                4.616861 = idf(docFreq=1187, maxDocs=44218)
                0.015070375 = queryNorm
              0.21641535 = fieldWeight in 5188, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.616861 = idf(docFreq=1187, maxDocs=44218)
                0.046875 = fieldNorm(doc=5188)
          0.054759737 = weight(abstract_txt:vector in 5188) [ClassicSimilarity], result of:
            0.054759737 = score(doc=5188,freq=1.0), product of:
              0.17915268 = queryWeight, product of:
                1.8230666 = boost
                6.5207376 = idf(docFreq=176, maxDocs=44218)
                0.015070375 = queryNorm
              0.3056596 = fieldWeight in 5188, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5207376 = idf(docFreq=176, maxDocs=44218)
                0.046875 = fieldNorm(doc=5188)
          0.061669894 = weight(abstract_txt:occurrence in 5188) [ClassicSimilarity], result of:
            0.061669894 = score(doc=5188,freq=1.0), product of:
              0.19392386 = queryWeight, product of:
                1.8967342 = boost
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.015070375 = queryNorm
              0.31801087 = fieldWeight in 5188, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.046875 = fieldNorm(doc=5188)
          0.15006305 = weight(abstract_txt:weights in 5188) [ClassicSimilarity], result of:
            0.15006305 = score(doc=5188,freq=1.0), product of:
              0.4420197 = queryWeight, product of:
                4.0497355 = boost
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.015070375 = queryNorm
              0.33949405 = fieldWeight in 5188, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.046875 = fieldNorm(doc=5188)
        0.36 = coord(9/25)