Document (#29422)

Author
Robertson, S.
Title
Understanding inverse document frequency : on theoretical arguments for IDF
Source
Journal of documentation. 60(2004) no.5, S.503-520
Year
2004
Abstract
The term-weighting function known as IDF was proposed in 1972, and has since been extremely widely used, usually as part of a TF*IDF function. It is often described as a heuristic, and many papers have been written (some based on Shannon's Information Theory) seeking to establish some theoretical basis for it. Some of these attempts are reviewed, and it is shown that the Information Theory approaches are problematic, but that there are good theoretical justifications of both IDF and TF*IDF in the traditional probabilistic model of information retrieval.
Footnote
Vgl. auch unter:http://www.emeraldinsight.com/10.1108/00220410410560582.
Theme
Retrievalalgorithmen
Object
IDF
TF*IDF

Similar documents (author)

  1. Robertson, M.A.: Windows 3.0 for the online searcher (1991) 4.57
    4.5717874 = sum of:
      4.5717874 = weight(author_txt:robertson in 592) [ClassicSimilarity], result of:
        4.5717874 = score(doc=592,freq=1.0), product of:
          0.99999994 = queryWeight, product of:
            7.314861 = idf(docFreq=79, maxDocs=44218)
            0.13670799 = queryNorm
          4.571788 = fieldWeight in 592, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            7.314861 = idf(docFreq=79, maxDocs=44218)
            0.625 = fieldNorm(doc=592)
    
  2. Robertson, S.E.: Some recent theories and models in information retrieval (1980) 4.57
    4.5717874 = sum of:
      4.5717874 = weight(author_txt:robertson in 1326) [ClassicSimilarity], result of:
        4.5717874 = score(doc=1326,freq=1.0), product of:
          0.99999994 = queryWeight, product of:
            7.314861 = idf(docFreq=79, maxDocs=44218)
            0.13670799 = queryNorm
          4.571788 = fieldWeight in 1326, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            7.314861 = idf(docFreq=79, maxDocs=44218)
            0.625 = fieldNorm(doc=1326)
    
  3. Robertson, S.E.: Theories and models in information retrieval (1977) 4.57
    4.5717874 = sum of:
      4.5717874 = weight(author_txt:robertson in 1844) [ClassicSimilarity], result of:
        4.5717874 = score(doc=1844,freq=1.0), product of:
          0.99999994 = queryWeight, product of:
            7.314861 = idf(docFreq=79, maxDocs=44218)
            0.13670799 = queryNorm
          4.571788 = fieldWeight in 1844, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            7.314861 = idf(docFreq=79, maxDocs=44218)
            0.625 = fieldNorm(doc=1844)
    
  4. Robertson, S.E.: On term selection for query expansion (1990) 4.57
    4.5717874 = sum of:
      4.5717874 = weight(author_txt:robertson in 2650) [ClassicSimilarity], result of:
        4.5717874 = score(doc=2650,freq=1.0), product of:
          0.99999994 = queryWeight, product of:
            7.314861 = idf(docFreq=79, maxDocs=44218)
            0.13670799 = queryNorm
          4.571788 = fieldWeight in 2650, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            7.314861 = idf(docFreq=79, maxDocs=44218)
            0.625 = fieldNorm(doc=2650)
    
  5. Robertson, S.E.: On relevance weight estimation and query expansion (1986) 4.57
    4.5717874 = sum of:
      4.5717874 = weight(author_txt:robertson in 3875) [ClassicSimilarity], result of:
        4.5717874 = score(doc=3875,freq=1.0), product of:
          0.99999994 = queryWeight, product of:
            7.314861 = idf(docFreq=79, maxDocs=44218)
            0.13670799 = queryNorm
          4.571788 = fieldWeight in 3875, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            7.314861 = idf(docFreq=79, maxDocs=44218)
            0.625 = fieldNorm(doc=3875)
    

Similar documents (content)

  1. Wong, S.K.M.; Yao, Y.Y.: ¬An information-theoretic measure of term specifics (1992) 0.21
    0.2061863 = sum of:
      0.2061863 = product of:
        0.8591096 = sum of:
          0.113786064 = weight(abstract_txt:frequency in 4807) [ClassicSimilarity], result of:
            0.113786064 = score(doc=4807,freq=2.0), product of:
              0.1443008 = queryWeight, product of:
                1.0710992 = boost
                5.947494 = idf(docFreq=313, maxDocs=44218)
                0.022651922 = queryNorm
              0.7885338 = fieldWeight in 4807, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.947494 = idf(docFreq=313, maxDocs=44218)
                0.09375 = fieldNorm(doc=4807)
          0.1838029 = weight(abstract_txt:weighting in 4807) [ClassicSimilarity], result of:
            0.1838029 = score(doc=4807,freq=2.0), product of:
              0.19866025 = queryWeight, product of:
                1.2567556 = boost
                6.9783883 = idf(docFreq=111, maxDocs=44218)
                0.022651922 = queryNorm
              0.92521226 = fieldWeight in 4807, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.9783883 = idf(docFreq=111, maxDocs=44218)
                0.09375 = fieldNorm(doc=4807)
          0.023023099 = weight(abstract_txt:information in 4807) [ClassicSimilarity], result of:
            0.023023099 = score(doc=4807,freq=2.0), product of:
              0.071728595 = queryWeight, product of:
                1.307983 = boost
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.022651922 = queryNorm
              0.32097518 = fieldWeight in 4807, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.09375 = fieldNorm(doc=4807)
          0.17389035 = weight(abstract_txt:inverse in 4807) [ClassicSimilarity], result of:
            0.17389035 = score(doc=4807,freq=1.0), product of:
              0.24121429 = queryWeight, product of:
                1.3848312 = boost
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.022651922 = queryNorm
              0.7208957 = fieldWeight in 4807, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.09375 = fieldNorm(doc=4807)
          0.30752444 = weight(abstract_txt:justifications in 4807) [ClassicSimilarity], result of:
            0.30752444 = score(doc=4807,freq=1.0), product of:
              0.35275444 = queryWeight, product of:
                1.6746789 = boost
                9.298992 = idf(docFreq=10, maxDocs=44218)
                0.022651922 = queryNorm
              0.8717805 = fieldWeight in 4807, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.298992 = idf(docFreq=10, maxDocs=44218)
                0.09375 = fieldNorm(doc=4807)
          0.057082698 = weight(abstract_txt:some in 4807) [ClassicSimilarity], result of:
            0.057082698 = score(doc=4807,freq=1.0), product of:
              0.16555011 = queryWeight, product of:
                1.9871043 = boost
                3.6779325 = idf(docFreq=3037, maxDocs=44218)
                0.022651922 = queryNorm
              0.34480616 = fieldWeight in 4807, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6779325 = idf(docFreq=3037, maxDocs=44218)
                0.09375 = fieldNorm(doc=4807)
        0.24 = coord(6/25)
    
  2. Robertson, S.E.; Sparck Jones, K.: Relevance weighting of search terms (1976) 0.18
    0.17805439 = sum of:
      0.17805439 = product of:
        0.7418933 = sum of:
          0.07784022 = weight(abstract_txt:shown in 71) [ClassicSimilarity], result of:
            0.07784022 = score(doc=71,freq=1.0), product of:
              0.12736718 = queryWeight, product of:
                1.0062921 = boost
                5.58764 = idf(docFreq=449, maxDocs=44218)
                0.022651922 = queryNorm
              0.6111481 = fieldWeight in 71, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.58764 = idf(docFreq=449, maxDocs=44218)
                0.109375 = fieldNorm(doc=71)
          0.13887155 = weight(abstract_txt:probabilistic in 71) [ClassicSimilarity], result of:
            0.13887155 = score(doc=71,freq=1.0), product of:
              0.18735433 = queryWeight, product of:
                1.2204702 = boost
                6.7769065 = idf(docFreq=136, maxDocs=44218)
                0.022651922 = queryNorm
              0.74122417 = fieldWeight in 71, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.7769065 = idf(docFreq=136, maxDocs=44218)
                0.109375 = fieldNorm(doc=71)
          0.26263028 = weight(abstract_txt:weighting in 71) [ClassicSimilarity], result of:
            0.26263028 = score(doc=71,freq=3.0), product of:
              0.19866025 = queryWeight, product of:
                1.2567556 = boost
                6.9783883 = idf(docFreq=111, maxDocs=44218)
                0.022651922 = queryNorm
              1.3220072 = fieldWeight in 71, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.9783883 = idf(docFreq=111, maxDocs=44218)
                0.109375 = fieldNorm(doc=71)
          0.026860282 = weight(abstract_txt:information in 71) [ClassicSimilarity], result of:
            0.026860282 = score(doc=71,freq=2.0), product of:
              0.071728595 = queryWeight, product of:
                1.307983 = boost
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.022651922 = queryNorm
              0.37447104 = fieldWeight in 71, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.109375 = fieldNorm(doc=71)
          0.08328749 = weight(abstract_txt:theory in 71) [ClassicSimilarity], result of:
            0.08328749 = score(doc=71,freq=1.0), product of:
              0.1678745 = queryWeight, product of:
                1.6338142 = boost
                4.5360413 = idf(docFreq=1287, maxDocs=44218)
                0.022651922 = queryNorm
              0.4961295 = fieldWeight in 71, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.5360413 = idf(docFreq=1287, maxDocs=44218)
                0.109375 = fieldNorm(doc=71)
          0.1524035 = weight(abstract_txt:theoretical in 71) [ClassicSimilarity], result of:
            0.1524035 = score(doc=71,freq=1.0), product of:
              0.28749165 = queryWeight, product of:
                2.6185963 = boost
                4.846761 = idf(docFreq=943, maxDocs=44218)
                0.022651922 = queryNorm
              0.53011453 = fieldWeight in 71, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.846761 = idf(docFreq=943, maxDocs=44218)
                0.109375 = fieldNorm(doc=71)
        0.24 = coord(6/25)
    
  3. Cornelius, I.: Theorizing information for information science (2002) 0.13
    0.13319975 = sum of:
      0.13319975 = product of:
        0.47571337 = sum of:
          0.044152737 = weight(abstract_txt:attempts in 4244) [ClassicSimilarity], result of:
            0.044152737 = score(doc=4244,freq=2.0), product of:
              0.13761169 = queryWeight, product of:
                1.045979 = boost
                5.808009 = idf(docFreq=360, maxDocs=44218)
                0.022651922 = queryNorm
              0.3208502 = fieldWeight in 4244, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.808009 = idf(docFreq=360, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4244)
          0.07154046 = weight(abstract_txt:problematic in 4244) [ClassicSimilarity], result of:
            0.07154046 = score(doc=4244,freq=2.0), product of:
              0.18983868 = queryWeight, product of:
                1.2285354 = boost
                6.82169 = idf(docFreq=130, maxDocs=44218)
                0.022651922 = queryNorm
              0.3768487 = fieldWeight in 4244, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.82169 = idf(docFreq=130, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4244)
          0.015088457 = weight(abstract_txt:been in 4244) [ClassicSimilarity], result of:
            0.015088457 = score(doc=4244,freq=1.0), product of:
              0.10677431 = queryWeight, product of:
                1.3029978 = boost
                3.617579 = idf(docFreq=3226, maxDocs=44218)
                0.022651922 = queryNorm
              0.14131168 = fieldWeight in 4244, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.617579 = idf(docFreq=3226, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4244)
          0.039552778 = weight(abstract_txt:information in 4244) [ClassicSimilarity], result of:
            0.039552778 = score(doc=4244,freq=34.0), product of:
              0.071728595 = queryWeight, product of:
                1.307983 = boost
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.022651922 = queryNorm
              0.5514227 = fieldWeight in 4244, product of:
                5.8309517 = tf(freq=34.0), with freq of:
                  34.0 = termFreq=34.0
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4244)
          0.15693401 = weight(abstract_txt:shannon's in 4244) [ClassicSimilarity], result of:
            0.15693401 = score(doc=4244,freq=2.0), product of:
              0.3205002 = queryWeight, product of:
                1.5962814 = boost
                8.863674 = idf(docFreq=16, maxDocs=44218)
                0.022651922 = queryNorm
              0.4896534 = fieldWeight in 4244, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.863674 = idf(docFreq=16, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4244)
          0.107249044 = weight(abstract_txt:theory in 4244) [ClassicSimilarity], result of:
            0.107249044 = score(doc=4244,freq=13.0), product of:
              0.1678745 = queryWeight, product of:
                1.6338142 = boost
                4.5360413 = idf(docFreq=1287, maxDocs=44218)
                0.022651922 = queryNorm
              0.6388644 = fieldWeight in 4244, product of:
                3.6055512 = tf(freq=13.0), with freq of:
                  13.0 = termFreq=13.0
                4.5360413 = idf(docFreq=1287, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4244)
          0.041195888 = weight(abstract_txt:some in 4244) [ClassicSimilarity], result of:
            0.041195888 = score(doc=4244,freq=3.0), product of:
              0.16555011 = queryWeight, product of:
                1.9871043 = boost
                3.6779325 = idf(docFreq=3037, maxDocs=44218)
                0.022651922 = queryNorm
              0.2488424 = fieldWeight in 4244, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.6779325 = idf(docFreq=3037, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4244)
        0.28 = coord(7/25)
    
  4. Liu, X.; Croft, W.B.: Statistical language modeling for information retrieval (2004) 0.12
    0.12372038 = sum of:
      0.12372038 = product of:
        0.4418585 = sum of:
          0.04693435 = weight(abstract_txt:frequency in 4277) [ClassicSimilarity], result of:
            0.04693435 = score(doc=4277,freq=1.0), product of:
              0.1443008 = queryWeight, product of:
                1.0710992 = boost
                5.947494 = idf(docFreq=313, maxDocs=44218)
                0.022651922 = queryNorm
              0.32525358 = fieldWeight in 4277, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.947494 = idf(docFreq=313, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4277)
          0.069435775 = weight(abstract_txt:probabilistic in 4277) [ClassicSimilarity], result of:
            0.069435775 = score(doc=4277,freq=1.0), product of:
              0.18735433 = queryWeight, product of:
                1.2204702 = boost
                6.7769065 = idf(docFreq=136, maxDocs=44218)
                0.022651922 = queryNorm
              0.37061208 = fieldWeight in 4277, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.7769065 = idf(docFreq=136, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4277)
          0.02987362 = weight(abstract_txt:been in 4277) [ClassicSimilarity], result of:
            0.02987362 = score(doc=4277,freq=2.0), product of:
              0.10677431 = queryWeight, product of:
                1.3029978 = boost
                3.617579 = idf(docFreq=3226, maxDocs=44218)
                0.022651922 = queryNorm
              0.27978286 = fieldWeight in 4277, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.617579 = idf(docFreq=3226, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4277)
          0.021234918 = weight(abstract_txt:information in 4277) [ClassicSimilarity], result of:
            0.021234918 = score(doc=4277,freq=5.0), product of:
              0.071728595 = queryWeight, product of:
                1.307983 = boost
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.022651922 = queryNorm
              0.29604536 = fieldWeight in 4277, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4277)
          0.15535676 = weight(abstract_txt:shannon's in 4277) [ClassicSimilarity], result of:
            0.15535676 = score(doc=4277,freq=1.0), product of:
              0.3205002 = queryWeight, product of:
                1.5962814 = boost
                8.863674 = idf(docFreq=16, maxDocs=44218)
                0.022651922 = queryNorm
              0.48473218 = fieldWeight in 4277, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.863674 = idf(docFreq=16, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4277)
          0.041643746 = weight(abstract_txt:theory in 4277) [ClassicSimilarity], result of:
            0.041643746 = score(doc=4277,freq=1.0), product of:
              0.1678745 = queryWeight, product of:
                1.6338142 = boost
                4.5360413 = idf(docFreq=1287, maxDocs=44218)
                0.022651922 = queryNorm
              0.24806476 = fieldWeight in 4277, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.5360413 = idf(docFreq=1287, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4277)
          0.07737931 = weight(abstract_txt:function in 4277) [ClassicSimilarity], result of:
            0.07737931 = score(doc=4277,freq=1.0), product of:
              0.25372782 = queryWeight, product of:
                2.008604 = boost
                5.5765896 = idf(docFreq=454, maxDocs=44218)
                0.022651922 = queryNorm
              0.30496973 = fieldWeight in 4277, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5765896 = idf(docFreq=454, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4277)
        0.28 = coord(7/25)
    
  5. Huang, X.; Robertson, S.E.: Application of probilistic methods to Chinese text retrieval (1997) 0.12
    0.1182697 = sum of:
      0.1182697 = product of:
        0.59134847 = sum of:
          0.065476425 = weight(abstract_txt:good in 4706) [ClassicSimilarity], result of:
            0.065476425 = score(doc=4706,freq=1.0), product of:
              0.12577936 = queryWeight, product of:
                5.5527015 = idf(docFreq=465, maxDocs=44218)
                0.022651922 = queryNorm
              0.52056575 = fieldWeight in 4706, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5527015 = idf(docFreq=465, maxDocs=44218)
                0.09375 = fieldNorm(doc=4706)
          0.2061708 = weight(abstract_txt:probabilistic in 4706) [ClassicSimilarity], result of:
            0.2061708 = score(doc=4706,freq=3.0), product of:
              0.18735433 = queryWeight, product of:
                1.2204702 = boost
                6.7769065 = idf(docFreq=136, maxDocs=44218)
                0.022651922 = queryNorm
              1.1004325 = fieldWeight in 4706, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.7769065 = idf(docFreq=136, maxDocs=44218)
                0.09375 = fieldNorm(doc=4706)
          0.12996829 = weight(abstract_txt:weighting in 4706) [ClassicSimilarity], result of:
            0.12996829 = score(doc=4706,freq=1.0), product of:
              0.19866025 = queryWeight, product of:
                1.2567556 = boost
                6.9783883 = idf(docFreq=111, maxDocs=44218)
                0.022651922 = queryNorm
              0.6542239 = fieldWeight in 4706, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9783883 = idf(docFreq=111, maxDocs=44218)
                0.09375 = fieldNorm(doc=4706)
          0.057082698 = weight(abstract_txt:some in 4706) [ClassicSimilarity], result of:
            0.057082698 = score(doc=4706,freq=1.0), product of:
              0.16555011 = queryWeight, product of:
                1.9871043 = boost
                3.6779325 = idf(docFreq=3037, maxDocs=44218)
                0.022651922 = queryNorm
              0.34480616 = fieldWeight in 4706, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6779325 = idf(docFreq=3037, maxDocs=44218)
                0.09375 = fieldNorm(doc=4706)
          0.13265024 = weight(abstract_txt:function in 4706) [ClassicSimilarity], result of:
            0.13265024 = score(doc=4706,freq=1.0), product of:
              0.25372782 = queryWeight, product of:
                2.008604 = boost
                5.5765896 = idf(docFreq=454, maxDocs=44218)
                0.022651922 = queryNorm
              0.5228053 = fieldWeight in 4706, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5765896 = idf(docFreq=454, maxDocs=44218)
                0.09375 = fieldNorm(doc=4706)
        0.2 = coord(5/25)