Document (#29157)

Author
Aizawa, A.
Title
¬An information-theoretic perspective of tf-idf measures
Source
Information processing and management. 39(2003) no.1, S.45-65
Year
2003
Abstract
This paper presents a mathematical definition of the "probability-weighted amount of information" (PWI), a measure of specificity of terms in documents that is based on an information-theoretic view of retrieval events. The proposed PWI is expressed as a product of the occurrence probabilities of terms and their amounts of information, and corresponds well with the conventional term frequency - inverse document frequency measures that are commonly used in today's information retrieval systems. The mathematical definition of the PWI is shown, together with some illustrative examples of the calculation.
Theme
Retrievalalgorithmen
Object
TF/iDF

Similar documents (content)

  1. Bruza, P.D.; Huibers, T.W.C.: ¬A study of aboutness in information retrieval (1996) 0.20
    0.20478061 = sum of:
      0.20478061 = product of:
        0.85325253 = sum of:
          0.080185756 = weight(abstract_txt:expressed in 775) [ClassicSimilarity], result of:
            0.080185756 = score(doc=775,freq=1.0), product of:
              0.13704687 = queryWeight, product of:
                1.0622987 = boost
                6.2410383 = idf(docFreq=223, maxDocs=42306)
                0.020671198 = queryNorm
              0.5850973 = fieldWeight in 775, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.2410383 = idf(docFreq=223, maxDocs=42306)
                0.09375 = fieldNorm(doc=775)
          0.06124228 = weight(abstract_txt:retrieval in 775) [ClassicSimilarity], result of:
            0.06124228 = score(doc=775,freq=5.0), product of:
              0.08437074 = queryWeight, product of:
                1.1787535 = boost
                3.4626071 = idf(docFreq=3604, maxDocs=42306)
                0.020671198 = queryNorm
              0.7258711 = fieldWeight in 775, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                3.4626071 = idf(docFreq=3604, maxDocs=42306)
                0.09375 = fieldNorm(doc=775)
          0.17330557 = weight(abstract_txt:probabilities in 775) [ClassicSimilarity], result of:
            0.17330557 = score(doc=775,freq=1.0), product of:
              0.22909343 = queryWeight, product of:
                1.3734674 = boost
                8.069165 = idf(docFreq=35, maxDocs=42306)
                0.020671198 = queryNorm
              0.75648427 = fieldWeight in 775, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.069165 = idf(docFreq=35, maxDocs=42306)
                0.09375 = fieldNorm(doc=775)
          0.15874027 = weight(abstract_txt:definition in 775) [ClassicSimilarity], result of:
            0.15874027 = score(doc=775,freq=2.0), product of:
              0.2160706 = queryWeight, product of:
                1.8863615 = boost
                5.541217 = idf(docFreq=450, maxDocs=42306)
                0.020671198 = queryNorm
              0.7346685 = fieldWeight in 775, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.541217 = idf(docFreq=450, maxDocs=42306)
                0.09375 = fieldNorm(doc=775)
          0.058389124 = weight(abstract_txt:information in 775) [ClassicSimilarity], result of:
            0.058389124 = score(doc=775,freq=6.0), product of:
              0.10438349 = queryWeight, product of:
                2.0730653 = boost
                2.435865 = idf(docFreq=10064, maxDocs=42306)
                0.020671198 = queryNorm
              0.55937123 = fieldWeight in 775, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                2.435865 = idf(docFreq=10064, maxDocs=42306)
                0.09375 = fieldNorm(doc=775)
          0.32138947 = weight(abstract_txt:theoretic in 775) [ClassicSimilarity], result of:
            0.32138947 = score(doc=775,freq=1.0), product of:
              0.43568107 = queryWeight, product of:
                2.6786218 = boost
                7.8684945 = idf(docFreq=43, maxDocs=42306)
                0.020671198 = queryNorm
              0.7376714 = fieldWeight in 775, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.8684945 = idf(docFreq=43, maxDocs=42306)
                0.09375 = fieldNorm(doc=775)
        0.24 = coord(6/25)
    
  2. Wong, S.K.M.; Yao, Y.Y.: ¬An information-theoretic measure of term specifics (1992) 0.19
    0.18735898 = sum of:
      0.18735898 = product of:
        0.9367949 = sum of:
          0.103747755 = weight(abstract_txt:occurrence in 4807) [ClassicSimilarity], result of:
            0.103747755 = score(doc=4807,freq=1.0), product of:
              0.16272593 = queryWeight, product of:
                1.1575518 = boost
                6.800654 = idf(docFreq=127, maxDocs=42306)
                0.020671198 = queryNorm
              0.6375613 = fieldWeight in 4807, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.800654 = idf(docFreq=127, maxDocs=42306)
                0.09375 = fieldNorm(doc=4807)
          0.1474073 = weight(abstract_txt:inverse in 4807) [ClassicSimilarity], result of:
            0.1474073 = score(doc=4807,freq=1.0), product of:
              0.20566021 = queryWeight, product of:
                1.3013293 = boost
                7.645351 = idf(docFreq=54, maxDocs=42306)
                0.020671198 = queryNorm
              0.71675164 = fieldWeight in 4807, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.645351 = idf(docFreq=54, maxDocs=42306)
                0.09375 = fieldNorm(doc=4807)
          0.19741553 = weight(abstract_txt:frequency in 4807) [ClassicSimilarity], result of:
            0.19741553 = score(doc=4807,freq=2.0), product of:
              0.24987635 = queryWeight, product of:
                2.0285683 = boost
                5.958952 = idf(docFreq=296, maxDocs=42306)
                0.020671198 = queryNorm
              0.7900529 = fieldWeight in 4807, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.958952 = idf(docFreq=296, maxDocs=42306)
                0.09375 = fieldNorm(doc=4807)
          0.03371097 = weight(abstract_txt:information in 4807) [ClassicSimilarity], result of:
            0.03371097 = score(doc=4807,freq=2.0), product of:
              0.10438349 = queryWeight, product of:
                2.0730653 = boost
                2.435865 = idf(docFreq=10064, maxDocs=42306)
                0.020671198 = queryNorm
              0.3229531 = fieldWeight in 4807, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.435865 = idf(docFreq=10064, maxDocs=42306)
                0.09375 = fieldNorm(doc=4807)
          0.4545133 = weight(abstract_txt:theoretic in 4807) [ClassicSimilarity], result of:
            0.4545133 = score(doc=4807,freq=2.0), product of:
              0.43568107 = queryWeight, product of:
                2.6786218 = boost
                7.8684945 = idf(docFreq=43, maxDocs=42306)
                0.020671198 = queryNorm
              1.0432248 = fieldWeight in 4807, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.8684945 = idf(docFreq=43, maxDocs=42306)
                0.09375 = fieldNorm(doc=4807)
        0.2 = coord(5/25)
    
  3. Rölleke, T.; Tsikrika, T.; Kazai, G.: ¬A general matrix framework for modelling Information Retrieval (2006) 0.18
    0.18486285 = sum of:
      0.18486285 = product of:
        0.57769644 = sum of:
          0.053457174 = weight(abstract_txt:expressed in 2958) [ClassicSimilarity], result of:
            0.053457174 = score(doc=2958,freq=1.0), product of:
              0.13704687 = queryWeight, product of:
                1.0622987 = boost
                6.2410383 = idf(docFreq=223, maxDocs=42306)
                0.020671198 = queryNorm
              0.3900649 = fieldWeight in 2958, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.2410383 = idf(docFreq=223, maxDocs=42306)
                0.0625 = fieldNorm(doc=2958)
          0.03651784 = weight(abstract_txt:retrieval in 2958) [ClassicSimilarity], result of:
            0.03651784 = score(doc=2958,freq=4.0), product of:
              0.08437074 = queryWeight, product of:
                1.1787535 = boost
                3.4626071 = idf(docFreq=3604, maxDocs=42306)
                0.020671198 = queryNorm
              0.4328259 = fieldWeight in 2958, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.4626071 = idf(docFreq=3604, maxDocs=42306)
                0.0625 = fieldNorm(doc=2958)
          0.09827153 = weight(abstract_txt:inverse in 2958) [ClassicSimilarity], result of:
            0.09827153 = score(doc=2958,freq=1.0), product of:
              0.20566021 = queryWeight, product of:
                1.3013293 = boost
                7.645351 = idf(docFreq=54, maxDocs=42306)
                0.020671198 = queryNorm
              0.47783443 = fieldWeight in 2958, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.645351 = idf(docFreq=54, maxDocs=42306)
                0.0625 = fieldNorm(doc=2958)
          0.029429564 = weight(abstract_txt:terms in 2958) [ClassicSimilarity], result of:
            0.029429564 = score(doc=2958,freq=1.0), product of:
              0.115983896 = queryWeight, product of:
                1.3820568 = boost
                4.059814 = idf(docFreq=1983, maxDocs=42306)
                0.020671198 = queryNorm
              0.25373837 = fieldWeight in 2958, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.059814 = idf(docFreq=1983, maxDocs=42306)
                0.0625 = fieldNorm(doc=2958)
          0.07088612 = weight(abstract_txt:measures in 2958) [ClassicSimilarity], result of:
            0.07088612 = score(doc=2958,freq=1.0), product of:
              0.20840874 = queryWeight, product of:
                1.8526144 = boost
                5.4420843 = idf(docFreq=497, maxDocs=42306)
                0.020671198 = queryNorm
              0.34013027 = fieldWeight in 2958, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.4420843 = idf(docFreq=497, maxDocs=42306)
                0.0625 = fieldNorm(doc=2958)
          0.1611891 = weight(abstract_txt:frequency in 2958) [ClassicSimilarity], result of:
            0.1611891 = score(doc=2958,freq=3.0), product of:
              0.24987635 = queryWeight, product of:
                2.0285683 = boost
                5.958952 = idf(docFreq=296, maxDocs=42306)
                0.020671198 = queryNorm
              0.64507544 = fieldWeight in 2958, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.958952 = idf(docFreq=296, maxDocs=42306)
                0.0625 = fieldNorm(doc=2958)
          0.015891505 = weight(abstract_txt:information in 2958) [ClassicSimilarity], result of:
            0.015891505 = score(doc=2958,freq=1.0), product of:
              0.10438349 = queryWeight, product of:
                2.0730653 = boost
                2.435865 = idf(docFreq=10064, maxDocs=42306)
                0.020671198 = queryNorm
              0.15224156 = fieldWeight in 2958, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.435865 = idf(docFreq=10064, maxDocs=42306)
                0.0625 = fieldNorm(doc=2958)
          0.11205363 = weight(abstract_txt:mathematical in 2958) [ClassicSimilarity], result of:
            0.11205363 = score(doc=2958,freq=1.0), product of:
              0.28280848 = queryWeight, product of:
                2.1581085 = boost
                6.339478 = idf(docFreq=202, maxDocs=42306)
                0.020671198 = queryNorm
              0.39621738 = fieldWeight in 2958, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.339478 = idf(docFreq=202, maxDocs=42306)
                0.0625 = fieldNorm(doc=2958)
        0.32 = coord(8/25)
    
  4. Dang, E.K.F.; Luk, R.W.P.; Allan, J.: Beyond bag-of-words : bigram-enhanced context-dependent term weights (2014) 0.17
    0.16939232 = sum of:
      0.16939232 = product of:
        0.529351 = sum of:
          0.06051952 = weight(abstract_txt:occurrence in 3284) [ClassicSimilarity], result of:
            0.06051952 = score(doc=3284,freq=1.0), product of:
              0.16272593 = queryWeight, product of:
                1.1575518 = boost
                6.800654 = idf(docFreq=127, maxDocs=42306)
                0.020671198 = queryNorm
              0.37191075 = fieldWeight in 3284, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.800654 = idf(docFreq=127, maxDocs=42306)
                0.0546875 = fieldNorm(doc=3284)
          0.089358255 = weight(abstract_txt:probability in 3284) [ClassicSimilarity], result of:
            0.089358255 = score(doc=3284,freq=2.0), product of:
              0.16747098 = queryWeight, product of:
                1.1743075 = boost
                6.899094 = idf(docFreq=115, maxDocs=42306)
                0.020671198 = queryNorm
              0.5335746 = fieldWeight in 3284, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.899094 = idf(docFreq=115, maxDocs=42306)
                0.0546875 = fieldNorm(doc=3284)
          0.035724666 = weight(abstract_txt:retrieval in 3284) [ClassicSimilarity], result of:
            0.035724666 = score(doc=3284,freq=5.0), product of:
              0.08437074 = queryWeight, product of:
                1.1787535 = boost
                3.4626071 = idf(docFreq=3604, maxDocs=42306)
                0.020671198 = queryNorm
              0.4234248 = fieldWeight in 3284, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                3.4626071 = idf(docFreq=3604, maxDocs=42306)
                0.0546875 = fieldNorm(doc=3284)
          0.08598759 = weight(abstract_txt:inverse in 3284) [ClassicSimilarity], result of:
            0.08598759 = score(doc=3284,freq=1.0), product of:
              0.20566021 = queryWeight, product of:
                1.3013293 = boost
                7.645351 = idf(docFreq=54, maxDocs=42306)
                0.020671198 = queryNorm
              0.41810513 = fieldWeight in 3284, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.645351 = idf(docFreq=54, maxDocs=42306)
                0.0546875 = fieldNorm(doc=3284)
          0.09718636 = weight(abstract_txt:calculation in 3284) [ClassicSimilarity], result of:
            0.09718636 = score(doc=3284,freq=1.0), product of:
              0.22314987 = queryWeight, product of:
                1.3555338 = boost
                7.9638047 = idf(docFreq=39, maxDocs=42306)
                0.020671198 = queryNorm
              0.43552056 = fieldWeight in 3284, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.9638047 = idf(docFreq=39, maxDocs=42306)
                0.0546875 = fieldNorm(doc=3284)
          0.02575087 = weight(abstract_txt:terms in 3284) [ClassicSimilarity], result of:
            0.02575087 = score(doc=3284,freq=1.0), product of:
              0.115983896 = queryWeight, product of:
                1.3820568 = boost
                4.059814 = idf(docFreq=1983, maxDocs=42306)
                0.020671198 = queryNorm
              0.22202107 = fieldWeight in 3284, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.059814 = idf(docFreq=1983, maxDocs=42306)
                0.0546875 = fieldNorm(doc=3284)
          0.11515906 = weight(abstract_txt:frequency in 3284) [ClassicSimilarity], result of:
            0.11515906 = score(doc=3284,freq=2.0), product of:
              0.24987635 = queryWeight, product of:
                2.0285683 = boost
                5.958952 = idf(docFreq=296, maxDocs=42306)
                0.020671198 = queryNorm
              0.4608642 = fieldWeight in 3284, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.958952 = idf(docFreq=296, maxDocs=42306)
                0.0546875 = fieldNorm(doc=3284)
          0.019664733 = weight(abstract_txt:information in 3284) [ClassicSimilarity], result of:
            0.019664733 = score(doc=3284,freq=2.0), product of:
              0.10438349 = queryWeight, product of:
                2.0730653 = boost
                2.435865 = idf(docFreq=10064, maxDocs=42306)
                0.020671198 = queryNorm
              0.1883893 = fieldWeight in 3284, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.435865 = idf(docFreq=10064, maxDocs=42306)
                0.0546875 = fieldNorm(doc=3284)
        0.32 = coord(8/25)
    
  5. Wong, S.K.M.: On modelling information retrieval with probabilistic inference (1995) 0.14
    0.14479361 = sum of:
      0.14479361 = product of:
        0.6033067 = sum of:
          0.08232677 = weight(abstract_txt:conventional in 2007) [ClassicSimilarity], result of:
            0.08232677 = score(doc=2007,freq=1.0), product of:
              0.13947563 = queryWeight, product of:
                1.0716704 = boost
                6.2960978 = idf(docFreq=211, maxDocs=42306)
                0.020671198 = queryNorm
              0.5902592 = fieldWeight in 2007, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.2960978 = idf(docFreq=211, maxDocs=42306)
                0.09375 = fieldNorm(doc=2007)
          0.10831857 = weight(abstract_txt:probability in 2007) [ClassicSimilarity], result of:
            0.10831857 = score(doc=2007,freq=1.0), product of:
              0.16747098 = queryWeight, product of:
                1.1743075 = boost
                6.899094 = idf(docFreq=115, maxDocs=42306)
                0.020671198 = queryNorm
              0.6467901 = fieldWeight in 2007, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.899094 = idf(docFreq=115, maxDocs=42306)
                0.09375 = fieldNorm(doc=2007)
          0.047438063 = weight(abstract_txt:retrieval in 2007) [ClassicSimilarity], result of:
            0.047438063 = score(doc=2007,freq=3.0), product of:
              0.08437074 = queryWeight, product of:
                1.1787535 = boost
                3.4626071 = idf(docFreq=3604, maxDocs=42306)
                0.020671198 = queryNorm
              0.5622573 = fieldWeight in 2007, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.4626071 = idf(docFreq=3604, maxDocs=42306)
                0.09375 = fieldNorm(doc=2007)
          0.17330557 = weight(abstract_txt:probabilities in 2007) [ClassicSimilarity], result of:
            0.17330557 = score(doc=2007,freq=1.0), product of:
              0.22909343 = queryWeight, product of:
                1.3734674 = boost
                8.069165 = idf(docFreq=35, maxDocs=42306)
                0.020671198 = queryNorm
              0.75648427 = fieldWeight in 2007, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.069165 = idf(docFreq=35, maxDocs=42306)
                0.09375 = fieldNorm(doc=2007)
          0.023837257 = weight(abstract_txt:information in 2007) [ClassicSimilarity], result of:
            0.023837257 = score(doc=2007,freq=1.0), product of:
              0.10438349 = queryWeight, product of:
                2.0730653 = boost
                2.435865 = idf(docFreq=10064, maxDocs=42306)
                0.020671198 = queryNorm
              0.22836234 = fieldWeight in 2007, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.435865 = idf(docFreq=10064, maxDocs=42306)
                0.09375 = fieldNorm(doc=2007)
          0.16808046 = weight(abstract_txt:mathematical in 2007) [ClassicSimilarity], result of:
            0.16808046 = score(doc=2007,freq=1.0), product of:
              0.28280848 = queryWeight, product of:
                2.1581085 = boost
                6.339478 = idf(docFreq=202, maxDocs=42306)
                0.020671198 = queryNorm
              0.5943261 = fieldWeight in 2007, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.339478 = idf(docFreq=202, maxDocs=42306)
                0.09375 = fieldNorm(doc=2007)
        0.24 = coord(6/25)