Document (#33028)

Author
Carterette, B.
Can, F.
Title
Comparing inverted files and signature files for searching a large lexicon
Source
Information processing and management. 41(2005) no.3, S.613-634
Year
2005
Abstract
Signature files and inverted files are well-known index structures. In this paper we undertake a direct comparison of the two for searching for partially-specified queries in a large lexicon stored in main memory. Using n-grams to index lexicon terms, a bit-sliced signature file can be compressed to a smaller size than an inverted file if each n-gram sets only one bit in the term signature. With a signature width less than half the number of unique n-grams in the lexicon, the signature file method is about as fast as the inverted file method, and significantly smaller. Greater flexibility in memory usage and faster index generation time make signature files appropriate for searching large lexicons or other collections in an environment where memory is at a premium.

Similar documents (content)

  1. Kelledy, F.; Smeaton, A.F.: Signature files and beyond (1996) 0.45
    0.45294043 = sum of:
      0.45294043 = product of:
        1.6176444 = sum of:
          0.012871655 = weight(abstract_txt:than in 40) [ClassicSimilarity], result of:
            0.012871655 = score(doc=40,freq=1.0), product of:
              0.042256318 = queryWeight, product of:
                1.2282072 = boost
                3.8989954 = idf(docFreq=2398, maxDocs=43556)
                0.0088240355 = queryNorm
              0.304609 = fieldWeight in 40, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.8989954 = idf(docFreq=2398, maxDocs=43556)
                0.078125 = fieldNorm(doc=40)
          0.025494393 = weight(abstract_txt:searching in 40) [ClassicSimilarity], result of:
            0.025494393 = score(doc=40,freq=1.0), product of:
              0.07628906 = queryWeight, product of:
                2.021169 = boost
                4.2775235 = idf(docFreq=1642, maxDocs=43556)
                0.0088240355 = queryNorm
              0.33418152 = fieldWeight in 40, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2775235 = idf(docFreq=1642, maxDocs=43556)
                0.078125 = fieldNorm(doc=40)
          0.028954558 = weight(abstract_txt:large in 40) [ClassicSimilarity], result of:
            0.028954558 = score(doc=40,freq=1.0), product of:
              0.08304442 = queryWeight, product of:
                2.1087577 = boost
                4.462893 = idf(docFreq=1364, maxDocs=43556)
                0.0088240355 = queryNorm
              0.3486635 = fieldWeight in 40, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.462893 = idf(docFreq=1364, maxDocs=43556)
                0.078125 = fieldNorm(doc=40)
          0.10616105 = weight(abstract_txt:file in 40) [ClassicSimilarity], result of:
            0.10616105 = score(doc=40,freq=2.0), product of:
              0.17249593 = queryWeight, product of:
                3.509379 = boost
                5.5703354 = idf(docFreq=450, maxDocs=43556)
                0.0088240355 = queryNorm
              0.6154409 = fieldWeight in 40, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.5703354 = idf(docFreq=450, maxDocs=43556)
                0.078125 = fieldNorm(doc=40)
          0.20274234 = weight(abstract_txt:files in 40) [ClassicSimilarity], result of:
            0.20274234 = score(doc=40,freq=4.0), product of:
              0.22701699 = queryWeight, product of:
                4.501166 = boost
                5.715656 = idf(docFreq=389, maxDocs=43556)
                0.0088240355 = queryNorm
              0.89307123 = fieldWeight in 40, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.715656 = idf(docFreq=389, maxDocs=43556)
                0.078125 = fieldNorm(doc=40)
          0.19631402 = weight(abstract_txt:inverted in 40) [ClassicSimilarity], result of:
            0.19631402 = score(doc=40,freq=1.0), product of:
              0.3274258 = queryWeight, product of:
                4.835009 = boost
                7.6744695 = idf(docFreq=54, maxDocs=43556)
                0.0088240355 = queryNorm
              0.59956795 = fieldWeight in 40, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6744695 = idf(docFreq=54, maxDocs=43556)
                0.078125 = fieldNorm(doc=40)
          1.0451064 = weight(abstract_txt:signature in 40) [ClassicSimilarity], result of:
            1.0451064 = score(doc=40,freq=5.0), product of:
              0.70351774 = queryWeight, product of:
                9.375564 = boost
                8.503749 = idf(docFreq=23, maxDocs=43556)
                0.0088240355 = queryNorm
              1.4855438 = fieldWeight in 40, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                8.503749 = idf(docFreq=23, maxDocs=43556)
                0.078125 = fieldNorm(doc=40)
        0.28 = coord(7/25)
    
  2. Lam, W.; Wong, K.-F.; Wong, C.-Y.: Chinese document indexing based on new partitioned signature file : model and evaluation (2001) 0.37
    0.37411085 = sum of:
      0.37411085 = product of:
        1.3361101 = sum of:
          0.03248191 = weight(abstract_txt:faster in 1301) [ClassicSimilarity], result of:
            0.03248191 = score(doc=1301,freq=1.0), product of:
              0.072137274 = queryWeight, product of:
                1.1347252 = boost
                7.204466 = idf(docFreq=87, maxDocs=43556)
                0.0088240355 = queryNorm
              0.45027912 = fieldWeight in 1301, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.204466 = idf(docFreq=87, maxDocs=43556)
                0.0625 = fieldNorm(doc=1301)
          0.0102973245 = weight(abstract_txt:than in 1301) [ClassicSimilarity], result of:
            0.0102973245 = score(doc=1301,freq=1.0), product of:
              0.042256318 = queryWeight, product of:
                1.2282072 = boost
                3.8989954 = idf(docFreq=2398, maxDocs=43556)
                0.0088240355 = queryNorm
              0.24368721 = fieldWeight in 1301, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.8989954 = idf(docFreq=2398, maxDocs=43556)
                0.0625 = fieldNorm(doc=1301)
          0.031925242 = weight(abstract_txt:method in 1301) [ClassicSimilarity], result of:
            0.031925242 = score(doc=1301,freq=4.0), product of:
              0.05659936 = queryWeight, product of:
                1.42145 = boost
                4.5124526 = idf(docFreq=1298, maxDocs=43556)
                0.0088240355 = queryNorm
              0.5640566 = fieldWeight in 1301, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.5124526 = idf(docFreq=1298, maxDocs=43556)
                0.0625 = fieldNorm(doc=1301)
          0.023163646 = weight(abstract_txt:large in 1301) [ClassicSimilarity], result of:
            0.023163646 = score(doc=1301,freq=1.0), product of:
              0.08304442 = queryWeight, product of:
                2.1087577 = boost
                4.462893 = idf(docFreq=1364, maxDocs=43556)
                0.0088240355 = queryNorm
              0.2789308 = fieldWeight in 1301, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.462893 = idf(docFreq=1364, maxDocs=43556)
                0.0625 = fieldNorm(doc=1301)
          0.13428429 = weight(abstract_txt:file in 1301) [ClassicSimilarity], result of:
            0.13428429 = score(doc=1301,freq=5.0), product of:
              0.17249593 = queryWeight, product of:
                3.509379 = boost
                5.5703354 = idf(docFreq=450, maxDocs=43556)
                0.0088240355 = queryNorm
              0.778478 = fieldWeight in 1301, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.5703354 = idf(docFreq=450, maxDocs=43556)
                0.0625 = fieldNorm(doc=1301)
          0.11468838 = weight(abstract_txt:files in 1301) [ClassicSimilarity], result of:
            0.11468838 = score(doc=1301,freq=2.0), product of:
              0.22701699 = queryWeight, product of:
                4.501166 = boost
                5.715656 = idf(docFreq=389, maxDocs=43556)
                0.0088240355 = queryNorm
              0.50519735 = fieldWeight in 1301, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.715656 = idf(docFreq=389, maxDocs=43556)
                0.0625 = fieldNorm(doc=1301)
          0.98926926 = weight(abstract_txt:signature in 1301) [ClassicSimilarity], result of:
            0.98926926 = score(doc=1301,freq=7.0), product of:
              0.70351774 = queryWeight, product of:
                9.375564 = boost
                8.503749 = idf(docFreq=23, maxDocs=43556)
                0.0088240355 = queryNorm
              1.4061753 = fieldWeight in 1301, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                8.503749 = idf(docFreq=23, maxDocs=43556)
                0.0625 = fieldNorm(doc=1301)
        0.28 = coord(7/25)
    
  3. Robertson, A.M.; Willett, P.: Applications of n-grams in textual information systems (1998) 0.28
    0.27675873 = sum of:
      0.27675873 = product of:
        1.3837936 = sum of:
          0.07690595 = weight(abstract_txt:gram in 713) [ClassicSimilarity], result of:
            0.07690595 = score(doc=713,freq=1.0), product of:
              0.08824294 = queryWeight, product of:
                1.2550205 = boost
                7.9682307 = idf(docFreq=40, maxDocs=43556)
                0.0088240355 = queryNorm
              0.8715252 = fieldWeight in 713, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.9682307 = idf(docFreq=40, maxDocs=43556)
                0.109375 = fieldNorm(doc=713)
          0.23578832 = weight(abstract_txt:grams in 713) [ClassicSimilarity], result of:
            0.23578832 = score(doc=713,freq=2.0), product of:
              0.18623224 = queryWeight, product of:
                2.5784178 = boost
                8.185295 = idf(docFreq=32, maxDocs=43556)
                0.0088240355 = queryNorm
              1.2660983 = fieldWeight in 713, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.185295 = idf(docFreq=32, maxDocs=43556)
                0.109375 = fieldNorm(doc=713)
          0.14191963 = weight(abstract_txt:files in 713) [ClassicSimilarity], result of:
            0.14191963 = score(doc=713,freq=1.0), product of:
              0.22701699 = queryWeight, product of:
                4.501166 = boost
                5.715656 = idf(docFreq=389, maxDocs=43556)
                0.0088240355 = queryNorm
              0.62514985 = fieldWeight in 713, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.715656 = idf(docFreq=389, maxDocs=43556)
                0.109375 = fieldNorm(doc=713)
          0.2748396 = weight(abstract_txt:inverted in 713) [ClassicSimilarity], result of:
            0.2748396 = score(doc=713,freq=1.0), product of:
              0.3274258 = queryWeight, product of:
                4.835009 = boost
                7.6744695 = idf(docFreq=54, maxDocs=43556)
                0.0088240355 = queryNorm
              0.8393951 = fieldWeight in 713, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6744695 = idf(docFreq=54, maxDocs=43556)
                0.109375 = fieldNorm(doc=713)
          0.6543401 = weight(abstract_txt:signature in 713) [ClassicSimilarity], result of:
            0.6543401 = score(doc=713,freq=1.0), product of:
              0.70351774 = queryWeight, product of:
                9.375564 = boost
                8.503749 = idf(docFreq=23, maxDocs=43556)
                0.0088240355 = queryNorm
              0.9300975 = fieldWeight in 713, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.503749 = idf(docFreq=23, maxDocs=43556)
                0.109375 = fieldNorm(doc=713)
        0.2 = coord(5/25)
    
  4. Lee, D.L.: Massive parallelism on the hybrid text-retrieval machine (1995) 0.19
    0.19343777 = sum of:
      0.19343777 = product of:
        1.208986 = sum of:
          0.03474547 = weight(abstract_txt:large in 4141) [ClassicSimilarity], result of:
            0.03474547 = score(doc=4141,freq=1.0), product of:
              0.08304442 = queryWeight, product of:
                2.1087577 = boost
                4.462893 = idf(docFreq=1364, maxDocs=43556)
                0.0088240355 = queryNorm
              0.41839623 = fieldWeight in 4141, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.462893 = idf(docFreq=1364, maxDocs=43556)
                0.09375 = fieldNorm(doc=4141)
          0.11271676 = weight(abstract_txt:memory in 4141) [ClassicSimilarity], result of:
            0.11271676 = score(doc=4141,freq=1.0), product of:
              0.1819857 = queryWeight, product of:
                3.1216924 = boost
                6.606629 = idf(docFreq=159, maxDocs=43556)
                0.0088240355 = queryNorm
              0.6193715 = fieldWeight in 4141, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.606629 = idf(docFreq=159, maxDocs=43556)
                0.09375 = fieldNorm(doc=4141)
          0.09008064 = weight(abstract_txt:file in 4141) [ClassicSimilarity], result of:
            0.09008064 = score(doc=4141,freq=1.0), product of:
              0.17249593 = queryWeight, product of:
                3.509379 = boost
                5.5703354 = idf(docFreq=450, maxDocs=43556)
                0.0088240355 = queryNorm
              0.52221894 = fieldWeight in 4141, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5703354 = idf(docFreq=450, maxDocs=43556)
                0.09375 = fieldNorm(doc=4141)
          0.9714431 = weight(abstract_txt:signature in 4141) [ClassicSimilarity], result of:
            0.9714431 = score(doc=4141,freq=3.0), product of:
              0.70351774 = queryWeight, product of:
                9.375564 = boost
                8.503749 = idf(docFreq=23, maxDocs=43556)
                0.0088240355 = queryNorm
              1.3808367 = fieldWeight in 4141, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                8.503749 = idf(docFreq=23, maxDocs=43556)
                0.09375 = fieldNorm(doc=4141)
        0.16 = coord(4/25)
    
  5. Lee, D.L.; Ren, L.: Document ranking on weight-partitioned signature files (1996) 0.19
    0.18671213 = sum of:
      0.18671213 = product of:
        1.5559344 = sum of:
          0.18016128 = weight(abstract_txt:file in 3415) [ClassicSimilarity], result of:
            0.18016128 = score(doc=3415,freq=4.0), product of:
              0.17249593 = queryWeight, product of:
                3.509379 = boost
                5.5703354 = idf(docFreq=450, maxDocs=43556)
                0.0088240355 = queryNorm
              1.0444379 = fieldWeight in 3415, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.5703354 = idf(docFreq=450, maxDocs=43556)
                0.09375 = fieldNorm(doc=3415)
          0.1216454 = weight(abstract_txt:files in 3415) [ClassicSimilarity], result of:
            0.1216454 = score(doc=3415,freq=1.0), product of:
              0.22701699 = queryWeight, product of:
                4.501166 = boost
                5.715656 = idf(docFreq=389, maxDocs=43556)
                0.0088240355 = queryNorm
              0.5358427 = fieldWeight in 3415, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.715656 = idf(docFreq=389, maxDocs=43556)
                0.09375 = fieldNorm(doc=3415)
          1.2541277 = weight(abstract_txt:signature in 3415) [ClassicSimilarity], result of:
            1.2541277 = score(doc=3415,freq=5.0), product of:
              0.70351774 = queryWeight, product of:
                9.375564 = boost
                8.503749 = idf(docFreq=23, maxDocs=43556)
                0.0088240355 = queryNorm
              1.7826526 = fieldWeight in 3415, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                8.503749 = idf(docFreq=23, maxDocs=43556)
                0.09375 = fieldNorm(doc=3415)
        0.12 = coord(3/25)