Document (#33031)

Author
Carterette, B.
Can, F.
Title
Comparing inverted files and signature files for searching a large lexicon
Source
Information processing and management. 41(2005) no.3, S.613-634
Year
2005
Abstract
Signature files and inverted files are well-known index structures. In this paper we undertake a direct comparison of the two for searching for partially-specified queries in a large lexicon stored in main memory. Using n-grams to index lexicon terms, a bit-sliced signature file can be compressed to a smaller size than an inverted file if each n-gram sets only one bit in the term signature. With a signature width less than half the number of unique n-grams in the lexicon, the signature file method is about as fast as the inverted file method, and significantly smaller. Greater flexibility in memory usage and faster index generation time make signature files appropriate for searching large lexicons or other collections in an environment where memory is at a premium.

Similar documents (content)

  1. Kelledy, F.; Smeaton, A.F.: Signature files and beyond (1996) 0.45
    0.45105904 = sum of:
      0.45105904 = product of:
        1.6109252 = sum of:
          0.013111309 = weight(abstract_txt:than in 43) [ClassicSimilarity], result of:
            0.013111309 = score(doc=43,freq=1.0), product of:
              0.04283292 = queryWeight, product of:
                1.2370623 = boost
                3.9181254 = idf(docFreq=2285, maxDocs=42306)
                0.008837059 = queryNorm
              0.30610356 = fieldWeight in 43, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.9181254 = idf(docFreq=2285, maxDocs=42306)
                0.078125 = fieldNorm(doc=43)
          0.02539916 = weight(abstract_txt:searching in 43) [ClassicSimilarity], result of:
            0.02539916 = score(doc=43,freq=1.0), product of:
              0.0761945 = queryWeight, product of:
                2.0207388 = boost
                4.2668333 = idf(docFreq=1612, maxDocs=42306)
                0.008837059 = queryNorm
              0.33334637 = fieldWeight in 43, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2668333 = idf(docFreq=1612, maxDocs=42306)
                0.078125 = fieldNorm(doc=43)
          0.029239152 = weight(abstract_txt:large in 43) [ClassicSimilarity], result of:
            0.029239152 = score(doc=43,freq=1.0), product of:
              0.08369263 = queryWeight, product of:
                2.1178343 = boost
                4.471853 = idf(docFreq=1313, maxDocs=42306)
                0.008837059 = queryNorm
              0.3493635 = fieldWeight in 43, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.471853 = idf(docFreq=1313, maxDocs=42306)
                0.078125 = fieldNorm(doc=43)
          0.106307626 = weight(abstract_txt:file in 43) [ClassicSimilarity], result of:
            0.106307626 = score(doc=43,freq=2.0), product of:
              0.17287144 = queryWeight, product of:
                3.5146282 = boost
                5.5659094 = idf(docFreq=439, maxDocs=42306)
                0.008837059 = queryNorm
              0.6149519 = fieldWeight in 43, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.5659094 = idf(docFreq=439, maxDocs=42306)
                0.078125 = fieldNorm(doc=43)
          0.20233352 = weight(abstract_txt:files in 43) [ClassicSimilarity], result of:
            0.20233352 = score(doc=43,freq=4.0), product of:
              0.22699636 = queryWeight, product of:
                4.502795 = boost
                5.704649 = idf(docFreq=382, maxDocs=42306)
                0.008837059 = queryNorm
              0.8913514 = fieldWeight in 43, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.704649 = idf(docFreq=382, maxDocs=42306)
                0.078125 = fieldNorm(doc=43)
          0.196226 = weight(abstract_txt:inverted in 43) [ClassicSimilarity], result of:
            0.196226 = score(doc=43,freq=1.0), product of:
              0.3277389 = queryWeight, product of:
                4.839291 = boost
                7.6637 = idf(docFreq=53, maxDocs=42306)
                0.008837059 = queryNorm
              0.5987266 = fieldWeight in 43, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6637 = idf(docFreq=53, maxDocs=42306)
                0.078125 = fieldNorm(doc=43)
          1.0383085 = weight(abstract_txt:signature in 43) [ClassicSimilarity], result of:
            1.0383085 = score(doc=43,freq=5.0), product of:
              0.7013432 = queryWeight, product of:
                9.364877 = boost
                8.47463 = idf(docFreq=23, maxDocs=42306)
                0.008837059 = queryNorm
              1.4804571 = fieldWeight in 43, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                8.47463 = idf(docFreq=23, maxDocs=42306)
                0.078125 = fieldNorm(doc=43)
        0.28 = coord(7/25)
    
  2. Lam, W.; Wong, K.-F.; Wong, C.-Y.: Chinese document indexing based on new partitioned signature file : model and evaluation (2001) 0.37
    0.37263602 = sum of:
      0.37263602 = product of:
        1.330843 = sum of:
          0.032680057 = weight(abstract_txt:faster in 1304) [ClassicSimilarity], result of:
            0.032680057 = score(doc=1304,freq=1.0), product of:
              0.07252129 = queryWeight, product of:
                1.138205 = boost
                7.210033 = idf(docFreq=84, maxDocs=42306)
                0.008837059 = queryNorm
              0.45062706 = fieldWeight in 1304, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.210033 = idf(docFreq=84, maxDocs=42306)
                0.0625 = fieldNorm(doc=1304)
          0.010489047 = weight(abstract_txt:than in 1304) [ClassicSimilarity], result of:
            0.010489047 = score(doc=1304,freq=1.0), product of:
              0.04283292 = queryWeight, product of:
                1.2370623 = boost
                3.9181254 = idf(docFreq=2285, maxDocs=42306)
                0.008837059 = queryNorm
              0.24488284 = fieldWeight in 1304, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.9181254 = idf(docFreq=2285, maxDocs=42306)
                0.0625 = fieldNorm(doc=1304)
          0.03252127 = weight(abstract_txt:method in 1304) [ClassicSimilarity], result of:
            0.03252127 = score(doc=1304,freq=4.0), product of:
              0.057373587 = queryWeight, product of:
                1.4317222 = boost
                4.534668 = idf(docFreq=1233, maxDocs=42306)
                0.008837059 = queryNorm
              0.5668335 = fieldWeight in 1304, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.534668 = idf(docFreq=1233, maxDocs=42306)
                0.0625 = fieldNorm(doc=1304)
          0.023391321 = weight(abstract_txt:large in 1304) [ClassicSimilarity], result of:
            0.023391321 = score(doc=1304,freq=1.0), product of:
              0.08369263 = queryWeight, product of:
                2.1178343 = boost
                4.471853 = idf(docFreq=1313, maxDocs=42306)
                0.008837059 = queryNorm
              0.2794908 = fieldWeight in 1304, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.471853 = idf(docFreq=1313, maxDocs=42306)
                0.0625 = fieldNorm(doc=1304)
          0.13446969 = weight(abstract_txt:file in 1304) [ClassicSimilarity], result of:
            0.13446969 = score(doc=1304,freq=5.0), product of:
              0.17287144 = queryWeight, product of:
                3.5146282 = boost
                5.5659094 = idf(docFreq=439, maxDocs=42306)
                0.008837059 = queryNorm
              0.7778595 = fieldWeight in 1304, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.5659094 = idf(docFreq=439, maxDocs=42306)
                0.0625 = fieldNorm(doc=1304)
          0.11445712 = weight(abstract_txt:files in 1304) [ClassicSimilarity], result of:
            0.11445712 = score(doc=1304,freq=2.0), product of:
              0.22699636 = queryWeight, product of:
                4.502795 = boost
                5.704649 = idf(docFreq=382, maxDocs=42306)
                0.008837059 = queryNorm
              0.5042245 = fieldWeight in 1304, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.704649 = idf(docFreq=382, maxDocs=42306)
                0.0625 = fieldNorm(doc=1304)
          0.98283446 = weight(abstract_txt:signature in 1304) [ClassicSimilarity], result of:
            0.98283446 = score(doc=1304,freq=7.0), product of:
              0.7013432 = queryWeight, product of:
                9.364877 = boost
                8.47463 = idf(docFreq=23, maxDocs=42306)
                0.008837059 = queryNorm
              1.4013603 = fieldWeight in 1304, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                8.47463 = idf(docFreq=23, maxDocs=42306)
                0.0625 = fieldNorm(doc=1304)
        0.28 = coord(7/25)
    
  3. Robertson, A.M.; Willett, P.: Applications of n-grams in textual information systems (1998) 0.28
    0.27567983 = sum of:
      0.27567983 = product of:
        1.3783991 = sum of:
          0.0778048 = weight(abstract_txt:gram in 716) [ClassicSimilarity], result of:
            0.0778048 = score(doc=716,freq=1.0), product of:
              0.08904083 = queryWeight, product of:
                1.2611953 = boost
                7.9891224 = idf(docFreq=38, maxDocs=42306)
                0.008837059 = queryNorm
              0.8738103 = fieldWeight in 716, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.9891224 = idf(docFreq=38, maxDocs=42306)
                0.109375 = fieldNorm(doc=716)
          0.23416066 = weight(abstract_txt:grams in 716) [ClassicSimilarity], result of:
            0.23416066 = score(doc=716,freq=2.0), product of:
              0.18560696 = queryWeight, product of:
                2.5751343 = boost
                8.156177 = idf(docFreq=32, maxDocs=42306)
                0.008837059 = queryNorm
              1.2615942 = fieldWeight in 716, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.156177 = idf(docFreq=32, maxDocs=42306)
                0.109375 = fieldNorm(doc=716)
          0.14163347 = weight(abstract_txt:files in 716) [ClassicSimilarity], result of:
            0.14163347 = score(doc=716,freq=1.0), product of:
              0.22699636 = queryWeight, product of:
                4.502795 = boost
                5.704649 = idf(docFreq=382, maxDocs=42306)
                0.008837059 = queryNorm
              0.62394595 = fieldWeight in 716, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.704649 = idf(docFreq=382, maxDocs=42306)
                0.109375 = fieldNorm(doc=716)
          0.27471638 = weight(abstract_txt:inverted in 716) [ClassicSimilarity], result of:
            0.27471638 = score(doc=716,freq=1.0), product of:
              0.3277389 = queryWeight, product of:
                4.839291 = boost
                7.6637 = idf(docFreq=53, maxDocs=42306)
                0.008837059 = queryNorm
              0.8382172 = fieldWeight in 716, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6637 = idf(docFreq=53, maxDocs=42306)
                0.109375 = fieldNorm(doc=716)
          0.6500839 = weight(abstract_txt:signature in 716) [ClassicSimilarity], result of:
            0.6500839 = score(doc=716,freq=1.0), product of:
              0.7013432 = queryWeight, product of:
                9.364877 = boost
                8.47463 = idf(docFreq=23, maxDocs=42306)
                0.008837059 = queryNorm
              0.92691267 = fieldWeight in 716, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.47463 = idf(docFreq=23, maxDocs=42306)
                0.109375 = fieldNorm(doc=716)
        0.2 = coord(5/25)
    
  4. Lee, D.L.: Massive parallelism on the hybrid text-retrieval machine (1995) 0.19
    0.192862 = sum of:
      0.192862 = product of:
        1.2053876 = sum of:
          0.03508698 = weight(abstract_txt:large in 4144) [ClassicSimilarity], result of:
            0.03508698 = score(doc=4144,freq=1.0), product of:
              0.08369263 = queryWeight, product of:
                2.1178343 = boost
                4.471853 = idf(docFreq=1313, maxDocs=42306)
                0.008837059 = queryNorm
              0.41923618 = fieldWeight in 4144, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.471853 = idf(docFreq=1313, maxDocs=42306)
                0.09375 = fieldNorm(doc=4144)
          0.114971384 = weight(abstract_txt:memory in 4144) [ClassicSimilarity], result of:
            0.114971384 = score(doc=4144,freq=1.0), product of:
              0.18463601 = queryWeight, product of:
                3.1456225 = boost
                6.642049 = idf(docFreq=149, maxDocs=42306)
                0.008837059 = queryNorm
              0.6226921 = fieldWeight in 4144, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.642049 = idf(docFreq=149, maxDocs=42306)
                0.09375 = fieldNorm(doc=4144)
          0.09020501 = weight(abstract_txt:file in 4144) [ClassicSimilarity], result of:
            0.09020501 = score(doc=4144,freq=1.0), product of:
              0.17287144 = queryWeight, product of:
                3.5146282 = boost
                5.5659094 = idf(docFreq=439, maxDocs=42306)
                0.008837059 = queryNorm
              0.521804 = fieldWeight in 4144, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5659094 = idf(docFreq=439, maxDocs=42306)
                0.09375 = fieldNorm(doc=4144)
          0.96512425 = weight(abstract_txt:signature in 4144) [ClassicSimilarity], result of:
            0.96512425 = score(doc=4144,freq=3.0), product of:
              0.7013432 = queryWeight, product of:
                9.364877 = boost
                8.47463 = idf(docFreq=23, maxDocs=42306)
                0.008837059 = queryNorm
              1.3761084 = fieldWeight in 4144, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                8.47463 = idf(docFreq=23, maxDocs=42306)
                0.09375 = fieldNorm(doc=4144)
        0.16 = coord(4/25)
    
  5. Lee, D.L.; Ren, L.: Document ranking on weight-partitioned signature files (1996) 0.19
    0.18573363 = sum of:
      0.18573363 = product of:
        1.5477803 = sum of:
          0.18041001 = weight(abstract_txt:file in 3418) [ClassicSimilarity], result of:
            0.18041001 = score(doc=3418,freq=4.0), product of:
              0.17287144 = queryWeight, product of:
                3.5146282 = boost
                5.5659094 = idf(docFreq=439, maxDocs=42306)
                0.008837059 = queryNorm
              1.043608 = fieldWeight in 3418, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.5659094 = idf(docFreq=439, maxDocs=42306)
                0.09375 = fieldNorm(doc=3418)
          0.12140012 = weight(abstract_txt:files in 3418) [ClassicSimilarity], result of:
            0.12140012 = score(doc=3418,freq=1.0), product of:
              0.22699636 = queryWeight, product of:
                4.502795 = boost
                5.704649 = idf(docFreq=382, maxDocs=42306)
                0.008837059 = queryNorm
              0.53481084 = fieldWeight in 3418, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.704649 = idf(docFreq=382, maxDocs=42306)
                0.09375 = fieldNorm(doc=3418)
          1.2459701 = weight(abstract_txt:signature in 3418) [ClassicSimilarity], result of:
            1.2459701 = score(doc=3418,freq=5.0), product of:
              0.7013432 = queryWeight, product of:
                9.364877 = boost
                8.47463 = idf(docFreq=23, maxDocs=42306)
                0.008837059 = queryNorm
              1.7765484 = fieldWeight in 3418, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                8.47463 = idf(docFreq=23, maxDocs=42306)
                0.09375 = fieldNorm(doc=3418)
        0.12 = coord(3/25)