Document (#33030)

Author
Carterette, B.
Can, F.
Title
Comparing inverted files and signature files for searching a large lexicon
Source
Information processing and management. 41(2005) no.3, S.613-634
Year
2005
Abstract
Signature files and inverted files are well-known index structures. In this paper we undertake a direct comparison of the two for searching for partially-specified queries in a large lexicon stored in main memory. Using n-grams to index lexicon terms, a bit-sliced signature file can be compressed to a smaller size than an inverted file if each n-gram sets only one bit in the term signature. With a signature width less than half the number of unique n-grams in the lexicon, the signature file method is about as fast as the inverted file method, and significantly smaller. Greater flexibility in memory usage and faster index generation time make signature files appropriate for searching large lexicons or other collections in an environment where memory is at a premium.

Similar documents (content)

  1. Kelledy, F.; Smeaton, A.F.: Signature files and beyond (1996) 0.45
    0.4540266 = sum of:
      0.4540266 = product of:
        1.6215236 = sum of:
          0.0128140915 = weight(abstract_txt:than in 6973) [ClassicSimilarity], result of:
            0.0128140915 = score(doc=6973,freq=1.0), product of:
              0.042109556 = queryWeight, product of:
                1.2259201 = boost
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.008818636 = queryNorm
              0.30430365 = fieldWeight in 6973, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.078125 = fieldNorm(doc=6973)
          0.025585694 = weight(abstract_txt:searching in 6973) [ClassicSimilarity], result of:
            0.025585694 = score(doc=6973,freq=1.0), product of:
              0.07643355 = queryWeight, product of:
                2.0228302 = boost
                4.284727 = idf(docFreq=1655, maxDocs=44218)
                0.008818636 = queryNorm
              0.3347443 = fieldWeight in 6973, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.284727 = idf(docFreq=1655, maxDocs=44218)
                0.078125 = fieldNorm(doc=6973)
          0.028741172 = weight(abstract_txt:large in 6973) [ClassicSimilarity], result of:
            0.028741172 = score(doc=6973,freq=1.0), product of:
              0.08259533 = queryWeight, product of:
                2.1027865 = boost
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.008818636 = queryNorm
              0.34797573 = fieldWeight in 6973, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.078125 = fieldNorm(doc=6973)
          0.10648822 = weight(abstract_txt:file in 6973) [ClassicSimilarity], result of:
            0.10648822 = score(doc=6973,freq=2.0), product of:
              0.17276528 = queryWeight, product of:
                3.5116808 = boost
                5.57879 = idf(docFreq=453, maxDocs=44218)
                0.008818636 = queryNorm
              0.6163751 = fieldWeight in 6973, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.57879 = idf(docFreq=453, maxDocs=44218)
                0.078125 = fieldNorm(doc=6973)
          0.20296289 = weight(abstract_txt:files in 6973) [ClassicSimilarity], result of:
            0.20296289 = score(doc=6973,freq=4.0), product of:
              0.22707006 = queryWeight, product of:
                4.501132 = boost
                5.720536 = idf(docFreq=393, maxDocs=44218)
                0.008818636 = queryNorm
              0.89383376 = fieldWeight in 6973, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.720536 = idf(docFreq=393, maxDocs=44218)
                0.078125 = fieldNorm(doc=6973)
          0.19580029 = weight(abstract_txt:inverted in 6973) [ClassicSimilarity], result of:
            0.19580029 = score(doc=6973,freq=1.0), product of:
              0.3266939 = queryWeight, product of:
                4.829001 = boost
                7.6715355 = idf(docFreq=55, maxDocs=44218)
                0.008818636 = queryNorm
              0.5993387 = fieldWeight in 6973, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6715355 = idf(docFreq=55, maxDocs=44218)
                0.078125 = fieldNorm(doc=6973)
          1.0491313 = weight(abstract_txt:signature in 6973) [ClassicSimilarity], result of:
            1.0491313 = score(doc=6973,freq=5.0), product of:
              0.70497656 = queryWeight, product of:
                9.384111 = boost
                8.518833 = idf(docFreq=23, maxDocs=44218)
                0.008818636 = queryNorm
              1.488179 = fieldWeight in 6973, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                8.518833 = idf(docFreq=23, maxDocs=44218)
                0.078125 = fieldNorm(doc=6973)
        0.28 = coord(7/25)
    
  2. Lam, W.; Wong, K.-F.; Wong, C.-Y.: Chinese document indexing based on new partitioned signature file : model and evaluation (2001) 0.38
    0.37523022 = sum of:
      0.37523022 = product of:
        1.3401079 = sum of:
          0.032638256 = weight(abstract_txt:faster in 303) [ClassicSimilarity], result of:
            0.032638256 = score(doc=303,freq=1.0), product of:
              0.072333045 = queryWeight, product of:
                1.1361226 = boost
                7.2195506 = idf(docFreq=87, maxDocs=44218)
                0.008818636 = queryNorm
              0.4512219 = fieldWeight in 303, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.2195506 = idf(docFreq=87, maxDocs=44218)
                0.0625 = fieldNorm(doc=303)
          0.010251273 = weight(abstract_txt:than in 303) [ClassicSimilarity], result of:
            0.010251273 = score(doc=303,freq=1.0), product of:
              0.042109556 = queryWeight, product of:
                1.2259201 = boost
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.008818636 = queryNorm
              0.24344292 = fieldWeight in 303, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.0625 = fieldNorm(doc=303)
          0.03163508 = weight(abstract_txt:method in 303) [ClassicSimilarity], result of:
            0.03163508 = score(doc=303,freq=4.0), product of:
              0.056228273 = queryWeight, product of:
                1.4166063 = boost
                4.50095 = idf(docFreq=1333, maxDocs=44218)
                0.008818636 = queryNorm
              0.56261873 = fieldWeight in 303, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.50095 = idf(docFreq=1333, maxDocs=44218)
                0.0625 = fieldNorm(doc=303)
          0.022992937 = weight(abstract_txt:large in 303) [ClassicSimilarity], result of:
            0.022992937 = score(doc=303,freq=1.0), product of:
              0.08259533 = queryWeight, product of:
                2.1027865 = boost
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.008818636 = queryNorm
              0.27838057 = fieldWeight in 303, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.0625 = fieldNorm(doc=303)
          0.13469812 = weight(abstract_txt:file in 303) [ClassicSimilarity], result of:
            0.13469812 = score(doc=303,freq=5.0), product of:
              0.17276528 = queryWeight, product of:
                3.5116808 = boost
                5.57879 = idf(docFreq=453, maxDocs=44218)
                0.008818636 = queryNorm
              0.7796596 = fieldWeight in 303, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.57879 = idf(docFreq=453, maxDocs=44218)
                0.0625 = fieldNorm(doc=303)
          0.11481316 = weight(abstract_txt:files in 303) [ClassicSimilarity], result of:
            0.11481316 = score(doc=303,freq=2.0), product of:
              0.22707006 = queryWeight, product of:
                4.501132 = boost
                5.720536 = idf(docFreq=393, maxDocs=44218)
                0.008818636 = queryNorm
              0.50562876 = fieldWeight in 303, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.720536 = idf(docFreq=393, maxDocs=44218)
                0.0625 = fieldNorm(doc=303)
          0.99307907 = weight(abstract_txt:signature in 303) [ClassicSimilarity], result of:
            0.99307907 = score(doc=303,freq=7.0), product of:
              0.70497656 = queryWeight, product of:
                9.384111 = boost
                8.518833 = idf(docFreq=23, maxDocs=44218)
                0.008818636 = queryNorm
              1.4086696 = fieldWeight in 303, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                8.518833 = idf(docFreq=23, maxDocs=44218)
                0.0625 = fieldNorm(doc=303)
        0.28 = coord(7/25)
    
  3. Robertson, A.M.; Willett, P.: Applications of n-grams in textual information systems (1998) 0.28
    0.27713108 = sum of:
      0.27713108 = product of:
        1.3856554 = sum of:
          0.07585567 = weight(abstract_txt:gram in 4715) [ClassicSimilarity], result of:
            0.07585567 = score(doc=4715,freq=1.0), product of:
              0.087394774 = queryWeight, product of:
                1.2488191 = boost
                7.935687 = idf(docFreq=42, maxDocs=44218)
                0.008818636 = queryNorm
              0.86796576 = fieldWeight in 4715, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.935687 = idf(docFreq=42, maxDocs=44218)
                0.109375 = fieldNorm(doc=4715)
          0.23674525 = weight(abstract_txt:grams in 4715) [ClassicSimilarity], result of:
            0.23674525 = score(doc=4715,freq=2.0), product of:
              0.1866441 = queryWeight, product of:
                2.5809462 = boost
                8.200379 = idf(docFreq=32, maxDocs=44218)
                0.008818636 = queryNorm
              1.2684314 = fieldWeight in 4715, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.200379 = idf(docFreq=32, maxDocs=44218)
                0.109375 = fieldNorm(doc=4715)
          0.14207403 = weight(abstract_txt:files in 4715) [ClassicSimilarity], result of:
            0.14207403 = score(doc=4715,freq=1.0), product of:
              0.22707006 = queryWeight, product of:
                4.501132 = boost
                5.720536 = idf(docFreq=393, maxDocs=44218)
                0.008818636 = queryNorm
              0.62568367 = fieldWeight in 4715, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.720536 = idf(docFreq=393, maxDocs=44218)
                0.109375 = fieldNorm(doc=4715)
          0.27412042 = weight(abstract_txt:inverted in 4715) [ClassicSimilarity], result of:
            0.27412042 = score(doc=4715,freq=1.0), product of:
              0.3266939 = queryWeight, product of:
                4.829001 = boost
                7.6715355 = idf(docFreq=55, maxDocs=44218)
                0.008818636 = queryNorm
              0.8390742 = fieldWeight in 4715, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6715355 = idf(docFreq=55, maxDocs=44218)
                0.109375 = fieldNorm(doc=4715)
          0.65686005 = weight(abstract_txt:signature in 4715) [ClassicSimilarity], result of:
            0.65686005 = score(doc=4715,freq=1.0), product of:
              0.70497656 = queryWeight, product of:
                9.384111 = boost
                8.518833 = idf(docFreq=23, maxDocs=44218)
                0.008818636 = queryNorm
              0.9317474 = fieldWeight in 4715, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.518833 = idf(docFreq=23, maxDocs=44218)
                0.109375 = fieldNorm(doc=4715)
        0.2 = coord(5/25)
    
  4. Lee, D.L.: Massive parallelism on the hybrid text-retrieval machine (1995) 0.19
    0.1939847 = sum of:
      0.1939847 = product of:
        1.2124044 = sum of:
          0.034489404 = weight(abstract_txt:large in 4075) [ClassicSimilarity], result of:
            0.034489404 = score(doc=4075,freq=1.0), product of:
              0.08259533 = queryWeight, product of:
                2.1027865 = boost
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.008818636 = queryNorm
              0.41757086 = fieldWeight in 4075, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.09375 = fieldNorm(doc=4075)
          0.11237244 = weight(abstract_txt:memory in 4075) [ClassicSimilarity], result of:
            0.11237244 = score(doc=4075,freq=1.0), product of:
              0.18152575 = queryWeight, product of:
                3.117357 = boost
                6.603137 = idf(docFreq=162, maxDocs=44218)
                0.008818636 = queryNorm
              0.61904407 = fieldWeight in 4075, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.603137 = idf(docFreq=162, maxDocs=44218)
                0.09375 = fieldNorm(doc=4075)
          0.09035824 = weight(abstract_txt:file in 4075) [ClassicSimilarity], result of:
            0.09035824 = score(doc=4075,freq=1.0), product of:
              0.17276528 = queryWeight, product of:
                3.5116808 = boost
                5.57879 = idf(docFreq=453, maxDocs=44218)
                0.008818636 = queryNorm
              0.52301157 = fieldWeight in 4075, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.57879 = idf(docFreq=453, maxDocs=44218)
                0.09375 = fieldNorm(doc=4075)
          0.97518426 = weight(abstract_txt:signature in 4075) [ClassicSimilarity], result of:
            0.97518426 = score(doc=4075,freq=3.0), product of:
              0.70497656 = queryWeight, product of:
                9.384111 = boost
                8.518833 = idf(docFreq=23, maxDocs=44218)
                0.008818636 = queryNorm
              1.3832861 = fieldWeight in 4075, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                8.518833 = idf(docFreq=23, maxDocs=44218)
                0.09375 = fieldNorm(doc=4075)
        0.16 = coord(4/25)
    
  5. Lee, D.L.; Ren, L.: Document ranking on weight-partitioned signature files (1996) 0.19
    0.18737419 = sum of:
      0.18737419 = product of:
        1.5614517 = sum of:
          0.18071648 = weight(abstract_txt:file in 2417) [ClassicSimilarity], result of:
            0.18071648 = score(doc=2417,freq=4.0), product of:
              0.17276528 = queryWeight, product of:
                3.5116808 = boost
                5.57879 = idf(docFreq=453, maxDocs=44218)
                0.008818636 = queryNorm
              1.0460231 = fieldWeight in 2417, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.57879 = idf(docFreq=453, maxDocs=44218)
                0.09375 = fieldNorm(doc=2417)
          0.12177774 = weight(abstract_txt:files in 2417) [ClassicSimilarity], result of:
            0.12177774 = score(doc=2417,freq=1.0), product of:
              0.22707006 = queryWeight, product of:
                4.501132 = boost
                5.720536 = idf(docFreq=393, maxDocs=44218)
                0.008818636 = queryNorm
              0.5363003 = fieldWeight in 2417, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.720536 = idf(docFreq=393, maxDocs=44218)
                0.09375 = fieldNorm(doc=2417)
          1.2589575 = weight(abstract_txt:signature in 2417) [ClassicSimilarity], result of:
            1.2589575 = score(doc=2417,freq=5.0), product of:
              0.70497656 = queryWeight, product of:
                9.384111 = boost
                8.518833 = idf(docFreq=23, maxDocs=44218)
                0.008818636 = queryNorm
              1.7858148 = fieldWeight in 2417, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                8.518833 = idf(docFreq=23, maxDocs=44218)
                0.09375 = fieldNorm(doc=2417)
        0.12 = coord(3/25)