Document (#26677)

Author
Heinz, S.
Zobel, J.
Title
Efficient single-pass index construction for text databases
Source
Journal of the American Society for Information Science and technology. 54(2003) no.8, S.713-729
Year
2003
Abstract
Efficient construction of inverted indexes is essential to provision of search over large collections of text data. In this article, we review the principal approaches to inversion, analyze their theoretical cost, and present experimental results. We identify the drawbacks of existing inversion approaches and propose a single-pass inversion method that, in contrast to previous approaches, does not require the complete vocabulary of the indexed collection in main memory, can operate within limited resources, and does not sacrifice speed with high temporary storage requirements. We show that the performance of the single-pass approach can be improved by constructing inverted files in segments, reducing the cost of disk accesses during inversion of large volumes of data.
Theme
Retrievalalgorithmen

Similar documents (author)

  1. Heinz, S.: Realisierung und Evaluierung eines virtuellen Bibliotheksregals für die Informationswissenschaft an der Universitätsbibliothek Hildesheim (2003) 2.07
    2.072534 = sum of:
      2.072534 = product of:
        4.145068 = sum of:
          4.145068 = weight(author_txt:heinz in 980) [ClassicSimilarity], result of:
            4.145068 = score(doc=980,freq=1.0), product of:
              0.70710677 = queryWeight, product of:
                9.379218 = idf(docFreq=9, maxDocs=43556)
                0.0753908 = queryNorm
              5.8620114 = fieldWeight in 980, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.379218 = idf(docFreq=9, maxDocs=43556)
                0.625 = fieldNorm(doc=980)
        0.5 = coord(1/2)
    
  2. Heinz, M.: Bemerkungen zur Entwicklung der Internationalität der Forschung : Bibliometrische Untersuchungen am SCI (2006) 2.07
    2.072534 = sum of:
      2.072534 = product of:
        4.145068 = sum of:
          4.145068 = weight(author_txt:heinz in 1108) [ClassicSimilarity], result of:
            4.145068 = score(doc=1108,freq=1.0), product of:
              0.70710677 = queryWeight, product of:
                9.379218 = idf(docFreq=9, maxDocs=43556)
                0.0753908 = queryNorm
              5.8620114 = fieldWeight in 1108, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.379218 = idf(docFreq=9, maxDocs=43556)
                0.625 = fieldNorm(doc=1108)
        0.5 = coord(1/2)
    
  3. Heinz, A.: ¬Diie Lösung des Leib-Seele-Problems bei John R. Searle (2002) 2.07
    2.072534 = sum of:
      2.072534 = product of:
        4.145068 = sum of:
          4.145068 = weight(author_txt:heinz in 585) [ClassicSimilarity], result of:
            4.145068 = score(doc=585,freq=1.0), product of:
              0.70710677 = queryWeight, product of:
                9.379218 = idf(docFreq=9, maxDocs=43556)
                0.0753908 = queryNorm
              5.8620114 = fieldWeight in 585, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.379218 = idf(docFreq=9, maxDocs=43556)
                0.625 = fieldNorm(doc=585)
        0.5 = coord(1/2)
    
  4. Großmann. R.; Heinz, M.: RAK-WB als Hypertext (1994) 1.66
    1.6580272 = sum of:
      1.6580272 = product of:
        3.3160543 = sum of:
          3.3160543 = weight(author_txt:heinz in 772) [ClassicSimilarity], result of:
            3.3160543 = score(doc=772,freq=1.0), product of:
              0.70710677 = queryWeight, product of:
                9.379218 = idf(docFreq=9, maxDocs=43556)
                0.0753908 = queryNorm
              4.689609 = fieldWeight in 772, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.379218 = idf(docFreq=9, maxDocs=43556)
                0.5 = fieldNorm(doc=772)
        0.5 = coord(1/2)
    
  5. Heinz, M.; Voigt, H.: Inhaltliche und formale Unzulänglichkeiten bei CD-ROMs : eine Gewichtung (1993) 1.66
    1.6580272 = sum of:
      1.6580272 = product of:
        3.3160543 = sum of:
          3.3160543 = weight(author_txt:heinz in 419) [ClassicSimilarity], result of:
            3.3160543 = score(doc=419,freq=1.0), product of:
              0.70710677 = queryWeight, product of:
                9.379218 = idf(docFreq=9, maxDocs=43556)
                0.0753908 = queryNorm
              4.689609 = fieldWeight in 419, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.379218 = idf(docFreq=9, maxDocs=43556)
                0.5 = fieldNorm(doc=419)
        0.5 = coord(1/2)
    

Similar documents (content)

  1. Moffat, A.; Bell, T.A.H.: In situ generation of compressed inverted files (1995) 0.20
    0.19855775 = sum of:
      0.19855775 = product of:
        0.7091348 = sum of:
          0.06254308 = weight(abstract_txt:memory in 2714) [ClassicSimilarity], result of:
            0.06254308 = score(doc=2714,freq=1.0), product of:
              0.12117396 = queryWeight, product of:
                1.0799941 = boost
                6.606629 = idf(docFreq=159, maxDocs=43556)
                0.016982751 = queryNorm
              0.5161429 = fieldWeight in 2714, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.606629 = idf(docFreq=159, maxDocs=43556)
                0.078125 = fieldNorm(doc=2714)
          0.016364193 = weight(abstract_txt:data in 2714) [ClassicSimilarity], result of:
            0.016364193 = score(doc=2714,freq=1.0), product of:
              0.062454376 = queryWeight, product of:
                1.0965114 = boost
                3.3538349 = idf(docFreq=4137, maxDocs=43556)
                0.016982751 = queryNorm
              0.26201835 = fieldWeight in 2714, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.3538349 = idf(docFreq=4137, maxDocs=43556)
                0.078125 = fieldNorm(doc=2714)
          0.028803293 = weight(abstract_txt:text in 2714) [ClassicSimilarity], result of:
            0.028803293 = score(doc=2714,freq=1.0), product of:
              0.09104607 = queryWeight, product of:
                1.3239218 = boost
                4.0494018 = idf(docFreq=2063, maxDocs=43556)
                0.016982751 = queryNorm
              0.31635952 = fieldWeight in 2714, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0494018 = idf(docFreq=2063, maxDocs=43556)
                0.078125 = fieldNorm(doc=2714)
          0.1859158 = weight(abstract_txt:temporary in 2714) [ClassicSimilarity], result of:
            0.1859158 = score(doc=2714,freq=2.0), product of:
              0.19883402 = queryWeight, product of:
                1.3834455 = boost
                8.462927 = idf(docFreq=24, maxDocs=43556)
                0.016982751 = queryNorm
              0.9350301 = fieldWeight in 2714, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.462927 = idf(docFreq=24, maxDocs=43556)
                0.078125 = fieldNorm(doc=2714)
          0.0667851 = weight(abstract_txt:large in 2714) [ClassicSimilarity], result of:
            0.0667851 = score(doc=2714,freq=3.0), product of:
              0.11058913 = queryWeight, product of:
                1.4591097 = boost
                4.462893 = idf(docFreq=1364, maxDocs=43556)
                0.016982751 = queryNorm
              0.6039029 = fieldWeight in 2714, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.462893 = idf(docFreq=1364, maxDocs=43556)
                0.078125 = fieldNorm(doc=2714)
          0.07143626 = weight(abstract_txt:construction in 2714) [ClassicSimilarity], result of:
            0.07143626 = score(doc=2714,freq=1.0), product of:
              0.16681905 = queryWeight, product of:
                1.7920682 = boost
                5.4812937 = idf(docFreq=492, maxDocs=43556)
                0.016982751 = queryNorm
              0.42822605 = fieldWeight in 2714, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.4812937 = idf(docFreq=492, maxDocs=43556)
                0.078125 = fieldNorm(doc=2714)
          0.27728707 = weight(abstract_txt:inverted in 2714) [ClassicSimilarity], result of:
            0.27728707 = score(doc=2714,freq=2.0), product of:
              0.32702145 = queryWeight, product of:
                2.5091107 = boost
                7.6744695 = idf(docFreq=54, maxDocs=43556)
                0.016982751 = queryNorm
              0.8479171 = fieldWeight in 2714, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.6744695 = idf(docFreq=54, maxDocs=43556)
                0.078125 = fieldNorm(doc=2714)
        0.28 = coord(7/25)
    
  2. Mukhopadhyay, S.; Peng, S.; Raje, R.; Mostafa, J.; Palakal, M.: Distributed multi-agent information filtering : a comparative study (2005) 0.16
    0.15744905 = sum of:
      0.15744905 = product of:
        0.562318 = sum of:
          0.062017 = weight(abstract_txt:speed in 4557) [ClassicSimilarity], result of:
            0.062017 = score(doc=4557,freq=1.0), product of:
              0.12049352 = queryWeight, product of:
                1.0769575 = boost
                6.5880527 = idf(docFreq=162, maxDocs=43556)
                0.016982751 = queryNorm
              0.5146916 = fieldWeight in 4557, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5880527 = idf(docFreq=162, maxDocs=43556)
                0.078125 = fieldNorm(doc=4557)
          0.016364193 = weight(abstract_txt:data in 4557) [ClassicSimilarity], result of:
            0.016364193 = score(doc=4557,freq=1.0), product of:
              0.062454376 = queryWeight, product of:
                1.0965114 = boost
                3.3538349 = idf(docFreq=4137, maxDocs=43556)
                0.016982751 = queryNorm
              0.26201835 = fieldWeight in 4557, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.3538349 = idf(docFreq=4137, maxDocs=43556)
                0.078125 = fieldNorm(doc=4557)
          0.099462174 = weight(abstract_txt:drawbacks in 4557) [ClassicSimilarity], result of:
            0.099462174 = score(doc=4557,freq=1.0), product of:
              0.16509293 = queryWeight, product of:
                1.2606106 = boost
                7.7115107 = idf(docFreq=52, maxDocs=43556)
                0.016982751 = queryNorm
              0.60246176 = fieldWeight in 4557, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.7115107 = idf(docFreq=52, maxDocs=43556)
                0.078125 = fieldNorm(doc=4557)
          0.07711679 = weight(abstract_txt:large in 4557) [ClassicSimilarity], result of:
            0.07711679 = score(doc=4557,freq=4.0), product of:
              0.11058913 = queryWeight, product of:
                1.4591097 = boost
                4.462893 = idf(docFreq=1364, maxDocs=43556)
                0.016982751 = queryNorm
              0.697327 = fieldWeight in 4557, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.462893 = idf(docFreq=1364, maxDocs=43556)
                0.078125 = fieldNorm(doc=4557)
          0.0842056 = weight(abstract_txt:efficient in 4557) [ClassicSimilarity], result of:
            0.0842056 = score(doc=4557,freq=1.0), product of:
              0.1861489 = queryWeight, product of:
                1.8930494 = boost
                5.7901587 = idf(docFreq=361, maxDocs=43556)
                0.016982751 = queryNorm
              0.45235616 = fieldWeight in 4557, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7901587 = idf(docFreq=361, maxDocs=43556)
                0.078125 = fieldNorm(doc=4557)
          0.09117458 = weight(abstract_txt:approaches in 4557) [ClassicSimilarity], result of:
            0.09117458 = score(doc=4557,freq=2.0), product of:
              0.17833479 = queryWeight, product of:
                2.269318 = boost
                4.627353 = idf(docFreq=1157, maxDocs=43556)
                0.016982751 = queryNorm
              0.51125515 = fieldWeight in 4557, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.627353 = idf(docFreq=1157, maxDocs=43556)
                0.078125 = fieldNorm(doc=4557)
          0.13197772 = weight(abstract_txt:single in 4557) [ClassicSimilarity], result of:
            0.13197772 = score(doc=4557,freq=2.0), product of:
              0.22820264 = queryWeight, product of:
                2.5670698 = boost
                5.234497 = idf(docFreq=630, maxDocs=43556)
                0.016982751 = queryNorm
              0.57833564 = fieldWeight in 4557, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.234497 = idf(docFreq=630, maxDocs=43556)
                0.078125 = fieldNorm(doc=4557)
        0.28 = coord(7/25)
    
  3. MacFarlane, A.; McCann, J.A.; Robertson, S.E.: Parallel methods for the generation of partitioned inverted files (2005) 0.12
    0.12140151 = sum of:
      0.12140151 = product of:
        0.50583965 = sum of:
          0.07016423 = weight(abstract_txt:speed in 1774) [ClassicSimilarity], result of:
            0.07016423 = score(doc=1774,freq=2.0), product of:
              0.12049352 = queryWeight, product of:
                1.0769575 = boost
                6.5880527 = idf(docFreq=162, maxDocs=43556)
                0.016982751 = queryNorm
              0.5823071 = fieldWeight in 1774, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.5880527 = idf(docFreq=162, maxDocs=43556)
                0.0625 = fieldNorm(doc=1774)
          0.013091354 = weight(abstract_txt:data in 1774) [ClassicSimilarity], result of:
            0.013091354 = score(doc=1774,freq=1.0), product of:
              0.062454376 = queryWeight, product of:
                1.0965114 = boost
                3.3538349 = idf(docFreq=4137, maxDocs=43556)
                0.016982751 = queryNorm
              0.20961468 = fieldWeight in 1774, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.3538349 = idf(docFreq=4137, maxDocs=43556)
                0.0625 = fieldNorm(doc=1774)
          0.03991101 = weight(abstract_txt:text in 1774) [ClassicSimilarity], result of:
            0.03991101 = score(doc=1774,freq=3.0), product of:
              0.09104607 = queryWeight, product of:
                1.3239218 = boost
                4.0494018 = idf(docFreq=2063, maxDocs=43556)
                0.016982751 = queryNorm
              0.4383606 = fieldWeight in 1774, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.0494018 = idf(docFreq=2063, maxDocs=43556)
                0.0625 = fieldNorm(doc=1774)
          0.043623846 = weight(abstract_txt:large in 1774) [ClassicSimilarity], result of:
            0.043623846 = score(doc=1774,freq=2.0), product of:
              0.11058913 = queryWeight, product of:
                1.4591097 = boost
                4.462893 = idf(docFreq=1364, maxDocs=43556)
                0.016982751 = queryNorm
              0.39446774 = fieldWeight in 1774, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.462893 = idf(docFreq=1364, maxDocs=43556)
                0.0625 = fieldNorm(doc=1774)
          0.06736448 = weight(abstract_txt:efficient in 1774) [ClassicSimilarity], result of:
            0.06736448 = score(doc=1774,freq=1.0), product of:
              0.1861489 = queryWeight, product of:
                1.8930494 = boost
                5.7901587 = idf(docFreq=361, maxDocs=43556)
                0.016982751 = queryNorm
              0.36188492 = fieldWeight in 1774, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7901587 = idf(docFreq=361, maxDocs=43556)
                0.0625 = fieldNorm(doc=1774)
          0.27168474 = weight(abstract_txt:inverted in 1774) [ClassicSimilarity], result of:
            0.27168474 = score(doc=1774,freq=3.0), product of:
              0.32702145 = queryWeight, product of:
                2.5091107 = boost
                7.6744695 = idf(docFreq=54, maxDocs=43556)
                0.016982751 = queryNorm
              0.8307857 = fieldWeight in 1774, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.6744695 = idf(docFreq=54, maxDocs=43556)
                0.0625 = fieldNorm(doc=1774)
        0.24 = coord(6/25)
    
  4. Uratani, N.; Takeda, M.: ¬A fast string-searching algorithm for multiple patterns (1993) 0.11
    0.11371406 = sum of:
      0.11371406 = product of:
        0.7107129 = sum of:
          0.048880804 = weight(abstract_txt:text in 6272) [ClassicSimilarity], result of:
            0.048880804 = score(doc=6272,freq=2.0), product of:
              0.09104607 = queryWeight, product of:
                1.3239218 = boost
                4.0494018 = idf(docFreq=2063, maxDocs=43556)
                0.016982751 = queryNorm
              0.5368799 = fieldWeight in 6272, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.0494018 = idf(docFreq=2063, maxDocs=43556)
                0.09375 = fieldNorm(doc=6272)
          0.10104672 = weight(abstract_txt:efficient in 6272) [ClassicSimilarity], result of:
            0.10104672 = score(doc=6272,freq=1.0), product of:
              0.1861489 = queryWeight, product of:
                1.8930494 = boost
                5.7901587 = idf(docFreq=361, maxDocs=43556)
                0.016982751 = queryNorm
              0.54282737 = fieldWeight in 6272, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7901587 = idf(docFreq=361, maxDocs=43556)
                0.09375 = fieldNorm(doc=6272)
          0.111986816 = weight(abstract_txt:single in 6272) [ClassicSimilarity], result of:
            0.111986816 = score(doc=6272,freq=1.0), product of:
              0.22820264 = queryWeight, product of:
                2.5670698 = boost
                5.234497 = idf(docFreq=630, maxDocs=43556)
                0.016982751 = queryNorm
              0.4907341 = fieldWeight in 6272, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.234497 = idf(docFreq=630, maxDocs=43556)
                0.09375 = fieldNorm(doc=6272)
          0.4487986 = weight(abstract_txt:pass in 6272) [ClassicSimilarity], result of:
            0.4487986 = score(doc=6272,freq=1.0), product of:
              0.575763 = queryWeight, product of:
                4.077549 = boost
                8.314507 = idf(docFreq=28, maxDocs=43556)
                0.016982751 = queryNorm
              0.779485 = fieldWeight in 6272, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.314507 = idf(docFreq=28, maxDocs=43556)
                0.09375 = fieldNorm(doc=6272)
        0.16 = coord(4/25)
    
  5. Moffat, A.; Zobel, J.: Self-indexing inverted files for fast text retrieval (1996) 0.10
    0.10200337 = sum of:
      0.10200337 = product of:
        0.51001686 = sum of:
          0.039719623 = weight(abstract_txt:indexed in 2007) [ClassicSimilarity], result of:
            0.039719623 = score(doc=2007,freq=1.0), product of:
              0.10388828 = queryWeight, product of:
                6.1172824 = idf(docFreq=260, maxDocs=43556)
                0.016982751 = queryNorm
              0.38233015 = fieldWeight in 2007, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.1172824 = idf(docFreq=260, maxDocs=43556)
                0.0625 = fieldNorm(doc=2007)
          0.032587204 = weight(abstract_txt:text in 2007) [ClassicSimilarity], result of:
            0.032587204 = score(doc=2007,freq=2.0), product of:
              0.09104607 = queryWeight, product of:
                1.3239218 = boost
                4.0494018 = idf(docFreq=2063, maxDocs=43556)
                0.016982751 = queryNorm
              0.35791993 = fieldWeight in 2007, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.0494018 = idf(docFreq=2063, maxDocs=43556)
                0.0625 = fieldNorm(doc=2007)
          0.030846717 = weight(abstract_txt:large in 2007) [ClassicSimilarity], result of:
            0.030846717 = score(doc=2007,freq=1.0), product of:
              0.11058913 = queryWeight, product of:
                1.4591097 = boost
                4.462893 = idf(docFreq=1364, maxDocs=43556)
                0.016982751 = queryNorm
              0.2789308 = fieldWeight in 2007, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.462893 = idf(docFreq=1364, maxDocs=43556)
                0.0625 = fieldNorm(doc=2007)
          0.09314883 = weight(abstract_txt:cost in 2007) [ClassicSimilarity], result of:
            0.09314883 = score(doc=2007,freq=2.0), product of:
              0.18337837 = queryWeight, product of:
                1.8789091 = boost
                5.7469087 = idf(docFreq=377, maxDocs=43556)
                0.016982751 = queryNorm
              0.5079597 = fieldWeight in 2007, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.7469087 = idf(docFreq=377, maxDocs=43556)
                0.0625 = fieldNorm(doc=2007)
          0.3137145 = weight(abstract_txt:inverted in 2007) [ClassicSimilarity], result of:
            0.3137145 = score(doc=2007,freq=4.0), product of:
              0.32702145 = queryWeight, product of:
                2.5091107 = boost
                7.6744695 = idf(docFreq=54, maxDocs=43556)
                0.016982751 = queryNorm
              0.9593087 = fieldWeight in 2007, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.6744695 = idf(docFreq=54, maxDocs=43556)
                0.0625 = fieldNorm(doc=2007)
        0.2 = coord(5/25)