Document (#26680)

Author
Heinz, S.
Zobel, J.
Title
Efficient single-pass index construction for text databases
Source
Journal of the American Society for Information Science and technology. 54(2003) no.8, S.713-729
Year
2003
Abstract
Efficient construction of inverted indexes is essential to provision of search over large collections of text data. In this article, we review the principal approaches to inversion, analyze their theoretical cost, and present experimental results. We identify the drawbacks of existing inversion approaches and propose a single-pass inversion method that, in contrast to previous approaches, does not require the complete vocabulary of the indexed collection in main memory, can operate within limited resources, and does not sacrifice speed with high temporary storage requirements. We show that the performance of the single-pass approach can be improved by constructing inverted files in segments, reducing the cost of disk accesses during inversion of large volumes of data.
Theme
Retrievalalgorithmen

Similar documents (author)

  1. Heinz, S.: Realisierung und Evaluierung eines virtuellen Bibliotheksregals für die Informationswissenschaft an der Universitätsbibliothek Hildesheim (2003) 2.07
    2.0660996 = sum of:
      2.0660996 = product of:
        4.1321993 = sum of:
          4.1321993 = weight(author_txt:heinz in 983) [ClassicSimilarity], result of:
            4.1321993 = score(doc=983,freq=1.0), product of:
              0.7071068 = queryWeight, product of:
                9.3501 = idf(docFreq=9, maxDocs=42306)
                0.07562559 = queryNorm
              5.843812 = fieldWeight in 983, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.3501 = idf(docFreq=9, maxDocs=42306)
                0.625 = fieldNorm(doc=983)
        0.5 = coord(1/2)
    
  2. Heinz, M.: Bemerkungen zur Entwicklung der Internationalität der Forschung : Bibliometrische Untersuchungen am SCI (2006) 2.07
    2.0660996 = sum of:
      2.0660996 = product of:
        4.1321993 = sum of:
          4.1321993 = weight(author_txt:heinz in 1111) [ClassicSimilarity], result of:
            4.1321993 = score(doc=1111,freq=1.0), product of:
              0.7071068 = queryWeight, product of:
                9.3501 = idf(docFreq=9, maxDocs=42306)
                0.07562559 = queryNorm
              5.843812 = fieldWeight in 1111, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.3501 = idf(docFreq=9, maxDocs=42306)
                0.625 = fieldNorm(doc=1111)
        0.5 = coord(1/2)
    
  3. Heinz, A.: ¬Diie Lösung des Leib-Seele-Problems bei John R. Searle (2002) 2.07
    2.0660996 = sum of:
      2.0660996 = product of:
        4.1321993 = sum of:
          4.1321993 = weight(author_txt:heinz in 1218) [ClassicSimilarity], result of:
            4.1321993 = score(doc=1218,freq=1.0), product of:
              0.7071068 = queryWeight, product of:
                9.3501 = idf(docFreq=9, maxDocs=42306)
                0.07562559 = queryNorm
              5.843812 = fieldWeight in 1218, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.3501 = idf(docFreq=9, maxDocs=42306)
                0.625 = fieldNorm(doc=1218)
        0.5 = coord(1/2)
    
  4. Großmann. R.; Heinz, M.: RAK-WB als Hypertext (1994) 1.65
    1.6528798 = sum of:
      1.6528798 = product of:
        3.3057597 = sum of:
          3.3057597 = weight(author_txt:heinz in 775) [ClassicSimilarity], result of:
            3.3057597 = score(doc=775,freq=1.0), product of:
              0.7071068 = queryWeight, product of:
                9.3501 = idf(docFreq=9, maxDocs=42306)
                0.07562559 = queryNorm
              4.67505 = fieldWeight in 775, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.3501 = idf(docFreq=9, maxDocs=42306)
                0.5 = fieldNorm(doc=775)
        0.5 = coord(1/2)
    
  5. Heinz, M.; Voigt, H.: Inhaltliche und formale Unzulänglichkeiten bei CD-ROMs : eine Gewichtung (1993) 1.65
    1.6528798 = sum of:
      1.6528798 = product of:
        3.3057597 = sum of:
          3.3057597 = weight(author_txt:heinz in 422) [ClassicSimilarity], result of:
            3.3057597 = score(doc=422,freq=1.0), product of:
              0.7071068 = queryWeight, product of:
                9.3501 = idf(docFreq=9, maxDocs=42306)
                0.07562559 = queryNorm
              4.67505 = fieldWeight in 422, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.3501 = idf(docFreq=9, maxDocs=42306)
                0.5 = fieldNorm(doc=422)
        0.5 = coord(1/2)
    

Similar documents (content)

  1. Moffat, A.; Bell, T.A.H.: In situ generation of compressed inverted files (1995) 0.20
    0.19876039 = sum of:
      0.19876039 = product of:
        0.70985854 = sum of:
          0.06311651 = weight(abstract_txt:memory in 2717) [ClassicSimilarity], result of:
            0.06311651 = score(doc=2717,freq=1.0), product of:
              0.12163287 = queryWeight, product of:
                1.0896007 = boost
                6.642049 = idf(docFreq=149, maxDocs=42306)
                0.016806664 = queryNorm
              0.51891005 = fieldWeight in 2717, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.642049 = idf(docFreq=149, maxDocs=42306)
                0.078125 = fieldNorm(doc=2717)
          0.016674243 = weight(abstract_txt:data in 2717) [ClassicSimilarity], result of:
            0.016674243 = score(doc=2717,freq=1.0), product of:
              0.063095205 = queryWeight, product of:
                1.1098264 = boost
                3.382671 = idf(docFreq=3904, maxDocs=42306)
                0.016806664 = queryNorm
              0.26427117 = fieldWeight in 2717, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.382671 = idf(docFreq=3904, maxDocs=42306)
                0.078125 = fieldNorm(doc=2717)
          0.028655296 = weight(abstract_txt:text in 2717) [ClassicSimilarity], result of:
            0.028655296 = score(doc=2717,freq=1.0), product of:
              0.09052507 = queryWeight, product of:
                1.3293561 = boost
                4.0517817 = idf(docFreq=1999, maxDocs=42306)
                0.016806664 = queryNorm
              0.31654543 = fieldWeight in 2717, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0517817 = idf(docFreq=1999, maxDocs=42306)
                0.078125 = fieldNorm(doc=2717)
          0.18820918 = weight(abstract_txt:temporary in 2717) [ClassicSimilarity], result of:
            0.18820918 = score(doc=2717,freq=2.0), product of:
              0.20000435 = queryWeight, product of:
                1.3972098 = boost
                8.51719 = idf(docFreq=22, maxDocs=42306)
                0.016806664 = queryNorm
              0.9410254 = fieldWeight in 2717, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.51719 = idf(docFreq=22, maxDocs=42306)
                0.078125 = fieldNorm(doc=2717)
          0.06672521 = weight(abstract_txt:large in 2717) [ClassicSimilarity], result of:
            0.06672521 = score(doc=2717,freq=3.0), product of:
              0.11026858 = queryWeight, product of:
                1.4671779 = boost
                4.471853 = idf(docFreq=1313, maxDocs=42306)
                0.016806664 = queryNorm
              0.60511535 = fieldWeight in 2717, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.471853 = idf(docFreq=1313, maxDocs=42306)
                0.078125 = fieldNorm(doc=2717)
          0.07225921 = weight(abstract_txt:construction in 2717) [ClassicSimilarity], result of:
            0.07225921 = score(doc=2717,freq=1.0), product of:
              0.16771081 = queryWeight, product of:
                1.8094118 = boost
                5.514957 = idf(docFreq=462, maxDocs=42306)
                0.016806664 = queryNorm
              0.43085602 = fieldWeight in 2717, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.514957 = idf(docFreq=462, maxDocs=42306)
                0.078125 = fieldNorm(doc=2717)
          0.27421886 = weight(abstract_txt:inverted in 2717) [ClassicSimilarity], result of:
            0.27421886 = score(doc=2717,freq=2.0), product of:
              0.3238574 = queryWeight, product of:
                2.5143967 = boost
                7.6637 = idf(docFreq=53, maxDocs=42306)
                0.016806664 = queryNorm
              0.8467272 = fieldWeight in 2717, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.6637 = idf(docFreq=53, maxDocs=42306)
                0.078125 = fieldNorm(doc=2717)
        0.28 = coord(7/25)
    
  2. Mukhopadhyay, S.; Peng, S.; Raje, R.; Mostafa, J.; Palakal, M.: Distributed multi-agent information filtering : a comparative study (2005) 0.16
    0.15732564 = sum of:
      0.15732564 = product of:
        0.5618773 = sum of:
          0.061120484 = weight(abstract_txt:speed in 4560) [ClassicSimilarity], result of:
            0.061120484 = score(doc=4560,freq=1.0), product of:
              0.119054765 = queryWeight, product of:
                1.0779914 = boost
                6.57128 = idf(docFreq=160, maxDocs=42306)
                0.016806664 = queryNorm
              0.51338124 = fieldWeight in 4560, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.57128 = idf(docFreq=160, maxDocs=42306)
                0.078125 = fieldNorm(doc=4560)
          0.016674243 = weight(abstract_txt:data in 4560) [ClassicSimilarity], result of:
            0.016674243 = score(doc=4560,freq=1.0), product of:
              0.063095205 = queryWeight, product of:
                1.1098264 = boost
                3.382671 = idf(docFreq=3904, maxDocs=42306)
                0.016806664 = queryNorm
              0.26427117 = fieldWeight in 4560, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.382671 = idf(docFreq=3904, maxDocs=42306)
                0.078125 = fieldNorm(doc=4560)
          0.0983904 = weight(abstract_txt:drawbacks in 4560) [ClassicSimilarity], result of:
            0.0983904 = score(doc=4560,freq=1.0), product of:
              0.16352747 = queryWeight, product of:
                1.2633895 = boost
                7.7014403 = idf(docFreq=51, maxDocs=42306)
                0.016806664 = queryNorm
              0.60167503 = fieldWeight in 4560, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.7014403 = idf(docFreq=51, maxDocs=42306)
                0.078125 = fieldNorm(doc=4560)
          0.07704763 = weight(abstract_txt:large in 4560) [ClassicSimilarity], result of:
            0.07704763 = score(doc=4560,freq=4.0), product of:
              0.11026858 = queryWeight, product of:
                1.4671779 = boost
                4.471853 = idf(docFreq=1313, maxDocs=42306)
                0.016806664 = queryNorm
              0.698727 = fieldWeight in 4560, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.471853 = idf(docFreq=1313, maxDocs=42306)
                0.078125 = fieldNorm(doc=4560)
          0.08370077 = weight(abstract_txt:efficient in 4560) [ClassicSimilarity], result of:
            0.08370077 = score(doc=4560,freq=1.0), product of:
              0.18497735 = queryWeight, product of:
                1.9002738 = boost
                5.791898 = idf(docFreq=350, maxDocs=42306)
                0.016806664 = queryNorm
              0.452492 = fieldWeight in 4560, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.791898 = idf(docFreq=350, maxDocs=42306)
                0.078125 = fieldNorm(doc=4560)
          0.0933964 = weight(abstract_txt:approaches in 4560) [ClassicSimilarity], result of:
            0.0933964 = score(doc=4560,freq=2.0), product of:
              0.1808032 = queryWeight, product of:
                2.3009415 = boost
                4.6754026 = idf(docFreq=1071, maxDocs=42306)
                0.016806664 = queryNorm
              0.5165639 = fieldWeight in 4560, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.6754026 = idf(docFreq=1071, maxDocs=42306)
                0.078125 = fieldNorm(doc=4560)
          0.13154738 = weight(abstract_txt:single in 4560) [ClassicSimilarity], result of:
            0.13154738 = score(doc=4560,freq=2.0), product of:
              0.2271821 = queryWeight, product of:
                2.579227 = boost
                5.2408657 = idf(docFreq=608, maxDocs=42306)
                0.016806664 = queryNorm
              0.57903934 = fieldWeight in 4560, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.2408657 = idf(docFreq=608, maxDocs=42306)
                0.078125 = fieldNorm(doc=4560)
        0.28 = coord(7/25)
    
  3. MacFarlane, A.; McCann, J.A.; Robertson, S.E.: Parallel methods for the generation of partitioned inverted files (2005) 0.12
    0.1203406 = sum of:
      0.1203406 = product of:
        0.5014192 = sum of:
          0.069149934 = weight(abstract_txt:speed in 1777) [ClassicSimilarity], result of:
            0.069149934 = score(doc=1777,freq=2.0), product of:
              0.119054765 = queryWeight, product of:
                1.0779914 = boost
                6.57128 = idf(docFreq=160, maxDocs=42306)
                0.016806664 = queryNorm
              0.58082455 = fieldWeight in 1777, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.57128 = idf(docFreq=160, maxDocs=42306)
                0.0625 = fieldNorm(doc=1777)
          0.013339396 = weight(abstract_txt:data in 1777) [ClassicSimilarity], result of:
            0.013339396 = score(doc=1777,freq=1.0), product of:
              0.063095205 = queryWeight, product of:
                1.1098264 = boost
                3.382671 = idf(docFreq=3904, maxDocs=42306)
                0.016806664 = queryNorm
              0.21141694 = fieldWeight in 1777, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.382671 = idf(docFreq=3904, maxDocs=42306)
                0.0625 = fieldNorm(doc=1777)
          0.039705943 = weight(abstract_txt:text in 1777) [ClassicSimilarity], result of:
            0.039705943 = score(doc=1777,freq=3.0), product of:
              0.09052507 = queryWeight, product of:
                1.3293561 = boost
                4.0517817 = idf(docFreq=1999, maxDocs=42306)
                0.016806664 = queryNorm
              0.4386182 = fieldWeight in 1777, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.0517817 = idf(docFreq=1999, maxDocs=42306)
                0.0625 = fieldNorm(doc=1777)
          0.043584723 = weight(abstract_txt:large in 1777) [ClassicSimilarity], result of:
            0.043584723 = score(doc=1777,freq=2.0), product of:
              0.11026858 = queryWeight, product of:
                1.4671779 = boost
                4.471853 = idf(docFreq=1313, maxDocs=42306)
                0.016806664 = queryNorm
              0.39525968 = fieldWeight in 1777, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.471853 = idf(docFreq=1313, maxDocs=42306)
                0.0625 = fieldNorm(doc=1777)
          0.06696062 = weight(abstract_txt:efficient in 1777) [ClassicSimilarity], result of:
            0.06696062 = score(doc=1777,freq=1.0), product of:
              0.18497735 = queryWeight, product of:
                1.9002738 = boost
                5.791898 = idf(docFreq=350, maxDocs=42306)
                0.016806664 = queryNorm
              0.3619936 = fieldWeight in 1777, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.791898 = idf(docFreq=350, maxDocs=42306)
                0.0625 = fieldNorm(doc=1777)
          0.26867855 = weight(abstract_txt:inverted in 1777) [ClassicSimilarity], result of:
            0.26867855 = score(doc=1777,freq=3.0), product of:
              0.3238574 = queryWeight, product of:
                2.5143967 = boost
                7.6637 = idf(docFreq=53, maxDocs=42306)
                0.016806664 = queryNorm
              0.8296199 = fieldWeight in 1777, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.6637 = idf(docFreq=53, maxDocs=42306)
                0.0625 = fieldNorm(doc=1777)
        0.24 = coord(6/25)
    
  4. Uratani, N.; Takeda, M.: ¬A fast string-searching algorithm for multiple patterns (1993) 0.11
    0.1141188 = sum of:
      0.1141188 = product of:
        0.71324253 = sum of:
          0.048629653 = weight(abstract_txt:text in 6275) [ClassicSimilarity], result of:
            0.048629653 = score(doc=6275,freq=2.0), product of:
              0.09052507 = queryWeight, product of:
                1.3293561 = boost
                4.0517817 = idf(docFreq=1999, maxDocs=42306)
                0.016806664 = queryNorm
              0.53719544 = fieldWeight in 6275, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.0517817 = idf(docFreq=1999, maxDocs=42306)
                0.09375 = fieldNorm(doc=6275)
          0.100440934 = weight(abstract_txt:efficient in 6275) [ClassicSimilarity], result of:
            0.100440934 = score(doc=6275,freq=1.0), product of:
              0.18497735 = queryWeight, product of:
                1.9002738 = boost
                5.791898 = idf(docFreq=350, maxDocs=42306)
                0.016806664 = queryNorm
              0.54299045 = fieldWeight in 6275, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.791898 = idf(docFreq=350, maxDocs=42306)
                0.09375 = fieldNorm(doc=6275)
          0.11162165 = weight(abstract_txt:single in 6275) [ClassicSimilarity], result of:
            0.11162165 = score(doc=6275,freq=1.0), product of:
              0.2271821 = queryWeight, product of:
                2.579227 = boost
                5.2408657 = idf(docFreq=608, maxDocs=42306)
                0.016806664 = queryNorm
              0.49133116 = fieldWeight in 6275, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2408657 = idf(docFreq=608, maxDocs=42306)
                0.09375 = fieldNorm(doc=6275)
          0.45255026 = weight(abstract_txt:pass in 6275) [ClassicSimilarity], result of:
            0.45255026 = score(doc=6275,freq=1.0), product of:
              0.5776344 = queryWeight, product of:
                4.112719 = boost
                8.356848 = idf(docFreq=26, maxDocs=42306)
                0.016806664 = queryNorm
              0.7834545 = fieldWeight in 6275, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.356848 = idf(docFreq=26, maxDocs=42306)
                0.09375 = fieldNorm(doc=6275)
        0.16 = coord(4/25)
    
  5. Chang, M.; Poon, C.K.: Efficient phrase querying with common phrase index (2008) 0.10
    0.09759777 = sum of:
      0.09759777 = product of:
        0.48798883 = sum of:
          0.028655296 = weight(abstract_txt:text in 4062) [ClassicSimilarity], result of:
            0.028655296 = score(doc=4062,freq=1.0), product of:
              0.09052507 = queryWeight, product of:
                1.3293561 = boost
                4.0517817 = idf(docFreq=1999, maxDocs=42306)
                0.016806664 = queryNorm
              0.31654543 = fieldWeight in 4062, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0517817 = idf(docFreq=1999, maxDocs=42306)
                0.078125 = fieldNorm(doc=4062)
          0.06672521 = weight(abstract_txt:large in 4062) [ClassicSimilarity], result of:
            0.06672521 = score(doc=4062,freq=3.0), product of:
              0.11026858 = queryWeight, product of:
                1.4671779 = boost
                4.471853 = idf(docFreq=1313, maxDocs=42306)
                0.016806664 = queryNorm
              0.60511535 = fieldWeight in 4062, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.471853 = idf(docFreq=1313, maxDocs=42306)
                0.078125 = fieldNorm(doc=4062)
          0.11500554 = weight(abstract_txt:cost in 4062) [ClassicSimilarity], result of:
            0.11500554 = score(doc=4062,freq=2.0), product of:
              0.18145464 = queryWeight, product of:
                1.8820924 = boost
                5.736482 = idf(docFreq=370, maxDocs=42306)
                0.016806664 = queryNorm
              0.6337977 = fieldWeight in 4062, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.736482 = idf(docFreq=370, maxDocs=42306)
                0.078125 = fieldNorm(doc=4062)
          0.08370077 = weight(abstract_txt:efficient in 4062) [ClassicSimilarity], result of:
            0.08370077 = score(doc=4062,freq=1.0), product of:
              0.18497735 = queryWeight, product of:
                1.9002738 = boost
                5.791898 = idf(docFreq=350, maxDocs=42306)
                0.016806664 = queryNorm
              0.452492 = fieldWeight in 4062, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.791898 = idf(docFreq=350, maxDocs=42306)
                0.078125 = fieldNorm(doc=4062)
          0.19390203 = weight(abstract_txt:inverted in 4062) [ClassicSimilarity], result of:
            0.19390203 = score(doc=4062,freq=1.0), product of:
              0.3238574 = queryWeight, product of:
                2.5143967 = boost
                7.6637 = idf(docFreq=53, maxDocs=42306)
                0.016806664 = queryNorm
              0.5987266 = fieldWeight in 4062, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6637 = idf(docFreq=53, maxDocs=42306)
                0.078125 = fieldNorm(doc=4062)
        0.2 = coord(5/25)