Document (#26679)

Author
Heinz, S.
Zobel, J.
Title
Efficient single-pass index construction for text databases
Source
Journal of the American Society for Information Science and technology. 54(2003) no.8, S.713-729
Year
2003
Abstract
Efficient construction of inverted indexes is essential to provision of search over large collections of text data. In this article, we review the principal approaches to inversion, analyze their theoretical cost, and present experimental results. We identify the drawbacks of existing inversion approaches and propose a single-pass inversion method that, in contrast to previous approaches, does not require the complete vocabulary of the indexed collection in main memory, can operate within limited resources, and does not sacrifice speed with high temporary storage requirements. We show that the performance of the single-pass approach can be improved by constructing inverted files in segments, reducing the cost of disk accesses during inversion of large volumes of data.
Theme
Retrievalalgorithmen

Similar documents (author)

  1. Heinz, S.: Realisierung und Evaluierung eines virtuellen Bibliotheksregals für die Informationswissenschaft an der Universitätsbibliothek Hildesheim (2003) 2.08
    2.0758674 = sum of:
      2.0758674 = product of:
        4.151735 = sum of:
          4.151735 = weight(author_txt:heinz in 5982) [ClassicSimilarity], result of:
            4.151735 = score(doc=5982,freq=1.0), product of:
              0.7071068 = queryWeight, product of:
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.07526975 = queryNorm
              5.871439 = fieldWeight in 5982, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.625 = fieldNorm(doc=5982)
        0.5 = coord(1/2)
    
  2. Heinz, M.: Bemerkungen zur Entwicklung der Internationalität der Forschung : Bibliometrische Untersuchungen am SCI (2006) 2.08
    2.0758674 = sum of:
      2.0758674 = product of:
        4.151735 = sum of:
          4.151735 = weight(author_txt:heinz in 6110) [ClassicSimilarity], result of:
            4.151735 = score(doc=6110,freq=1.0), product of:
              0.7071068 = queryWeight, product of:
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.07526975 = queryNorm
              5.871439 = fieldWeight in 6110, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.625 = fieldNorm(doc=6110)
        0.5 = coord(1/2)
    
  3. Heinz, A.: ¬Diie Lösung des Leib-Seele-Problems bei John R. Searle (2002) 2.08
    2.0758674 = sum of:
      2.0758674 = product of:
        4.151735 = sum of:
          4.151735 = weight(author_txt:heinz in 4299) [ClassicSimilarity], result of:
            4.151735 = score(doc=4299,freq=1.0), product of:
              0.7071068 = queryWeight, product of:
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.07526975 = queryNorm
              5.871439 = fieldWeight in 4299, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.625 = fieldNorm(doc=4299)
        0.5 = coord(1/2)
    
  4. Großmann. R.; Heinz, M.: RAK-WB als Hypertext (1994) 1.66
    1.6606939 = sum of:
      1.6606939 = product of:
        3.3213878 = sum of:
          3.3213878 = weight(author_txt:heinz in 8775) [ClassicSimilarity], result of:
            3.3213878 = score(doc=8775,freq=1.0), product of:
              0.7071068 = queryWeight, product of:
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.07526975 = queryNorm
              4.697151 = fieldWeight in 8775, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.5 = fieldNorm(doc=8775)
        0.5 = coord(1/2)
    
  5. Heinz, M.; Voigt, H.: Inhaltliche und formale Unzulänglichkeiten bei CD-ROMs : eine Gewichtung (1993) 1.66
    1.6606939 = sum of:
      1.6606939 = product of:
        3.3213878 = sum of:
          3.3213878 = weight(author_txt:heinz in 353) [ClassicSimilarity], result of:
            3.3213878 = score(doc=353,freq=1.0), product of:
              0.7071068 = queryWeight, product of:
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.07526975 = queryNorm
              4.697151 = fieldWeight in 353, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.5 = fieldNorm(doc=353)
        0.5 = coord(1/2)
    

Similar documents (content)

  1. Moffat, A.; Bell, T.A.H.: In situ generation of compressed inverted files (1995) 0.13
    0.13450012 = sum of:
      0.13450012 = product of:
        0.48035756 = sum of:
          0.042371783 = weight(abstract_txt:memory in 2648) [ClassicSimilarity], result of:
            0.042371783 = score(doc=2648,freq=1.0), product of:
              0.082136534 = queryWeight, product of:
                1.0229269 = boost
                6.603137 = idf(docFreq=162, maxDocs=44218)
                0.01216022 = queryNorm
              0.5158701 = fieldWeight in 2648, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.603137 = idf(docFreq=162, maxDocs=44218)
                0.078125 = fieldNorm(doc=2648)
          0.010931249 = weight(abstract_txt:data in 2648) [ClassicSimilarity], result of:
            0.010931249 = score(doc=2648,freq=1.0), product of:
              0.041938066 = queryWeight, product of:
                1.0337027 = boost
                3.3363478 = idf(docFreq=4274, maxDocs=44218)
                0.01216022 = queryNorm
              0.26065218 = fieldWeight in 2648, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.3363478 = idf(docFreq=4274, maxDocs=44218)
                0.078125 = fieldNorm(doc=2648)
          0.019464636 = weight(abstract_txt:text in 2648) [ClassicSimilarity], result of:
            0.019464636 = score(doc=2648,freq=1.0), product of:
              0.061611168 = queryWeight, product of:
                1.2529137 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.01216022 = queryNorm
              0.3159271 = fieldWeight in 2648, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.078125 = fieldNorm(doc=2648)
          0.12683024 = weight(abstract_txt:temporary in 2648) [ClassicSimilarity], result of:
            0.12683024 = score(doc=2648,freq=2.0), product of:
              0.13540159 = queryWeight, product of:
                1.3133737 = boost
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.01216022 = queryNorm
              0.93669677 = fieldWeight in 2648, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.078125 = fieldNorm(doc=2648)
          0.045049828 = weight(abstract_txt:large in 2648) [ClassicSimilarity], result of:
            0.045049828 = score(doc=2648,freq=3.0), product of:
              0.074745245 = queryWeight, product of:
                1.3800131 = boost
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.01216022 = queryNorm
              0.6027116 = fieldWeight in 2648, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.078125 = fieldNorm(doc=2648)
          0.047770493 = weight(abstract_txt:construction in 2648) [ClassicSimilarity], result of:
            0.047770493 = score(doc=2648,freq=1.0), product of:
              0.11209899 = queryWeight, product of:
                1.6900218 = boost
                5.4546638 = idf(docFreq=513, maxDocs=44218)
                0.01216022 = queryNorm
              0.4261456 = fieldWeight in 2648, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.4546638 = idf(docFreq=513, maxDocs=44218)
                0.078125 = fieldNorm(doc=2648)
          0.18793935 = weight(abstract_txt:inverted in 2648) [ClassicSimilarity], result of:
            0.18793935 = score(doc=2648,freq=2.0), product of:
              0.22173302 = queryWeight, product of:
                2.3768766 = boost
                7.6715355 = idf(docFreq=55, maxDocs=44218)
                0.01216022 = queryNorm
              0.84759295 = fieldWeight in 2648, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.6715355 = idf(docFreq=55, maxDocs=44218)
                0.078125 = fieldNorm(doc=2648)
        0.28 = coord(7/25)
    
  2. Mukhopadhyay, S.; Peng, S.; Raje, R.; Mostafa, J.; Palakal, M.: Distributed multi-agent information filtering : a comparative study (2005) 0.11
    0.106442206 = sum of:
      0.106442206 = product of:
        0.38015074 = sum of:
          0.04213745 = weight(abstract_txt:speed in 3559) [ClassicSimilarity], result of:
            0.04213745 = score(doc=3559,freq=1.0), product of:
              0.08183343 = queryWeight, product of:
                1.0210378 = boost
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.01216022 = queryNorm
              0.5149173 = fieldWeight in 3559, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.078125 = fieldNorm(doc=3559)
          0.010931249 = weight(abstract_txt:data in 3559) [ClassicSimilarity], result of:
            0.010931249 = score(doc=3559,freq=1.0), product of:
              0.041938066 = queryWeight, product of:
                1.0337027 = boost
                3.3363478 = idf(docFreq=4274, maxDocs=44218)
                0.01216022 = queryNorm
              0.26065218 = fieldWeight in 3559, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.3363478 = idf(docFreq=4274, maxDocs=44218)
                0.078125 = fieldNorm(doc=3559)
          0.067887574 = weight(abstract_txt:drawbacks in 3559) [ClassicSimilarity], result of:
            0.067887574 = score(doc=3559,freq=1.0), product of:
              0.11246363 = queryWeight, product of:
                1.1969678 = boost
                7.7265954 = idf(docFreq=52, maxDocs=44218)
                0.01216022 = queryNorm
              0.60364026 = fieldWeight in 3559, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.7265954 = idf(docFreq=52, maxDocs=44218)
                0.078125 = fieldNorm(doc=3559)
          0.052019063 = weight(abstract_txt:large in 3559) [ClassicSimilarity], result of:
            0.052019063 = score(doc=3559,freq=4.0), product of:
              0.074745245 = queryWeight, product of:
                1.3800131 = boost
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.01216022 = queryNorm
              0.69595146 = fieldWeight in 3559, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.078125 = fieldNorm(doc=3559)
          0.05701793 = weight(abstract_txt:efficient in 3559) [ClassicSimilarity], result of:
            0.05701793 = score(doc=3559,freq=1.0), product of:
              0.12613517 = queryWeight, product of:
                1.7927079 = boost
                5.7860904 = idf(docFreq=368, maxDocs=44218)
                0.01216022 = queryNorm
              0.45203832 = fieldWeight in 3559, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7860904 = idf(docFreq=368, maxDocs=44218)
                0.078125 = fieldNorm(doc=3559)
          0.06111316 = weight(abstract_txt:approaches in 3559) [ClassicSimilarity], result of:
            0.06111316 = score(doc=3559,freq=2.0), product of:
              0.12002512 = queryWeight, product of:
                2.1417716 = boost
                4.6084785 = idf(docFreq=1197, maxDocs=44218)
                0.01216022 = queryNorm
              0.50916976 = fieldWeight in 3559, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.6084785 = idf(docFreq=1197, maxDocs=44218)
                0.078125 = fieldNorm(doc=3559)
          0.08904435 = weight(abstract_txt:single in 3559) [ClassicSimilarity], result of:
            0.08904435 = score(doc=3559,freq=2.0), product of:
              0.15425996 = queryWeight, product of:
                2.428084 = boost
                5.2245407 = idf(docFreq=646, maxDocs=44218)
                0.01216022 = queryNorm
              0.57723564 = fieldWeight in 3559, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.2245407 = idf(docFreq=646, maxDocs=44218)
                0.078125 = fieldNorm(doc=3559)
        0.28 = coord(7/25)
    
  3. MacFarlane, A.; McCann, J.A.; Robertson, S.E.: Parallel methods for the generation of partitioned inverted files (2005) 0.08
    0.08221728 = sum of:
      0.08221728 = product of:
        0.34257203 = sum of:
          0.047673084 = weight(abstract_txt:speed in 651) [ClassicSimilarity], result of:
            0.047673084 = score(doc=651,freq=2.0), product of:
              0.08183343 = queryWeight, product of:
                1.0210378 = boost
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.01216022 = queryNorm
              0.58256245 = fieldWeight in 651, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.0625 = fieldNorm(doc=651)
          0.008744999 = weight(abstract_txt:data in 651) [ClassicSimilarity], result of:
            0.008744999 = score(doc=651,freq=1.0), product of:
              0.041938066 = queryWeight, product of:
                1.0337027 = boost
                3.3363478 = idf(docFreq=4274, maxDocs=44218)
                0.01216022 = queryNorm
              0.20852174 = fieldWeight in 651, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.3363478 = idf(docFreq=4274, maxDocs=44218)
                0.0625 = fieldNorm(doc=651)
          0.026970992 = weight(abstract_txt:text in 651) [ClassicSimilarity], result of:
            0.026970992 = score(doc=651,freq=3.0), product of:
              0.061611168 = queryWeight, product of:
                1.2529137 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.01216022 = queryNorm
              0.4377614 = fieldWeight in 651, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=651)
          0.029426424 = weight(abstract_txt:large in 651) [ClassicSimilarity], result of:
            0.029426424 = score(doc=651,freq=2.0), product of:
              0.074745245 = queryWeight, product of:
                1.3800131 = boost
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.01216022 = queryNorm
              0.39368957 = fieldWeight in 651, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.0625 = fieldNorm(doc=651)
          0.045614343 = weight(abstract_txt:efficient in 651) [ClassicSimilarity], result of:
            0.045614343 = score(doc=651,freq=1.0), product of:
              0.12613517 = queryWeight, product of:
                1.7927079 = boost
                5.7860904 = idf(docFreq=368, maxDocs=44218)
                0.01216022 = queryNorm
              0.36163065 = fieldWeight in 651, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7860904 = idf(docFreq=368, maxDocs=44218)
                0.0625 = fieldNorm(doc=651)
          0.18414219 = weight(abstract_txt:inverted in 651) [ClassicSimilarity], result of:
            0.18414219 = score(doc=651,freq=3.0), product of:
              0.22173302 = queryWeight, product of:
                2.3768766 = boost
                7.6715355 = idf(docFreq=55, maxDocs=44218)
                0.01216022 = queryNorm
              0.83046806 = fieldWeight in 651, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.6715355 = idf(docFreq=55, maxDocs=44218)
                0.0625 = fieldNorm(doc=651)
        0.24 = coord(6/25)
    
  4. Uratani, N.; Takeda, M.: ¬A fast string-searching algorithm for multiple patterns (1993) 0.08
    0.07671731 = sum of:
      0.07671731 = product of:
        0.4794832 = sum of:
          0.033032585 = weight(abstract_txt:text in 6275) [ClassicSimilarity], result of:
            0.033032585 = score(doc=6275,freq=2.0), product of:
              0.061611168 = queryWeight, product of:
                1.2529137 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.01216022 = queryNorm
              0.53614604 = fieldWeight in 6275, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.09375 = fieldNorm(doc=6275)
          0.06842151 = weight(abstract_txt:efficient in 6275) [ClassicSimilarity], result of:
            0.06842151 = score(doc=6275,freq=1.0), product of:
              0.12613517 = queryWeight, product of:
                1.7927079 = boost
                5.7860904 = idf(docFreq=368, maxDocs=44218)
                0.01216022 = queryNorm
              0.54244596 = fieldWeight in 6275, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7860904 = idf(docFreq=368, maxDocs=44218)
                0.09375 = fieldNorm(doc=6275)
          0.075556636 = weight(abstract_txt:single in 6275) [ClassicSimilarity], result of:
            0.075556636 = score(doc=6275,freq=1.0), product of:
              0.15425996 = queryWeight, product of:
                2.428084 = boost
                5.2245407 = idf(docFreq=646, maxDocs=44218)
                0.01216022 = queryNorm
              0.4898007 = fieldWeight in 6275, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2245407 = idf(docFreq=646, maxDocs=44218)
                0.09375 = fieldNorm(doc=6275)
          0.30247244 = weight(abstract_txt:pass in 6275) [ClassicSimilarity], result of:
            0.30247244 = score(doc=6275,freq=1.0), product of:
              0.38892156 = queryWeight, product of:
                3.855388 = boost
                8.29569 = idf(docFreq=29, maxDocs=44218)
                0.01216022 = queryNorm
              0.7777209 = fieldWeight in 6275, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.29569 = idf(docFreq=29, maxDocs=44218)
                0.09375 = fieldNorm(doc=6275)
        0.16 = coord(4/25)
    
  5. Chang, M.; Poon, C.K.: Efficient phrase querying with common phrase index (2008) 0.07
    0.06668133 = sum of:
      0.06668133 = product of:
        0.33340666 = sum of:
          0.019464636 = weight(abstract_txt:text in 2061) [ClassicSimilarity], result of:
            0.019464636 = score(doc=2061,freq=1.0), product of:
              0.061611168 = queryWeight, product of:
                1.2529137 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.01216022 = queryNorm
              0.3159271 = fieldWeight in 2061, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.078125 = fieldNorm(doc=2061)
          0.045049828 = weight(abstract_txt:large in 2061) [ClassicSimilarity], result of:
            0.045049828 = score(doc=2061,freq=3.0), product of:
              0.074745245 = queryWeight, product of:
                1.3800131 = boost
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.01216022 = queryNorm
              0.6027116 = fieldWeight in 2061, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.078125 = fieldNorm(doc=2061)
          0.0789811 = weight(abstract_txt:cost in 2061) [ClassicSimilarity], result of:
            0.0789811 = score(doc=2061,freq=2.0), product of:
              0.12440391 = queryWeight, product of:
                1.7803626 = boost
                5.746245 = idf(docFreq=383, maxDocs=44218)
                0.01216022 = queryNorm
              0.6348764 = fieldWeight in 2061, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.746245 = idf(docFreq=383, maxDocs=44218)
                0.078125 = fieldNorm(doc=2061)
          0.05701793 = weight(abstract_txt:efficient in 2061) [ClassicSimilarity], result of:
            0.05701793 = score(doc=2061,freq=1.0), product of:
              0.12613517 = queryWeight, product of:
                1.7927079 = boost
                5.7860904 = idf(docFreq=368, maxDocs=44218)
                0.01216022 = queryNorm
              0.45203832 = fieldWeight in 2061, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7860904 = idf(docFreq=368, maxDocs=44218)
                0.078125 = fieldNorm(doc=2061)
          0.13289317 = weight(abstract_txt:inverted in 2061) [ClassicSimilarity], result of:
            0.13289317 = score(doc=2061,freq=1.0), product of:
              0.22173302 = queryWeight, product of:
                2.3768766 = boost
                7.6715355 = idf(docFreq=55, maxDocs=44218)
                0.01216022 = queryNorm
              0.5993387 = fieldWeight in 2061, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6715355 = idf(docFreq=55, maxDocs=44218)
                0.078125 = fieldNorm(doc=2061)
        0.2 = coord(5/25)