Document (#32083)

Author
Wan, R.
Moffat, A.
Title
Block merging for off-line compression
Source
Journal of the American Society for Information Science and Technology. 58(2007) no.1, S.3-14
Year
2007
Abstract
To bound memory consumption, most compression systems provide a facility that controls the amount of data that may be processed at once - usually as a block size, but sometimes as a direct megabyte limit. In this work we consider the Re-Pair mechanism of Larsson and Moffat (2000), which processes large messages as disjoint blocks to limit memory consumption. We show that the blocks emitted by Re-Pair can be postprocessed to yield further savings, and describe techniques that allow files of 500 MB or more to be compressed in a holistic manner using less than that much main memory. The block merging process we describe has the additional advantage of allowing new text to be appended to the end of the compressed file.

Similar documents (author)

  1. Moffat, A.; Bell, T.A.H.: In situ generation of compressed inverted files (1995) 4.86
    4.85849 = sum of:
      4.85849 = weight(author_txt:moffat in 2717) [ClassicSimilarity], result of:
        4.85849 = fieldWeight in 2717, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.71698 = idf(docFreq=6, maxDocs=42740)
          0.5 = fieldNorm(doc=2717)
    
  2. Moffat, A.; Zobel, J.: Self-indexing inverted files for fast text retrieval (1996) 4.86
    4.85849 = sum of:
      4.85849 = weight(author_txt:moffat in 2010) [ClassicSimilarity], result of:
        4.85849 = fieldWeight in 2010, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.71698 = idf(docFreq=6, maxDocs=42740)
          0.5 = fieldNorm(doc=2010)
    
  3. Moffat, A.; Isal, R.Y.K.: Word-based text compression using the Burrows-Wheeler transform (2005) 4.86
    4.85849 = sum of:
      4.85849 = weight(author_txt:moffat in 3045) [ClassicSimilarity], result of:
        4.85849 = fieldWeight in 3045, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.71698 = idf(docFreq=6, maxDocs=42740)
          0.5 = fieldNorm(doc=3045)
    
  4. Witten, I.H.; Moffat, A.; Bell, T.C.: Managing gigabytes : compressing and indexing documents and images (1994) 3.64
    3.6438675 = sum of:
      3.6438675 = weight(author_txt:moffat in 4084) [ClassicSimilarity], result of:
        3.6438675 = fieldWeight in 4084, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.71698 = idf(docFreq=6, maxDocs=42740)
          0.375 = fieldNorm(doc=4084)
    
  5. Bell, T.C.; Moffat, A.; Nevill-Manning, C.G.; Witten, I.H.; Zobel, J.: Data compression in full-text retrieval system (1993) 2.43
    2.429245 = sum of:
      2.429245 = weight(author_txt:moffat in 5643) [ClassicSimilarity], result of:
        2.429245 = fieldWeight in 5643, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.71698 = idf(docFreq=6, maxDocs=42740)
          0.25 = fieldNorm(doc=5643)
    

Similar documents (content)

  1. Moffat, A.; Isal, R.Y.K.: Word-based text compression using the Burrows-Wheeler transform (2005) 0.20
    0.20158635 = sum of:
      0.20158635 = product of:
        1.0079317 = sum of:
          0.062045846 = weight(abstract_txt:mechanism in 3045) [ClassicSimilarity], result of:
            0.062045846 = score(doc=3045,freq=2.0), product of:
              0.088025466 = queryWeight, product of:
                6.379687 = idf(docFreq=196, maxDocs=42740)
                0.013797772 = queryNorm
              0.7048625 = fieldWeight in 3045, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.379687 = idf(docFreq=196, maxDocs=42740)
                0.078125 = fieldNorm(doc=3045)
          0.011601135 = weight(abstract_txt:that in 3045) [ClassicSimilarity], result of:
            0.011601135 = score(doc=3045,freq=1.0), product of:
              0.062010728 = queryWeight, product of:
                1.8767838 = boost
                2.3946586 = idf(docFreq=10595, maxDocs=42740)
                0.013797772 = queryNorm
              0.18708271 = fieldWeight in 3045, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3946586 = idf(docFreq=10595, maxDocs=42740)
                0.078125 = fieldNorm(doc=3045)
          0.24888657 = weight(abstract_txt:compression in 3045) [ClassicSimilarity], result of:
            0.24888657 = score(doc=3045,freq=3.0), product of:
              0.24459472 = queryWeight, product of:
                2.357406 = boost
                7.519756 = idf(docFreq=62, maxDocs=42740)
                0.013797772 = queryNorm
              1.0175468 = fieldWeight in 3045, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.519756 = idf(docFreq=62, maxDocs=42740)
                0.078125 = fieldNorm(doc=3045)
          0.15615201 = weight(abstract_txt:blocks in 3045) [ClassicSimilarity], result of:
            0.15615201 = score(doc=3045,freq=1.0), product of:
              0.25853434 = queryWeight, product of:
                2.4236503 = boost
                7.731065 = idf(docFreq=50, maxDocs=42740)
                0.013797772 = queryNorm
              0.6039894 = fieldWeight in 3045, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.731065 = idf(docFreq=50, maxDocs=42740)
                0.078125 = fieldNorm(doc=3045)
          0.52924615 = weight(abstract_txt:block in 3045) [ClassicSimilarity], result of:
            0.52924615 = score(doc=3045,freq=4.0), product of:
              0.42066407 = queryWeight, product of:
                3.7863798 = boost
                8.051972 = idf(docFreq=36, maxDocs=42740)
                0.013797772 = queryNorm
              1.2581207 = fieldWeight in 3045, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                8.051972 = idf(docFreq=36, maxDocs=42740)
                0.078125 = fieldNorm(doc=3045)
        0.2 = coord(5/25)
    
  2. Moura, E.S. de; Fernandes, D.; Ribeiro-Neto, B.; Silva, A.S. da; Gonçalves, M.A.: Using structural information to improve search in Web collections (2010) 0.09
    0.09242066 = sum of:
      0.09242066 = product of:
        0.7701722 = sum of:
          0.020093754 = weight(abstract_txt:that in 1120) [ClassicSimilarity], result of:
            0.020093754 = score(doc=1120,freq=3.0), product of:
              0.062010728 = queryWeight, product of:
                1.8767838 = boost
                2.3946586 = idf(docFreq=10595, maxDocs=42740)
                0.013797772 = queryNorm
              0.32403675 = fieldWeight in 1120, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.3946586 = idf(docFreq=10595, maxDocs=42740)
                0.078125 = fieldNorm(doc=1120)
          0.22083229 = weight(abstract_txt:blocks in 1120) [ClassicSimilarity], result of:
            0.22083229 = score(doc=1120,freq=2.0), product of:
              0.25853434 = queryWeight, product of:
                2.4236503 = boost
                7.731065 = idf(docFreq=50, maxDocs=42740)
                0.013797772 = queryNorm
              0.85417 = fieldWeight in 1120, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.731065 = idf(docFreq=50, maxDocs=42740)
                0.078125 = fieldNorm(doc=1120)
          0.52924615 = weight(abstract_txt:block in 1120) [ClassicSimilarity], result of:
            0.52924615 = score(doc=1120,freq=4.0), product of:
              0.42066407 = queryWeight, product of:
                3.7863798 = boost
                8.051972 = idf(docFreq=36, maxDocs=42740)
                0.013797772 = queryNorm
              1.2581207 = fieldWeight in 1120, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                8.051972 = idf(docFreq=36, maxDocs=42740)
                0.078125 = fieldNorm(doc=1120)
        0.12 = coord(3/25)
    
  3. Fersini, E.; Messina, E.; Archetti, F.: Enhancing web page classification through image-block importance analysis (2008) 0.09
    0.089425236 = sum of:
      0.089425236 = product of:
        0.7452103 = sum of:
          0.016406482 = weight(abstract_txt:that in 4103) [ClassicSimilarity], result of:
            0.016406482 = score(doc=4103,freq=2.0), product of:
              0.062010728 = queryWeight, product of:
                1.8767838 = boost
                2.3946586 = idf(docFreq=10595, maxDocs=42740)
                0.013797772 = queryNorm
              0.2645749 = fieldWeight in 4103, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3946586 = idf(docFreq=10595, maxDocs=42740)
                0.078125 = fieldNorm(doc=4103)
          0.2704632 = weight(abstract_txt:blocks in 4103) [ClassicSimilarity], result of:
            0.2704632 = score(doc=4103,freq=3.0), product of:
              0.25853434 = queryWeight, product of:
                2.4236503 = boost
                7.731065 = idf(docFreq=50, maxDocs=42740)
                0.013797772 = queryNorm
              1.0461403 = fieldWeight in 4103, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.731065 = idf(docFreq=50, maxDocs=42740)
                0.078125 = fieldNorm(doc=4103)
          0.45834062 = weight(abstract_txt:block in 4103) [ClassicSimilarity], result of:
            0.45834062 = score(doc=4103,freq=3.0), product of:
              0.42066407 = queryWeight, product of:
                3.7863798 = boost
                8.051972 = idf(docFreq=36, maxDocs=42740)
                0.013797772 = queryNorm
              1.0895644 = fieldWeight in 4103, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                8.051972 = idf(docFreq=36, maxDocs=42740)
                0.078125 = fieldNorm(doc=4103)
        0.12 = coord(3/25)
    
  4. Moffat, A.; Bell, T.A.H.: In situ generation of compressed inverted files (1995) 0.09
    0.08713272 = sum of:
      0.08713272 = product of:
        0.5445795 = sum of:
          0.016406482 = weight(abstract_txt:that in 2717) [ClassicSimilarity], result of:
            0.016406482 = score(doc=2717,freq=2.0), product of:
              0.062010728 = queryWeight, product of:
                1.8767838 = boost
                2.3946586 = idf(docFreq=10595, maxDocs=42740)
                0.013797772 = queryNorm
              0.2645749 = fieldWeight in 2717, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3946586 = idf(docFreq=10595, maxDocs=42740)
                0.078125 = fieldNorm(doc=2717)
          0.14369473 = weight(abstract_txt:compression in 2717) [ClassicSimilarity], result of:
            0.14369473 = score(doc=2717,freq=1.0), product of:
              0.24459472 = queryWeight, product of:
                2.357406 = boost
                7.519756 = idf(docFreq=62, maxDocs=42740)
                0.013797772 = queryNorm
              0.5874809 = fieldWeight in 2717, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.519756 = idf(docFreq=62, maxDocs=42740)
                0.078125 = fieldNorm(doc=2717)
          0.23745379 = weight(abstract_txt:compressed in 2717) [ClassicSimilarity], result of:
            0.23745379 = score(doc=2717,freq=1.0), product of:
              0.34187913 = queryWeight, product of:
                2.7870653 = boost
                8.890302 = idf(docFreq=15, maxDocs=42740)
                0.013797772 = queryNorm
              0.6945548 = fieldWeight in 2717, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.890302 = idf(docFreq=15, maxDocs=42740)
                0.078125 = fieldNorm(doc=2717)
          0.1470245 = weight(abstract_txt:memory in 2717) [ClassicSimilarity], result of:
            0.1470245 = score(doc=2717,freq=1.0), product of:
              0.2842999 = queryWeight, product of:
                3.112754 = boost
                6.6194654 = idf(docFreq=154, maxDocs=42740)
                0.013797772 = queryNorm
              0.51714575 = fieldWeight in 2717, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.6194654 = idf(docFreq=154, maxDocs=42740)
                0.078125 = fieldNorm(doc=2717)
        0.16 = coord(4/25)
    
  5. Wan, X.; Yang, J.; Xiao, J.: Towards a unified approach to document similarity search using manifold-ranking of blocks (2008) 0.08
    0.078634396 = sum of:
      0.078634396 = product of:
        0.65528667 = sum of:
          0.009280908 = weight(abstract_txt:that in 4082) [ClassicSimilarity], result of:
            0.009280908 = score(doc=4082,freq=1.0), product of:
              0.062010728 = queryWeight, product of:
                1.8767838 = boost
                2.3946586 = idf(docFreq=10595, maxDocs=42740)
                0.013797772 = queryNorm
              0.14966616 = fieldWeight in 4082, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3946586 = idf(docFreq=10595, maxDocs=42740)
                0.0625 = fieldNorm(doc=4082)
          0.27933323 = weight(abstract_txt:blocks in 4082) [ClassicSimilarity], result of:
            0.27933323 = score(doc=4082,freq=5.0), product of:
              0.25853434 = queryWeight, product of:
                2.4236503 = boost
                7.731065 = idf(docFreq=50, maxDocs=42740)
                0.013797772 = queryNorm
              1.0804492 = fieldWeight in 4082, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.731065 = idf(docFreq=50, maxDocs=42740)
                0.0625 = fieldNorm(doc=4082)
          0.36667252 = weight(abstract_txt:block in 4082) [ClassicSimilarity], result of:
            0.36667252 = score(doc=4082,freq=3.0), product of:
              0.42066407 = queryWeight, product of:
                3.7863798 = boost
                8.051972 = idf(docFreq=36, maxDocs=42740)
                0.013797772 = queryNorm
              0.8716516 = fieldWeight in 4082, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                8.051972 = idf(docFreq=36, maxDocs=42740)
                0.0625 = fieldNorm(doc=4082)
        0.12 = coord(3/25)