Document (#33046)

Author
Moffat, A.
Isal, R.Y.K.
Title
Word-based text compression using the Burrows-Wheeler transform
Source
Information processing and management. 41(2005) no.5, S.1175-1192
Year
2005
Abstract
Block-sorting is an innovative compression mechanism introduced in 1994 by Burrows and Wheeler. It involves three steps: permuting the input one block at a time through the use of the Burrows-Wheeler transform (bwt); applying a move-to-front (mtf) transform to each of the permuted blocks; and then entropy coding the output with a Huffman or arithmetic coder. Until now, block-sorting implementations have assumed that the input message is a sequence of characters. In this paper we extend the block-sorting mechanism to word-based models. We also consider other recency transformations, and are able to show improved compression results compared to mtf and uniform arithmetic coding. For large files of text, the combination of word-based modeling, bwt, and mtf-like transformations allows excellent compression effectiveness to be attained within reasonable resource costs.
Theme
Computerlinguistik

Similar documents (author)

  1. Moffat, A.; Bell, T.A.H.: In situ generation of compressed inverted files (1995) 4.86
    4.85849 = sum of:
      4.85849 = weight(author_txt:moffat in 2717) [ClassicSimilarity], result of:
        4.85849 = fieldWeight in 2717, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.71698 = idf(docFreq=6, maxDocs=42740)
          0.5 = fieldNorm(doc=2717)
    
  2. Moffat, A.; Zobel, J.: Self-indexing inverted files for fast text retrieval (1996) 4.86
    4.85849 = sum of:
      4.85849 = weight(author_txt:moffat in 2010) [ClassicSimilarity], result of:
        4.85849 = fieldWeight in 2010, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.71698 = idf(docFreq=6, maxDocs=42740)
          0.5 = fieldNorm(doc=2010)
    
  3. Wan, R.; Moffat, A.: Block merging for off-line compression (2007) 4.86
    4.85849 = sum of:
      4.85849 = weight(author_txt:moffat in 2082) [ClassicSimilarity], result of:
        4.85849 = fieldWeight in 2082, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.71698 = idf(docFreq=6, maxDocs=42740)
          0.5 = fieldNorm(doc=2082)
    
  4. Witten, I.H.; Moffat, A.; Bell, T.C.: Managing gigabytes : compressing and indexing documents and images (1994) 3.64
    3.6438675 = sum of:
      3.6438675 = weight(author_txt:moffat in 4084) [ClassicSimilarity], result of:
        3.6438675 = fieldWeight in 4084, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.71698 = idf(docFreq=6, maxDocs=42740)
          0.375 = fieldNorm(doc=4084)
    
  5. Bell, T.C.; Moffat, A.; Nevill-Manning, C.G.; Witten, I.H.; Zobel, J.: Data compression in full-text retrieval system (1993) 2.43
    2.429245 = sum of:
      2.429245 = weight(author_txt:moffat in 5643) [ClassicSimilarity], result of:
        2.429245 = fieldWeight in 5643, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.71698 = idf(docFreq=6, maxDocs=42740)
          0.25 = fieldNorm(doc=5643)
    

Similar documents (content)

  1. Cheng, K.-S.; Young, G.H.; Wong, K.-F.: ¬A study on word-based and integral-bit Chinese text compression algorithms (1999) 0.33
    0.32902274 = sum of:
      0.32902274 = product of:
        1.370928 = sum of:
          0.026471788 = weight(abstract_txt:text in 4057) [ClassicSimilarity], result of:
            0.026471788 = score(doc=4057,freq=1.0), product of:
              0.059759073 = queryWeight, product of:
                1.1851676 = boost
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.01244981 = queryNorm
              0.44297522 = fieldWeight in 4057, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.109375 = fieldNorm(doc=4057)
          0.01974871 = weight(abstract_txt:based in 4057) [ClassicSimilarity], result of:
            0.01974871 = score(doc=4057,freq=1.0), product of:
              0.056269266 = queryWeight, product of:
                1.4085072 = boost
                3.2088501 = idf(docFreq=4693, maxDocs=42740)
                0.01244981 = queryNorm
              0.35096797 = fieldWeight in 4057, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.2088501 = idf(docFreq=4693, maxDocs=42740)
                0.109375 = fieldNorm(doc=4057)
          0.17623036 = weight(abstract_txt:coding in 4057) [ClassicSimilarity], result of:
            0.17623036 = score(doc=4057,freq=2.0), product of:
              0.1678516 = queryWeight, product of:
                1.9862804 = boost
                6.787693 = idf(docFreq=130, maxDocs=42740)
                0.01244981 = queryNorm
              1.0499177 = fieldWeight in 4057, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.787693 = idf(docFreq=130, maxDocs=42740)
                0.109375 = fieldNorm(doc=4057)
          0.13715847 = weight(abstract_txt:word in 4057) [ClassicSimilarity], result of:
            0.13715847 = score(doc=4057,freq=2.0), product of:
              0.16257378 = queryWeight, product of:
                2.3941355 = boost
                5.4543004 = idf(docFreq=496, maxDocs=42740)
                0.01244981 = queryNorm
              0.843669 = fieldWeight in 4057, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.4543004 = idf(docFreq=496, maxDocs=42740)
                0.109375 = fieldNorm(doc=4057)
          0.42436895 = weight(abstract_txt:arithmetic in 4057) [ClassicSimilarity], result of:
            0.42436895 = score(doc=4057,freq=2.0), product of:
              0.3015556 = queryWeight, product of:
                2.6623278 = boost
                9.097941 = idf(docFreq=12, maxDocs=42740)
                0.01244981 = queryNorm
              1.407266 = fieldWeight in 4057, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.097941 = idf(docFreq=12, maxDocs=42740)
                0.109375 = fieldNorm(doc=4057)
          0.5869499 = weight(abstract_txt:compression in 4057) [ClassicSimilarity], result of:
            0.5869499 = score(doc=4057,freq=3.0), product of:
              0.41202027 = queryWeight, product of:
                4.4010077 = boost
                7.519756 = idf(docFreq=62, maxDocs=42740)
                0.01244981 = queryNorm
              1.4245656 = fieldWeight in 4057, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.519756 = idf(docFreq=62, maxDocs=42740)
                0.109375 = fieldNorm(doc=4057)
        0.24 = coord(6/25)
    
  2. Wan, R.; Moffat, A.: Block merging for off-line compression (2007) 0.17
    0.16962591 = sum of:
      0.16962591 = product of:
        0.8481295 = sum of:
          0.092998095 = weight(abstract_txt:blocks in 2082) [ClassicSimilarity], result of:
            0.092998095 = score(doc=2082,freq=2.0), product of:
              0.108875394 = queryWeight, product of:
                1.1311696 = boost
                7.731065 = idf(docFreq=50, maxDocs=42740)
                0.01244981 = queryNorm
              0.85417 = fieldWeight in 2082, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.731065 = idf(docFreq=50, maxDocs=42740)
                0.078125 = fieldNorm(doc=2082)
          0.018908422 = weight(abstract_txt:text in 2082) [ClassicSimilarity], result of:
            0.018908422 = score(doc=2082,freq=1.0), product of:
              0.059759073 = queryWeight, product of:
                1.1851676 = boost
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.01244981 = queryNorm
              0.3164109 = fieldWeight in 2082, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.078125 = fieldNorm(doc=2082)
          0.073904224 = weight(abstract_txt:mechanism in 2082) [ClassicSimilarity], result of:
            0.073904224 = score(doc=2082,freq=1.0), product of:
              0.14827907 = queryWeight, product of:
                1.8668858 = boost
                6.379687 = idf(docFreq=196, maxDocs=42740)
                0.01244981 = queryNorm
              0.49841303 = fieldWeight in 2082, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.379687 = idf(docFreq=196, maxDocs=42740)
                0.078125 = fieldNorm(doc=2082)
          0.24205405 = weight(abstract_txt:compression in 2082) [ClassicSimilarity], result of:
            0.24205405 = score(doc=2082,freq=1.0), product of:
              0.41202027 = queryWeight, product of:
                4.4010077 = boost
                7.519756 = idf(docFreq=62, maxDocs=42740)
                0.01244981 = queryNorm
              0.5874809 = fieldWeight in 2082, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.519756 = idf(docFreq=62, maxDocs=42740)
                0.078125 = fieldNorm(doc=2082)
          0.42026475 = weight(abstract_txt:block in 2082) [ClassicSimilarity], result of:
            0.42026475 = score(doc=2082,freq=2.0), product of:
              0.4724063 = queryWeight, product of:
                4.7124925 = boost
                8.051972 = idf(docFreq=36, maxDocs=42740)
                0.01244981 = queryNorm
              0.8896256 = fieldWeight in 2082, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.051972 = idf(docFreq=36, maxDocs=42740)
                0.078125 = fieldNorm(doc=2082)
        0.2 = coord(5/25)
    
  3. Steinmetz, R.: Data compression in multimedia computing : principles and techniques (1994) 0.11
    0.10882995 = sum of:
      0.10882995 = product of:
        0.54414976 = sum of:
          0.081635036 = weight(abstract_txt:entropy in 182) [ClassicSimilarity], result of:
            0.081635036 = score(doc=182,freq=2.0), product of:
              0.115825646 = queryWeight, product of:
                1.1667161 = boost
                7.974011 = idf(docFreq=39, maxDocs=42740)
                0.01244981 = queryNorm
              0.70480967 = fieldWeight in 182, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.974011 = idf(docFreq=39, maxDocs=42740)
                0.0625 = fieldNorm(doc=182)
          0.015126737 = weight(abstract_txt:text in 182) [ClassicSimilarity], result of:
            0.015126737 = score(doc=182,freq=1.0), product of:
              0.059759073 = queryWeight, product of:
                1.1851676 = boost
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.01244981 = queryNorm
              0.2531287 = fieldWeight in 182, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.0625 = fieldNorm(doc=182)
          0.011284977 = weight(abstract_txt:based in 182) [ClassicSimilarity], result of:
            0.011284977 = score(doc=182,freq=1.0), product of:
              0.056269266 = queryWeight, product of:
                1.4085072 = boost
                3.2088501 = idf(docFreq=4693, maxDocs=42740)
                0.01244981 = queryNorm
              0.20055313 = fieldWeight in 182, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.2088501 = idf(docFreq=4693, maxDocs=42740)
                0.0625 = fieldNorm(doc=182)
          0.10070306 = weight(abstract_txt:coding in 182) [ClassicSimilarity], result of:
            0.10070306 = score(doc=182,freq=2.0), product of:
              0.1678516 = queryWeight, product of:
                1.9862804 = boost
                6.787693 = idf(docFreq=130, maxDocs=42740)
                0.01244981 = queryNorm
              0.59995294 = fieldWeight in 182, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.787693 = idf(docFreq=130, maxDocs=42740)
                0.0625 = fieldNorm(doc=182)
          0.33539993 = weight(abstract_txt:compression in 182) [ClassicSimilarity], result of:
            0.33539993 = score(doc=182,freq=3.0), product of:
              0.41202027 = queryWeight, product of:
                4.4010077 = boost
                7.519756 = idf(docFreq=62, maxDocs=42740)
                0.01244981 = queryNorm
              0.81403744 = fieldWeight in 182, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.519756 = idf(docFreq=62, maxDocs=42740)
                0.0625 = fieldNorm(doc=182)
        0.2 = coord(5/25)
    
  4. Fersini, E.; Messina, E.; Archetti, F.: Enhancing web page classification through image-block importance analysis (2008) 0.11
    0.10586091 = sum of:
      0.10586091 = product of:
        0.6616307 = sum of:
          0.11389894 = weight(abstract_txt:blocks in 4103) [ClassicSimilarity], result of:
            0.11389894 = score(doc=4103,freq=3.0), product of:
              0.108875394 = queryWeight, product of:
                1.1311696 = boost
                7.731065 = idf(docFreq=50, maxDocs=42740)
                0.01244981 = queryNorm
              1.0461403 = fieldWeight in 4103, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.731065 = idf(docFreq=50, maxDocs=42740)
                0.078125 = fieldNorm(doc=4103)
          0.018908422 = weight(abstract_txt:text in 4103) [ClassicSimilarity], result of:
            0.018908422 = score(doc=4103,freq=1.0), product of:
              0.059759073 = queryWeight, product of:
                1.1851676 = boost
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.01244981 = queryNorm
              0.3164109 = fieldWeight in 4103, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.078125 = fieldNorm(doc=4103)
          0.0141062215 = weight(abstract_txt:based in 4103) [ClassicSimilarity], result of:
            0.0141062215 = score(doc=4103,freq=1.0), product of:
              0.056269266 = queryWeight, product of:
                1.4085072 = boost
                3.2088501 = idf(docFreq=4693, maxDocs=42740)
                0.01244981 = queryNorm
              0.2506914 = fieldWeight in 4103, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.2088501 = idf(docFreq=4693, maxDocs=42740)
                0.078125 = fieldNorm(doc=4103)
          0.5147171 = weight(abstract_txt:block in 4103) [ClassicSimilarity], result of:
            0.5147171 = score(doc=4103,freq=3.0), product of:
              0.4724063 = queryWeight, product of:
                4.7124925 = boost
                8.051972 = idf(docFreq=36, maxDocs=42740)
                0.01244981 = queryNorm
              1.0895644 = fieldWeight in 4103, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                8.051972 = idf(docFreq=36, maxDocs=42740)
                0.078125 = fieldNorm(doc=4103)
        0.16 = coord(4/25)
    
  5. Tsai, R.T.-H.; Chiu, B.; Wu, C.-E.: Visual webpage block importance prediction using conditional random fields (2011) 0.10
    0.09946364 = sum of:
      0.09946364 = product of:
        0.6216478 = sum of:
          0.07269357 = weight(abstract_txt:sequence in 1925) [ClassicSimilarity], result of:
            0.07269357 = score(doc=1925,freq=4.0), product of:
              0.085089184 = queryWeight, product of:
                6.8345766 = idf(docFreq=124, maxDocs=42740)
                0.01244981 = queryNorm
              0.8543221 = fieldWeight in 1925, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.8345766 = idf(docFreq=124, maxDocs=42740)
                0.0625 = fieldNorm(doc=1925)
          0.11763433 = weight(abstract_txt:blocks in 1925) [ClassicSimilarity], result of:
            0.11763433 = score(doc=1925,freq=5.0), product of:
              0.108875394 = queryWeight, product of:
                1.1311696 = boost
                7.731065 = idf(docFreq=50, maxDocs=42740)
                0.01244981 = queryNorm
              1.0804492 = fieldWeight in 1925, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.731065 = idf(docFreq=50, maxDocs=42740)
                0.0625 = fieldNorm(doc=1925)
          0.019546155 = weight(abstract_txt:based in 1925) [ClassicSimilarity], result of:
            0.019546155 = score(doc=1925,freq=3.0), product of:
              0.056269266 = queryWeight, product of:
                1.4085072 = boost
                3.2088501 = idf(docFreq=4693, maxDocs=42740)
                0.01244981 = queryNorm
              0.3473682 = fieldWeight in 1925, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.2088501 = idf(docFreq=4693, maxDocs=42740)
                0.0625 = fieldNorm(doc=1925)
          0.4117737 = weight(abstract_txt:block in 1925) [ClassicSimilarity], result of:
            0.4117737 = score(doc=1925,freq=3.0), product of:
              0.4724063 = queryWeight, product of:
                4.7124925 = boost
                8.051972 = idf(docFreq=36, maxDocs=42740)
                0.01244981 = queryNorm
              0.8716516 = fieldWeight in 1925, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                8.051972 = idf(docFreq=36, maxDocs=42740)
                0.0625 = fieldNorm(doc=1925)
        0.16 = coord(4/25)