Document (#33045)

Author
Moffat, A.
Isal, R.Y.K.
Title
Word-based text compression using the Burrows-Wheeler transform
Source
Information processing and management. 41(2005) no.5, S.1175-1192
Year
2005
Abstract
Block-sorting is an innovative compression mechanism introduced in 1994 by Burrows and Wheeler. It involves three steps: permuting the input one block at a time through the use of the Burrows-Wheeler transform (bwt); applying a move-to-front (mtf) transform to each of the permuted blocks; and then entropy coding the output with a Huffman or arithmetic coder. Until now, block-sorting implementations have assumed that the input message is a sequence of characters. In this paper we extend the block-sorting mechanism to word-based models. We also consider other recency transformations, and are able to show improved compression results compared to mtf and uniform arithmetic coding. For large files of text, the combination of word-based modeling, bwt, and mtf-like transformations allows excellent compression effectiveness to be attained within reasonable resource costs.
Theme
Computerlinguistik

Similar documents (author)

  1. Moffat, A.; Bell, T.A.H.: In situ generation of compressed inverted files (1995) 4.88
    4.8754888 = sum of:
      4.8754888 = weight(author_txt:moffat in 2648) [ClassicSimilarity], result of:
        4.8754888 = fieldWeight in 2648, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.7509775 = idf(docFreq=6, maxDocs=44218)
          0.5 = fieldNorm(doc=2648)
    
  2. Moffat, A.; Zobel, J.: Self-indexing inverted files for fast text retrieval (1996) 4.88
    4.8754888 = sum of:
      4.8754888 = weight(author_txt:moffat in 9) [ClassicSimilarity], result of:
        4.8754888 = fieldWeight in 9, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.7509775 = idf(docFreq=6, maxDocs=44218)
          0.5 = fieldNorm(doc=9)
    
  3. Wan, R.; Moffat, A.: Block merging for off-line compression (2007) 4.88
    4.8754888 = sum of:
      4.8754888 = weight(author_txt:moffat in 81) [ClassicSimilarity], result of:
        4.8754888 = fieldWeight in 81, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.7509775 = idf(docFreq=6, maxDocs=44218)
          0.5 = fieldNorm(doc=81)
    
  4. Witten, I.H.; Moffat, A.; Bell, T.C.: Managing gigabytes : compressing and indexing documents and images (1994) 3.66
    3.6566167 = sum of:
      3.6566167 = weight(author_txt:moffat in 3083) [ClassicSimilarity], result of:
        3.6566167 = fieldWeight in 3083, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.7509775 = idf(docFreq=6, maxDocs=44218)
          0.375 = fieldNorm(doc=3083)
    
  5. Bell, T.C.; Moffat, A.; Nevill-Manning, C.G.; Witten, I.H.; Zobel, J.: Data compression in full-text retrieval system (1993) 2.44
    2.4377444 = sum of:
      2.4377444 = weight(author_txt:moffat in 5643) [ClassicSimilarity], result of:
        2.4377444 = fieldWeight in 5643, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.7509775 = idf(docFreq=6, maxDocs=44218)
          0.25 = fieldNorm(doc=5643)
    

Similar documents (content)

  1. Cheng, K.-S.; Young, G.H.; Wong, K.-F.: ¬A study on word-based and integral-bit Chinese text compression algorithms (1999) 0.33
    0.33067232 = sum of:
      0.33067232 = product of:
        1.3778014 = sum of:
          0.02652199 = weight(abstract_txt:text in 3056) [ClassicSimilarity], result of:
            0.02652199 = score(doc=3056,freq=1.0), product of:
              0.059964087 = queryWeight, product of:
                1.1855909 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.012507185 = queryNorm
              0.4422979 = fieldWeight in 3056, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.109375 = fieldNorm(doc=3056)
          0.019490922 = weight(abstract_txt:based in 3056) [ClassicSimilarity], result of:
            0.019490922 = score(doc=3056,freq=1.0), product of:
              0.055899233 = queryWeight, product of:
                1.4019669 = boost
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.012507185 = queryNorm
              0.3486796 = fieldWeight in 3056, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.109375 = fieldNorm(doc=3056)
          0.17484526 = weight(abstract_txt:coding in 3056) [ClassicSimilarity], result of:
            0.17484526 = score(doc=3056,freq=2.0), product of:
              0.16733229 = queryWeight, product of:
                1.9805194 = boost
                6.7552447 = idf(docFreq=139, maxDocs=44218)
                0.012507185 = queryNorm
              1.0448985 = fieldWeight in 3056, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.7552447 = idf(docFreq=139, maxDocs=44218)
                0.109375 = fieldNorm(doc=3056)
          0.13662034 = weight(abstract_txt:word in 3056) [ClassicSimilarity], result of:
            0.13662034 = score(doc=3056,freq=2.0), product of:
              0.16249917 = queryWeight, product of:
                2.3903441 = boost
                5.4353957 = idf(docFreq=523, maxDocs=44218)
                0.012507185 = queryNorm
              0.84074485 = fieldWeight in 3056, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.4353957 = idf(docFreq=523, maxDocs=44218)
                0.109375 = fieldNorm(doc=3056)
          0.42150536 = weight(abstract_txt:arithmetic in 3056) [ClassicSimilarity], result of:
            0.42150536 = score(doc=3056,freq=2.0), product of:
              0.30084714 = queryWeight, product of:
                2.655597 = boost
                9.05783 = idf(docFreq=13, maxDocs=44218)
                0.012507185 = queryNorm
              1.4010615 = fieldWeight in 3056, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.05783 = idf(docFreq=13, maxDocs=44218)
                0.109375 = fieldNorm(doc=3056)
          0.5988175 = weight(abstract_txt:compression in 3056) [ClassicSimilarity], result of:
            0.5988175 = score(doc=3056,freq=3.0), product of:
              0.41845915 = queryWeight, product of:
                4.4292555 = boost
                7.5537524 = idf(docFreq=62, maxDocs=44218)
                0.012507185 = queryNorm
              1.431006 = fieldWeight in 3056, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.5537524 = idf(docFreq=62, maxDocs=44218)
                0.109375 = fieldNorm(doc=3056)
        0.24 = coord(6/25)
    
  2. Wan, R.; Moffat, A.: Block merging for off-line compression (2007) 0.17
    0.1678838 = sum of:
      0.1678838 = product of:
        0.839419 = sum of:
          0.092764415 = weight(abstract_txt:blocks in 81) [ClassicSimilarity], result of:
            0.092764415 = score(doc=81,freq=2.0), product of:
              0.10892815 = queryWeight, product of:
                1.1299111 = boost
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.012507185 = queryNorm
              0.851611 = fieldWeight in 81, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.078125 = fieldNorm(doc=81)
          0.01894428 = weight(abstract_txt:text in 81) [ClassicSimilarity], result of:
            0.01894428 = score(doc=81,freq=1.0), product of:
              0.059964087 = queryWeight, product of:
                1.1855909 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.012507185 = queryNorm
              0.3159271 = fieldWeight in 81, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.078125 = fieldNorm(doc=81)
          0.07221346 = weight(abstract_txt:mechanism in 81) [ClassicSimilarity], result of:
            0.07221346 = score(doc=81,freq=1.0), product of:
              0.14632481 = queryWeight, product of:
                1.8520308 = boost
                6.31699 = idf(docFreq=216, maxDocs=44218)
                0.012507185 = queryNorm
              0.49351484 = fieldWeight in 81, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.31699 = idf(docFreq=216, maxDocs=44218)
                0.078125 = fieldNorm(doc=81)
          0.24694818 = weight(abstract_txt:compression in 81) [ClassicSimilarity], result of:
            0.24694818 = score(doc=81,freq=1.0), product of:
              0.41845915 = queryWeight, product of:
                4.4292555 = boost
                7.5537524 = idf(docFreq=62, maxDocs=44218)
                0.012507185 = queryNorm
              0.5901369 = fieldWeight in 81, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5537524 = idf(docFreq=62, maxDocs=44218)
                0.078125 = fieldNorm(doc=81)
          0.40854862 = weight(abstract_txt:block in 81) [ClassicSimilarity], result of:
            0.40854862 = score(doc=81,freq=2.0), product of:
              0.46458837 = queryWeight, product of:
                4.667006 = boost
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.012507185 = queryNorm
              0.8793776 = fieldWeight in 81, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.078125 = fieldNorm(doc=81)
        0.2 = coord(5/25)
    
  3. Steinmetz, R.: Data compression in multimedia computing : principles and techniques (1994) 0.11
    0.11001917 = sum of:
      0.11001917 = product of:
        0.55009586 = sum of:
          0.08170972 = weight(abstract_txt:entropy in 8182) [ClassicSimilarity], result of:
            0.08170972 = score(doc=8182,freq=2.0), product of:
              0.11614709 = queryWeight, product of:
                1.1667515 = boost
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.012507185 = queryNorm
              0.70350206 = fieldWeight in 8182, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.0625 = fieldNorm(doc=8182)
          0.015155423 = weight(abstract_txt:text in 8182) [ClassicSimilarity], result of:
            0.015155423 = score(doc=8182,freq=1.0), product of:
              0.059964087 = queryWeight, product of:
                1.1855909 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.012507185 = queryNorm
              0.25274166 = fieldWeight in 8182, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=8182)
          0.01113767 = weight(abstract_txt:based in 8182) [ClassicSimilarity], result of:
            0.01113767 = score(doc=8182,freq=1.0), product of:
              0.055899233 = queryWeight, product of:
                1.4019669 = boost
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.012507185 = queryNorm
              0.19924548 = fieldWeight in 8182, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.0625 = fieldNorm(doc=8182)
          0.09991158 = weight(abstract_txt:coding in 8182) [ClassicSimilarity], result of:
            0.09991158 = score(doc=8182,freq=2.0), product of:
              0.16733229 = queryWeight, product of:
                1.9805194 = boost
                6.7552447 = idf(docFreq=139, maxDocs=44218)
                0.012507185 = queryNorm
              0.5970849 = fieldWeight in 8182, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.7552447 = idf(docFreq=139, maxDocs=44218)
                0.0625 = fieldNorm(doc=8182)
          0.34218144 = weight(abstract_txt:compression in 8182) [ClassicSimilarity], result of:
            0.34218144 = score(doc=8182,freq=3.0), product of:
              0.41845915 = queryWeight, product of:
                4.4292555 = boost
                7.5537524 = idf(docFreq=62, maxDocs=44218)
                0.012507185 = queryNorm
              0.8177177 = fieldWeight in 8182, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.5537524 = idf(docFreq=62, maxDocs=44218)
                0.0625 = fieldNorm(doc=8182)
        0.2 = coord(5/25)
    
  4. Fersini, E.; Messina, E.; Archetti, F.: Enhancing web page classification through image-block importance analysis (2008) 0.10
    0.1034955 = sum of:
      0.1034955 = product of:
        0.6468469 = sum of:
          0.11361275 = weight(abstract_txt:blocks in 2102) [ClassicSimilarity], result of:
            0.11361275 = score(doc=2102,freq=3.0), product of:
              0.10892815 = queryWeight, product of:
                1.1299111 = boost
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.012507185 = queryNorm
              1.0430063 = fieldWeight in 2102, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.078125 = fieldNorm(doc=2102)
          0.01894428 = weight(abstract_txt:text in 2102) [ClassicSimilarity], result of:
            0.01894428 = score(doc=2102,freq=1.0), product of:
              0.059964087 = queryWeight, product of:
                1.1855909 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.012507185 = queryNorm
              0.3159271 = fieldWeight in 2102, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.078125 = fieldNorm(doc=2102)
          0.013922087 = weight(abstract_txt:based in 2102) [ClassicSimilarity], result of:
            0.013922087 = score(doc=2102,freq=1.0), product of:
              0.055899233 = queryWeight, product of:
                1.4019669 = boost
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.012507185 = queryNorm
              0.24905685 = fieldWeight in 2102, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.078125 = fieldNorm(doc=2102)
          0.50036776 = weight(abstract_txt:block in 2102) [ClassicSimilarity], result of:
            0.50036776 = score(doc=2102,freq=3.0), product of:
              0.46458837 = queryWeight, product of:
                4.667006 = boost
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.012507185 = queryNorm
              1.0770131 = fieldWeight in 2102, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.078125 = fieldNorm(doc=2102)
        0.16 = coord(4/25)
    
  5. Tsai, R.T.-H.; Chiu, B.; Wu, C.-E.: Visual webpage block importance prediction using conditional random fields (2011) 0.10
    0.09754839 = sum of:
      0.09754839 = product of:
        0.60967743 = sum of:
          0.07275345 = weight(abstract_txt:sequence in 4924) [ClassicSimilarity], result of:
            0.07275345 = score(doc=4924,freq=4.0), product of:
              0.085320145 = queryWeight, product of:
                6.82169 = idf(docFreq=130, maxDocs=44218)
                0.012507185 = queryNorm
              0.85271126 = fieldWeight in 4924, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.82169 = idf(docFreq=130, maxDocs=44218)
                0.0625 = fieldNorm(doc=4924)
          0.11733873 = weight(abstract_txt:blocks in 4924) [ClassicSimilarity], result of:
            0.11733873 = score(doc=4924,freq=5.0), product of:
              0.10892815 = queryWeight, product of:
                1.1299111 = boost
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.012507185 = queryNorm
              1.0772122 = fieldWeight in 4924, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.0625 = fieldNorm(doc=4924)
          0.01929101 = weight(abstract_txt:based in 4924) [ClassicSimilarity], result of:
            0.01929101 = score(doc=4924,freq=3.0), product of:
              0.055899233 = queryWeight, product of:
                1.4019669 = boost
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.012507185 = queryNorm
              0.3451033 = fieldWeight in 4924, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.0625 = fieldNorm(doc=4924)
          0.40029424 = weight(abstract_txt:block in 4924) [ClassicSimilarity], result of:
            0.40029424 = score(doc=4924,freq=3.0), product of:
              0.46458837 = queryWeight, product of:
                4.667006 = boost
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.012507185 = queryNorm
              0.86161053 = fieldWeight in 4924, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.0625 = fieldNorm(doc=4924)
        0.16 = coord(4/25)