Document (#29995)

Author
Adiego, J.
Navarro, G.
Fuente, P. de la
Title
Lempel-Ziv compression of highly structured documents
Source
Journal of the American Society for Information Science and Technology. 58(2007) no.4, S.461-478
Year
2007
Abstract
The authors describe Lempel-Ziv to Compress Structure (LZCS), a novel Lempel-Ziv approach suitable for compressing structured documents. LZCS takes advantage of repeated substructures that may appear in the documents, by replacing them with a backward reference to their previous occurrence. The result of the LZCS transformation is still a valid structured document, which is human-readable and can be transmitted by ASCII channels. Moreover, LZCS transformed documents are easy to search, display, access at random, and navigate. In a second stage, the transformed documents can be further compressed using any semistatic technique, so that it is still possible to do all those operations efficiently; or with any adaptive technique to boost compression. LZCS is especially efficient in the compression of collections of highly structured data, such as extensible markup language (XML) forms, invoices, e-commerce, and Web-service exchange documents. The comparison with other structure-aware and standard compressors shows that LZCS is a competitive choice for these type of documents, whereas the others are not well-suited to support navigation or random access. When joined to an adaptive compressor, LZCS obtains by far the best compression ratios.

Similar documents (author)

  1. Navarro, M.A.E. -> Esteban Navarro, M.A.: 5.39
    5.3942666 = sum of:
      5.3942666 = weight(author_txt:navarro in 2822) [ClassicSimilarity], result of:
        5.3942666 = fieldWeight in 2822, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.7184515 = idf(docFreq=18, maxDocs=42740)
          0.4375 = fieldNorm(doc=2822)
    
  2. Molina, C. Navarro- -> Navarro-Molina, C.: 4.62
    4.623657 = sum of:
      4.623657 = weight(author_txt:navarro in 946) [ClassicSimilarity], result of:
        4.623657 = fieldWeight in 946, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.7184515 = idf(docFreq=18, maxDocs=42740)
          0.375 = fieldNorm(doc=946)
    
  3. Navarro, M.A. Esteban -> Esteban Navarro, M.A.: 4.62
    4.623657 = sum of:
      4.623657 = weight(author_txt:navarro in 2552) [ClassicSimilarity], result of:
        4.623657 = fieldWeight in 2552, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.7184515 = idf(docFreq=18, maxDocs=42740)
          0.375 = fieldNorm(doc=2552)
    
  4. Esteban Navarro, M.A.: Aplicaciones de la terminologia para la docencia de la gestion de lenguajes documentales (1995) 4.36
    4.3592257 = sum of:
      4.3592257 = weight(author_txt:navarro in 5498) [ClassicSimilarity], result of:
        4.3592257 = fieldWeight in 5498, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.7184515 = idf(docFreq=18, maxDocs=42740)
          0.5 = fieldNorm(doc=5498)
    
  5. Esteban Navarro, M.A.: Fundamentos epistemologicos de la classificacion documental (1995) 4.36
    4.3592257 = sum of:
      4.3592257 = weight(author_txt:navarro in 5616) [ClassicSimilarity], result of:
        4.3592257 = fieldWeight in 5616, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.7184515 = idf(docFreq=18, maxDocs=42740)
          0.5 = fieldNorm(doc=5616)
    

Similar documents (content)

  1. Cannane, A.; Williams, H.E.: General-purpose compression for efficient retrieval (2001) 0.37
    0.3701502 = sum of:
      0.3701502 = product of:
        1.321965 = sum of:
          0.017611515 = weight(abstract_txt:access in 621) [ClassicSimilarity], result of:
            0.017611515 = score(doc=621,freq=1.0), product of:
              0.061734956 = queryWeight, product of:
                1.0013201 = boost
                3.6515355 = idf(docFreq=3014, maxDocs=42740)
                0.016884284 = queryNorm
              0.2852762 = fieldWeight in 621, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6515355 = idf(docFreq=3014, maxDocs=42740)
                0.078125 = fieldNorm(doc=621)
          0.0086583095 = weight(abstract_txt:with in 621) [ClassicSimilarity], result of:
            0.0086583095 = score(doc=621,freq=1.0), product of:
              0.04402025 = queryWeight, product of:
                1.0355695 = boost
                2.5176222 = idf(docFreq=9369, maxDocs=42740)
                0.016884284 = queryNorm
              0.19668923 = fieldWeight in 621, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.5176222 = idf(docFreq=9369, maxDocs=42740)
                0.078125 = fieldNorm(doc=621)
          0.063256785 = weight(abstract_txt:technique in 621) [ClassicSimilarity], result of:
            0.063256785 = score(doc=621,freq=1.0), product of:
              0.14478984 = queryWeight, product of:
                1.5334741 = boost
                5.5921526 = idf(docFreq=432, maxDocs=42740)
                0.016884284 = queryNorm
              0.4368869 = fieldWeight in 621, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5921526 = idf(docFreq=432, maxDocs=42740)
                0.078125 = fieldNorm(doc=621)
          0.10113221 = weight(abstract_txt:random in 621) [ClassicSimilarity], result of:
            0.10113221 = score(doc=621,freq=1.0), product of:
              0.19796708 = queryWeight, product of:
                1.7930974 = boost
                6.5389266 = idf(docFreq=167, maxDocs=42740)
                0.016884284 = queryNorm
              0.51085365 = fieldWeight in 621, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5389266 = idf(docFreq=167, maxDocs=42740)
                0.078125 = fieldNorm(doc=621)
          0.12021198 = weight(abstract_txt:adaptive in 621) [ClassicSimilarity], result of:
            0.12021198 = score(doc=621,freq=1.0), product of:
              0.22214259 = queryWeight, product of:
                1.89943 = boost
                6.926692 = idf(docFreq=113, maxDocs=42740)
                0.016884284 = queryNorm
              0.5411478 = fieldWeight in 621, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.926692 = idf(docFreq=113, maxDocs=42740)
                0.078125 = fieldNorm(doc=621)
          0.08824107 = weight(abstract_txt:documents in 621) [ClassicSimilarity], result of:
            0.08824107 = score(doc=621,freq=1.0), product of:
              0.27445418 = queryWeight, product of:
                3.9498112 = boost
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.016884284 = queryNorm
              0.32151476 = fieldWeight in 621, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.078125 = fieldNorm(doc=621)
          0.9228531 = weight(abstract_txt:compression in 621) [ClassicSimilarity], result of:
            0.9228531 = score(doc=621,freq=9.0), product of:
              0.5236216 = queryWeight, product of:
                4.124119 = boost
                7.519756 = idf(docFreq=62, maxDocs=42740)
                0.016884284 = queryNorm
              1.7624427 = fieldWeight in 621, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                7.519756 = idf(docFreq=62, maxDocs=42740)
                0.078125 = fieldNorm(doc=621)
        0.28 = coord(7/25)
    
  2. Moffat, A.; Bell, T.A.H.: In situ generation of compressed inverted files (1995) 0.13
    0.12833722 = sum of:
      0.12833722 = product of:
        0.6416861 = sum of:
          0.017611515 = weight(abstract_txt:access in 2717) [ClassicSimilarity], result of:
            0.017611515 = score(doc=2717,freq=1.0), product of:
              0.061734956 = queryWeight, product of:
                1.0013201 = boost
                3.6515355 = idf(docFreq=3014, maxDocs=42740)
                0.016884284 = queryNorm
              0.2852762 = fieldWeight in 2717, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6515355 = idf(docFreq=3014, maxDocs=42740)
                0.078125 = fieldNorm(doc=2717)
          0.12708361 = weight(abstract_txt:compressed in 2717) [ClassicSimilarity], result of:
            0.12708361 = score(doc=2717,freq=1.0), product of:
              0.18297133 = queryWeight, product of:
                1.2189444 = boost
                8.890302 = idf(docFreq=15, maxDocs=42740)
                0.016884284 = queryNorm
              0.6945548 = fieldWeight in 2717, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.890302 = idf(docFreq=15, maxDocs=42740)
                0.078125 = fieldNorm(doc=2717)
          0.10113221 = weight(abstract_txt:random in 2717) [ClassicSimilarity], result of:
            0.10113221 = score(doc=2717,freq=1.0), product of:
              0.19796708 = queryWeight, product of:
                1.7930974 = boost
                6.5389266 = idf(docFreq=167, maxDocs=42740)
                0.016884284 = queryNorm
              0.51085365 = fieldWeight in 2717, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5389266 = idf(docFreq=167, maxDocs=42740)
                0.078125 = fieldNorm(doc=2717)
          0.08824107 = weight(abstract_txt:documents in 2717) [ClassicSimilarity], result of:
            0.08824107 = score(doc=2717,freq=1.0), product of:
              0.27445418 = queryWeight, product of:
                3.9498112 = boost
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.016884284 = queryNorm
              0.32151476 = fieldWeight in 2717, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.078125 = fieldNorm(doc=2717)
          0.3076177 = weight(abstract_txt:compression in 2717) [ClassicSimilarity], result of:
            0.3076177 = score(doc=2717,freq=1.0), product of:
              0.5236216 = queryWeight, product of:
                4.124119 = boost
                7.519756 = idf(docFreq=62, maxDocs=42740)
                0.016884284 = queryNorm
              0.5874809 = fieldWeight in 2717, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.519756 = idf(docFreq=62, maxDocs=42740)
                0.078125 = fieldNorm(doc=2717)
        0.2 = coord(5/25)
    
  3. Gillman, P.: Data handling and text compression (1992) 0.12
    0.12243871 = sum of:
      0.12243871 = product of:
        0.6121935 = sum of:
          0.0069266474 = weight(abstract_txt:with in 5306) [ClassicSimilarity], result of:
            0.0069266474 = score(doc=5306,freq=1.0), product of:
              0.04402025 = queryWeight, product of:
                1.0355695 = boost
                2.5176222 = idf(docFreq=9369, maxDocs=42740)
                0.016884284 = queryNorm
              0.15735139 = fieldWeight in 5306, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.5176222 = idf(docFreq=9369, maxDocs=42740)
                0.0625 = fieldNorm(doc=5306)
          0.07478504 = weight(abstract_txt:ascii in 5306) [ClassicSimilarity], result of:
            0.07478504 = score(doc=5306,freq=1.0), product of:
              0.14909847 = queryWeight, product of:
                1.1003453 = boost
                8.025305 = idf(docFreq=37, maxDocs=42740)
                0.016884284 = queryNorm
              0.50158155 = fieldWeight in 5306, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.025305 = idf(docFreq=37, maxDocs=42740)
                0.0625 = fieldNorm(doc=5306)
          0.111859284 = weight(abstract_txt:compressing in 5306) [ClassicSimilarity], result of:
            0.111859284 = score(doc=5306,freq=1.0), product of:
              0.19500454 = queryWeight, product of:
                1.2583885 = boost
                9.177984 = idf(docFreq=11, maxDocs=42740)
                0.016884284 = queryNorm
              0.573624 = fieldWeight in 5306, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.177984 = idf(docFreq=11, maxDocs=42740)
                0.0625 = fieldNorm(doc=5306)
          0.07059285 = weight(abstract_txt:documents in 5306) [ClassicSimilarity], result of:
            0.07059285 = score(doc=5306,freq=1.0), product of:
              0.27445418 = queryWeight, product of:
                3.9498112 = boost
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.016884284 = queryNorm
              0.2572118 = fieldWeight in 5306, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.0625 = fieldNorm(doc=5306)
          0.3480297 = weight(abstract_txt:compression in 5306) [ClassicSimilarity], result of:
            0.3480297 = score(doc=5306,freq=2.0), product of:
              0.5236216 = queryWeight, product of:
                4.124119 = boost
                7.519756 = idf(docFreq=62, maxDocs=42740)
                0.016884284 = queryNorm
              0.6646588 = fieldWeight in 5306, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.519756 = idf(docFreq=62, maxDocs=42740)
                0.0625 = fieldNorm(doc=5306)
        0.2 = coord(5/25)
    
  4. Nomoto, T.: Discriminative sentence compression with conditional random fields (2007) 0.11
    0.108228296 = sum of:
      0.108228296 = product of:
        0.6764269 = sum of:
          0.0122446995 = weight(abstract_txt:with in 2946) [ClassicSimilarity], result of:
            0.0122446995 = score(doc=2946,freq=2.0), product of:
              0.04402025 = queryWeight, product of:
                1.0355695 = boost
                2.5176222 = idf(docFreq=9369, maxDocs=42740)
                0.016884284 = queryNorm
              0.2781606 = fieldWeight in 2946, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.5176222 = idf(docFreq=9369, maxDocs=42740)
                0.078125 = fieldNorm(doc=2946)
          0.030240478 = weight(abstract_txt:structure in 2946) [ClassicSimilarity], result of:
            0.030240478 = score(doc=2946,freq=1.0), product of:
              0.08852361 = queryWeight, product of:
                1.199049 = boost
                4.3725977 = idf(docFreq=1465, maxDocs=42740)
                0.016884284 = queryNorm
              0.34160918 = fieldWeight in 2946, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.3725977 = idf(docFreq=1465, maxDocs=42740)
                0.078125 = fieldNorm(doc=2946)
          0.10113221 = weight(abstract_txt:random in 2946) [ClassicSimilarity], result of:
            0.10113221 = score(doc=2946,freq=1.0), product of:
              0.19796708 = queryWeight, product of:
                1.7930974 = boost
                6.5389266 = idf(docFreq=167, maxDocs=42740)
                0.016884284 = queryNorm
              0.51085365 = fieldWeight in 2946, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5389266 = idf(docFreq=167, maxDocs=42740)
                0.078125 = fieldNorm(doc=2946)
          0.5328095 = weight(abstract_txt:compression in 2946) [ClassicSimilarity], result of:
            0.5328095 = score(doc=2946,freq=3.0), product of:
              0.5236216 = queryWeight, product of:
                4.124119 = boost
                7.519756 = idf(docFreq=62, maxDocs=42740)
                0.016884284 = queryNorm
              1.0175468 = fieldWeight in 2946, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.519756 = idf(docFreq=62, maxDocs=42740)
                0.078125 = fieldNorm(doc=2946)
        0.16 = coord(4/25)
    
  5. Lalmas, M.: XML information retrieval (2009) 0.11
    0.106890045 = sum of:
      0.106890045 = product of:
        0.53445023 = sum of:
          0.08420136 = weight(abstract_txt:extensible in 881) [ClassicSimilarity], result of:
            0.08420136 = score(doc=881,freq=1.0), product of:
              0.12314456 = queryWeight, product of:
                7.2934427 = idf(docFreq=78, maxDocs=42740)
                0.016884284 = queryNorm
              0.6837603 = fieldWeight in 881, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.2934427 = idf(docFreq=78, maxDocs=42740)
                0.09375 = fieldNorm(doc=881)
          0.021133818 = weight(abstract_txt:access in 881) [ClassicSimilarity], result of:
            0.021133818 = score(doc=881,freq=1.0), product of:
              0.061734956 = queryWeight, product of:
                1.0013201 = boost
                3.6515355 = idf(docFreq=3014, maxDocs=42740)
                0.016884284 = queryNorm
              0.34233147 = fieldWeight in 881, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6515355 = idf(docFreq=3014, maxDocs=42740)
                0.09375 = fieldNorm(doc=881)
          0.051319797 = weight(abstract_txt:structure in 881) [ClassicSimilarity], result of:
            0.051319797 = score(doc=881,freq=2.0), product of:
              0.08852361 = queryWeight, product of:
                1.199049 = boost
                4.3725977 = idf(docFreq=1465, maxDocs=42740)
                0.016884284 = queryNorm
              0.57973003 = fieldWeight in 881, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.3725977 = idf(docFreq=1465, maxDocs=42740)
                0.09375 = fieldNorm(doc=881)
          0.14101963 = weight(abstract_txt:structured in 881) [ClassicSimilarity], result of:
            0.14101963 = score(doc=881,freq=1.0), product of:
              0.2756823 = queryWeight, product of:
                2.9924493 = boost
                5.4563146 = idf(docFreq=495, maxDocs=42740)
                0.016884284 = queryNorm
              0.5115295 = fieldWeight in 881, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.4563146 = idf(docFreq=495, maxDocs=42740)
                0.09375 = fieldNorm(doc=881)
          0.23677564 = weight(abstract_txt:documents in 881) [ClassicSimilarity], result of:
            0.23677564 = score(doc=881,freq=5.0), product of:
              0.27445418 = queryWeight, product of:
                3.9498112 = boost
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.016884284 = queryNorm
              0.86271465 = fieldWeight in 881, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.09375 = fieldNorm(doc=881)
        0.2 = coord(5/25)