Document (#29994)

Author
Adiego, J.
Navarro, G.
Fuente, P. de la
Title
Lempel-Ziv compression of highly structured documents
Source
Journal of the American Society for Information Science and Technology. 58(2007) no.4, S.461-478
Year
2007
Abstract
The authors describe Lempel-Ziv to Compress Structure (LZCS), a novel Lempel-Ziv approach suitable for compressing structured documents. LZCS takes advantage of repeated substructures that may appear in the documents, by replacing them with a backward reference to their previous occurrence. The result of the LZCS transformation is still a valid structured document, which is human-readable and can be transmitted by ASCII channels. Moreover, LZCS transformed documents are easy to search, display, access at random, and navigate. In a second stage, the transformed documents can be further compressed using any semistatic technique, so that it is still possible to do all those operations efficiently; or with any adaptive technique to boost compression. LZCS is especially efficient in the compression of collections of highly structured data, such as extensible markup language (XML) forms, invoices, e-commerce, and Web-service exchange documents. The comparison with other structure-aware and standard compressors shows that LZCS is a competitive choice for these type of documents, whereas the others are not well-suited to support navigation or random access. When joined to an adaptive compressor, LZCS obtains by far the best compression ratios.

Similar documents (author)

  1. Navarro, M.A.E. -> Esteban Navarro, M.A.: 5.42
    5.415301 = sum of:
      5.415301 = weight(author_txt:navarro in 2822) [ClassicSimilarity], result of:
        5.415301 = fieldWeight in 2822, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.752448 = idf(docFreq=18, maxDocs=44218)
          0.4375 = fieldNorm(doc=2822)
    
  2. Molina, C. Navarro- -> Navarro-Molina, C.: 4.64
    4.6416864 = sum of:
      4.6416864 = weight(author_txt:navarro in 946) [ClassicSimilarity], result of:
        4.6416864 = fieldWeight in 946, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.752448 = idf(docFreq=18, maxDocs=44218)
          0.375 = fieldNorm(doc=946)
    
  3. Navarro, M.A. Esteban -> Esteban Navarro, M.A.: 4.64
    4.6416864 = sum of:
      4.6416864 = weight(author_txt:navarro in 2552) [ClassicSimilarity], result of:
        4.6416864 = fieldWeight in 2552, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.752448 = idf(docFreq=18, maxDocs=44218)
          0.375 = fieldNorm(doc=2552)
    
  4. Esteban Navarro, M.A.: Aplicaciones de la terminologia para la docencia de la gestion de lenguajes documentales (1995) 4.38
    4.376224 = sum of:
      4.376224 = weight(author_txt:navarro in 5429) [ClassicSimilarity], result of:
        4.376224 = fieldWeight in 5429, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.752448 = idf(docFreq=18, maxDocs=44218)
          0.5 = fieldNorm(doc=5429)
    
  5. Esteban Navarro, M.A.: Fundamentos epistemologicos de la classificacion documental (1995) 4.38
    4.376224 = sum of:
      4.376224 = weight(author_txt:navarro in 5547) [ClassicSimilarity], result of:
        4.376224 = fieldWeight in 5547, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.752448 = idf(docFreq=18, maxDocs=44218)
          0.5 = fieldNorm(doc=5547)
    

Similar documents (content)

  1. Cannane, A.; Williams, H.E.: General-purpose compression for efficient retrieval (2001) 0.37
    0.37386075 = sum of:
      0.37386075 = product of:
        1.335217 = sum of:
          0.017603677 = weight(abstract_txt:access in 5705) [ClassicSimilarity], result of:
            0.017603677 = score(doc=5705,freq=1.0), product of:
              0.0617169 = queryWeight, product of:
                1.0016172 = boost
                3.6509786 = idf(docFreq=3120, maxDocs=44218)
                0.016876914 = queryNorm
              0.2852327 = fieldWeight in 5705, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6509786 = idf(docFreq=3120, maxDocs=44218)
                0.078125 = fieldNorm(doc=5705)
          0.008475158 = weight(abstract_txt:with in 5705) [ClassicSimilarity], result of:
            0.008475158 = score(doc=5705,freq=1.0), product of:
              0.04339744 = queryWeight, product of:
                1.0286732 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.016876914 = queryNorm
              0.19529167 = fieldWeight in 5705, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.078125 = fieldNorm(doc=5705)
          0.06317996 = weight(abstract_txt:technique in 5705) [ClassicSimilarity], result of:
            0.06317996 = score(doc=5705,freq=1.0), product of:
              0.14467318 = queryWeight, product of:
                1.5335351 = boost
                5.5898643 = idf(docFreq=448, maxDocs=44218)
                0.016876914 = queryNorm
              0.43670815 = fieldWeight in 5705, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5898643 = idf(docFreq=448, maxDocs=44218)
                0.078125 = fieldNorm(doc=5705)
          0.100817114 = weight(abstract_txt:random in 5705) [ClassicSimilarity], result of:
            0.100817114 = score(doc=5705,freq=1.0), product of:
              0.1975565 = queryWeight, product of:
                1.7920305 = boost
                6.532101 = idf(docFreq=174, maxDocs=44218)
                0.016876914 = queryNorm
              0.5103204 = fieldWeight in 5705, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.532101 = idf(docFreq=174, maxDocs=44218)
                0.078125 = fieldNorm(doc=5705)
          0.121080086 = weight(abstract_txt:adaptive in 5705) [ClassicSimilarity], result of:
            0.121080086 = score(doc=5705,freq=1.0), product of:
              0.2232117 = queryWeight, product of:
                1.9048387 = boost
                6.943297 = idf(docFreq=115, maxDocs=44218)
                0.016876914 = queryNorm
              0.54244506 = fieldWeight in 5705, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.943297 = idf(docFreq=115, maxDocs=44218)
                0.078125 = fieldNorm(doc=5705)
          0.08862314 = weight(abstract_txt:documents in 5705) [ClassicSimilarity], result of:
            0.08862314 = score(doc=5705,freq=1.0), product of:
              0.27524698 = queryWeight, product of:
                3.9572637 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.016876914 = queryNorm
              0.32197678 = fieldWeight in 5705, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.078125 = fieldNorm(doc=5705)
          0.9354379 = weight(abstract_txt:compression in 5705) [ClassicSimilarity], result of:
            0.9354379 = score(doc=5705,freq=9.0), product of:
              0.5283734 = queryWeight, product of:
                4.1446247 = boost
                7.5537524 = idf(docFreq=62, maxDocs=44218)
                0.016876914 = queryNorm
              1.7704107 = fieldWeight in 5705, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                7.5537524 = idf(docFreq=62, maxDocs=44218)
                0.078125 = fieldNorm(doc=5705)
        0.28 = coord(7/25)
    
  2. Moffat, A.; Bell, T.A.H.: In situ generation of compressed inverted files (1995) 0.13
    0.12948106 = sum of:
      0.12948106 = product of:
        0.64740527 = sum of:
          0.017603677 = weight(abstract_txt:access in 2648) [ClassicSimilarity], result of:
            0.017603677 = score(doc=2648,freq=1.0), product of:
              0.0617169 = queryWeight, product of:
                1.0016172 = boost
                3.6509786 = idf(docFreq=3120, maxDocs=44218)
                0.016876914 = queryNorm
              0.2852327 = fieldWeight in 2648, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6509786 = idf(docFreq=3120, maxDocs=44218)
                0.078125 = fieldNorm(doc=2648)
          0.12854873 = weight(abstract_txt:compressed in 2648) [ClassicSimilarity], result of:
            0.12854873 = score(doc=2648,freq=1.0), product of:
              0.1843757 = queryWeight, product of:
                1.2241554 = boost
                8.924298 = idf(docFreq=15, maxDocs=44218)
                0.016876914 = queryNorm
              0.6972108 = fieldWeight in 2648, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.924298 = idf(docFreq=15, maxDocs=44218)
                0.078125 = fieldNorm(doc=2648)
          0.100817114 = weight(abstract_txt:random in 2648) [ClassicSimilarity], result of:
            0.100817114 = score(doc=2648,freq=1.0), product of:
              0.1975565 = queryWeight, product of:
                1.7920305 = boost
                6.532101 = idf(docFreq=174, maxDocs=44218)
                0.016876914 = queryNorm
              0.5103204 = fieldWeight in 2648, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.532101 = idf(docFreq=174, maxDocs=44218)
                0.078125 = fieldNorm(doc=2648)
          0.08862314 = weight(abstract_txt:documents in 2648) [ClassicSimilarity], result of:
            0.08862314 = score(doc=2648,freq=1.0), product of:
              0.27524698 = queryWeight, product of:
                3.9572637 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.016876914 = queryNorm
              0.32197678 = fieldWeight in 2648, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.078125 = fieldNorm(doc=2648)
          0.31181264 = weight(abstract_txt:compression in 2648) [ClassicSimilarity], result of:
            0.31181264 = score(doc=2648,freq=1.0), product of:
              0.5283734 = queryWeight, product of:
                4.1446247 = boost
                7.5537524 = idf(docFreq=62, maxDocs=44218)
                0.016876914 = queryNorm
              0.5901369 = fieldWeight in 2648, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5537524 = idf(docFreq=62, maxDocs=44218)
                0.078125 = fieldNorm(doc=2648)
        0.2 = coord(5/25)
    
  3. Gillman, P.: Data handling and text compression (1992) 0.12
    0.12386062 = sum of:
      0.12386062 = product of:
        0.6193031 = sum of:
          0.006780127 = weight(abstract_txt:with in 5306) [ClassicSimilarity], result of:
            0.006780127 = score(doc=5306,freq=1.0), product of:
              0.04339744 = queryWeight, product of:
                1.0286732 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.016876914 = queryNorm
              0.15623334 = fieldWeight in 5306, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.0625 = fieldNorm(doc=5306)
          0.07574042 = weight(abstract_txt:ascii in 5306) [ClassicSimilarity], result of:
            0.07574042 = score(doc=5306,freq=1.0), product of:
              0.15036622 = queryWeight, product of:
                1.1055028 = boost
                8.059301 = idf(docFreq=37, maxDocs=44218)
                0.016876914 = queryNorm
              0.50370634 = fieldWeight in 5306, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.059301 = idf(docFreq=37, maxDocs=44218)
                0.0625 = fieldNorm(doc=5306)
          0.11310834 = weight(abstract_txt:compressing in 5306) [ClassicSimilarity], result of:
            0.11310834 = score(doc=5306,freq=1.0), product of:
              0.19645432 = queryWeight, product of:
                1.2636172 = boost
                9.211981 = idf(docFreq=11, maxDocs=44218)
                0.016876914 = queryNorm
              0.5757488 = fieldWeight in 5306, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.211981 = idf(docFreq=11, maxDocs=44218)
                0.0625 = fieldNorm(doc=5306)
          0.0708985 = weight(abstract_txt:documents in 5306) [ClassicSimilarity], result of:
            0.0708985 = score(doc=5306,freq=1.0), product of:
              0.27524698 = queryWeight, product of:
                3.9572637 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.016876914 = queryNorm
              0.2575814 = fieldWeight in 5306, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0625 = fieldNorm(doc=5306)
          0.35277575 = weight(abstract_txt:compression in 5306) [ClassicSimilarity], result of:
            0.35277575 = score(doc=5306,freq=2.0), product of:
              0.5283734 = queryWeight, product of:
                4.1446247 = boost
                7.5537524 = idf(docFreq=62, maxDocs=44218)
                0.016876914 = queryNorm
              0.6676637 = fieldWeight in 5306, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.5537524 = idf(docFreq=62, maxDocs=44218)
                0.0625 = fieldNorm(doc=5306)
        0.2 = coord(5/25)
    
  4. Nomoto, T.: Discriminative sentence compression with conditional random fields (2007) 0.11
    0.10925075 = sum of:
      0.10925075 = product of:
        0.68281716 = sum of:
          0.011985685 = weight(abstract_txt:with in 945) [ClassicSimilarity], result of:
            0.011985685 = score(doc=945,freq=2.0), product of:
              0.04339744 = queryWeight, product of:
                1.0286732 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.016876914 = queryNorm
              0.27618414 = fieldWeight in 945, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.078125 = fieldNorm(doc=945)
          0.029938981 = weight(abstract_txt:structure in 945) [ClassicSimilarity], result of:
            0.029938981 = score(doc=945,freq=1.0), product of:
              0.087934606 = queryWeight, product of:
                1.1955827 = boost
                4.3579993 = idf(docFreq=1538, maxDocs=44218)
                0.016876914 = queryNorm
              0.3404687 = fieldWeight in 945, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.3579993 = idf(docFreq=1538, maxDocs=44218)
                0.078125 = fieldNorm(doc=945)
          0.100817114 = weight(abstract_txt:random in 945) [ClassicSimilarity], result of:
            0.100817114 = score(doc=945,freq=1.0), product of:
              0.1975565 = queryWeight, product of:
                1.7920305 = boost
                6.532101 = idf(docFreq=174, maxDocs=44218)
                0.016876914 = queryNorm
              0.5103204 = fieldWeight in 945, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.532101 = idf(docFreq=174, maxDocs=44218)
                0.078125 = fieldNorm(doc=945)
          0.54007536 = weight(abstract_txt:compression in 945) [ClassicSimilarity], result of:
            0.54007536 = score(doc=945,freq=3.0), product of:
              0.5283734 = queryWeight, product of:
                4.1446247 = boost
                7.5537524 = idf(docFreq=62, maxDocs=44218)
                0.016876914 = queryNorm
              1.022147 = fieldWeight in 945, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.5537524 = idf(docFreq=62, maxDocs=44218)
                0.078125 = fieldNorm(doc=945)
        0.16 = coord(4/25)
    
  5. Lalmas, M.: XML information retrieval (2009) 0.11
    0.10699332 = sum of:
      0.10699332 = product of:
        0.5349666 = sum of:
          0.021124415 = weight(abstract_txt:access in 3880) [ClassicSimilarity], result of:
            0.021124415 = score(doc=3880,freq=1.0), product of:
              0.0617169 = queryWeight, product of:
                1.0016172 = boost
                3.6509786 = idf(docFreq=3120, maxDocs=44218)
                0.016876914 = queryNorm
              0.34227926 = fieldWeight in 3880, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6509786 = idf(docFreq=3120, maxDocs=44218)
                0.09375 = fieldNorm(doc=3880)
          0.085385375 = weight(abstract_txt:extensible in 3880) [ClassicSimilarity], result of:
            0.085385375 = score(doc=3880,freq=1.0), product of:
              0.1242968 = queryWeight, product of:
                1.0051125 = boost
                7.3274393 = idf(docFreq=78, maxDocs=44218)
                0.016876914 = queryNorm
              0.68694746 = fieldWeight in 3880, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.3274393 = idf(docFreq=78, maxDocs=44218)
                0.09375 = fieldNorm(doc=3880)
          0.050808135 = weight(abstract_txt:structure in 3880) [ClassicSimilarity], result of:
            0.050808135 = score(doc=3880,freq=2.0), product of:
              0.087934606 = queryWeight, product of:
                1.1955827 = boost
                4.3579993 = idf(docFreq=1538, maxDocs=44218)
                0.016876914 = queryNorm
              0.57779455 = fieldWeight in 3880, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.3579993 = idf(docFreq=1538, maxDocs=44218)
                0.09375 = fieldNorm(doc=3880)
          0.13984786 = weight(abstract_txt:structured in 3880) [ClassicSimilarity], result of:
            0.13984786 = score(doc=3880,freq=1.0), product of:
              0.2741542 = queryWeight, product of:
                2.9854662 = boost
                5.4411373 = idf(docFreq=520, maxDocs=44218)
                0.016876914 = queryNorm
              0.5101066 = fieldWeight in 3880, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.4411373 = idf(docFreq=520, maxDocs=44218)
                0.09375 = fieldNorm(doc=3880)
          0.23780082 = weight(abstract_txt:documents in 3880) [ClassicSimilarity], result of:
            0.23780082 = score(doc=3880,freq=5.0), product of:
              0.27524698 = queryWeight, product of:
                3.9572637 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.016876914 = queryNorm
              0.86395437 = fieldWeight in 3880, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.09375 = fieldNorm(doc=3880)
        0.2 = coord(5/25)