Document (#10930)

Author
Akman, K.I.
Title
¬A new text compression technique based on natural language structure
Source
Journal of information science. 21(1995) no.2, S.87-94
Year
1995
Abstract
Describes a new data compression technique which utilizes some of the common structural characteristics of languages. The proposed algorithm partitions words into their roots and suffixes which are then replaced by shorter bit representations. The method used 3 dictionaries in the from of binary search trees and 1 character array. The first 2 dictionaries are for roots, and the third one is for suffixes. The character array is used for both searching compressible words and coding incompressible words. The number of bits in representing a substring depends on the number of the entries in the dictionary in which the substring is found. The proposed algorithm is implemented in the Turkish language and tested using 3 different text groups with different lenghts. Results indicate a compression factor of up to 47 per cent
Theme
Computerlinguistik

Similar documents (content)

  1. Ucoluk, G.; Toroslu, I.H.: ¬A genetic algorithm approach for verification of the syllable-based text compression technique (1997) 0.23
    0.23135117 = sum of:
      0.23135117 = product of:
        0.7229724 = sum of:
          0.063993715 = weight(abstract_txt:coding in 2601) [ClassicSimilarity], result of:
            0.063993715 = score(doc=2601,freq=1.0), product of:
              0.121256836 = queryWeight, product of:
                6.7552447 = idf(docFreq=139, maxDocs=44218)
                0.017950028 = queryNorm
              0.5277535 = fieldWeight in 2601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.7552447 = idf(docFreq=139, maxDocs=44218)
                0.078125 = fieldNorm(doc=2601)
          0.047554888 = weight(abstract_txt:text in 2601) [ClassicSimilarity], result of:
            0.047554888 = score(doc=2601,freq=3.0), product of:
              0.086905584 = queryWeight, product of:
                1.1972524 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.017950028 = queryNorm
              0.54720175 = fieldWeight in 2601, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.078125 = fieldNorm(doc=2601)
          0.030368501 = weight(abstract_txt:language in 2601) [ClassicSimilarity], result of:
            0.030368501 = score(doc=2601,freq=1.0), product of:
              0.09294804 = queryWeight, product of:
                1.2381749 = boost
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.017950028 = queryNorm
              0.32672557 = fieldWeight in 2601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.078125 = fieldNorm(doc=2601)
          0.015453155 = weight(abstract_txt:which in 2601) [ClassicSimilarity], result of:
            0.015453155 = score(doc=2601,freq=1.0), product of:
              0.06781616 = queryWeight, product of:
                1.2953112 = boost
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.017950028 = queryNorm
              0.22786833 = fieldWeight in 2601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.078125 = fieldNorm(doc=2601)
          0.147549 = weight(abstract_txt:turkish in 2601) [ClassicSimilarity], result of:
            0.147549 = score(doc=2601,freq=1.0), product of:
              0.21162754 = queryWeight, product of:
                1.3210918 = boost
                8.924298 = idf(docFreq=15, maxDocs=44218)
                0.017950028 = queryNorm
              0.6972108 = fieldWeight in 2601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.924298 = idf(docFreq=15, maxDocs=44218)
                0.078125 = fieldNorm(doc=2601)
          0.07251834 = weight(abstract_txt:technique in 2601) [ClassicSimilarity], result of:
            0.07251834 = score(doc=2601,freq=1.0), product of:
              0.16605677 = queryWeight, product of:
                1.6549702 = boost
                5.5898643 = idf(docFreq=448, maxDocs=44218)
                0.017950028 = queryNorm
              0.43670815 = fieldWeight in 2601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5898643 = idf(docFreq=448, maxDocs=44218)
                0.078125 = fieldNorm(doc=2601)
          0.07710945 = weight(abstract_txt:algorithm in 2601) [ClassicSimilarity], result of:
            0.07710945 = score(doc=2601,freq=1.0), product of:
              0.17299348 = queryWeight, product of:
                1.6891832 = boost
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.017950028 = queryNorm
              0.44573617 = fieldWeight in 2601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.078125 = fieldNorm(doc=2601)
          0.26842535 = weight(abstract_txt:compression in 2601) [ClassicSimilarity], result of:
            0.26842535 = score(doc=2601,freq=1.0), product of:
              0.4548527 = queryWeight, product of:
                3.3546166 = boost
                7.5537524 = idf(docFreq=62, maxDocs=44218)
                0.017950028 = queryNorm
              0.5901369 = fieldWeight in 2601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5537524 = idf(docFreq=62, maxDocs=44218)
                0.078125 = fieldNorm(doc=2601)
        0.32 = coord(8/25)
    
  2. Cheng, K.-S.; Young, G.H.; Wong, K.-F.: ¬A study on word-based and integral-bit Chinese text compression algorithms (1999) 0.20
    0.20493022 = sum of:
      0.20493022 = product of:
        1.024651 = sum of:
          0.12670109 = weight(abstract_txt:coding in 3056) [ClassicSimilarity], result of:
            0.12670109 = score(doc=3056,freq=2.0), product of:
              0.121256836 = queryWeight, product of:
                6.7552447 = idf(docFreq=139, maxDocs=44218)
                0.017950028 = queryNorm
              1.0448985 = fieldWeight in 3056, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.7552447 = idf(docFreq=139, maxDocs=44218)
                0.109375 = fieldNorm(doc=3056)
          0.038438156 = weight(abstract_txt:text in 3056) [ClassicSimilarity], result of:
            0.038438156 = score(doc=3056,freq=1.0), product of:
              0.086905584 = queryWeight, product of:
                1.1972524 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.017950028 = queryNorm
              0.4422979 = fieldWeight in 3056, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.109375 = fieldNorm(doc=3056)
          0.021634419 = weight(abstract_txt:which in 3056) [ClassicSimilarity], result of:
            0.021634419 = score(doc=3056,freq=1.0), product of:
              0.06781616 = queryWeight, product of:
                1.2953112 = boost
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.017950028 = queryNorm
              0.31901568 = fieldWeight in 3056, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.109375 = fieldNorm(doc=3056)
          0.18698049 = weight(abstract_txt:algorithm in 3056) [ClassicSimilarity], result of:
            0.18698049 = score(doc=3056,freq=3.0), product of:
              0.17299348 = queryWeight, product of:
                1.6891832 = boost
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.017950028 = queryNorm
              1.0808527 = fieldWeight in 3056, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.109375 = fieldNorm(doc=3056)
          0.6508969 = weight(abstract_txt:compression in 3056) [ClassicSimilarity], result of:
            0.6508969 = score(doc=3056,freq=3.0), product of:
              0.4548527 = queryWeight, product of:
                3.3546166 = boost
                7.5537524 = idf(docFreq=62, maxDocs=44218)
                0.017950028 = queryNorm
              1.431006 = fieldWeight in 3056, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.5537524 = idf(docFreq=62, maxDocs=44218)
                0.109375 = fieldNorm(doc=3056)
        0.2 = coord(5/25)
    
  3. Cannane, A.; Williams, H.E.: General-purpose compression for efficient retrieval (2001) 0.20
    0.20324619 = sum of:
      0.20324619 = product of:
        1.016231 = sum of:
          0.027455827 = weight(abstract_txt:text in 5705) [ClassicSimilarity], result of:
            0.027455827 = score(doc=5705,freq=1.0), product of:
              0.086905584 = queryWeight, product of:
                1.1972524 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.017950028 = queryNorm
              0.3159271 = fieldWeight in 5705, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.078125 = fieldNorm(doc=5705)
          0.015453155 = weight(abstract_txt:which in 5705) [ClassicSimilarity], result of:
            0.015453155 = score(doc=5705,freq=1.0), product of:
              0.06781616 = queryWeight, product of:
                1.2953112 = boost
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.017950028 = queryNorm
              0.22786833 = fieldWeight in 5705, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.078125 = fieldNorm(doc=5705)
          0.07251834 = weight(abstract_txt:technique in 5705) [ClassicSimilarity], result of:
            0.07251834 = score(doc=5705,freq=1.0), product of:
              0.16605677 = queryWeight, product of:
                1.6549702 = boost
                5.5898643 = idf(docFreq=448, maxDocs=44218)
                0.017950028 = queryNorm
              0.43670815 = fieldWeight in 5705, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5898643 = idf(docFreq=448, maxDocs=44218)
                0.078125 = fieldNorm(doc=5705)
          0.09552757 = weight(abstract_txt:words in 5705) [ClassicSimilarity], result of:
            0.09552757 = score(doc=5705,freq=1.0), product of:
              0.22842357 = queryWeight, product of:
                2.3772671 = boost
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.017950028 = queryNorm
              0.41820365 = fieldWeight in 5705, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.078125 = fieldNorm(doc=5705)
          0.8052761 = weight(abstract_txt:compression in 5705) [ClassicSimilarity], result of:
            0.8052761 = score(doc=5705,freq=9.0), product of:
              0.4548527 = queryWeight, product of:
                3.3546166 = boost
                7.5537524 = idf(docFreq=62, maxDocs=44218)
                0.017950028 = queryNorm
              1.7704107 = fieldWeight in 5705, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                7.5537524 = idf(docFreq=62, maxDocs=44218)
                0.078125 = fieldNorm(doc=5705)
        0.2 = coord(5/25)
    
  4. Wang, F.L.; Yang, C.C.: Mining Web data for Chinese segmentation (2007) 0.15
    0.15309781 = sum of:
      0.15309781 = product of:
        0.5467779 = sum of:
          0.016358158 = weight(abstract_txt:different in 604) [ClassicSimilarity], result of:
            0.016358158 = score(doc=604,freq=1.0), product of:
              0.071403734 = queryWeight, product of:
                1.0852314 = boost
                3.6655018 = idf(docFreq=3075, maxDocs=44218)
                0.017950028 = queryNorm
              0.22909386 = fieldWeight in 604, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6655018 = idf(docFreq=3075, maxDocs=44218)
                0.0625 = fieldNorm(doc=604)
          0.021964662 = weight(abstract_txt:text in 604) [ClassicSimilarity], result of:
            0.021964662 = score(doc=604,freq=1.0), product of:
              0.086905584 = queryWeight, product of:
                1.1972524 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.017950028 = queryNorm
              0.25274166 = fieldWeight in 604, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=604)
          0.042079832 = weight(abstract_txt:language in 604) [ClassicSimilarity], result of:
            0.042079832 = score(doc=604,freq=3.0), product of:
              0.09294804 = queryWeight, product of:
                1.2381749 = boost
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.017950028 = queryNorm
              0.45272425 = fieldWeight in 604, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.0625 = fieldNorm(doc=604)
          0.03252691 = weight(abstract_txt:proposed in 604) [ClassicSimilarity], result of:
            0.03252691 = score(doc=604,freq=1.0), product of:
              0.112908475 = queryWeight, product of:
                1.3646622 = boost
                4.6093135 = idf(docFreq=1196, maxDocs=44218)
                0.017950028 = queryNorm
              0.2880821 = fieldWeight in 604, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6093135 = idf(docFreq=1196, maxDocs=44218)
                0.0625 = fieldNorm(doc=604)
          0.15110305 = weight(abstract_txt:algorithm in 604) [ClassicSimilarity], result of:
            0.15110305 = score(doc=604,freq=6.0), product of:
              0.17299348 = queryWeight, product of:
                1.6891832 = boost
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.017950028 = queryNorm
              0.87346095 = fieldWeight in 604, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.0625 = fieldNorm(doc=604)
          0.12990117 = weight(abstract_txt:character in 604) [ClassicSimilarity], result of:
            0.12990117 = score(doc=604,freq=2.0), product of:
              0.225578 = queryWeight, product of:
                1.9289024 = boost
                6.515104 = idf(docFreq=177, maxDocs=44218)
                0.017950028 = queryNorm
              0.57585925 = fieldWeight in 604, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.515104 = idf(docFreq=177, maxDocs=44218)
                0.0625 = fieldNorm(doc=604)
          0.15284412 = weight(abstract_txt:words in 604) [ClassicSimilarity], result of:
            0.15284412 = score(doc=604,freq=4.0), product of:
              0.22842357 = queryWeight, product of:
                2.3772671 = boost
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.017950028 = queryNorm
              0.66912585 = fieldWeight in 604, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.0625 = fieldNorm(doc=604)
        0.28 = coord(7/25)
    
  5. Hmeidi, I.I.; Al-Shalabi, R.F.; Al-Taani, A.T.; Najadat, H.; Al-Hazaimeh, S.A.: ¬A novel approach to the extraction of roots from Arabic words using bigrams (2010) 0.14
    0.14463161 = sum of:
      0.14463161 = product of:
        0.6026317 = sum of:
          0.021964662 = weight(abstract_txt:text in 3426) [ClassicSimilarity], result of:
            0.021964662 = score(doc=3426,freq=1.0), product of:
              0.086905584 = queryWeight, product of:
                1.1972524 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.017950028 = queryNorm
              0.25274166 = fieldWeight in 3426, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=3426)
          0.024294803 = weight(abstract_txt:language in 3426) [ClassicSimilarity], result of:
            0.024294803 = score(doc=3426,freq=1.0), product of:
              0.09294804 = queryWeight, product of:
                1.2381749 = boost
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.017950028 = queryNorm
              0.26138046 = fieldWeight in 3426, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.0625 = fieldNorm(doc=3426)
          0.056338258 = weight(abstract_txt:proposed in 3426) [ClassicSimilarity], result of:
            0.056338258 = score(doc=3426,freq=3.0), product of:
              0.112908475 = queryWeight, product of:
                1.3646622 = boost
                4.6093135 = idf(docFreq=1196, maxDocs=44218)
                0.017950028 = queryNorm
              0.4989728 = fieldWeight in 3426, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.6093135 = idf(docFreq=1196, maxDocs=44218)
                0.0625 = fieldNorm(doc=3426)
          0.06168756 = weight(abstract_txt:algorithm in 3426) [ClassicSimilarity], result of:
            0.06168756 = score(doc=3426,freq=1.0), product of:
              0.17299348 = queryWeight, product of:
                1.6891832 = boost
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.017950028 = queryNorm
              0.35658893 = fieldWeight in 3426, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.0625 = fieldNorm(doc=3426)
          0.25115135 = weight(abstract_txt:roots in 3426) [ClassicSimilarity], result of:
            0.25115135 = score(doc=3426,freq=3.0), product of:
              0.30583084 = queryWeight, product of:
                2.2459626 = boost
                7.5860133 = idf(docFreq=60, maxDocs=44218)
                0.017950028 = queryNorm
              0.82121 = fieldWeight in 3426, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.5860133 = idf(docFreq=60, maxDocs=44218)
                0.0625 = fieldNorm(doc=3426)
          0.18719505 = weight(abstract_txt:words in 3426) [ClassicSimilarity], result of:
            0.18719505 = score(doc=3426,freq=6.0), product of:
              0.22842357 = queryWeight, product of:
                2.3772671 = boost
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.017950028 = queryNorm
              0.8195085 = fieldWeight in 3426, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.0625 = fieldNorm(doc=3426)
        0.24 = coord(6/25)