Document (#21915)

Author
Lee, K.H.
Ng, M.K.M.
Lu, Q.
Title
Text segmentation for Chinese spell checking
Source
Journal of the American Society for Information Science. 50(1999) no.9, S.751-759
Year
1999
Abstract
Chinese spell checking is different from its counterparts for Western languages because Chinese words in texts are not separated by spaces. Chinese spell checking in this article refers to how to identify the misuse of characters in text composition. In other words, it is error correction at the word level rather than at the character level. Before Chinese sentences are spell checked, the text is segmented into semantic units. Error detection can then be carried out on the segmented text based on thesaurus and grammar rules. Segmentation is not a trivial process due to ambiguities in the Chinese language and errors in texts. Because it is not practical to define all Chinese words in a dictionary, words not predefined must also be dealt with. The number of word combinations increases exponentially with the length of the sentence. In this article, a Block-of-Combinations (BOC) segmentation method based on frequency of word usage is proposed to reduce the word combinations from exponential growth to linear growth. From experiments carried out on Hong Kong newspapers, BOC can correctly solve 10% more ambiguities than the Maximum Match segmentation method. To make the segmentation more suitable for spell checking, user interaction is also suggested
Theme
Computerlinguistik

Similar documents (content)

  1. Wang, F.L.; Yang, C.C.: Mining Web data for Chinese segmentation (2007) 0.42
    0.42412013 = sum of:
      0.42412013 = product of:
        1.3253754 = sum of:
          0.0054521174 = weight(abstract_txt:from in 2605) [ClassicSimilarity], result of:
            0.0054521174 = score(doc=2605,freq=1.0), product of:
              0.031280946 = queryWeight, product of:
                1.028223 = boost
                2.7887225 = idf(docFreq=7144, maxDocs=42740)
                0.010909057 = queryNorm
              0.17429516 = fieldWeight in 2605, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.7887225 = idf(docFreq=7144, maxDocs=42740)
                0.0625 = fieldNorm(doc=2605)
          0.019417536 = weight(abstract_txt:because in 2605) [ClassicSimilarity], result of:
            0.019417536 = score(doc=2605,freq=1.0), product of:
              0.06372875 = queryWeight, product of:
                1.1983107 = boost
                4.875046 = idf(docFreq=886, maxDocs=42740)
                0.010909057 = queryNorm
              0.30469036 = fieldWeight in 2605, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.875046 = idf(docFreq=886, maxDocs=42740)
                0.0625 = fieldNorm(doc=2605)
          0.04317935 = weight(abstract_txt:texts in 2605) [ClassicSimilarity], result of:
            0.04317935 = score(doc=2605,freq=2.0), product of:
              0.08617476 = queryWeight, product of:
                1.3934513 = boost
                5.668929 = idf(docFreq=400, maxDocs=42740)
                0.010909057 = queryNorm
              0.5010673 = fieldWeight in 2605, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.668929 = idf(docFreq=400, maxDocs=42740)
                0.0625 = fieldNorm(doc=2605)
          0.022267535 = weight(abstract_txt:text in 2605) [ClassicSimilarity], result of:
            0.022267535 = score(doc=2605,freq=1.0), product of:
              0.08796922 = queryWeight, product of:
                1.9910499 = boost
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.010909057 = queryNorm
              0.2531287 = fieldWeight in 2605, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.0625 = fieldNorm(doc=2605)
          0.10314153 = weight(abstract_txt:words in 2605) [ClassicSimilarity], result of:
            0.10314153 = score(doc=2605,freq=4.0), product of:
              0.15398735 = queryWeight, product of:
                2.634264 = boost
                5.358442 = idf(docFreq=546, maxDocs=42740)
                0.010909057 = queryNorm
              0.6698052 = fieldWeight in 2605, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.358442 = idf(docFreq=546, maxDocs=42740)
                0.0625 = fieldNorm(doc=2605)
          0.076916605 = weight(abstract_txt:word in 2605) [ClassicSimilarity], result of:
            0.076916605 = score(doc=2605,freq=2.0), product of:
              0.15954605 = queryWeight, product of:
                2.6813889 = boost
                5.4543004 = idf(docFreq=496, maxDocs=42740)
                0.010909057 = queryNorm
              0.48209658 = fieldWeight in 2605, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.4543004 = idf(docFreq=496, maxDocs=42740)
                0.0625 = fieldNorm(doc=2605)
          0.6595264 = weight(abstract_txt:segmentation in 2605) [ClassicSimilarity], result of:
            0.6595264 = score(doc=2605,freq=10.0), product of:
              0.4210569 = queryWeight, product of:
                4.870148 = boost
                7.925221 = idf(docFreq=41, maxDocs=42740)
                0.010909057 = queryNorm
              1.5663594 = fieldWeight in 2605, product of:
                3.1622777 = tf(freq=10.0), with freq of:
                  10.0 = termFreq=10.0
                7.925221 = idf(docFreq=41, maxDocs=42740)
                0.0625 = fieldNorm(doc=2605)
          0.39547428 = weight(abstract_txt:chinese in 2605) [ClassicSimilarity], result of:
            0.39547428 = score(doc=2605,freq=7.0), product of:
              0.3772317 = queryWeight, product of:
                5.4543104 = boost
                6.3398805 = idf(docFreq=204, maxDocs=42740)
                0.010909057 = queryNorm
              1.0483592 = fieldWeight in 2605, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                6.3398805 = idf(docFreq=204, maxDocs=42740)
                0.0625 = fieldNorm(doc=2605)
        0.32 = coord(8/25)
    
  2. Yang, C.C.; Li, K.W.: ¬A heuristic method based on a statistical approach for chinese text segmentation (2005) 0.42
    0.422454 = sum of:
      0.422454 = product of:
        1.5087643 = sum of:
          0.04648029 = weight(abstract_txt:method in 581) [ClassicSimilarity], result of:
            0.04648029 = score(doc=581,freq=9.0), product of:
              0.054824043 = queryWeight, product of:
                1.1114433 = boost
                4.5216455 = idf(docFreq=1262, maxDocs=42740)
                0.010909057 = queryNorm
              0.84780854 = fieldWeight in 581, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                4.5216455 = idf(docFreq=1262, maxDocs=42740)
                0.0625 = fieldNorm(doc=581)
          0.05891436 = weight(abstract_txt:text in 581) [ClassicSimilarity], result of:
            0.05891436 = score(doc=581,freq=7.0), product of:
              0.08796922 = queryWeight, product of:
                1.9910499 = boost
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.010909057 = queryNorm
              0.6697156 = fieldWeight in 581, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.0625 = fieldNorm(doc=581)
          0.13702986 = weight(abstract_txt:ambiguities in 581) [ClassicSimilarity], result of:
            0.13702986 = score(doc=581,freq=2.0), product of:
              0.18609704 = queryWeight, product of:
                2.0477245 = boost
                8.330686 = idf(docFreq=27, maxDocs=42740)
                0.010909057 = queryNorm
              0.7363355 = fieldWeight in 581, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.330686 = idf(docFreq=27, maxDocs=42740)
                0.0625 = fieldNorm(doc=581)
          0.115315735 = weight(abstract_txt:words in 581) [ClassicSimilarity], result of:
            0.115315735 = score(doc=581,freq=5.0), product of:
              0.15398735 = queryWeight, product of:
                2.634264 = boost
                5.358442 = idf(docFreq=546, maxDocs=42740)
                0.010909057 = queryNorm
              0.748865 = fieldWeight in 581, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.358442 = idf(docFreq=546, maxDocs=42740)
                0.0625 = fieldNorm(doc=581)
          0.076916605 = weight(abstract_txt:word in 581) [ClassicSimilarity], result of:
            0.076916605 = score(doc=581,freq=2.0), product of:
              0.15954605 = queryWeight, product of:
                2.6813889 = boost
                5.4543004 = idf(docFreq=496, maxDocs=42740)
                0.010909057 = queryNorm
              0.48209658 = fieldWeight in 581, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.4543004 = idf(docFreq=496, maxDocs=42740)
                0.0625 = fieldNorm(doc=581)
          0.6256817 = weight(abstract_txt:segmentation in 581) [ClassicSimilarity], result of:
            0.6256817 = score(doc=581,freq=9.0), product of:
              0.4210569 = queryWeight, product of:
                4.870148 = boost
                7.925221 = idf(docFreq=41, maxDocs=42740)
                0.010909057 = queryNorm
              1.485979 = fieldWeight in 581, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                7.925221 = idf(docFreq=41, maxDocs=42740)
                0.0625 = fieldNorm(doc=581)
          0.4484257 = weight(abstract_txt:chinese in 581) [ClassicSimilarity], result of:
            0.4484257 = score(doc=581,freq=9.0), product of:
              0.3772317 = queryWeight, product of:
                5.4543104 = boost
                6.3398805 = idf(docFreq=204, maxDocs=42740)
                0.010909057 = queryNorm
              1.1887276 = fieldWeight in 581, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                6.3398805 = idf(docFreq=204, maxDocs=42740)
                0.0625 = fieldNorm(doc=581)
        0.28 = coord(7/25)
    
  3. Khoo, C.S.G.; Dai, D.; Loh, T.E.: Using statistical and contextual information to identify two- and three-character words in Chinese text (2002) 0.24
    0.24158062 = sum of:
      0.24158062 = product of:
        1.006586 = sum of:
          0.0054521174 = weight(abstract_txt:from in 207) [ClassicSimilarity], result of:
            0.0054521174 = score(doc=207,freq=1.0), product of:
              0.031280946 = queryWeight, product of:
                1.028223 = boost
                2.7887225 = idf(docFreq=7144, maxDocs=42740)
                0.010909057 = queryNorm
              0.17429516 = fieldWeight in 207, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.7887225 = idf(docFreq=7144, maxDocs=42740)
                0.0625 = fieldNorm(doc=207)
          0.022267535 = weight(abstract_txt:text in 207) [ClassicSimilarity], result of:
            0.022267535 = score(doc=207,freq=1.0), product of:
              0.08796922 = queryWeight, product of:
                1.9910499 = boost
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.010909057 = queryNorm
              0.2531287 = fieldWeight in 207, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.0625 = fieldNorm(doc=207)
          0.15471229 = weight(abstract_txt:words in 207) [ClassicSimilarity], result of:
            0.15471229 = score(doc=207,freq=9.0), product of:
              0.15398735 = queryWeight, product of:
                2.634264 = boost
                5.358442 = idf(docFreq=546, maxDocs=42740)
                0.010909057 = queryNorm
              1.0047078 = fieldWeight in 207, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                5.358442 = idf(docFreq=546, maxDocs=42740)
                0.0625 = fieldNorm(doc=207)
          0.054388255 = weight(abstract_txt:word in 207) [ClassicSimilarity], result of:
            0.054388255 = score(doc=207,freq=1.0), product of:
              0.15954605 = queryWeight, product of:
                2.6813889 = boost
                5.4543004 = idf(docFreq=496, maxDocs=42740)
                0.010909057 = queryNorm
              0.34089378 = fieldWeight in 207, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.4543004 = idf(docFreq=496, maxDocs=42740)
                0.0625 = fieldNorm(doc=207)
          0.510867 = weight(abstract_txt:segmentation in 207) [ClassicSimilarity], result of:
            0.510867 = score(doc=207,freq=6.0), product of:
              0.4210569 = queryWeight, product of:
                4.870148 = boost
                7.925221 = idf(docFreq=41, maxDocs=42740)
                0.010909057 = queryNorm
              1.2132968 = fieldWeight in 207, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                7.925221 = idf(docFreq=41, maxDocs=42740)
                0.0625 = fieldNorm(doc=207)
          0.2588987 = weight(abstract_txt:chinese in 207) [ClassicSimilarity], result of:
            0.2588987 = score(doc=207,freq=3.0), product of:
              0.3772317 = queryWeight, product of:
                5.4543104 = boost
                6.3398805 = idf(docFreq=204, maxDocs=42740)
                0.010909057 = queryNorm
              0.6863122 = fieldWeight in 207, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.3398805 = idf(docFreq=204, maxDocs=42740)
                0.0625 = fieldNorm(doc=207)
        0.24 = coord(6/25)
    
  4. Arsenault, C.: Testing the impact of syllable aggregation in romanized fields of Chinese language bibliographic records (2000) 0.22
    0.21995918 = sum of:
      0.21995918 = product of:
        0.68737245 = sum of:
          0.04513835 = weight(abstract_txt:separated in 1088) [ClassicSimilarity], result of:
            0.04513835 = score(doc=1088,freq=1.0), product of:
              0.08876187 = queryWeight, product of:
                8.13653 = idf(docFreq=33, maxDocs=42740)
                0.010909057 = queryNorm
              0.5085331 = fieldWeight in 1088, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.13653 = idf(docFreq=33, maxDocs=42740)
                0.0625 = fieldNorm(doc=1088)
          0.0077104582 = weight(abstract_txt:from in 1088) [ClassicSimilarity], result of:
            0.0077104582 = score(doc=1088,freq=2.0), product of:
              0.031280946 = queryWeight, product of:
                1.028223 = boost
                2.7887225 = idf(docFreq=7144, maxDocs=42740)
                0.010909057 = queryNorm
              0.24649057 = fieldWeight in 1088, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.7887225 = idf(docFreq=7144, maxDocs=42740)
                0.0625 = fieldNorm(doc=1088)
          0.01549343 = weight(abstract_txt:method in 1088) [ClassicSimilarity], result of:
            0.01549343 = score(doc=1088,freq=1.0), product of:
              0.054824043 = queryWeight, product of:
                1.1114433 = boost
                4.5216455 = idf(docFreq=1262, maxDocs=42740)
                0.010909057 = queryNorm
              0.28260285 = fieldWeight in 1088, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.5216455 = idf(docFreq=1262, maxDocs=42740)
                0.0625 = fieldNorm(doc=1088)
          0.031584375 = weight(abstract_txt:carried in 1088) [ClassicSimilarity], result of:
            0.031584375 = score(doc=1088,freq=1.0), product of:
              0.08814293 = queryWeight, product of:
                1.4092742 = boost
                5.733301 = idf(docFreq=375, maxDocs=42740)
                0.010909057 = queryNorm
              0.35833132 = fieldWeight in 1088, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.733301 = idf(docFreq=375, maxDocs=42740)
                0.0625 = fieldNorm(doc=1088)
          0.022267535 = weight(abstract_txt:text in 1088) [ClassicSimilarity], result of:
            0.022267535 = score(doc=1088,freq=1.0), product of:
              0.08796922 = queryWeight, product of:
                1.9910499 = boost
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.010909057 = queryNorm
              0.2531287 = fieldWeight in 1088, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.0625 = fieldNorm(doc=1088)
          0.115315735 = weight(abstract_txt:words in 1088) [ClassicSimilarity], result of:
            0.115315735 = score(doc=1088,freq=5.0), product of:
              0.15398735 = queryWeight, product of:
                2.634264 = boost
                5.358442 = idf(docFreq=546, maxDocs=42740)
                0.010909057 = queryNorm
              0.748865 = fieldWeight in 1088, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.358442 = idf(docFreq=546, maxDocs=42740)
                0.0625 = fieldNorm(doc=1088)
          0.054388255 = weight(abstract_txt:word in 1088) [ClassicSimilarity], result of:
            0.054388255 = score(doc=1088,freq=1.0), product of:
              0.15954605 = queryWeight, product of:
                2.6813889 = boost
                5.4543004 = idf(docFreq=496, maxDocs=42740)
                0.010909057 = queryNorm
              0.34089378 = fieldWeight in 1088, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.4543004 = idf(docFreq=496, maxDocs=42740)
                0.0625 = fieldNorm(doc=1088)
          0.39547428 = weight(abstract_txt:chinese in 1088) [ClassicSimilarity], result of:
            0.39547428 = score(doc=1088,freq=7.0), product of:
              0.3772317 = queryWeight, product of:
                5.4543104 = boost
                6.3398805 = idf(docFreq=204, maxDocs=42740)
                0.010909057 = queryNorm
              1.0483592 = fieldWeight in 1088, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                6.3398805 = idf(docFreq=204, maxDocs=42740)
                0.0625 = fieldNorm(doc=1088)
        0.32 = coord(8/25)
    
  5. Leydesdorff, L.; Zhou, P.: Co-word analysis using the Chinese character set (2008) 0.21
    0.20657557 = sum of:
      0.20657557 = product of:
        0.86073154 = sum of:
          0.06770753 = weight(abstract_txt:separated in 3971) [ClassicSimilarity], result of:
            0.06770753 = score(doc=3971,freq=1.0), product of:
              0.08876187 = queryWeight, product of:
                8.13653 = idf(docFreq=33, maxDocs=42740)
                0.010909057 = queryNorm
              0.7627997 = fieldWeight in 3971, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.13653 = idf(docFreq=33, maxDocs=42740)
                0.09375 = fieldNorm(doc=3971)
          0.029126303 = weight(abstract_txt:because in 3971) [ClassicSimilarity], result of:
            0.029126303 = score(doc=3971,freq=1.0), product of:
              0.06372875 = queryWeight, product of:
                1.1983107 = boost
                4.875046 = idf(docFreq=886, maxDocs=42740)
                0.010909057 = queryNorm
              0.45703554 = fieldWeight in 3971, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.875046 = idf(docFreq=886, maxDocs=42740)
                0.09375 = fieldNorm(doc=3971)
          0.06476903 = weight(abstract_txt:texts in 3971) [ClassicSimilarity], result of:
            0.06476903 = score(doc=3971,freq=2.0), product of:
              0.08617476 = queryWeight, product of:
                1.3934513 = boost
                5.668929 = idf(docFreq=400, maxDocs=42740)
                0.010909057 = queryNorm
              0.7516009 = fieldWeight in 3971, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.668929 = idf(docFreq=400, maxDocs=42740)
                0.09375 = fieldNorm(doc=3971)
          0.10939812 = weight(abstract_txt:words in 3971) [ClassicSimilarity], result of:
            0.10939812 = score(doc=3971,freq=2.0), product of:
              0.15398735 = queryWeight, product of:
                2.634264 = boost
                5.358442 = idf(docFreq=546, maxDocs=42740)
                0.010909057 = queryNorm
              0.71043575 = fieldWeight in 3971, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.358442 = idf(docFreq=546, maxDocs=42740)
                0.09375 = fieldNorm(doc=3971)
          0.14130484 = weight(abstract_txt:word in 3971) [ClassicSimilarity], result of:
            0.14130484 = score(doc=3971,freq=3.0), product of:
              0.15954605 = queryWeight, product of:
                2.6813889 = boost
                5.4543004 = idf(docFreq=496, maxDocs=42740)
                0.010909057 = queryNorm
              0.88566804 = fieldWeight in 3971, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.4543004 = idf(docFreq=496, maxDocs=42740)
                0.09375 = fieldNorm(doc=3971)
          0.4484257 = weight(abstract_txt:chinese in 3971) [ClassicSimilarity], result of:
            0.4484257 = score(doc=3971,freq=4.0), product of:
              0.3772317 = queryWeight, product of:
                5.4543104 = boost
                6.3398805 = idf(docFreq=204, maxDocs=42740)
                0.010909057 = queryNorm
              1.1887276 = fieldWeight in 3971, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.3398805 = idf(docFreq=204, maxDocs=42740)
                0.09375 = fieldNorm(doc=3971)
        0.24 = coord(6/25)