Search (4 results, page 1 of 1)

Yang, C.C.; Li, K.W.: ¬A heuristic method based on a statistical approach for chinese text segmentation (2005) 0.10
```
0.09500122 = sum of:
  0.020073157 = product of:
    0.08029263 = sum of:
      0.08029263 = weight(_text_:authors in 4580) [ClassicSimilarity], result of:
        0.08029263 = score(doc=4580,freq=4.0), product of:
          0.22544144 = queryWeight, product of:
            4.558814 = idf(docFreq=1258, maxDocs=44218)
            0.049451772 = queryNorm
          0.35615736 = fieldWeight in 4580, product of:
            2.0 = tf(freq=4.0), with freq of:
              4.0 = termFreq=4.0
            4.558814 = idf(docFreq=1258, maxDocs=44218)
            0.0390625 = fieldNorm(doc=4580)
    0.25 = coord(1/4)
  0.07492806 = product of:
    0.14985612 = sum of:
      0.14985612 = weight(_text_:k.w in 4580) [ClassicSimilarity], result of:
        0.14985612 = score(doc=4580,freq=2.0), product of:
          0.36626098 = queryWeight, product of:
            7.406428 = idf(docFreq=72, maxDocs=44218)
            0.049451772 = queryNorm
          0.4091512 = fieldWeight in 4580, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            7.406428 = idf(docFreq=72, maxDocs=44218)
            0.0390625 = fieldNorm(doc=4580)
    0.5 = coord(1/2)
```
Abstract

The authors propose a heuristic method for Chinese automatic text segmentation based an a statistical approach. This method is developed based an statistical information about the association among adjacent characters in Chinese text. Mutual information of bi-grams and significant estimation of tri-grams are utilized. A heuristic method with six rules is then proposed to determine the segmentation points in a Chinese sentence. No dictionary is required in this method. Chinese text segmentation is important in Chinese text indexing and thus greatly affects the performance of Chinese information retrieval. Due to the lack of delimiters of words in Chinese text, Chinese text segmentation is more difficult than English text segmentation. Besides, segmentation ambiguities and occurrences of out-of-vocabulary words (i.e., unknown words) are the major challenges in Chinese segmentation. Many research studies dealing with the problem of word segmentation have focused an the resolution of segmentation ambiguities. The problem of unknown word identification has not drawn much attention. The experimental result Shows that the proposed heuristic method is promising to segment the unknown words as weIl as the known words. The authors further investigated the distribution of the errors of commission and the errors of omission caused by the proposed heuristic method and benchmarked the proposed heuristic method with a previous proposed technique, boundary detection. It is found that the heuristic method outperformed the boundary detection method.
Li, K.W.; Yang, C.C.: Conceptual analysis of parallel corpus collected from the Web (2006) 0.10
```
0.09500122 = sum of:
  0.020073157 = product of:
    0.08029263 = sum of:
      0.08029263 = weight(_text_:authors in 5051) [ClassicSimilarity], result of:
        0.08029263 = score(doc=5051,freq=4.0), product of:
          0.22544144 = queryWeight, product of:
            4.558814 = idf(docFreq=1258, maxDocs=44218)
            0.049451772 = queryNorm
          0.35615736 = fieldWeight in 5051, product of:
            2.0 = tf(freq=4.0), with freq of:
              4.0 = termFreq=4.0
            4.558814 = idf(docFreq=1258, maxDocs=44218)
            0.0390625 = fieldNorm(doc=5051)
    0.25 = coord(1/4)
  0.07492806 = product of:
    0.14985612 = sum of:
      0.14985612 = weight(_text_:k.w in 5051) [ClassicSimilarity], result of:
        0.14985612 = score(doc=5051,freq=2.0), product of:
          0.36626098 = queryWeight, product of:
            7.406428 = idf(docFreq=72, maxDocs=44218)
            0.049451772 = queryNorm
          0.4091512 = fieldWeight in 5051, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            7.406428 = idf(docFreq=72, maxDocs=44218)
            0.0390625 = fieldNorm(doc=5051)
    0.5 = coord(1/2)
```
Abstract

As illustrated by the World Wide Web, the volume of information in languages other than English has grown significantly in recent years. This highlights the importance of multilingual corpora. Much effort has been devoted to the compilation of multilingual corpora for the purpose of cross-lingual information retrieval and machine translation. Existing parallel corpora mostly involve European languages, such as English-French and English-Spanish. There is still a lack of parallel corpora between European languages and Asian. languages. In the authors' previous work, an alignment method to identify one-to-one Chinese and English title pairs was developed to construct an English-Chinese parallel corpus that works automatically from the World Wide Web, and a 100% precision and 87% recall were obtained. Careful analysis of these results has helped the authors to understand how the alignment method can be improved. A conceptual analysis was conducted, which includes the analysis of conceptual equivalent and conceptual information alternation in the aligned and nonaligned English-Chinese title pairs that are obtained by the alignment method. The result of the analysis not only reflects the characteristics of parallel corpora, but also gives insight into the strengths and weaknesses of the alignment method. In particular, conceptual alternation, such as omission and addition, is found to have a significant impact on the performance of the alignment method.

Yang, C.C.; Li, K.W.: Automatic construction of English/Chinese parallel corpora (2003) 0.03

0.029971223 = product of:
  0.059942447 = sum of:
    0.059942447 = product of:
      0.11988489 = sum of:
        0.11988489 = weight(_text_:k.w in 1683) [ClassicSimilarity], result of:
          0.11988489 = score(doc=1683,freq=2.0), product of:
            0.36626098 = queryWeight, product of:
              7.406428 = idf(docFreq=72, maxDocs=44218)
              0.049451772 = queryNorm
            0.32732096 = fieldWeight in 1683, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              7.406428 = idf(docFreq=72, maxDocs=44218)
              0.03125 = fieldNorm(doc=1683)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Li, K.W.; Yang, C.C.: Automatic crosslingual thesaurus generated from the Hong Kong SAR Police Department Web Corpus for Crime Analysis (2005) 0.03

0.029971223 = product of:
  0.059942447 = sum of:
    0.059942447 = product of:
      0.11988489 = sum of:
        0.11988489 = weight(_text_:k.w in 3391) [ClassicSimilarity], result of:
          0.11988489 = score(doc=3391,freq=2.0), product of:
            0.36626098 = queryWeight, product of:
              7.406428 = idf(docFreq=72, maxDocs=44218)
              0.049451772 = queryNorm
            0.32732096 = fieldWeight in 3391, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              7.406428 = idf(docFreq=72, maxDocs=44218)
              0.03125 = fieldNorm(doc=3391)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Search (4 results, page 1 of 1)

Themes