Search (2 results, page 1 of 1)

Yang, Y.; Wilbur, J.: Using corpus statistics to remove redundant words in text categorization (1996) 0.00
```
0.0026772693 = product of:
  0.016063616 = sum of:
    0.016063616 = weight(_text_:in in 4199) [ClassicSimilarity], result of:
      0.016063616 = score(doc=4199,freq=18.0), product of:
        0.059380736 = queryWeight, product of:
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.043654136 = queryNorm
        0.27051896 = fieldWeight in 4199, product of:
          4.2426405 = tf(freq=18.0), with freq of:
            18.0 = termFreq=18.0
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.046875 = fieldNorm(doc=4199)
  0.16666667 = coord(1/6)
```
Abstract

This article studies aggressive word removal in text categorization to reduce the noice in free texts to enhance the computational efficiency of categorization. We use a novel stop word identification method to automatically generate domain specific stoplists which are much larger than a conventional domain-independent stoplist. In our tests with 3 categorization methods on text collections from different domains/applications, significant numbers of words were removed without sacrificing categorization effectiveness. In the test of the Expert Network method on CACM documents, for example, an 87% removal of unique qords reduced the vocabulary of documents from 8.002 distinct words to 1.045 words, which resulted in a 63% time savings and a 74% memory savings in the computation of category ranking, with a 10% precision improvement on average over not using word removal. It is evident in this study that automated word removal based on corpus statistics has a practical and significant impact on the computational tractability of categorization methods in large databases
Yang, Y.; Lu, Q.; Zhao, T.: ¬A delimiter-based general approach for Chinese term extraction (2009) 0.00
```
0.0014873719 = product of:
  0.008924231 = sum of:
    0.008924231 = weight(_text_:in in 3315) [ClassicSimilarity], result of:
      0.008924231 = score(doc=3315,freq=8.0), product of:
        0.059380736 = queryWeight, product of:
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.043654136 = queryNorm
        0.15028831 = fieldWeight in 3315, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3315)
  0.16666667 = coord(1/6)
```
Abstract

This article addresses a two-step approach for term extraction. In the first step on term candidate extraction, a new delimiter-based approach is proposed to identify features of the delimiters of term candidates rather than those of the term candidates themselves. This delimiter-based method is much more stable and domain independent than the previous approaches. In the second step on term verification, an algorithm using link analysis is applied to calculate the relevance between term candidates and the sentences from which the terms are extracted. All information is obtained from the working domain corpus without the need for prior domain knowledge. The approach is not targeted at any specific domain and there is no need for extensive training when applying it to new domains. In other words, the method is not domain dependent and it is especially useful for resource-limited domains. Evaluations of Chinese text in two different domains show quite significant improvements over existing techniques and also verify its efficiency and its relatively domain-independent nature. The proposed method is also very effective for extracting new terms so that it can serve as an efficient tool for updating domain knowledge, especially for expanding lexicons.

Search (2 results, page 1 of 1)

Authors

Years