Yang, Y.; Wilbur, J.: Using corpus statistics to remove redundant words in text categorization (1996)
0.00
0.0024128247 = product of:
0.0048256493 = sum of:
0.0048256493 = product of:
0.009651299 = sum of:
0.009651299 = weight(_text_:a in 4199) [ClassicSimilarity], result of:
0.009651299 = score(doc=4199,freq=14.0), product of:
0.04772363 = queryWeight, product of:
1.153047 = idf(docFreq=37942, maxDocs=44218)
0.041389145 = queryNorm
0.20223314 = fieldWeight in 4199, product of:
3.7416575 = tf(freq=14.0), with freq of:
14.0 = termFreq=14.0
1.153047 = idf(docFreq=37942, maxDocs=44218)
0.046875 = fieldNorm(doc=4199)
0.5 = coord(1/2)
0.5 = coord(1/2)
- Abstract
- This article studies aggressive word removal in text categorization to reduce the noice in free texts to enhance the computational efficiency of categorization. We use a novel stop word identification method to automatically generate domain specific stoplists which are much larger than a conventional domain-independent stoplist. In our tests with 3 categorization methods on text collections from different domains/applications, significant numbers of words were removed without sacrificing categorization effectiveness. In the test of the Expert Network method on CACM documents, for example, an 87% removal of unique qords reduced the vocabulary of documents from 8.002 distinct words to 1.045 words, which resulted in a 63% time savings and a 74% memory savings in the computation of category ranking, with a 10% precision improvement on average over not using word removal. It is evident in this study that automated word removal based on corpus statistics has a practical and significant impact on the computational tractability of categorization methods in large databases
- Type
- a