Search (1 results, page 1 of 1)

Did you mean:
author's%3a%22Gilliland-swetland%2c A.%22 1
author's%3a%22Gilliland-scotland%2c A.%22 1
authors%3a%22Gilliland-swetland%2c A.%22 1
author's%3a%22Gilliland-seland%2c A.%22 1
authors%3a%22Gilliland-scotland%2c A.%22 1

Yang, Y.; Wilbur, J.: Using corpus statistics to remove redundant words in text categorization (1996) 0.00
```
0.0024128247 = product of:
  0.0048256493 = sum of:
    0.0048256493 = product of:
      0.009651299 = sum of:
        0.009651299 = weight(_text_:a in 4199) [ClassicSimilarity], result of:
          0.009651299 = score(doc=4199,freq=14.0), product of:
            0.04772363 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.041389145 = queryNorm
            0.20223314 = fieldWeight in 4199, product of:
              3.7416575 = tf(freq=14.0), with freq of:
                14.0 = termFreq=14.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046875 = fieldNorm(doc=4199)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

This article studies aggressive word removal in text categorization to reduce the noice in free texts to enhance the computational efficiency of categorization. We use a novel stop word identification method to automatically generate domain specific stoplists which are much larger than a conventional domain-independent stoplist. In our tests with 3 categorization methods on text collections from different domains/applications, significant numbers of words were removed without sacrificing categorization effectiveness. In the test of the Expert Network method on CACM documents, for example, an 87% removal of unique qords reduced the vocabulary of documents from 8.002 distinct words to 1.045 words, which resulted in a 63% time savings and a 74% memory savings in the computation of category ranking, with a 10% precision improvement on average over not using word removal. It is evident in this study that automated word removal based on corpus statistics has a practical and significant impact on the computational tractability of categorization methods in large databases

Type

a