Search (1 results, page 1 of 1)

  • × author_ss:"Wilbur, J."
  • × author_ss:"Yang, Y."
  1. Yang, Y.; Wilbur, J.: Using corpus statistics to remove redundant words in text categorization (1996) 0.00
    0.0024128247 = product of:
      0.0048256493 = sum of:
        0.0048256493 = product of:
          0.009651299 = sum of:
            0.009651299 = weight(_text_:a in 4199) [ClassicSimilarity], result of:
              0.009651299 = score(doc=4199,freq=14.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.20223314 = fieldWeight in 4199, product of:
                  3.7416575 = tf(freq=14.0), with freq of:
                    14.0 = termFreq=14.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046875 = fieldNorm(doc=4199)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    This article studies aggressive word removal in text categorization to reduce the noice in free texts to enhance the computational efficiency of categorization. We use a novel stop word identification method to automatically generate domain specific stoplists which are much larger than a conventional domain-independent stoplist. In our tests with 3 categorization methods on text collections from different domains/applications, significant numbers of words were removed without sacrificing categorization effectiveness. In the test of the Expert Network method on CACM documents, for example, an 87% removal of unique qords reduced the vocabulary of documents from 8.002 distinct words to 1.045 words, which resulted in a 63% time savings and a 74% memory savings in the computation of category ranking, with a 10% precision improvement on average over not using word removal. It is evident in this study that automated word removal based on corpus statistics has a practical and significant impact on the computational tractability of categorization methods in large databases
    Type
    a