Search (144 results, page 1 of 8)

Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.12

0.11997249 = product of:
  0.17995873 = sum of:
    0.080158204 = product of:
      0.2404746 = sum of:
        0.2404746 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
          0.2404746 = score(doc=562,freq=2.0), product of:
            0.427877 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.05046903 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.33333334 = coord(1/3)
    0.09980053 = sum of:
      0.058773387 = weight(_text_:classification in 562) [ClassicSimilarity], result of:
        0.058773387 = score(doc=562,freq=6.0), product of:
          0.16072905 = queryWeight, product of:
            3.1847067 = idf(docFreq=4974, maxDocs=44218)
            0.05046903 = queryNorm
          0.3656675 = fieldWeight in 562, product of:
            2.4494898 = tf(freq=6.0), with freq of:
              6.0 = termFreq=6.0
            3.1847067 = idf(docFreq=4974, maxDocs=44218)
            0.046875 = fieldNorm(doc=562)
      0.04102714 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
        0.04102714 = score(doc=562,freq=2.0), product of:
          0.17673394 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.05046903 = queryNorm
          0.23214069 = fieldWeight in 562, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.046875 = fieldNorm(doc=562)
  0.6666667 = coord(2/3)

Abstract: Document representations for text classification are typically based on the classical Bag-Of-Words paradigm. This approach comes with deficiencies that motivate the integration of features on a higher semantic level than single words. In this paper we propose an enhancement of the classical document representation through concepts extracted from background knowledge. Boosting is used for actual classification. Experimental evaluations on two well known text corpora support our approach through consistent improvement of the results.
Content: Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
Date: 8. 1.2013 10:22:32

Liu, R.-L.: ¬A passage extractor for classification of disease aspect information (2013) 0.11

0.10637534 = product of:
  0.159563 = sum of:
    0.068818994 = weight(_text_:interest in 1107) [ClassicSimilarity], result of:
      0.068818994 = score(doc=1107,freq=2.0), product of:
        0.25074318 = queryWeight, product of:
          4.9682584 = idf(docFreq=835, maxDocs=44218)
          0.05046903 = queryNorm
        0.27446008 = fieldWeight in 1107, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.9682584 = idf(docFreq=835, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1107)
    0.090744 = sum of:
      0.056554716 = weight(_text_:classification in 1107) [ClassicSimilarity], result of:
        0.056554716 = score(doc=1107,freq=8.0), product of:
          0.16072905 = queryWeight, product of:
            3.1847067 = idf(docFreq=4974, maxDocs=44218)
            0.05046903 = queryNorm
          0.35186368 = fieldWeight in 1107, product of:
            2.828427 = tf(freq=8.0), with freq of:
              8.0 = termFreq=8.0
            3.1847067 = idf(docFreq=4974, maxDocs=44218)
            0.0390625 = fieldNorm(doc=1107)
      0.034189284 = weight(_text_:22 in 1107) [ClassicSimilarity], result of:
        0.034189284 = score(doc=1107,freq=2.0), product of:
          0.17673394 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.05046903 = queryNorm
          0.19345059 = fieldWeight in 1107, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0390625 = fieldNorm(doc=1107)
  0.6666667 = coord(2/3)

Abstract: Retrieval of disease information is often based on several key aspects such as etiology, diagnosis, treatment, prevention, and symptoms of diseases. Automatic identification of disease aspect information is thus essential. In this article, I model the aspect identification problem as a text classification (TC) problem in which a disease aspect corresponds to a category. The disease aspect classification problem poses two challenges to classifiers: (a) a medical text often contains information about multiple aspects of a disease and hence produces noise for the classifiers and (b) text classifiers often cannot extract the textual parts (i.e., passages) about the categories of interest. I thus develop a technique, PETC (Passage Extractor for Text Classification), that extracts passages (from medical texts) for the underlying text classifiers to classify. Case studies on thousands of Chinese and English medical texts show that PETC enhances a support vector machine (SVM) classifier in classifying disease aspect information. PETC also performs better than three state-of-the-art classifier enhancement techniques, including two passage extraction techniques for text classifiers and a technique that employs term proximity information to enhance text classifiers. The contribution is of significance to evidence-based medicine, health education, and healthcare decision support. PETC can be used in those application domains in which a text to be classified may have several parts about different categories.
Date: 28.10.2013 19:22:57

Mengle, S.; Goharian, N.: Passage detection using text classification (2009) 0.10
```
0.09533234 = product of:
  0.1429985 = sum of:
    0.068818994 = weight(_text_:interest in 2765) [ClassicSimilarity], result of:
      0.068818994 = score(doc=2765,freq=2.0), product of:
        0.25074318 = queryWeight, product of:
          4.9682584 = idf(docFreq=835, maxDocs=44218)
          0.05046903 = queryNorm
        0.27446008 = fieldWeight in 2765, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.9682584 = idf(docFreq=835, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2765)
    0.07417951 = sum of:
      0.039990224 = weight(_text_:classification in 2765) [ClassicSimilarity], result of:
        0.039990224 = score(doc=2765,freq=4.0), product of:
          0.16072905 = queryWeight, product of:
            3.1847067 = idf(docFreq=4974, maxDocs=44218)
            0.05046903 = queryNorm
          0.24880521 = fieldWeight in 2765, product of:
            2.0 = tf(freq=4.0), with freq of:
              4.0 = termFreq=4.0
            3.1847067 = idf(docFreq=4974, maxDocs=44218)
            0.0390625 = fieldNorm(doc=2765)
      0.034189284 = weight(_text_:22 in 2765) [ClassicSimilarity], result of:
        0.034189284 = score(doc=2765,freq=2.0), product of:
          0.17673394 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.05046903 = queryNorm
          0.19345059 = fieldWeight in 2765, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0390625 = fieldNorm(doc=2765)
  0.6666667 = coord(2/3)
```
Abstract

Passages can be hidden within a text to circumvent their disallowed transfer. Such release of compartmentalized information is of concern to all corporate and governmental organizations. Passage retrieval is well studied; we posit, however, that passage detection is not. Passage retrieval is the determination of the degree of relevance of blocks of text, namely passages, comprising a document. Rather than determining the relevance of a document in its entirety, passage retrieval determines the relevance of the individual passages. As such, modified traditional information-retrieval techniques compare terms found in user queries with the individual passages to determine a similarity score for passages of interest. In passage detection, passages are classified into predetermined categories. More often than not, passage detection techniques are deployed to detect hidden paragraphs in documents. That is, to hide information, documents are injected with hidden text into passages. Rather than matching query terms against passages to determine their relevance, using text-mining techniques, the passages are classified. Those documents with hidden passages are defined as infected. Thus, simply stated, passage retrieval is the search for passages relevant to a user query, while passage detection is the classification of passages. That is, in passage detection, passages are labeled with one or more categories from a set of predetermined categories. We present a keyword-based dynamic passage approach (KDP) and demonstrate that KDP outperforms statistically significantly (99% confidence) the other document-splitting approaches by 12% to 18% in the passage detection and passage category-prediction tasks. Furthermore, we evaluate the effects of the feature selection, passage length, ambiguous passages, and finally training-data category distribution on passage-detection accuracy.

Date

22. 3.2009 19:14:43

Mostafa, J.; Quiroga, L.M.; Palakal, M.: Filtering medical documents using automated and human classification methods (1998) 0.09

0.090823546 = product of:
  0.13623531 = sum of:
    0.082582794 = weight(_text_:interest in 2326) [ClassicSimilarity], result of:
      0.082582794 = score(doc=2326,freq=2.0), product of:
        0.25074318 = queryWeight, product of:
          4.9682584 = idf(docFreq=835, maxDocs=44218)
          0.05046903 = queryNorm
        0.3293521 = fieldWeight in 2326, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.9682584 = idf(docFreq=835, maxDocs=44218)
          0.046875 = fieldNorm(doc=2326)
    0.053652517 = product of:
      0.107305035 = sum of:
        0.107305035 = weight(_text_:classification in 2326) [ClassicSimilarity], result of:
          0.107305035 = score(doc=2326,freq=20.0), product of:
            0.16072905 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.05046903 = queryNorm
            0.66761446 = fieldWeight in 2326, product of:
              4.472136 = tf(freq=20.0), with freq of:
                20.0 = termFreq=20.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.046875 = fieldNorm(doc=2326)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Abstract: The goal of this research is to clarify the role of document classification in information filtering. An important function of classification, in managing computational complexity, is described and illustrated in the context of an existing filtering system. A parameter called classification homogeneity is presented for analyzing unsupervised automated classification by employing human classification as a control. 2 significant components of the automated classification approach, vocabulary discovery and classification scheme generation, are described in detail. Results of classification performance revealed considerable variability in the homogeneity of automatically produced classes. Based on the classification performance, different types of interest profiles were created. Subsequently, these profiles were used to perform filtering sessions. The filtering results showed that with increasing homogeneity, filtering performance improves, and, conversely, with decreasing homogeneity, filtering performance degrades

Sebastiani, F.: Classification of text, automatic (2006) 0.08

0.08289318 = product of:
  0.12433976 = sum of:
    0.0963466 = weight(_text_:interest in 5003) [ClassicSimilarity], result of:
      0.0963466 = score(doc=5003,freq=2.0), product of:
        0.25074318 = queryWeight, product of:
          4.9682584 = idf(docFreq=835, maxDocs=44218)
          0.05046903 = queryNorm
        0.38424414 = fieldWeight in 5003, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.9682584 = idf(docFreq=835, maxDocs=44218)
          0.0546875 = fieldNorm(doc=5003)
    0.027993156 = product of:
      0.05598631 = sum of:
        0.05598631 = weight(_text_:classification in 5003) [ClassicSimilarity], result of:
          0.05598631 = score(doc=5003,freq=4.0), product of:
            0.16072905 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.05046903 = queryNorm
            0.34832728 = fieldWeight in 5003, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.0546875 = fieldNorm(doc=5003)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Abstract: Automatic text classification (ATC) is a discipline at the crossroads of information retrieval (IR), machine learning (ML), and computational linguistics (CL), and consists in the realization of text classifiers, i.e. software systems capable of assigning texts to one or more categories, or classes, from a predefined set. Applications range from the automated indexing of scientific articles, to e-mail routing, spam filtering, authorship attribution, and automated survey coding. This article will focus on the ML approach to ATC, whereby a software system (called the learner) automatically builds a classifier for the categories of interest by generalizing from a "training" set of pre-classified texts.

Mukhopadhyay, S.; Peng, S.; Raje, R.; Palakal, M.; Mostafa, J.: Multi-agent information classification using dynamic acquaintance lists (2003) 0.07
```
0.074646324 = product of:
  0.111969486 = sum of:
    0.082582794 = weight(_text_:interest in 1755) [ClassicSimilarity], result of:
      0.082582794 = score(doc=1755,freq=2.0), product of:
        0.25074318 = queryWeight, product of:
          4.9682584 = idf(docFreq=835, maxDocs=44218)
          0.05046903 = queryNorm
        0.3293521 = fieldWeight in 1755, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.9682584 = idf(docFreq=835, maxDocs=44218)
          0.046875 = fieldNorm(doc=1755)
    0.029386694 = product of:
      0.058773387 = sum of:
        0.058773387 = weight(_text_:classification in 1755) [ClassicSimilarity], result of:
          0.058773387 = score(doc=1755,freq=6.0), product of:
            0.16072905 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.05046903 = queryNorm
            0.3656675 = fieldWeight in 1755, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.046875 = fieldNorm(doc=1755)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

There has been considerable interest in recent years in providing automated information services, such as information classification, by means of a society of collaborative agents. These agents augment each other's knowledge structures (e.g., the vocabularies) and assist each other in providing efficient information services to a human user. However, when the number of agents present in the society increases, exhaustive communication and collaboration among agents result in a [arge communication overhead and increased delays in response time. This paper introduces a method to achieve selective interaction with a relatively small number of potentially useful agents, based an simple agent modeling and acquaintance lists. The key idea presented here is that the acquaintance list of an agent, representing a small number of other agents to be collaborated with, is dynamically adjusted. The best acquaintances are automatically discovered using a learning algorithm, based an the past history of collaboration. Experimental results are presented to demonstrate that such dynamically learned acquaintance lists can lead to high quality of classification, while significantly reducing the delay in response time.

Li, T.; Zhu, S.; Ogihara, M.: Hierarchical document classification using automatically generated hierarchy (2007) 0.07

0.074646324 = product of:
  0.111969486 = sum of:
    0.082582794 = weight(_text_:interest in 4797) [ClassicSimilarity], result of:
      0.082582794 = score(doc=4797,freq=2.0), product of:
        0.25074318 = queryWeight, product of:
          4.9682584 = idf(docFreq=835, maxDocs=44218)
          0.05046903 = queryNorm
        0.3293521 = fieldWeight in 4797, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.9682584 = idf(docFreq=835, maxDocs=44218)
          0.046875 = fieldNorm(doc=4797)
    0.029386694 = product of:
      0.058773387 = sum of:
        0.058773387 = weight(_text_:classification in 4797) [ClassicSimilarity], result of:
          0.058773387 = score(doc=4797,freq=6.0), product of:
            0.16072905 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.05046903 = queryNorm
            0.3656675 = fieldWeight in 4797, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.046875 = fieldNorm(doc=4797)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Abstract: Automated text categorization has witnessed a booming interest with the exponential growth of information and the ever-increasing needs for organizations. The underlying hierarchical structure identifies the relationships of dependence between different categories and provides valuable sources of information for categorization. Although considerable research has been conducted in the field of hierarchical document categorization, little has been done on automatic generation of topic hierarchies. In this paper, we propose the method of using linear discriminant projection to generate more meaningful intermediate levels of hierarchies in large flat sets of classes. The linear discriminant projection approach first transforms all documents onto a low-dimensional space and then clusters the categories into hier- archies accordingly. The paper also investigates the effect of using generated hierarchical structure for text classification. Our experiments show that generated hierarchies improve classification performance in most cases.

Sebastiani, F.: Machine learning in automated text categorization (2002) 0.07

0.06636614 = product of:
  0.09954921 = sum of:
    0.082582794 = weight(_text_:interest in 3389) [ClassicSimilarity], result of:
      0.082582794 = score(doc=3389,freq=2.0), product of:
        0.25074318 = queryWeight, product of:
          4.9682584 = idf(docFreq=835, maxDocs=44218)
          0.05046903 = queryNorm
        0.3293521 = fieldWeight in 3389, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.9682584 = idf(docFreq=835, maxDocs=44218)
          0.046875 = fieldNorm(doc=3389)
    0.016966416 = product of:
      0.03393283 = sum of:
        0.03393283 = weight(_text_:classification in 3389) [ClassicSimilarity], result of:
          0.03393283 = score(doc=3389,freq=2.0), product of:
            0.16072905 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.05046903 = queryNorm
            0.21111822 = fieldWeight in 3389, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.046875 = fieldNorm(doc=3389)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Abstract: The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based an machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

Sebastiani, F.: ¬A tutorial an automated text categorisation (1999) 0.07
```
0.06636614 = product of:
  0.09954921 = sum of:
    0.082582794 = weight(_text_:interest in 3390) [ClassicSimilarity], result of:
      0.082582794 = score(doc=3390,freq=2.0), product of:
        0.25074318 = queryWeight, product of:
          4.9682584 = idf(docFreq=835, maxDocs=44218)
          0.05046903 = queryNorm
        0.3293521 = fieldWeight in 3390, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.9682584 = idf(docFreq=835, maxDocs=44218)
          0.046875 = fieldNorm(doc=3390)
    0.016966416 = product of:
      0.03393283 = sum of:
        0.03393283 = weight(_text_:classification in 3390) [ClassicSimilarity], result of:
          0.03393283 = score(doc=3390,freq=2.0), product of:
            0.16072905 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.05046903 = queryNorm
            0.21111822 = fieldWeight in 3390, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.046875 = fieldNorm(doc=3390)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

The automated categorisation (or classification) of texts into topical categories has a long history, dating back at least to 1960. Until the late '80s, the dominant approach to the problem involved knowledge-engineering automatic categorisers, i.e. manually building a set of rules encoding expert knowledge an how to classify documents. In the '90s, with the booming production and availability of on-line documents, automated text categorisation has witnessed an increased and renewed interest. A newer paradigm based an machine learning has superseded the previous approach. Within this paradigm, a general inductive process automatically builds a classifier by "learning", from a set of previously classified documents, the characteristics of one or more categories; the advantages are a very good effectiveness, a considerable savings in terms of expert manpower, and domain independence. In this tutorial we look at the main approaches that have been taken towards automatic text categorisation within the general machine learning paradigm. Issues of document indexing, classifier construction, and classifier evaluation, will be touched upon.
Fagni, T.; Sebastiani, F.: Selecting negative examples for hierarchical text classification: An experimental comparison (2010) 0.06
```
0.064730905 = product of:
  0.097096354 = sum of:
    0.068818994 = weight(_text_:interest in 4101) [ClassicSimilarity], result of:
      0.068818994 = score(doc=4101,freq=2.0), product of:
        0.25074318 = queryWeight, product of:
          4.9682584 = idf(docFreq=835, maxDocs=44218)
          0.05046903 = queryNorm
        0.27446008 = fieldWeight in 4101, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.9682584 = idf(docFreq=835, maxDocs=44218)
          0.0390625 = fieldNorm(doc=4101)
    0.028277358 = product of:
      0.056554716 = sum of:
        0.056554716 = weight(_text_:classification in 4101) [ClassicSimilarity], result of:
          0.056554716 = score(doc=4101,freq=8.0), product of:
            0.16072905 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.05046903 = queryNorm
            0.35186368 = fieldWeight in 4101, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4101)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

Hierarchical text classification (HTC) approaches have recently attracted a lot of interest on the part of researchers in human language technology and machine learning, since they have been shown to bring about equal, if not better, classification accuracy with respect to their "flat" counterparts while allowing exponential time savings at both learning and classification time. A typical component of HTC methods is a "local" policy for selecting negative examples: Given a category c, its negative training examples are by default identified with the training examples that are negative for c and positive for the categories which are siblings of c in the hierarchy. However, this policy has always been taken for granted and never been subjected to careful scrutiny since first proposed 15 years ago. This article proposes a thorough experimental comparison between this policy and three other policies for the selection of negative examples in HTC contexts, one of which (BEST LOCAL (k)) is being proposed for the first time in this article. We compare these policies on the hierarchical versions of three supervised learning algorithms (boosting, support vector machines, and naïve Bayes) by performing experiments on two standard TC datasets, REUTERS-21578 and RCV1-V2.
Borko, H.: Research in computer based classification systems (1985) 0.06
```
0.055905145 = product of:
  0.083857715 = sum of:
    0.0481733 = weight(_text_:interest in 3647) [ClassicSimilarity], result of:
      0.0481733 = score(doc=3647,freq=2.0), product of:
        0.25074318 = queryWeight, product of:
          4.9682584 = idf(docFreq=835, maxDocs=44218)
          0.05046903 = queryNorm
        0.19212207 = fieldWeight in 3647, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.9682584 = idf(docFreq=835, maxDocs=44218)
          0.02734375 = fieldNorm(doc=3647)
    0.035684414 = product of:
      0.07136883 = sum of:
        0.07136883 = weight(_text_:classification in 3647) [ClassicSimilarity], result of:
          0.07136883 = score(doc=3647,freq=26.0), product of:
            0.16072905 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.05046903 = queryNorm
            0.44403192 = fieldWeight in 3647, product of:
              5.0990195 = tf(freq=26.0), with freq of:
                26.0 = termFreq=26.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.02734375 = fieldNorm(doc=3647)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

The selection in this reader by R. M. Needham and K. Sparck Jones reports an early approach to automatic classification that was taken in England. The following selection reviews various approaches that were being pursued in the United States at about the same time. It then discusses a particular approach initiated in the early 1960s by Harold Borko, at that time Head of the Language Processing and Retrieval Research Staff at the System Development Corporation, Santa Monica, California and, since 1966, a member of the faculty at the Graduate School of Library and Information Science, University of California, Los Angeles. As was described earlier, there are two steps in automatic classification, the first being to identify pairs of terms that are similar by virtue of co-occurring as index terms in the same documents, and the second being to form equivalence classes of intersubstitutable terms. To compute similarities, Borko and his associates used a standard correlation formula; to derive classification categories, where Needham and Sparck Jones used clumping, the Borko team used the statistical technique of factor analysis. The fact that documents can be classified automatically, and in any number of ways, is worthy of passing notice. Worthy of serious attention would be a demonstra tion that a computer-based classification system was effective in the organization and retrieval of documents. One reason for the inclusion of the following selection in the reader is that it addresses the question of evaluation. To evaluate the effectiveness of their automatically derived classification, Borko and his team asked three questions. The first was Is the classification reliable? in other words, could the categories derived from one sample of texts be used to classify other texts? Reliability was assessed by a case-study comparison of the classes derived from three different samples of abstracts. The notso-surprising conclusion reached was that automatically derived classes were reliable only to the extent that the sample from which they were derived was representative of the total document collection. The second evaluation question asked whether the classification was reasonable, in the sense of adequately describing the content of the document collection. The answer was sought by comparing the automatically derived categories with categories in a related classification system that was manually constructed. Here the conclusion was that the automatic method yielded categories that fairly accurately reflected the major area of interest in the sample collection of texts; however, since there were only eleven such categories and they were quite broad, they could not be regarded as suitable for use in a university or any large general library. The third evaluation question asked whether automatic classification was accurate, in the sense of producing results similar to those obtainabie by human cIassifiers. When using human classification as a criterion, automatic classification was found to be 50 percent accurate.

Footnote

Original in: Classification research: Proceedings of the Second International Study Conference held at Hotel Prins Hamlet, Elsinore, Denmark, 14th-18th Sept. 1964. Ed.: Pauline Atherton. Copenhagen: Munksgaard 1965. S.220-238.
Fang, H.: Classifying research articles in multidisciplinary sciences journals into subject categories (2015) 0.06
```
0.055305116 = product of:
  0.08295767 = sum of:
    0.068818994 = weight(_text_:interest in 2194) [ClassicSimilarity], result of:
      0.068818994 = score(doc=2194,freq=2.0), product of:
        0.25074318 = queryWeight, product of:
          4.9682584 = idf(docFreq=835, maxDocs=44218)
          0.05046903 = queryNorm
        0.27446008 = fieldWeight in 2194, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.9682584 = idf(docFreq=835, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2194)
    0.014138679 = product of:
      0.028277358 = sum of:
        0.028277358 = weight(_text_:classification in 2194) [ClassicSimilarity], result of:
          0.028277358 = score(doc=2194,freq=2.0), product of:
            0.16072905 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.05046903 = queryNorm
            0.17593184 = fieldWeight in 2194, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2194)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

In the Thomson Reuters Web of Science database, the subject categories of a journal are applied to all articles in the journal. However, many articles in multidisciplinary Sciences journals may only be represented by a small number of subject categories. To provide more accurate information on the research areas of articles in such journals, we can classify articles in these journals into subject categories as defined by Web of Science based on their references. For an article in a multidisciplinary sciences journal, the method counts the subject categories in all of the article's references indexed by Web of Science, and uses the most numerous subject categories of the references to determine the most appropriate classification of the article. We used articles in an issue of Proceedings of the National Academy of Sciences (PNAS) to validate the correctness of the method by comparing the obtained results with the categories of the articles as defined by PNAS and their content. This study shows that the method provides more precise search results for the subject category of interest in bibliometric investigations through recognition of articles in multidisciplinary sciences journals whose work relates to a particular subject category.

Yi, K.: Automatic text classification using library classification schemes : trends, issues and challenges (2007) 0.05

0.053279206 = product of:
  0.15983762 = sum of:
    0.15983762 = sum of:
      0.11197262 = weight(_text_:classification in 2560) [ClassicSimilarity], result of:
        0.11197262 = score(doc=2560,freq=16.0), product of:
          0.16072905 = queryWeight, product of:
            3.1847067 = idf(docFreq=4974, maxDocs=44218)
            0.05046903 = queryNorm
          0.69665456 = fieldWeight in 2560, product of:
            4.0 = tf(freq=16.0), with freq of:
              16.0 = termFreq=16.0
            3.1847067 = idf(docFreq=4974, maxDocs=44218)
            0.0546875 = fieldNorm(doc=2560)
      0.047864996 = weight(_text_:22 in 2560) [ClassicSimilarity], result of:
        0.047864996 = score(doc=2560,freq=2.0), product of:
          0.17673394 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.05046903 = queryNorm
          0.2708308 = fieldWeight in 2560, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0546875 = fieldNorm(doc=2560)
  0.33333334 = coord(1/3)

Abstract: The proliferation of digital resources and their integration into a traditional library setting has created a pressing need for an automated tool that organizes textual information based on library classification schemes. Automated text classification is a research field of developing tools, methods, and models to automate text classification. This article describes the current popular approach for text classification and major text classification projects and applications that are based on library classification schemes. Related issues and challenges are discussed, and a number of considerations for the challenges are examined.
Date: 22. 9.2008 18:31:54

Yoon, Y.; Lee, C.; Lee, G.G.: ¬An effective procedure for constructing a hierarchical text classification system (2006) 0.05

0.04827871 = product of:
  0.14483613 = sum of:
    0.14483613 = sum of:
      0.09697114 = weight(_text_:classification in 5273) [ClassicSimilarity], result of:
        0.09697114 = score(doc=5273,freq=12.0), product of:
          0.16072905 = queryWeight, product of:
            3.1847067 = idf(docFreq=4974, maxDocs=44218)
            0.05046903 = queryNorm
          0.60332054 = fieldWeight in 5273, product of:
            3.4641016 = tf(freq=12.0), with freq of:
              12.0 = termFreq=12.0
            3.1847067 = idf(docFreq=4974, maxDocs=44218)
            0.0546875 = fieldNorm(doc=5273)
      0.047864996 = weight(_text_:22 in 5273) [ClassicSimilarity], result of:
        0.047864996 = score(doc=5273,freq=2.0), product of:
          0.17673394 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.05046903 = queryNorm
          0.2708308 = fieldWeight in 5273, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0546875 = fieldNorm(doc=5273)
  0.33333334 = coord(1/3)

Abstract: In text categorization tasks, classification on some class hierarchies has better results than in cases without the hierarchy. Currently, because a large number of documents are divided into several subgroups in a hierarchy, we can appropriately use a hierarchical classification method. However, we have no systematic method to build a hierarchical classification system that performs well with large collections of practical data. In this article, we introduce a new evaluation scheme for internal node classifiers, which can be used effectively to develop a hierarchical classification system. We also show that our method for constructing the hierarchical classification system is very effective, especially for the task of constructing classifiers applied to hierarchy tree with a lot of levels.
Date: 22. 7.2006 16:24:52

Dolin, R.; Agrawal, D.; El Abbadi, A.; Pearlman, J.: Using automated classification for summarizing and selecting heterogeneous information sources (1998) 0.05
```
0.045411773 = product of:
  0.068117656 = sum of:
    0.041291397 = weight(_text_:interest in 1253) [ClassicSimilarity], result of:
      0.041291397 = score(doc=1253,freq=2.0), product of:
        0.25074318 = queryWeight, product of:
          4.9682584 = idf(docFreq=835, maxDocs=44218)
          0.05046903 = queryNorm
        0.16467606 = fieldWeight in 1253, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.9682584 = idf(docFreq=835, maxDocs=44218)
          0.0234375 = fieldNorm(doc=1253)
    0.026826259 = product of:
      0.053652517 = sum of:
        0.053652517 = weight(_text_:classification in 1253) [ClassicSimilarity], result of:
          0.053652517 = score(doc=1253,freq=20.0), product of:
            0.16072905 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.05046903 = queryNorm
            0.33380723 = fieldWeight in 1253, product of:
              4.472136 = tf(freq=20.0), with freq of:
                20.0 = termFreq=20.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.0234375 = fieldNorm(doc=1253)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

Information retrieval over the Internet increasingly requires the filtering of thousands of heterogeneous information sources. Important sources of information include not only traditional databases with structured data and queries, but also increasing numbers of non-traditional, semi- or unstructured collections such as Web sites, FTP archives, etc. As the number and variability of sources increases, new ways of automatically summarizing, discovering, and selecting collections relevant to a user's query are needed. One such method involves the use of classification schemes, such as the Library of Congress Classification (LCC), within which a collection may be represented based on its content, irrespective of the structure of the actual data or documents. For such a system to be useful in a large-scale distributed environment, it must be easy to use for both collection managers and users. As a result, it must be possible to classify documents automatically within a classification scheme. Furthermore, there must be a straightforward and intuitive interface with which the user may use the scheme to assist in information retrieval (IR). Our work with the Alexandria Digital Library (ADL) Project focuses on geo-referenced information, whether text, maps, aerial photographs, or satellite images. As a result, we have emphasized techniques which work with both text and non-text, such as combined textual and graphical queries, multi-dimensional indexing, and IR methods which are not solely dependent on words or phrases. Part of this work involves locating relevant online sources of information. In particular, we have designed and are currently testing aspects of an architecture, Pharos, which we believe will scale up to 1.000.000 heterogeneous sources. Pharos accommodates heterogeneity in content and format, both among multiple sources as well as within a single source. That is, we consider sources to include Web sites, FTP archives, newsgroups, and full digital libraries; all of these systems can include a wide variety of content and multimedia data formats. Pharos is based on the use of hierarchical classification schemes. These include not only well-known 'subject' (or 'concept') based schemes such as the Dewey Decimal System and the LCC, but also, for example, geographic classifications, which might be constructed as layers of smaller and smaller hierarchical longitude/latitude boxes. Pharos is designed to work with sophisticated queries which utilize subjects, geographical locations, temporal specifications, and other types of information domains. The Pharos architecture requires that hierarchically structured collection metadata be extracted so that it can be partitioned in such a way as to greatly enhance scalability. Automated classification is important to Pharos because it allows information sources to extract the requisite collection metadata automatically that must be distributed.
We are currently experimenting with newsgroups as collections. We have built an initial prototype which automatically classifies and summarizes newsgroups within the LCC. (The prototype can be tested below, and more details may be found at http://pharos.alexandria.ucsb.edu/). The prototype uses electronic library catalog records as a `training set' and Latent Semantic Indexing (LSI) for IR. We use the training set to build a rich set of classification terminology, and associate these terms with the relevant categories in the LCC. This association between terms and classification categories allows us to relate users' queries to nodes in the LCC so that users can select appropriate query categories. Newsgroups are similarly associated with classification categories. Pharos then matches the categories selected by users to relevant newsgroups. In principle, this approach allows users to exclude newsgroups that might have been selected based on an unintended meaning of a query term, and to include newsgroups with relevant content even though the exact query terms may not have been used. This work is extensible to other types of classification, including geographical, temporal, and image feature. Before discussing the methodology of the collection summarization and selection, we first present an online demonstration below. The demonstration is not intended to be a complete end-user interface. Rather, it is intended merely to offer a view of the process to suggest the "look and feel" of the prototype. The demo works as follows. First supply it with a few keywords of interest. The system will then use those terms to try to return to you the most relevant subject categories within the LCC. Assuming that the system recognizes any of your terms (it has over 400,000 terms indexed), it will give you a list of 15 LCC categories sorted by relevancy ranking. From there, you have two choices. The first choice, by clicking on the "News" links, is to get a list of newsgroups which the system has identified as relevant to the LCC category you select. The other choice, by clicking on the LCC ID links, is to enter the LCC hierarchy starting at the category of your choice and navigate the tree until you locate the best category for your query. From there, again, you can get a list of newsgroups by clicking on the "News" links. After having shown this demonstration to many people, we would like to suggest that you first give it easier examples before trying to break it. For example, "prostate cancer" (discussed below), "remote sensing", "investment banking", and "gershwin" all work reasonably well.

HaCohen-Kerner, Y. et al.: Classification using various machine learning methods and combinations of key-phrases and visual features (2016) 0.04

0.04164443 = product of:
  0.12493329 = sum of:
    0.12493329 = sum of:
      0.056554716 = weight(_text_:classification in 2748) [ClassicSimilarity], result of:
        0.056554716 = score(doc=2748,freq=2.0), product of:
          0.16072905 = queryWeight, product of:
            3.1847067 = idf(docFreq=4974, maxDocs=44218)
            0.05046903 = queryNorm
          0.35186368 = fieldWeight in 2748, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.1847067 = idf(docFreq=4974, maxDocs=44218)
            0.078125 = fieldNorm(doc=2748)
      0.06837857 = weight(_text_:22 in 2748) [ClassicSimilarity], result of:
        0.06837857 = score(doc=2748,freq=2.0), product of:
          0.17673394 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.05046903 = queryNorm
          0.38690117 = fieldWeight in 2748, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.078125 = fieldNorm(doc=2748)
  0.33333334 = coord(1/3)

Date: 1. 2.2016 18:25:22

Liu, R.-L.: Context recognition for hierarchical text classification (2009) 0.04
```
0.038967755 = product of:
  0.11690326 = sum of:
    0.11690326 = sum of:
      0.07587612 = weight(_text_:classification in 2760) [ClassicSimilarity], result of:
        0.07587612 = score(doc=2760,freq=10.0), product of:
          0.16072905 = queryWeight, product of:
            3.1847067 = idf(docFreq=4974, maxDocs=44218)
            0.05046903 = queryNorm
          0.4720747 = fieldWeight in 2760, product of:
            3.1622777 = tf(freq=10.0), with freq of:
              10.0 = termFreq=10.0
            3.1847067 = idf(docFreq=4974, maxDocs=44218)
            0.046875 = fieldNorm(doc=2760)
      0.04102714 = weight(_text_:22 in 2760) [ClassicSimilarity], result of:
        0.04102714 = score(doc=2760,freq=2.0), product of:
          0.17673394 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.05046903 = queryNorm
          0.23214069 = fieldWeight in 2760, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.046875 = fieldNorm(doc=2760)
  0.33333334 = coord(1/3)
```
Abstract

Information is often organized as a text hierarchy. A hierarchical text-classification system is thus essential for the management, sharing, and dissemination of information. It aims to automatically classify each incoming document into zero, one, or several categories in the text hierarchy. In this paper, we present a technique called CRHTC (context recognition for hierarchical text classification) that performs hierarchical text classification by recognizing the context of discussion (COD) of each category. A category's COD is governed by its ancestor categories, whose contents indicate contextual backgrounds of the category. A document may be classified into a category only if its content matches the category's COD. CRHTC does not require any trials to manually set parameters, and hence is more portable and easier to implement than other methods. It is empirically evaluated under various conditions. The results show that CRHTC achieves both better and more stable performance than several hierarchical and nonhierarchical text-classification methodologies.

Date

22. 3.2009 19:11:54

Automatic classification research at OCLC (2002) 0.04

0.03881132 = product of:
  0.11643395 = sum of:
    0.11643395 = sum of:
      0.06856895 = weight(_text_:classification in 1563) [ClassicSimilarity], result of:
        0.06856895 = score(doc=1563,freq=6.0), product of:
          0.16072905 = queryWeight, product of:
            3.1847067 = idf(docFreq=4974, maxDocs=44218)
            0.05046903 = queryNorm
          0.42661208 = fieldWeight in 1563, product of:
            2.4494898 = tf(freq=6.0), with freq of:
              6.0 = termFreq=6.0
            3.1847067 = idf(docFreq=4974, maxDocs=44218)
            0.0546875 = fieldNorm(doc=1563)
      0.047864996 = weight(_text_:22 in 1563) [ClassicSimilarity], result of:
        0.047864996 = score(doc=1563,freq=2.0), product of:
          0.17673394 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.05046903 = queryNorm
          0.2708308 = fieldWeight in 1563, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0546875 = fieldNorm(doc=1563)
  0.33333334 = coord(1/3)

Abstract: OCLC enlists the cooperation of the world's libraries to make the written record of humankind's cultural heritage more accessible through electronic media. Part of this goal can be accomplished through the application of the principles of knowledge organization. We believe that cultural artifacts are effectively lost unless they are indexed, cataloged and classified. Accordingly, OCLC has developed products, sponsored research projects, and encouraged the participation in international standards communities whose outcome has been improved library classification schemes, cataloging productivity tools, and new proposals for the creation and maintenance of metadata. Though cataloging and classification requires expert intellectual effort, we recognize that at least some of the work must be automated if we hope to keep pace with cultural change
Date: 5. 5.2003 9:22:09

Jenkins, C.: Automatic classification of Web resources using Java and Dewey Decimal Classification (1998) 0.04

0.03881132 = product of:
  0.11643395 = sum of:
    0.11643395 = sum of:
      0.06856895 = weight(_text_:classification in 1673) [ClassicSimilarity], result of:
        0.06856895 = score(doc=1673,freq=6.0), product of:
          0.16072905 = queryWeight, product of:
            3.1847067 = idf(docFreq=4974, maxDocs=44218)
            0.05046903 = queryNorm
          0.42661208 = fieldWeight in 1673, product of:
            2.4494898 = tf(freq=6.0), with freq of:
              6.0 = termFreq=6.0
            3.1847067 = idf(docFreq=4974, maxDocs=44218)
            0.0546875 = fieldNorm(doc=1673)
      0.047864996 = weight(_text_:22 in 1673) [ClassicSimilarity], result of:
        0.047864996 = score(doc=1673,freq=2.0), product of:
          0.17673394 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.05046903 = queryNorm
          0.2708308 = fieldWeight in 1673, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0546875 = fieldNorm(doc=1673)
  0.33333334 = coord(1/3)

Abstract: The Wolverhampton Web Library (WWLib) is a WWW search engine that provides access to UK based information. The experimental version developed in 1995, was a success but highlighted the need for a much higher degree of automation. An interesting feature of the experimental WWLib was that it organised information according to DDC. Discusses the advantages of classification and describes the automatic classifier that is being developed in Java as part of the new, fully automated WWLib
Date: 1. 8.1996 22:08:06

Egbert, J.; Biber, D.; Davies, M.: Developing a bottom-up, user-based method of web register classification (2015) 0.03

0.033266842 = product of:
  0.09980053 = sum of:
    0.09980053 = sum of:
      0.058773387 = weight(_text_:classification in 2158) [ClassicSimilarity], result of:
        0.058773387 = score(doc=2158,freq=6.0), product of:
          0.16072905 = queryWeight, product of:
            3.1847067 = idf(docFreq=4974, maxDocs=44218)
            0.05046903 = queryNorm
          0.3656675 = fieldWeight in 2158, product of:
            2.4494898 = tf(freq=6.0), with freq of:
              6.0 = termFreq=6.0
            3.1847067 = idf(docFreq=4974, maxDocs=44218)
            0.046875 = fieldNorm(doc=2158)
      0.04102714 = weight(_text_:22 in 2158) [ClassicSimilarity], result of:
        0.04102714 = score(doc=2158,freq=2.0), product of:
          0.17673394 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.05046903 = queryNorm
          0.23214069 = fieldWeight in 2158, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.046875 = fieldNorm(doc=2158)
  0.33333334 = coord(1/3)

Abstract: This paper introduces a project to develop a reliable, cost-effective method for classifying Internet texts into register categories, and apply that approach to the analysis of a large corpus of web documents. To date, the project has proceeded in 2 key phases. First, we developed a bottom-up method for web register classification, asking end users of the web to utilize a decision-tree survey to code relevant situational characteristics of web documents, resulting in a bottom-up identification of register and subregister categories. We present details regarding the development and testing of this method through a series of 10 pilot studies. Then, in the second phase of our project we applied this procedure to a corpus of 53,000 web documents. An analysis of the results demonstrates the effectiveness of these methods for web register classification and provides a preliminary description of the types and distribution of registers on the web.
Date: 4. 8.2015 19:22:04

Search (144 results, page 1 of 8)

Authors

Years

Languages

Types

Themes

Subjects