Search (57 results, page 1 of 3)

Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.10

0.10263703 = sum of:
  0.081723005 = product of:
    0.24516901 = sum of:
      0.24516901 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
        0.24516901 = score(doc=562,freq=2.0), product of:
          0.4362298 = queryWeight, product of:
            8.478011 = idf(docFreq=24, maxDocs=44218)
            0.05145426 = queryNorm
          0.56201804 = fieldWeight in 562, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            8.478011 = idf(docFreq=24, maxDocs=44218)
            0.046875 = fieldNorm(doc=562)
    0.33333334 = coord(1/3)
  0.020914026 = product of:
    0.04182805 = sum of:
      0.04182805 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
        0.04182805 = score(doc=562,freq=2.0), product of:
          0.18018405 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.05145426 = queryNorm
          0.23214069 = fieldWeight in 562, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.046875 = fieldNorm(doc=562)
    0.5 = coord(1/2)

Content: Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
Date: 8. 1.2013 10:22:32

HaCohen-Kerner, Y. et al.: Classification using various machine learning methods and combinations of key-phrases and visual features (2016) 0.09

0.09152177 = product of:
  0.18304354 = sum of:
    0.18304354 = sum of:
      0.113330126 = weight(_text_:learning in 2748) [ClassicSimilarity], result of:
        0.113330126 = score(doc=2748,freq=2.0), product of:
          0.22973695 = queryWeight, product of:
            4.464877 = idf(docFreq=1382, maxDocs=44218)
            0.05145426 = queryNorm
          0.49330387 = fieldWeight in 2748, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            4.464877 = idf(docFreq=1382, maxDocs=44218)
            0.078125 = fieldNorm(doc=2748)
      0.06971342 = weight(_text_:22 in 2748) [ClassicSimilarity], result of:
        0.06971342 = score(doc=2748,freq=2.0), product of:
          0.18018405 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.05145426 = queryNorm
          0.38690117 = fieldWeight in 2748, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.078125 = fieldNorm(doc=2748)
  0.5 = coord(1/2)

Date: 1. 2.2016 18:25:22

Ko, Y.; Seo, J.: Text classification from unlabeled documents with bootstrapping and feature projection techniques (2009) 0.05
```
0.048081905 = product of:
  0.09616381 = sum of:
    0.09616381 = product of:
      0.19232762 = sum of:
        0.19232762 = weight(_text_:learning in 2452) [ClassicSimilarity], result of:
          0.19232762 = score(doc=2452,freq=16.0), product of:
            0.22973695 = queryWeight, product of:
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.05145426 = queryNorm
            0.83716446 = fieldWeight in 2452, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.046875 = fieldNorm(doc=2452)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Many machine learning algorithms have been applied to text classification tasks. In the machine learning paradigm, a general inductive process automatically builds a text classifier by learning, generally known as supervised learning. However, the supervised learning approaches have some problems. The most notable problem is that they require a large number of labeled training documents for accurate learning. While unlabeled documents are easily collected and plentiful, labeled documents are difficultly generated because a labeling task must be done by human developers. In this paper, we propose a new text classification method based on unsupervised or semi-supervised learning. The proposed method launches text classification tasks with only unlabeled documents and the title word of each category for learning, and then it automatically learns text classifier by using bootstrapping and feature projection techniques. The results of experiments showed that the proposed method achieved reasonably useful performance compared to a supervised method. If the proposed method is used in a text classification task, building text classification systems will become significantly faster and less expensive.
Zhou, G.D.; Zhang, M.; Ji, D.H.; Zhu, Q.M.: Hierarchical learning strategy in semantic relation extraction (2008) 0.04
```
0.041640148 = product of:
  0.083280295 = sum of:
    0.083280295 = product of:
      0.16656059 = sum of:
        0.16656059 = weight(_text_:learning in 2077) [ClassicSimilarity], result of:
          0.16656059 = score(doc=2077,freq=12.0), product of:
            0.22973695 = queryWeight, product of:
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.05145426 = queryNorm
            0.7250056 = fieldWeight in 2077, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.046875 = fieldNorm(doc=2077)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

This paper proposes a novel hierarchical learning strategy to deal with the data sparseness problem in semantic relation extraction by modeling the commonality among related classes. For each class in the hierarchy either manually predefined or automatically clustered, a discriminative function is determined in a top-down way. As the upper-level class normally has much more positive training examples than the lower-level class, the corresponding discriminative function can be determined more reliably and guide the discriminative function learning in the lower-level one more effectively, which otherwise might suffer from limited training data. In this paper, two classifier learning approaches, i.e. the simple perceptron algorithm and the state-of-the-art Support Vector Machines, are applied using the hierarchical learning strategy. Moreover, several kinds of class hierarchies either manually predefined or automatically clustered are explored and compared. Evaluation on the ACE RDC 2003 and 2004 corpora shows that the hierarchical learning strategy much improves the performance on least- and medium-frequent relations.
Ruiz, M.E.; Srinivasan, P.: Combining machine learning and hierarchical indexing structures for text categorization (2001) 0.03
```
0.034351375 = product of:
  0.06870275 = sum of:
    0.06870275 = product of:
      0.1374055 = sum of:
        0.1374055 = weight(_text_:learning in 1595) [ClassicSimilarity], result of:
          0.1374055 = score(doc=1595,freq=6.0), product of:
            0.22973695 = queryWeight, product of:
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.05145426 = queryNorm
            0.59809923 = fieldWeight in 1595, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1595)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

This paper presents a method that exploits the hierarchical structure of an indexing vocabulary to guide the development and training of machine learning methods for automatic text categorization. We present the design of a hierarchical classifier based an the divide-and-conquer principle. The method is evaluated using backpropagation neural networks, such as the machine learning algorithm, that leam to assign MeSH categories to a subset of MEDLINE records. Comparisons with traditional Rocchio's algorithm adapted for text categorization, as well as flat neural network classifiers, are provided. The results indicate that the use of hierarchical structures improves Performance significantly.
Sebastiani, F.: Machine learning in automated text categorization (2002) 0.03
```
0.03399904 = product of:
  0.06799808 = sum of:
    0.06799808 = product of:
      0.13599616 = sum of:
        0.13599616 = weight(_text_:learning in 3389) [ClassicSimilarity], result of:
          0.13599616 = score(doc=3389,freq=8.0), product of:
            0.22973695 = queryWeight, product of:
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.05145426 = queryNorm
            0.59196466 = fieldWeight in 3389, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.046875 = fieldNorm(doc=3389)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based an machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
Pong, J.Y.-H.; Kwok, R.C.-W.; Lau, R.Y.-K.; Hao, J.-X.; Wong, P.C.-C.: ¬A comparative study of two automatic document classification methods in a library setting (2008) 0.03
```
0.031676736 = product of:
  0.06335347 = sum of:
    0.06335347 = product of:
      0.12670694 = sum of:
        0.12670694 = weight(_text_:learning in 2532) [ClassicSimilarity], result of:
          0.12670694 = score(doc=2532,freq=10.0), product of:
            0.22973695 = queryWeight, product of:
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.05145426 = queryNorm
            0.55153054 = fieldWeight in 2532, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2532)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

In current library practice, trained human experts usually carry out document cataloguing and indexing based on a manual approach. With the explosive growth in the number of electronic documents available on the Internet and digital libraries, it is increasingly difficult for library practitioners to categorize both electronic documents and traditional library materials using just a manual approach. To improve the effectiveness and efficiency of document categorization at the library setting, more in-depth studies of using automatic document classification methods to categorize library items are required. Machine learning research has advanced rapidly in recent years. However, applying machine learning techniques to improve library practice is still a relatively unexplored area. This paper illustrates the design and development of a machine learning based automatic document classification system to alleviate the manual categorization problem encountered within the library setting. Two supervised machine learning algorithms have been tested. Our empirical tests show that supervised machine learning algorithms in general, and the k-nearest neighbours (KNN) algorithm in particular, can be used to develop an effective document classification system to enhance current library practice. Moreover, some concrete recommendations regarding how to practically apply the KNN algorithm to develop automatic document classification in a library setting are made. To our best knowledge, this is the first in-depth study of applying the KNN algorithm to automatic document classification based on the widely used LCC classification scheme adopted by many large libraries.
Borodin, Y.; Polishchuk, V.; Mahmud, J.; Ramakrishnan, I.V.; Stent, A.: Live and learn from mistakes : a lightweight system for document classification (2013) 0.03
```
0.031676736 = product of:
  0.06335347 = sum of:
    0.06335347 = product of:
      0.12670694 = sum of:
        0.12670694 = weight(_text_:learning in 2722) [ClassicSimilarity], result of:
          0.12670694 = score(doc=2722,freq=10.0), product of:
            0.22973695 = queryWeight, product of:
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.05145426 = queryNorm
            0.55153054 = fieldWeight in 2722, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2722)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

We present a Life-Long Learning from Mistakes (3LM) algorithm for document classification, which could be used in various scenarios such as spam filtering, blog classification, and web resource categorization. We extend the ideas of online clustering and batch-mode centroid-based classification to online learning with negative feedback. The 3LM is a competitive learning algorithm, which avoids over-smoothing, characteristic of the centroid-based classifiers, by using a different class representative, which we call clusterhead. The clusterheads competing for vector-space dominance are drawn toward misclassified documents, eventually bringing the model to a "balanced state" for a fixed distribution of documents. Subsequently, the clusterheads oscillate between the misclassified documents, heuristically minimizing the rate of misclassifications, an NP-complete problem. Further, the 3LM algorithm prevents over-fitting by "leashing" the clusterheads to their respective centroids. A clusterhead provably converges if its class can be separated by a hyper-plane from all other classes. Lifelong learning with fixed learning rate allows 3LM to adapt to possibly changing distribution of the data and continually learn and unlearn document classes. We report on our experiments, which demonstrate high accuracy of document classification on Reuters21578, OHSUMED, and TREC07p-spam datasets. The 3LM algorithm did not show over-fitting, while consistently outperforming centroid-based, Naïve Bayes, C4.5, AdaBoost, kNN, and SVM whose accuracy had been reported on the same three corpora.
Sebastiani, F.: ¬A tutorial an automated text categorisation (1999) 0.03
```
0.029444033 = product of:
  0.058888067 = sum of:
    0.058888067 = product of:
      0.11777613 = sum of:
        0.11777613 = weight(_text_:learning in 3390) [ClassicSimilarity], result of:
          0.11777613 = score(doc=3390,freq=6.0), product of:
            0.22973695 = queryWeight, product of:
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.05145426 = queryNorm
            0.51265645 = fieldWeight in 3390, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.046875 = fieldNorm(doc=3390)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

The automated categorisation (or classification) of texts into topical categories has a long history, dating back at least to 1960. Until the late '80s, the dominant approach to the problem involved knowledge-engineering automatic categorisers, i.e. manually building a set of rules encoding expert knowledge an how to classify documents. In the '90s, with the booming production and availability of on-line documents, automated text categorisation has witnessed an increased and renewed interest. A newer paradigm based an machine learning has superseded the previous approach. Within this paradigm, a general inductive process automatically builds a classifier by "learning", from a set of previously classified documents, the characteristics of one or more categories; the advantages are a very good effectiveness, a considerable savings in terms of expert manpower, and domain independence. In this tutorial we look at the main approaches that have been taken towards automatic text categorisation within the general machine learning paradigm. Issues of document indexing, classifier construction, and classifier evaluation, will be touched upon.
Billal, B.; Fonseca, A.; Sadat, F.; Lounis, H.: Semi-supervised learning and social media text analysis towards multi-labeling categorization (2017) 0.03
```
0.027760098 = product of:
  0.055520196 = sum of:
    0.055520196 = product of:
      0.11104039 = sum of:
        0.11104039 = weight(_text_:learning in 4095) [ClassicSimilarity], result of:
          0.11104039 = score(doc=4095,freq=12.0), product of:
            0.22973695 = queryWeight, product of:
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.05145426 = queryNorm
            0.4833371 = fieldWeight in 4095, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.03125 = fieldNorm(doc=4095)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

In traditional text classification, classes are mutually exclusive, i.e. it is not possible to have one text or text fragment classified into more than one class. On the other hand, in multi-label classification an individual text may belong to several classes simultaneously. This type of classification is required by a large number of current applications such as big data classification, images and video annotation. Supervised learning is the most used type of machine learning in the classification task. It requires large quantities of labeled data and the intervention of a human tagger in the creation of the training sets. When the data sets become very large or heavily noisy, this operation can be tedious, prone to error and time consuming. In this case, semi-supervised learning, which requires only few labels, is a better choice. In this paper, we study and evaluate several methods to address the problem of multi-label classification using semi-supervised learning and data from social networks. First, we propose a linguistic pre-processing involving tokeni-sation, recognition of named entities and hashtag segmentation in order to decrease the noise in this type of massive and unstructured real data and then we perform a word sense disambiguation using WordNet. Second, several experiments related to multi-label classification and semi-supervised learning are carried out on these data sets and compared to each other. These evaluations compare the results of the approaches considered. This paper proposes a method for combining semi-supervised methods with a graph method for the extraction of subjects in social networks using a multi-label classification approach. Experiments show that the performance of the proposed model increases in 4 p.p. the precision of the classification when compared to a baseline.
Fagni, T.; Sebastiani, F.: Selecting negative examples for hierarchical text classification: An experimental comparison (2010) 0.02
```
0.024536695 = product of:
  0.04907339 = sum of:
    0.04907339 = product of:
      0.09814678 = sum of:
        0.09814678 = weight(_text_:learning in 4101) [ClassicSimilarity], result of:
          0.09814678 = score(doc=4101,freq=6.0), product of:
            0.22973695 = queryWeight, product of:
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.05145426 = queryNorm
            0.42721373 = fieldWeight in 4101, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4101)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Hierarchical text classification (HTC) approaches have recently attracted a lot of interest on the part of researchers in human language technology and machine learning, since they have been shown to bring about equal, if not better, classification accuracy with respect to their "flat" counterparts while allowing exponential time savings at both learning and classification time. A typical component of HTC methods is a "local" policy for selecting negative examples: Given a category c, its negative training examples are by default identified with the training examples that are negative for c and positive for the categories which are siblings of c in the hierarchy. However, this policy has always been taken for granted and never been subjected to careful scrutiny since first proposed 15 years ago. This article proposes a thorough experimental comparison between this policy and three other policies for the selection of negative examples in HTC contexts, one of which (BEST LOCAL (k)) is being proposed for the first time in this article. We compare these policies on the hierarchical versions of three supervised learning algorithms (boosting, support vector machines, and naïve Bayes) by performing experiments on two standard TC datasets, REUTERS-21578 and RCV1-V2.
Duwairi, R.M.: Machine learning for Arabic text categorization (2006) 0.02
```
0.024040952 = product of:
  0.048081905 = sum of:
    0.048081905 = product of:
      0.09616381 = sum of:
        0.09616381 = weight(_text_:learning in 5115) [ClassicSimilarity], result of:
          0.09616381 = score(doc=5115,freq=4.0), product of:
            0.22973695 = queryWeight, product of:
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.05145426 = queryNorm
            0.41858223 = fieldWeight in 5115, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.046875 = fieldNorm(doc=5115)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

In this article we propose a distance-based classifier for categorizing Arabic text. Each category is represented as a vector of words in an m-dimensional space, and documents are classified on the basis of their closeness to feature vectors of categories. The classifier, in its learning phase, scans the set of training documents to extract features of categories that capture inherent category-specific properties; in its testing phase the classifier uses previously determined category-specific features to categorize unclassified documents. Stemming was used to reduce the dimensionality of feature vectors of documents. The accuracy of the classifier was tested by carrying out several categorization tasks on an in-house collected Arabic corpus. The results show that the proposed classifier is very accurate and robust.
Finn, A.; Kushmerick, N.: Learning to classify documents according to genre (2006) 0.02
```
0.024040952 = product of:
  0.048081905 = sum of:
    0.048081905 = product of:
      0.09616381 = sum of:
        0.09616381 = weight(_text_:learning in 6010) [ClassicSimilarity], result of:
          0.09616381 = score(doc=6010,freq=4.0), product of:
            0.22973695 = queryWeight, product of:
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.05145426 = queryNorm
            0.41858223 = fieldWeight in 6010, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.046875 = fieldNorm(doc=6010)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Current document-retrieval tools succeed in locating large numbers of documents relevant to a given query. While search results may be relevant according to the topic of the documents, it is more difficult to identify which of the relevant documents are most suitable for a particular user. Automatic genre analysis (i.e., the ability to distinguish documents according to style) would be a useful tool for identifying documents that are most suitable for a particular user. We investigate the use of machine learning for automatic genre classification. We introduce the idea of domain transfer-genre classifiers should be reusable across multiple topics-which does not arise in standard text classification. We investigate different features for building genre classifiers and their ability to transfer across multiple-topic domains. We also show how different feature-sets can be used in conjunction with each other to improve performance and reduce the number of documents that need to be labeled.
Golub, K.; Hamon, T.; Ardö, A.: Automated classification of textual documents based on a controlled vocabulary in engineering (2007) 0.02
```
0.024040952 = product of:
  0.048081905 = sum of:
    0.048081905 = product of:
      0.09616381 = sum of:
        0.09616381 = weight(_text_:learning in 1461) [ClassicSimilarity], result of:
          0.09616381 = score(doc=1461,freq=4.0), product of:
            0.22973695 = queryWeight, product of:
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.05145426 = queryNorm
            0.41858223 = fieldWeight in 1461, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.046875 = fieldNorm(doc=1461)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Automated subject classification has been a challenging research issue for many years now, receiving particular attention in the past decade due to rapid increase of digital documents. The most frequent approach to automated classification is machine learning. It, however, requires training documents and performs well on new documents only if these are similar enough to the former. We explore a string-matching algorithm based on a controlled vocabulary, which does not require training documents - instead it reuses the intellectual work put into creating the controlled vocabulary. Terms from the Engineering Information thesaurus and classification scheme were matched against title and abstract of engineering papers from the Compendex database. Simple string-matching was enhanced by several methods such as term weighting schemes and cut-offs, exclusion of certain terms, and en- richment of the controlled vocabulary with automatically extracted terms. The best results are 76% recall when the controlled vocabulary is enriched with new terms, and 79% precision when certain terms are excluded. Precision of individual classes is up to 98%. These results are comparable to state-of-the-art machine-learning algorithms.
Golub, K.: Automated subject classification of textual documents in the context of Web-based hierarchical browsing (2011) 0.02
```
0.024040952 = product of:
  0.048081905 = sum of:
    0.048081905 = product of:
      0.09616381 = sum of:
        0.09616381 = weight(_text_:learning in 4558) [ClassicSimilarity], result of:
          0.09616381 = score(doc=4558,freq=4.0), product of:
            0.22973695 = queryWeight, product of:
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.05145426 = queryNorm
            0.41858223 = fieldWeight in 4558, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.046875 = fieldNorm(doc=4558)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

While automated methods for information organization have been around for several decades now, exponential growth of the World Wide Web has put them into the forefront of research in different communities, within which several approaches can be identified: 1) machine learning (algorithms that allow computers to improve their performance based on learning from pre-existing data); 2) document clustering (algorithms for unsupervised document organization and automated topic extraction); and 3) string matching (algorithms that match given strings within larger text). Here the aim was to automatically organize textual documents into hierarchical structures for subject browsing. The string-matching approach was tested using a controlled vocabulary (containing pre-selected and pre-defined authorized terms, each corresponding to only one concept). The results imply that an appropriate controlled vocabulary, with a sufficient number of entry terms designating classes, could in itself be a solution for automated classification. Then, if the same controlled vocabulary had an appropriat hierarchical structure, it would at the same time provide a good browsing structure for the collection of automatically classified documents.
Malo, P.; Sinha, A.; Wallenius, J.; Korhonen, P.: Concept-based document classification using Wikipedia and value function (2011) 0.02
```
0.024040952 = product of:
  0.048081905 = sum of:
    0.048081905 = product of:
      0.09616381 = sum of:
        0.09616381 = weight(_text_:learning in 4948) [ClassicSimilarity], result of:
          0.09616381 = score(doc=4948,freq=4.0), product of:
            0.22973695 = queryWeight, product of:
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.05145426 = queryNorm
            0.41858223 = fieldWeight in 4948, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.046875 = fieldNorm(doc=4948)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

In this article, we propose a new concept-based method for document classification. The conceptual knowledge associated with the words is drawn from Wikipedia. The purpose is to utilize the abundant semantic relatedness information available in Wikipedia in an efficient value function-based query learning algorithm. The procedure learns the value function by solving a simple linear programming problem formulated using the training documents. The learning involves a step-wise iterative process that helps in generating a value function with an appropriate set of concepts (dimensions) chosen from a collection of concepts. Once the value function is formulated, it is utilized to make a decision between relevance and irrelevance. The value assigned to a particular document from the value function can be further used to rank the documents according to their relevance. Reuters newswire documents have been used to evaluate the efficacy of the procedure. An extensive comparison with other frameworks has been performed. The results are promising.
Schaalje, G.B.; Blades, N.J.; Funai, T.: ¬An open-set size-adjusted Bayesian classifier for authorship attribution (2013) 0.02
```
0.024040952 = product of:
  0.048081905 = sum of:
    0.048081905 = product of:
      0.09616381 = sum of:
        0.09616381 = weight(_text_:learning in 1041) [ClassicSimilarity], result of:
          0.09616381 = score(doc=1041,freq=4.0), product of:
            0.22973695 = queryWeight, product of:
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.05145426 = queryNorm
            0.41858223 = fieldWeight in 1041, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.046875 = fieldNorm(doc=1041)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Recent studies of authorship attribution have used machine-learning methods including regularized multinomial logistic regression, neural nets, support vector machines, and the nearest shrunken centroid classifier to identify likely authors of disputed texts. These methods are all limited by an inability to perform open-set classification and account for text and corpus size. We propose a customized Bayesian logit-normal-beta-binomial classification model for supervised authorship attribution. The model is based on the beta-binomial distribution with an explicit inverse relationship between extra-binomial variation and text size. The model internally estimates the relationship of extra-binomial variation to text size, and uses Markov Chain Monte Carlo (MCMC) to produce distributions of posterior authorship probabilities instead of point estimates. We illustrate the method by training the machine-learning methods as well as the open-set Bayesian classifier on undisputed papers of The Federalist, and testing the method on documents historically attributed to Alexander Hamilton, John Jay, and James Madison. The Bayesian classifier was the best classifier of these texts.

Subramanian, S.; Shafer, K.E.: Clustering (2001) 0.02

0.020914026 = product of:
  0.04182805 = sum of:
    0.04182805 = product of:
      0.0836561 = sum of:
        0.0836561 = weight(_text_:22 in 1046) [ClassicSimilarity], result of:
          0.0836561 = score(doc=1046,freq=2.0), product of:
            0.18018405 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.05145426 = queryNorm
            0.46428138 = fieldWeight in 1046, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.09375 = fieldNorm(doc=1046)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Date: 5. 5.2003 14:17:22

Peng, F.; Huang, X.: Machine learning for Asian language text classification (2007) 0.02
```
0.020034127 = product of:
  0.040068254 = sum of:
    0.040068254 = product of:
      0.08013651 = sum of:
        0.08013651 = weight(_text_:learning in 831) [ClassicSimilarity], result of:
          0.08013651 = score(doc=831,freq=4.0), product of:
            0.22973695 = queryWeight, product of:
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.05145426 = queryNorm
            0.34881854 = fieldWeight in 831, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.0390625 = fieldNorm(doc=831)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Purpose - The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word boundary information is available in written text. The paper advocates a simple language modeling based approach for this task. Design/methodology/approach - Naïve Bayes, maximum entropy model, support vector machines, and language modeling approaches were implemented and were applied to Chinese and Japanese text classification. To investigate the influence of word segmentation, different word segmentation approaches were investigated and applied to Chinese text. A segmentation-based approach was compared with the non-segmentation-based approach. Findings - There were two findings: the experiments show that statistical language modeling can significantly outperform standard techniques, given the same set of features; and it was found that classification with word level features normally yields improved classification performance, but that classification performance is not monotonically related to segmentation accuracy. In particular, classification performance may initially improve with increased segmentation accuracy, but eventually classification performance stops improving, and can in fact even decrease, after a certain level of segmentation accuracy. Practical implications - Apply the findings to real web text classification is ongoing work. Originality/value - The paper is very relevant to Chinese and Japanese information processing, e.g. webpage classification, web search.
AlQenaei, Z.M.; Monarchi, D.E.: ¬The use of learning techniques to analyze the results of a manual classification system (2016) 0.02
```
0.020034127 = product of:
  0.040068254 = sum of:
    0.040068254 = product of:
      0.08013651 = sum of:
        0.08013651 = weight(_text_:learning in 2836) [ClassicSimilarity], result of:
          0.08013651 = score(doc=2836,freq=4.0), product of:
            0.22973695 = queryWeight, product of:
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.05145426 = queryNorm
            0.34881854 = fieldWeight in 2836, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.464877 = idf(docFreq=1382, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2836)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Classification is the process of assigning objects to pre-defined classes based on observations or characteristics of those objects, and there are many approaches to performing this task. The overall objective of this study is to demonstrate the use of two learning techniques to analyze the results of a manual classification system. Our sample consisted of 1,026 documents, from the ACM Computing Classification System, classified by their authors as belonging to one of the groups of the classification system: "H.3 Information Storage and Retrieval." A singular value decomposition of the documents' weighted term-frequency matrix was used to represent each document in a 50-dimensional vector space. The analysis of the representation using both supervised (decision tree) and unsupervised (clustering) techniques suggests that two pairs of the ACM classes are closely related to each other in the vector space. Class 1 (Content Analysis and Indexing) is closely related to Class 3 (Information Search and Retrieval), and Class 4 (Systems and Software) is closely related to Class 5 (Online Information Services). Further analysis was performed to test the diffusion of the words in the two classes using both cosine and Euclidean distance.

Search (57 results, page 1 of 3)

Authors

Years

Languages

Types

Themes