Search (112 results, page 1 of 6)

Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.14

0.14434984 = sum of:
  0.08056292 = product of:
    0.24168874 = sum of:
      0.24168874 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
        0.24168874 = score(doc=562,freq=2.0), product of:
          0.43003735 = queryWeight, product of:
            8.478011 = idf(docFreq=24, maxDocs=44218)
            0.050723847 = queryNorm
          0.56201804 = fieldWeight in 562, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            8.478011 = idf(docFreq=24, maxDocs=44218)
            0.046875 = fieldNorm(doc=562)
    0.33333334 = coord(1/3)
  0.04316979 = weight(_text_:based in 562) [ClassicSimilarity], result of:
    0.04316979 = score(doc=562,freq=4.0), product of:
      0.15283063 = queryWeight, product of:
        3.0129938 = idf(docFreq=5906, maxDocs=44218)
        0.050723847 = queryNorm
      0.28246817 = fieldWeight in 562, product of:
        2.0 = tf(freq=4.0), with freq of:
          4.0 = termFreq=4.0
        3.0129938 = idf(docFreq=5906, maxDocs=44218)
        0.046875 = fieldNorm(doc=562)
  0.020617142 = product of:
    0.041234285 = sum of:
      0.041234285 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
        0.041234285 = score(doc=562,freq=2.0), product of:
          0.17762627 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.050723847 = queryNorm
          0.23214069 = fieldWeight in 562, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.046875 = fieldNorm(doc=562)
    0.5 = coord(1/2)

Abstract: Document representations for text classification are typically based on the classical Bag-Of-Words paradigm. This approach comes with deficiencies that motivate the integration of features on a higher semantic level than single words. In this paper we propose an enhancement of the classical document representation through concepts extracted from background knowledge. Boosting is used for actual classification. Experimental evaluations on two well known text corpora support our approach through consistent improvement of the results.
Content: Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
Date: 8. 1.2013 10:22:32

Mengle, S.; Goharian, N.: Passage detection using text classification (2009) 0.08
```
0.080615476 = product of:
  0.12092321 = sum of:
    0.025438042 = weight(_text_:based in 2765) [ClassicSimilarity], result of:
      0.025438042 = score(doc=2765,freq=2.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.16644597 = fieldWeight in 2765, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2765)
    0.09548517 = sum of:
      0.061123267 = weight(_text_:training in 2765) [ClassicSimilarity], result of:
        0.061123267 = score(doc=2765,freq=2.0), product of:
          0.23690371 = queryWeight, product of:
            4.67046 = idf(docFreq=1125, maxDocs=44218)
            0.050723847 = queryNorm
          0.2580089 = fieldWeight in 2765, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            4.67046 = idf(docFreq=1125, maxDocs=44218)
            0.0390625 = fieldNorm(doc=2765)
      0.034361906 = weight(_text_:22 in 2765) [ClassicSimilarity], result of:
        0.034361906 = score(doc=2765,freq=2.0), product of:
          0.17762627 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.050723847 = queryNorm
          0.19345059 = fieldWeight in 2765, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0390625 = fieldNorm(doc=2765)
  0.6666667 = coord(2/3)
```
Abstract

Passages can be hidden within a text to circumvent their disallowed transfer. Such release of compartmentalized information is of concern to all corporate and governmental organizations. Passage retrieval is well studied; we posit, however, that passage detection is not. Passage retrieval is the determination of the degree of relevance of blocks of text, namely passages, comprising a document. Rather than determining the relevance of a document in its entirety, passage retrieval determines the relevance of the individual passages. As such, modified traditional information-retrieval techniques compare terms found in user queries with the individual passages to determine a similarity score for passages of interest. In passage detection, passages are classified into predetermined categories. More often than not, passage detection techniques are deployed to detect hidden paragraphs in documents. That is, to hide information, documents are injected with hidden text into passages. Rather than matching query terms against passages to determine their relevance, using text-mining techniques, the passages are classified. Those documents with hidden passages are defined as infected. Thus, simply stated, passage retrieval is the search for passages relevant to a user query, while passage detection is the classification of passages. That is, in passage detection, passages are labeled with one or more categories from a set of predetermined categories. We present a keyword-based dynamic passage approach (KDP) and demonstrate that KDP outperforms statistically significantly (99% confidence) the other document-splitting approaches by 12% to 18% in the passage detection and passage category-prediction tasks. Furthermore, we evaluate the effects of the feature selection, passage length, ambiguous passages, and finally training-data category distribution on passage-detection accuracy.

Date

22. 3.2009 19:14:43
Yoon, Y.; Lee, G.G.: Efficient implementation of associative classifiers for document classification (2007) 0.07
```
0.06924906 = product of:
  0.10387358 = sum of:
    0.03052565 = weight(_text_:based in 909) [ClassicSimilarity], result of:
      0.03052565 = score(doc=909,freq=2.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.19973516 = fieldWeight in 909, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.046875 = fieldNorm(doc=909)
    0.073347926 = product of:
      0.14669585 = sum of:
        0.14669585 = weight(_text_:training in 909) [ClassicSimilarity], result of:
          0.14669585 = score(doc=909,freq=8.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.6192214 = fieldWeight in 909, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.046875 = fieldNorm(doc=909)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

In practical text classification tasks, the ability to interpret the classification result is as important as the ability to classify exactly. Associative classifiers have many favorable characteristics such as rapid training, good classification accuracy, and excellent interpretation. However, associative classifiers also have some obstacles to overcome when they are applied in the area of text classification. The target text collection generally has a very high dimension, thus the training process might take a very long time. We propose a feature selection based on the mutual information between the word and class variables to reduce the space dimension of the associative classifiers. In addition, the training process of the associative classifier produces a huge amount of classification rules, which makes the prediction with a new document ineffective. We resolve this by introducing a new efficient method for storing and pruning classification rules. This method can also be used when predicting a test document. Experimental results using the 20-newsgroups dataset show many benefits of the associative classification in both training and predicting when applied to a real world problem.
Ahmed, M.; Mukhopadhyay, M.; Mukhopadhyay, P.: Automated knowledge organization : AI ML based subject indexing system for libraries (2023) 0.06
```
0.06466287 = product of:
  0.096994296 = sum of:
    0.044059984 = weight(_text_:based in 977) [ClassicSimilarity], result of:
      0.044059984 = score(doc=977,freq=6.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.28829288 = fieldWeight in 977, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.0390625 = fieldNorm(doc=977)
    0.052934308 = product of:
      0.105868615 = sum of:
        0.105868615 = weight(_text_:training in 977) [ClassicSimilarity], result of:
          0.105868615 = score(doc=977,freq=6.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.44688457 = fieldWeight in 977, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.0390625 = fieldNorm(doc=977)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

The research study as reported here is an attempt to explore the possibilities of an AI/ML-based semi-automated indexing system in a library setup to handle large volumes of documents. It uses the Python virtual environment to install and configure an open source AI environment (named Annif) to feed the LOD (Linked Open Data) dataset of Library of Congress Subject Headings (LCSH) as a standard KOS (Knowledge Organisation System). The framework deployed the Turtle format of LCSH after cleaning the file with Skosify, applied an array of backend algorithms (namely TF-IDF, Omikuji, and NN-Ensemble) to measure relative performance, and selected Snowball as an analyser. The training of Annif was conducted with a large set of bibliographic records populated with subject descriptors (MARC tag 650$a) and indexed by trained LIS professionals. The training dataset is first treated with MarcEdit to export it in a format suitable for OpenRefine, and then in OpenRefine it undergoes many steps to produce a bibliographic record set suitable to train Annif. The framework, after training, has been tested with a bibliographic dataset to measure indexing efficiencies, and finally, the automated indexing framework is integrated with data wrangling software (OpenRefine) to produce suggested headings on a mass scale. The entire framework is based on open-source software, open datasets, and open standards.
Golub, K.; Hamon, T.; Ardö, A.: Automated classification of textual documents based on a controlled vocabulary in engineering (2007) 0.06
```
0.0633564 = product of:
  0.0950346 = sum of:
    0.04316979 = weight(_text_:based in 1461) [ClassicSimilarity], result of:
      0.04316979 = score(doc=1461,freq=4.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.28246817 = fieldWeight in 1461, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.046875 = fieldNorm(doc=1461)
    0.051864814 = product of:
      0.10372963 = sum of:
        0.10372963 = weight(_text_:training in 1461) [ClassicSimilarity], result of:
          0.10372963 = score(doc=1461,freq=4.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.43785566 = fieldWeight in 1461, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.046875 = fieldNorm(doc=1461)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

Automated subject classification has been a challenging research issue for many years now, receiving particular attention in the past decade due to rapid increase of digital documents. The most frequent approach to automated classification is machine learning. It, however, requires training documents and performs well on new documents only if these are similar enough to the former. We explore a string-matching algorithm based on a controlled vocabulary, which does not require training documents - instead it reuses the intellectual work put into creating the controlled vocabulary. Terms from the Engineering Information thesaurus and classification scheme were matched against title and abstract of engineering papers from the Compendex database. Simple string-matching was enhanced by several methods such as term weighting schemes and cut-offs, exclusion of certain terms, and en- richment of the controlled vocabulary with automatically extracted terms. The best results are 76% recall when the controlled vocabulary is enriched with new terms, and 79% precision when certain terms are excluded. Precision of individual classes is up to 98%. These results are comparable to state-of-the-art machine-learning algorithms.

Hung, C.-M.; Chien, L.-F.: Web-based text classification in the absence of manually labeled training documents (2007) 0.06

0.06269788 = product of:
  0.094046816 = sum of:
    0.03052565 = weight(_text_:based in 87) [ClassicSimilarity], result of:
      0.03052565 = score(doc=87,freq=2.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.19973516 = fieldWeight in 87, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.046875 = fieldNorm(doc=87)
    0.06352116 = product of:
      0.12704232 = sum of:
        0.12704232 = weight(_text_:training in 87) [ClassicSimilarity], result of:
          0.12704232 = score(doc=87,freq=6.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.53626144 = fieldWeight in 87, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.046875 = fieldNorm(doc=87)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Abstract: Most text classification techniques assume that manually labeled documents (corpora) can be easily obtained while learning text classifiers. However, labeled training documents are sometimes unavailable or inadequate even if they are available. The goal of this article is to present a self-learned approach to extract high-quality training documents from the Web when the required manually labeled documents are unavailable or of poor quality. To learn a text classifier automatically, we need only a set of user-defined categories and some highly related keywords. Extensive experiments are conducted to evaluate the performance of the proposed approach using the test set from the Reuters-21578 news data set. The experiments show that very promising results can be achieved only by using automatically extracted documents from the Web.

Malo, P.; Sinha, A.; Wallenius, J.; Korhonen, P.: Concept-based document classification using Wikipedia and value function (2011) 0.06

0.059697293 = product of:
  0.089545935 = sum of:
    0.052871976 = weight(_text_:based in 4948) [ClassicSimilarity], result of:
      0.052871976 = score(doc=4948,freq=6.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.34595144 = fieldWeight in 4948, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.046875 = fieldNorm(doc=4948)
    0.036673963 = product of:
      0.073347926 = sum of:
        0.073347926 = weight(_text_:training in 4948) [ClassicSimilarity], result of:
          0.073347926 = score(doc=4948,freq=2.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.3096107 = fieldWeight in 4948, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.046875 = fieldNorm(doc=4948)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Abstract: In this article, we propose a new concept-based method for document classification. The conceptual knowledge associated with the words is drawn from Wikipedia. The purpose is to utilize the abundant semantic relatedness information available in Wikipedia in an efficient value function-based query learning algorithm. The procedure learns the value function by solving a simple linear programming problem formulated using the training documents. The learning involves a step-wise iterative process that helps in generating a value function with an appropriate set of concepts (dimensions) chosen from a collection of concepts. Once the value function is formulated, it is utilized to make a decision between relevance and irrelevance. The value assigned to a particular document from the value function can be further used to rank the documents according to their relevance. Reuters newswire documents have been used to evaluate the efficacy of the procedure. An extensive comparison with other frameworks has been performed. The results are promising.

HaCohen-Kerner, Y. et al.: Classification using various machine learning methods and combinations of key-phrases and visual features (2016) 0.06

0.056825332 = product of:
  0.085237995 = sum of:
    0.050876085 = weight(_text_:based in 2748) [ClassicSimilarity], result of:
      0.050876085 = score(doc=2748,freq=2.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.33289194 = fieldWeight in 2748, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.078125 = fieldNorm(doc=2748)
    0.034361906 = product of:
      0.06872381 = sum of:
        0.06872381 = weight(_text_:22 in 2748) [ClassicSimilarity], result of:
          0.06872381 = score(doc=2748,freq=2.0), product of:
            0.17762627 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.050723847 = queryNorm
            0.38690117 = fieldWeight in 2748, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.078125 = fieldNorm(doc=2748)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Date: 1. 2.2016 18:25:22
Source: Semantic keyword-based search on structured data sources: First COST Action IC1302 International KEYSTONE Conference, IKC 2015, Coimbra, Portugal, September 8-9, 2015. Revised Selected Papers. Eds.: J. Cardoso et al

HaCohen-Kerner, Y.; Beck, H.; Yehudai, E.; Rosenstein, M.; Mughaz, D.: Cuisine : classification using stylistic feature sets and/or name-based feature sets (2010) 0.05
```
0.054291815 = product of:
  0.08143772 = sum of:
    0.050876085 = weight(_text_:based in 3706) [ClassicSimilarity], result of:
      0.050876085 = score(doc=3706,freq=8.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.33289194 = fieldWeight in 3706, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3706)
    0.030561633 = product of:
      0.061123267 = sum of:
        0.061123267 = weight(_text_:training in 3706) [ClassicSimilarity], result of:
          0.061123267 = score(doc=3706,freq=2.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.2580089 = fieldWeight in 3706, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3706)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

Document classification presents challenges due to the large number of features, their dependencies, and the large number of training documents. In this research, we investigated the use of six stylistic feature sets (including 42 features) and/or six name-based feature sets (including 234 features) for various combinations of the following classification tasks: ethnic groups of the authors and/or periods of time when the documents were written and/or places where the documents were written. The investigated corpus contains Jewish Law articles written in Hebrew-Aramaic, which present interesting problems for classification. Our system CUISINE (Classification UsIng Stylistic feature sets and/or NamE-based feature sets) achieves accuracy results between 90.71 to 98.99% for the seven classification experiments (ethnicity, time, place, ethnicity&time, ethnicity&place, time&place, ethnicity&time&place). For the first six tasks, the stylistic feature sets in general and the quantitative feature set in particular are enough for excellent classification results. In contrast, the name-based feature sets are rather poor for these tasks. However, for the most complex task (ethnicity&time&place), a hill-climbing model using all feature sets succeeds in significantly improving the classification results. Most of the stylistic features (34 of 42) are language-independent and domain-independent. These features might be useful to the community at large, at least for rather simple tasks.
Teich, E.; Degaetano-Ortlieb, S.; Fankhauser, P.; Kermes, H.; Lapshinova-Koltunski, E.: ¬The linguistic construal of disciplinarity : a data-mining approach using register features (2016) 0.05
```
0.053229168 = product of:
  0.07984375 = sum of:
    0.04316979 = weight(_text_:based in 3015) [ClassicSimilarity], result of:
      0.04316979 = score(doc=3015,freq=4.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.28246817 = fieldWeight in 3015, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.046875 = fieldNorm(doc=3015)
    0.036673963 = product of:
      0.073347926 = sum of:
        0.073347926 = weight(_text_:training in 3015) [ClassicSimilarity], result of:
          0.073347926 = score(doc=3015,freq=2.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.3096107 = fieldWeight in 3015, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.046875 = fieldNorm(doc=3015)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

We analyze the linguistic evolution of selected scientific disciplines over a 30-year time span (1970s to 2000s). Our focus is on four highly specialized disciplines at the boundaries of computer science that emerged during that time: computational linguistics, bioinformatics, digital construction, and microelectronics. Our analysis is driven by the question whether these disciplines develop a distinctive language use-both individually and collectively-over the given time period. The data set is the English Scientific Text Corpus (scitex), which includes texts from the 1970s/1980s and early 2000s. Our theoretical basis is register theory. In terms of methods, we combine corpus-based methods of feature extraction (various aggregated features [part-of-speech based], n-grams, lexico-grammatical patterns) and automatic text classification. The results of our research are directly relevant to the study of linguistic variation and languages for specific purposes (LSP) and have implications for various natural language processing (NLP) tasks, for example, authorship attribution, text mining, or training NLP tools.
Han, K.; Rezapour, R.; Nakamura, K.; Devkota, D.; Miller, D.C.; Diesner, J.: ¬An expert-in-the-loop method for domain-specific document categorization based on small training data (2023) 0.05
```
0.052797 = product of:
  0.0791955 = sum of:
    0.035974823 = weight(_text_:based in 967) [ClassicSimilarity], result of:
      0.035974823 = score(doc=967,freq=4.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.23539014 = fieldWeight in 967, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.0390625 = fieldNorm(doc=967)
    0.043220676 = product of:
      0.08644135 = sum of:
        0.08644135 = weight(_text_:training in 967) [ClassicSimilarity], result of:
          0.08644135 = score(doc=967,freq=4.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.3648797 = fieldWeight in 967, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.0390625 = fieldNorm(doc=967)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

Automated text categorization methods are of broad relevance for domain experts since they free researchers and practitioners from manual labeling, save their resources (e.g., time, labor), and enrich the data with information helpful to study substantive questions. Despite a variety of newly developed categorization methods that require substantial amounts of annotated data, little is known about how to build models when (a) labeling texts with categories requires substantial domain expertise and/or in-depth reading, (b) only a few annotated documents are available for model training, and (c) no relevant computational resources, such as pretrained models, are available. In a collaboration with environmental scientists who study the socio-ecological impact of funded biodiversity conservation projects, we develop a method that integrates deep domain expertise with computational models to automatically categorize project reports based on a small sample of 93 annotated documents. Our results suggest that domain expertise can improve automated categorization and that the magnitude of these improvements is influenced by the experts' understanding of categories and their confidence in their annotation, as well as data sparsity and additional category characteristics such as the portion of exclusive keywords that can identify a category.

Ruiz, M.E.; Srinivasan, P.: Combining machine learning and hierarchical indexing structures for text categorization (2001) 0.05

0.05226637 = product of:
  0.078399554 = sum of:
    0.03561326 = weight(_text_:based in 1595) [ClassicSimilarity], result of:
      0.03561326 = score(doc=1595,freq=2.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.23302436 = fieldWeight in 1595, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1595)
    0.04278629 = product of:
      0.08557258 = sum of:
        0.08557258 = weight(_text_:training in 1595) [ClassicSimilarity], result of:
          0.08557258 = score(doc=1595,freq=2.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.3612125 = fieldWeight in 1595, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1595)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Abstract: This paper presents a method that exploits the hierarchical structure of an indexing vocabulary to guide the development and training of machine learning methods for automatic text categorization. We present the design of a hierarchical classifier based an the divide-and-conquer principle. The method is evaluated using backpropagation neural networks, such as the machine learning algorithm, that leam to assign MeSH categories to a subset of MEDLINE records. Comparisons with traditional Rocchio's algorithm adapted for text categorization, as well as flat neural network classifiers, are provided. The results indicate that the use of hierarchical structures improves Performance significantly.

Maghsoodi, N.; Homayounpour, M.M.: Improving Farsi multiclass text classification using a thesaurus and two-stage feature selection (2011) 0.05
```
0.052248236 = product of:
  0.07837235 = sum of:
    0.025438042 = weight(_text_:based in 4775) [ClassicSimilarity], result of:
      0.025438042 = score(doc=4775,freq=2.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.16644597 = fieldWeight in 4775, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.0390625 = fieldNorm(doc=4775)
    0.052934308 = product of:
      0.105868615 = sum of:
        0.105868615 = weight(_text_:training in 4775) [ClassicSimilarity], result of:
          0.105868615 = score(doc=4775,freq=6.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.44688457 = fieldWeight in 4775, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4775)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

The progressive increase of information content has recently made it necessary to create a system for automatic classification of documents. In this article, a system is presented for the categorization of multiclass Farsi documents that requires fewer training examples and can help to compensate the shortcoming of the standard training dataset. The new idea proposed in the present article is based on extending the feature vector by adding some words extracted from a thesaurus and then filtering the new feature vector by applying secondary feature selection to discard inappropriate features. In fact, a phase of secondary feature selection is applied to choose more appropriate features among the features added from a thesaurus to enhance the effect of using a thesaurus on the efficiency of the classifier. To evaluate the proposed system, a corpus is gathered from the Farsi Wikipedia website and some articles in the Hamshahri newspaper, the Roshd periodical, and the Soroush magazine. In addition to studying the role of a thesaurus and applying secondary feature selection, the effect of a various number of categories, size of the training dataset, and average number of words in the test data also are examined. As the results indicate, classification efficiency improves by applying this approach, especially when available data is not sufficient for some text categories.

Yi, K.: Automatic text classification using library classification schemes : trends, issues and challenges (2007) 0.05

0.04961206 = product of:
  0.07441809 = sum of:
    0.050364755 = weight(_text_:based in 2560) [ClassicSimilarity], result of:
      0.050364755 = score(doc=2560,freq=4.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.3295462 = fieldWeight in 2560, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.0546875 = fieldNorm(doc=2560)
    0.024053333 = product of:
      0.048106667 = sum of:
        0.048106667 = weight(_text_:22 in 2560) [ClassicSimilarity], result of:
          0.048106667 = score(doc=2560,freq=2.0), product of:
            0.17762627 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.050723847 = queryNorm
            0.2708308 = fieldWeight in 2560, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2560)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Abstract: The proliferation of digital resources and their integration into a traditional library setting has created a pressing need for an automated tool that organizes textual information based on library classification schemes. Automated text classification is a research field of developing tools, methods, and models to automate text classification. This article describes the current popular approach for text classification and major text classification projects and applications that are based on library classification schemes. Related issues and challenges are discussed, and a number of considerations for the challenges are examined.
Date: 22. 9.2008 18:31:54

Duwairi, R.M.: Machine learning for Arabic text categorization (2006) 0.04

0.044799745 = product of:
  0.06719962 = sum of:
    0.03052565 = weight(_text_:based in 5115) [ClassicSimilarity], result of:
      0.03052565 = score(doc=5115,freq=2.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.19973516 = fieldWeight in 5115, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.046875 = fieldNorm(doc=5115)
    0.036673963 = product of:
      0.073347926 = sum of:
        0.073347926 = weight(_text_:training in 5115) [ClassicSimilarity], result of:
          0.073347926 = score(doc=5115,freq=2.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.3096107 = fieldWeight in 5115, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.046875 = fieldNorm(doc=5115)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Abstract: In this article we propose a distance-based classifier for categorizing Arabic text. Each category is represented as a vector of words in an m-dimensional space, and documents are classified on the basis of their closeness to feature vectors of categories. The classifier, in its learning phase, scans the set of training documents to extract features of categories that capture inherent category-specific properties; in its testing phase the classifier uses previously determined category-specific features to categorize unclassified documents. Stemming was used to reduce the dimensionality of feature vectors of documents. The accuracy of the classifier was tested by carrying out several categorization tasks on an in-house collected Arabic corpus. The results show that the proposed classifier is very accurate and robust.

Ko, Y.; Seo, J.: Text classification from unlabeled documents with bootstrapping and feature projection techniques (2009) 0.04
```
0.044799745 = product of:
  0.06719962 = sum of:
    0.03052565 = weight(_text_:based in 2452) [ClassicSimilarity], result of:
      0.03052565 = score(doc=2452,freq=2.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.19973516 = fieldWeight in 2452, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.046875 = fieldNorm(doc=2452)
    0.036673963 = product of:
      0.073347926 = sum of:
        0.073347926 = weight(_text_:training in 2452) [ClassicSimilarity], result of:
          0.073347926 = score(doc=2452,freq=2.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.3096107 = fieldWeight in 2452, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.046875 = fieldNorm(doc=2452)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

Many machine learning algorithms have been applied to text classification tasks. In the machine learning paradigm, a general inductive process automatically builds a text classifier by learning, generally known as supervised learning. However, the supervised learning approaches have some problems. The most notable problem is that they require a large number of labeled training documents for accurate learning. While unlabeled documents are easily collected and plentiful, labeled documents are difficultly generated because a labeling task must be done by human developers. In this paper, we propose a new text classification method based on unsupervised or semi-supervised learning. The proposed method launches text classification tasks with only unlabeled documents and the title word of each category for learning, and then it automatically learns text classifier by using bootstrapping and feature projection techniques. The results of experiments showed that the proposed method achieved reasonably useful performance compared to a supervised method. If the proposed method is used in a text classification task, building text classification systems will become significantly faster and less expensive.
Schaalje, G.B.; Blades, N.J.; Funai, T.: ¬An open-set size-adjusted Bayesian classifier for authorship attribution (2013) 0.04
```
0.044799745 = product of:
  0.06719962 = sum of:
    0.03052565 = weight(_text_:based in 1041) [ClassicSimilarity], result of:
      0.03052565 = score(doc=1041,freq=2.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.19973516 = fieldWeight in 1041, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.046875 = fieldNorm(doc=1041)
    0.036673963 = product of:
      0.073347926 = sum of:
        0.073347926 = weight(_text_:training in 1041) [ClassicSimilarity], result of:
          0.073347926 = score(doc=1041,freq=2.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.3096107 = fieldWeight in 1041, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.046875 = fieldNorm(doc=1041)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

Recent studies of authorship attribution have used machine-learning methods including regularized multinomial logistic regression, neural nets, support vector machines, and the nearest shrunken centroid classifier to identify likely authors of disputed texts. These methods are all limited by an inability to perform open-set classification and account for text and corpus size. We propose a customized Bayesian logit-normal-beta-binomial classification model for supervised authorship attribution. The model is based on the beta-binomial distribution with an explicit inverse relationship between extra-binomial variation and text size. The model internally estimates the relationship of extra-binomial variation to text size, and uses Markov Chain Monte Carlo (MCMC) to produce distributions of posterior authorship probabilities instead of point estimates. We illustrate the method by training the machine-learning methods as well as the open-set Bayesian classifier on undisputed papers of The Federalist, and testing the method on documents historically attributed to Alexander Hamilton, John Jay, and James Madison. The Bayesian classifier was the best classifier of these texts.
Mu, T.; Goulermas, J.Y.; Korkontzelos, I.; Ananiadou, S.: Descriptive document clustering via discriminant learning in a co-embedded space of multilevel similarities (2016) 0.04
```
0.04435764 = product of:
  0.06653646 = sum of:
    0.035974823 = weight(_text_:based in 2496) [ClassicSimilarity], result of:
      0.035974823 = score(doc=2496,freq=4.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.23539014 = fieldWeight in 2496, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2496)
    0.030561633 = product of:
      0.061123267 = sum of:
        0.061123267 = weight(_text_:training in 2496) [ClassicSimilarity], result of:
          0.061123267 = score(doc=2496,freq=2.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.2580089 = fieldWeight in 2496, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2496)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

Descriptive document clustering aims at discovering clusters of semantically interrelated documents together with meaningful labels to summarize the content of each document cluster. In this work, we propose a novel descriptive clustering framework, referred to as CEDL. It relies on the formulation and generation of 2 types of heterogeneous objects, which correspond to documents and candidate phrases, using multilevel similarity information. CEDL is composed of 5 main processing stages. First, it simultaneously maps the documents and candidate phrases into a common co-embedded space that preserves higher-order, neighbor-based proximities between the combined sets of documents and phrases. Then, it discovers an approximate cluster structure of documents in the common space. The third stage extracts promising topic phrases by constructing a discriminant model where documents along with their cluster memberships are used as training instances. Subsequently, the final cluster labels are selected from the topic phrases using a ranking scheme using multiple scores based on the extracted co-embedding information and the discriminant output. The final stage polishes the initial clusters to reduce noise and accommodate the multitopic nature of documents. The effectiveness and competitiveness of CEDL is demonstrated qualitatively and quantitatively with experiments using document databases from different application fields.

Jenkins, C.: Automatic classification of Web resources using Java and Dewey Decimal Classification (1998) 0.04

0.039777733 = product of:
  0.059666596 = sum of:
    0.03561326 = weight(_text_:based in 1673) [ClassicSimilarity], result of:
      0.03561326 = score(doc=1673,freq=2.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.23302436 = fieldWeight in 1673, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1673)
    0.024053333 = product of:
      0.048106667 = sum of:
        0.048106667 = weight(_text_:22 in 1673) [ClassicSimilarity], result of:
          0.048106667 = score(doc=1673,freq=2.0), product of:
            0.17762627 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.050723847 = queryNorm
            0.2708308 = fieldWeight in 1673, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1673)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Abstract: The Wolverhampton Web Library (WWLib) is a WWW search engine that provides access to UK based information. The experimental version developed in 1995, was a success but highlighted the need for a much higher degree of automation. An interesting feature of the experimental WWLib was that it organised information according to DDC. Discusses the advantages of classification and describes the automatic classifier that is being developed in Java as part of the new, fully automated WWLib
Date: 1. 8.1996 22:08:06

Dolin, R.; Agrawal, D.; El Abbadi, A.; Pearlman, J.: Using automated classification for summarizing and selecting heterogeneous information sources (1998) 0.04
```
0.037638705 = product of:
  0.056458056 = sum of:
    0.03052565 = weight(_text_:based in 1253) [ClassicSimilarity], result of:
      0.03052565 = score(doc=1253,freq=8.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.19973516 = fieldWeight in 1253, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.0234375 = fieldNorm(doc=1253)
    0.025932407 = product of:
      0.051864814 = sum of:
        0.051864814 = weight(_text_:training in 1253) [ClassicSimilarity], result of:
          0.051864814 = score(doc=1253,freq=4.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.21892783 = fieldWeight in 1253, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.0234375 = fieldNorm(doc=1253)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

Information retrieval over the Internet increasingly requires the filtering of thousands of heterogeneous information sources. Important sources of information include not only traditional databases with structured data and queries, but also increasing numbers of non-traditional, semi- or unstructured collections such as Web sites, FTP archives, etc. As the number and variability of sources increases, new ways of automatically summarizing, discovering, and selecting collections relevant to a user's query are needed. One such method involves the use of classification schemes, such as the Library of Congress Classification (LCC), within which a collection may be represented based on its content, irrespective of the structure of the actual data or documents. For such a system to be useful in a large-scale distributed environment, it must be easy to use for both collection managers and users. As a result, it must be possible to classify documents automatically within a classification scheme. Furthermore, there must be a straightforward and intuitive interface with which the user may use the scheme to assist in information retrieval (IR). Our work with the Alexandria Digital Library (ADL) Project focuses on geo-referenced information, whether text, maps, aerial photographs, or satellite images. As a result, we have emphasized techniques which work with both text and non-text, such as combined textual and graphical queries, multi-dimensional indexing, and IR methods which are not solely dependent on words or phrases. Part of this work involves locating relevant online sources of information. In particular, we have designed and are currently testing aspects of an architecture, Pharos, which we believe will scale up to 1.000.000 heterogeneous sources. Pharos accommodates heterogeneity in content and format, both among multiple sources as well as within a single source. That is, we consider sources to include Web sites, FTP archives, newsgroups, and full digital libraries; all of these systems can include a wide variety of content and multimedia data formats. Pharos is based on the use of hierarchical classification schemes. These include not only well-known 'subject' (or 'concept') based schemes such as the Dewey Decimal System and the LCC, but also, for example, geographic classifications, which might be constructed as layers of smaller and smaller hierarchical longitude/latitude boxes. Pharos is designed to work with sophisticated queries which utilize subjects, geographical locations, temporal specifications, and other types of information domains. The Pharos architecture requires that hierarchically structured collection metadata be extracted so that it can be partitioned in such a way as to greatly enhance scalability. Automated classification is important to Pharos because it allows information sources to extract the requisite collection metadata automatically that must be distributed.
We are currently experimenting with newsgroups as collections. We have built an initial prototype which automatically classifies and summarizes newsgroups within the LCC. (The prototype can be tested below, and more details may be found at http://pharos.alexandria.ucsb.edu/). The prototype uses electronic library catalog records as a `training set' and Latent Semantic Indexing (LSI) for IR. We use the training set to build a rich set of classification terminology, and associate these terms with the relevant categories in the LCC. This association between terms and classification categories allows us to relate users' queries to nodes in the LCC so that users can select appropriate query categories. Newsgroups are similarly associated with classification categories. Pharos then matches the categories selected by users to relevant newsgroups. In principle, this approach allows users to exclude newsgroups that might have been selected based on an unintended meaning of a query term, and to include newsgroups with relevant content even though the exact query terms may not have been used. This work is extensible to other types of classification, including geographical, temporal, and image feature. Before discussing the methodology of the collection summarization and selection, we first present an online demonstration below. The demonstration is not intended to be a complete end-user interface. Rather, it is intended merely to offer a view of the process to suggest the "look and feel" of the prototype. The demo works as follows. First supply it with a few keywords of interest. The system will then use those terms to try to return to you the most relevant subject categories within the LCC. Assuming that the system recognizes any of your terms (it has over 400,000 terms indexed), it will give you a list of 15 LCC categories sorted by relevancy ranking. From there, you have two choices. The first choice, by clicking on the "News" links, is to get a list of newsgroups which the system has identified as relevant to the LCC category you select. The other choice, by clicking on the LCC ID links, is to enter the LCC hierarchy starting at the category of your choice and navigate the tree until you locate the best category for your query. From there, again, you can get a list of newsgroups by clicking on the "News" links. After having shown this demonstration to many people, we would like to suggest that you first give it easier examples before trying to break it. For example, "prostate cancer" (discussed below), "remote sensing", "investment banking", and "gershwin" all work reasonably well.

Search (112 results, page 1 of 6)

Authors

Years

Languages

Types

Themes