Search (102 results, page 3 of 6)

Reiner, U.: Automatische DDC-Klassifizierung bibliografischer Titeldatensätze der Deutschen Nationalbibliografie (2009) 0.00

0.003180102 = product of:
  0.012720408 = sum of:
    0.012720408 = product of:
      0.025440816 = sum of:
        0.025440816 = weight(_text_:22 in 3284) [ClassicSimilarity], result of:
          0.025440816 = score(doc=3284,freq=2.0), product of:
            0.16438834 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.04694356 = queryNorm
            0.15476047 = fieldWeight in 3284, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.03125 = fieldNorm(doc=3284)
      0.5 = coord(1/2)
  0.25 = coord(1/4)

Date: 22. 1.2010 14:41:24

Malo, P.; Sinha, A.; Wallenius, J.; Korhonen, P.: Concept-based document classification using Wikipedia and value function (2011) 0.00
```
0.003058225 = product of:
  0.0122329 = sum of:
    0.0122329 = product of:
      0.0489316 = sum of:
        0.0489316 = weight(_text_:based in 4948) [ClassicSimilarity], result of:
          0.0489316 = score(doc=4948,freq=6.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.34595144 = fieldWeight in 4948, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.046875 = fieldNorm(doc=4948)
      0.25 = coord(1/4)
  0.25 = coord(1/4)
```
Abstract

In this article, we propose a new concept-based method for document classification. The conceptual knowledge associated with the words is drawn from Wikipedia. The purpose is to utilize the abundant semantic relatedness information available in Wikipedia in an efficient value function-based query learning algorithm. The procedure learns the value function by solving a simple linear programming problem formulated using the training documents. The learning involves a step-wise iterative process that helps in generating a value function with an appropriate set of concepts (dimensions) chosen from a collection of concepts. Once the value function is formulated, it is utilized to make a decision between relevance and irrelevance. The value assigned to a particular document from the value function can be further used to rank the documents according to their relevance. Reuters newswire documents have been used to evaluate the efficacy of the procedure. An extensive comparison with other frameworks has been performed. The results are promising.
Khoo, C.S.G.; Ou, S.: Machine versus human clustering of concepts across documents (2008) 0.00
```
0.0029427784 = product of:
  0.011771114 = sum of:
    0.011771114 = product of:
      0.047084454 = sum of:
        0.047084454 = weight(_text_:based in 2286) [ClassicSimilarity], result of:
          0.047084454 = score(doc=2286,freq=8.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.33289194 = fieldWeight in 2286, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2286)
      0.25 = coord(1/4)
  0.25 = coord(1/4)
```
Content

An automated method for clustering terms/concepts from a set of documents on the same topic was developed for the purpose of multidocument summarization. The clustering method makes use of a combination of lexical overlap between multiword terms, syntactic constraints and semantic consideration based on a manually constructed taxonomy to generate hierarchically organized clusters of terms. This study evaluates the machine-generated clusters by calculating the proportion of overlap with two sets of human-generated clusters for 15 topics. It was found that the overlap between machine-generated clusters and individual human-generated clusters are higher than that between two human-generated clusters. A quailtative analysis of the human clustering found that clusters formed are either semantic-conceptual based or lexical based (similar to machine clustering). The semantic-conceptual based clusters that were formed tended to be different for different human coders. This has raised questions about whether machine-generated clustering can be evaluated by comparing with human clustering.
HaCohen-Kerner, Y.; Beck, H.; Yehudai, E.; Rosenstein, M.; Mughaz, D.: Cuisine : classification using stylistic feature sets and/or name-based feature sets (2010) 0.00
```
0.0029427784 = product of:
  0.011771114 = sum of:
    0.011771114 = product of:
      0.047084454 = sum of:
        0.047084454 = weight(_text_:based in 3706) [ClassicSimilarity], result of:
          0.047084454 = score(doc=3706,freq=8.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.33289194 = fieldWeight in 3706, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3706)
      0.25 = coord(1/4)
  0.25 = coord(1/4)
```
Abstract

Document classification presents challenges due to the large number of features, their dependencies, and the large number of training documents. In this research, we investigated the use of six stylistic feature sets (including 42 features) and/or six name-based feature sets (including 234 features) for various combinations of the following classification tasks: ethnic groups of the authors and/or periods of time when the documents were written and/or places where the documents were written. The investigated corpus contains Jewish Law articles written in Hebrew-Aramaic, which present interesting problems for classification. Our system CUISINE (Classification UsIng Stylistic feature sets and/or NamE-based feature sets) achieves accuracy results between 90.71 to 98.99% for the seven classification experiments (ethnicity, time, place, ethnicity&time, ethnicity&place, time&place, ethnicity&time&place). For the first six tasks, the stylistic feature sets in general and the quantitative feature set in particular are enough for excellent classification results. In contrast, the name-based feature sets are rather poor for these tasks. However, for the most complex task (ethnicity&time&place), a hill-climbing model using all feature sets succeeds in significantly improving the classification results. Most of the stylistic features (34 of 42) are language-independent and domain-independent. These features might be useful to the community at large, at least for rather simple tasks.
Yilmaz, T.; Ozcan, R.; Altingovde, I.S.; Ulusoy, Ö.: Improving educational web search for question-like queries through subject classification (2019) 0.00
```
0.0029427784 = product of:
  0.011771114 = sum of:
    0.011771114 = product of:
      0.047084454 = sum of:
        0.047084454 = weight(_text_:based in 5041) [ClassicSimilarity], result of:
          0.047084454 = score(doc=5041,freq=8.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.33289194 = fieldWeight in 5041, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5041)
      0.25 = coord(1/4)
  0.25 = coord(1/4)
```
Abstract

Students use general web search engines as their primary source of research while trying to find answers to school-related questions. Although search engines are highly relevant for the general population, they may return results that are out of educational context. Another rising trend; social community question answering websites are the second choice for students who try to get answers from other peers online. We attempt discovering possible improvements in educational search by leveraging both of these information sources. For this purpose, we first implement a classifier for educational questions. This classifier is built by an ensemble method that employs several regular learning algorithms and retrieval based approaches that utilize external resources. We also build a query expander to facilitate classification. We further improve the classification using search engine results and obtain 83.5% accuracy. Although our work is entirely based on the Turkish language, the features could easily be mapped to other languages as well. In order to find out whether search engine ranking can be improved in the education domain using the classification model, we collect and label a set of query results retrieved from a general web search engine. We propose five ad-hoc methods to improve search ranking based on the idea that the query-document category relation is an indicator of relevance. We evaluate these methods for overall performance, varying query length and based on factoid and non-factoid queries. We show that some of the methods significantly improve the rankings in the education domain.
Adamson, G.W.; Boreham, J.: ¬The use of an association measure based on character structure to identify semantically related pairs of words and document titles (1974) 0.00
```
0.0029132022 = product of:
  0.011652809 = sum of:
    0.011652809 = product of:
      0.046611235 = sum of:
        0.046611235 = weight(_text_:based in 398) [ClassicSimilarity], result of:
          0.046611235 = score(doc=398,freq=4.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.3295462 = fieldWeight in 398, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0546875 = fieldNorm(doc=398)
      0.25 = coord(1/4)
  0.25 = coord(1/4)
```
Abstract

An automatic classification technique has been developed, based on the character structure of words. Dice's similarity coefficient is computed from the number of matching diagrams in pairs of character strings, and used to cluster sets of character strings. A sample of words from a chemical data base was chosen to contain certain stems derived from the names of chemical elements. They were successfully clusterd into groups of semantically related words. Each cluster is characterised by the root word from which all its members are derived. A second example of titles from Mathematical Reviews was clustered into well-defined classes, which compare favourably with the subject groupings of Mathematical Reviews
Calado, P.; Cristo, M.; Gonçalves, M.A.; Moura, E.S. de; Ribeiro-Neto, B.; Ziviani, N.: Link-based similarity measures for the classification of Web documents (2006) 0.00
```
0.002548521 = product of:
  0.010194084 = sum of:
    0.010194084 = product of:
      0.040776335 = sum of:
        0.040776335 = weight(_text_:based in 4921) [ClassicSimilarity], result of:
          0.040776335 = score(doc=4921,freq=6.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.28829288 = fieldWeight in 4921, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4921)
      0.25 = coord(1/4)
  0.25 = coord(1/4)
```
Abstract

Traditional text-based document classifiers tend to perform poorly an the Web. Text in Web documents is usually noisy and often does not contain enough information to determine their topic. However, the Web provides a different source that can be useful to document classification: its hyperlink structure. In this work, the authors evaluate how the link structure of the Web can be used to determine a measure of similarity appropriate for document classification. They experiment with five different similarity measures and determine their adequacy for predicting the topic of a Web page. Tests performed an a Web directory Show that link information alone allows classifying documents with an average precision of 86%. Further, when combined with a traditional textbased classifier, precision increases to values of up to 90%, representing gains that range from 63 to 132% over the use of text-based classification alone. Because the measures proposed in this article are straightforward to compute, they provide a practical and effective solution for Web classification and related information retrieval tasks. Further, the authors provide an important set of guidelines an how link structure can be used effectively to classify Web documents.
Peng, F.; Huang, X.: Machine learning for Asian language text classification (2007) 0.00
```
0.002548521 = product of:
  0.010194084 = sum of:
    0.010194084 = product of:
      0.040776335 = sum of:
        0.040776335 = weight(_text_:based in 831) [ClassicSimilarity], result of:
          0.040776335 = score(doc=831,freq=6.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.28829288 = fieldWeight in 831, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=831)
      0.25 = coord(1/4)
  0.25 = coord(1/4)
```
Abstract

Purpose - The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word boundary information is available in written text. The paper advocates a simple language modeling based approach for this task. Design/methodology/approach - Naïve Bayes, maximum entropy model, support vector machines, and language modeling approaches were implemented and were applied to Chinese and Japanese text classification. To investigate the influence of word segmentation, different word segmentation approaches were investigated and applied to Chinese text. A segmentation-based approach was compared with the non-segmentation-based approach. Findings - There were two findings: the experiments show that statistical language modeling can significantly outperform standard techniques, given the same set of features; and it was found that classification with word level features normally yields improved classification performance, but that classification performance is not monotonically related to segmentation accuracy. In particular, classification performance may initially improve with increased segmentation accuracy, but eventually classification performance stops improving, and can in fact even decrease, after a certain level of segmentation accuracy. Practical implications - Apply the findings to real web text classification is ongoing work. Originality/value - The paper is very relevant to Chinese and Japanese information processing, e.g. webpage classification, web search.
Pong, J.Y.-H.; Kwok, R.C.-W.; Lau, R.Y.-K.; Hao, J.-X.; Wong, P.C.-C.: ¬A comparative study of two automatic document classification methods in a library setting (2008) 0.00
```
0.002548521 = product of:
  0.010194084 = sum of:
    0.010194084 = product of:
      0.040776335 = sum of:
        0.040776335 = weight(_text_:based in 2532) [ClassicSimilarity], result of:
          0.040776335 = score(doc=2532,freq=6.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.28829288 = fieldWeight in 2532, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2532)
      0.25 = coord(1/4)
  0.25 = coord(1/4)
```
Abstract

In current library practice, trained human experts usually carry out document cataloguing and indexing based on a manual approach. With the explosive growth in the number of electronic documents available on the Internet and digital libraries, it is increasingly difficult for library practitioners to categorize both electronic documents and traditional library materials using just a manual approach. To improve the effectiveness and efficiency of document categorization at the library setting, more in-depth studies of using automatic document classification methods to categorize library items are required. Machine learning research has advanced rapidly in recent years. However, applying machine learning techniques to improve library practice is still a relatively unexplored area. This paper illustrates the design and development of a machine learning based automatic document classification system to alleviate the manual categorization problem encountered within the library setting. Two supervised machine learning algorithms have been tested. Our empirical tests show that supervised machine learning algorithms in general, and the k-nearest neighbours (KNN) algorithm in particular, can be used to develop an effective document classification system to enhance current library practice. Moreover, some concrete recommendations regarding how to practically apply the KNN algorithm to develop automatic document classification in a library setting are made. To our best knowledge, this is the first in-depth study of applying the KNN algorithm to automatic document classification based on the widely used LCC classification scheme adopted by many large libraries.
Borodin, Y.; Polishchuk, V.; Mahmud, J.; Ramakrishnan, I.V.; Stent, A.: Live and learn from mistakes : a lightweight system for document classification (2013) 0.00
```
0.002548521 = product of:
  0.010194084 = sum of:
    0.010194084 = product of:
      0.040776335 = sum of:
        0.040776335 = weight(_text_:based in 2722) [ClassicSimilarity], result of:
          0.040776335 = score(doc=2722,freq=6.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.28829288 = fieldWeight in 2722, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2722)
      0.25 = coord(1/4)
  0.25 = coord(1/4)
```
Abstract

We present a Life-Long Learning from Mistakes (3LM) algorithm for document classification, which could be used in various scenarios such as spam filtering, blog classification, and web resource categorization. We extend the ideas of online clustering and batch-mode centroid-based classification to online learning with negative feedback. The 3LM is a competitive learning algorithm, which avoids over-smoothing, characteristic of the centroid-based classifiers, by using a different class representative, which we call clusterhead. The clusterheads competing for vector-space dominance are drawn toward misclassified documents, eventually bringing the model to a "balanced state" for a fixed distribution of documents. Subsequently, the clusterheads oscillate between the misclassified documents, heuristically minimizing the rate of misclassifications, an NP-complete problem. Further, the 3LM algorithm prevents over-fitting by "leashing" the clusterheads to their respective centroids. A clusterhead provably converges if its class can be separated by a hyper-plane from all other classes. Lifelong learning with fixed learning rate allows 3LM to adapt to possibly changing distribution of the data and continually learn and unlearn document classes. We report on our experiments, which demonstrate high accuracy of document classification on Reuters21578, OHSUMED, and TREC07p-spam datasets. The 3LM algorithm did not show over-fitting, while consistently outperforming centroid-based, Naïve Bayes, C4.5, AdaBoost, kNN, and SVM whose accuracy had been reported on the same three corpora.
Ahmed, M.; Mukhopadhyay, M.; Mukhopadhyay, P.: Automated knowledge organization : AI ML based subject indexing system for libraries (2023) 0.00
```
0.002548521 = product of:
  0.010194084 = sum of:
    0.010194084 = product of:
      0.040776335 = sum of:
        0.040776335 = weight(_text_:based in 977) [ClassicSimilarity], result of:
          0.040776335 = score(doc=977,freq=6.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.28829288 = fieldWeight in 977, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=977)
      0.25 = coord(1/4)
  0.25 = coord(1/4)
```
Abstract

The research study as reported here is an attempt to explore the possibilities of an AI/ML-based semi-automated indexing system in a library setup to handle large volumes of documents. It uses the Python virtual environment to install and configure an open source AI environment (named Annif) to feed the LOD (Linked Open Data) dataset of Library of Congress Subject Headings (LCSH) as a standard KOS (Knowledge Organisation System). The framework deployed the Turtle format of LCSH after cleaning the file with Skosify, applied an array of backend algorithms (namely TF-IDF, Omikuji, and NN-Ensemble) to measure relative performance, and selected Snowball as an analyser. The training of Annif was conducted with a large set of bibliographic records populated with subject descriptors (MARC tag 650$a) and indexed by trained LIS professionals. The training dataset is first treated with MarcEdit to export it in a format suitable for OpenRefine, and then in OpenRefine it undergoes many steps to produce a bibliographic record set suitable to train Annif. The framework, after training, has been tested with a bibliographic dataset to measure indexing efficiencies, and finally, the automated indexing framework is integrated with data wrangling software (OpenRefine) to produce suggested headings on a mass scale. The entire framework is based on open-source software, open datasets, and open standards.
Larson, R.R.: Experiments in automatic Library of Congress Classification (1992) 0.00
```
0.0024970302 = product of:
  0.009988121 = sum of:
    0.009988121 = product of:
      0.039952483 = sum of:
        0.039952483 = weight(_text_:based in 1054) [ClassicSimilarity], result of:
          0.039952483 = score(doc=1054,freq=4.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.28246817 = fieldWeight in 1054, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.046875 = fieldNorm(doc=1054)
      0.25 = coord(1/4)
  0.25 = coord(1/4)
```
Abstract

This article presents the results of research into the automatic selection of Library of Congress Classification numbers based on the titles and subject headings in MARC records. The method used in this study was based on partial match retrieval techniques using various elements of new recors (i.e., those to be classified) as "queries", and a test database of classification clusters generated from previously classified MARC records. Sixty individual methods for automatic classification were tested on a set of 283 new records, using all combinations of four different partial match methods, five query types, and three representations of search terms. The results indicate that if the best method for a particular case can be determined, then up to 86% of the new records may be correctly classified. The single method with the best accuracy was able to select the correct classification for about 46% of the new records.
Mukhopadhyay, S.; Peng, S.; Raje, R.; Palakal, M.; Mostafa, J.: Multi-agent information classification using dynamic acquaintance lists (2003) 0.00
```
0.0024970302 = product of:
  0.009988121 = sum of:
    0.009988121 = product of:
      0.039952483 = sum of:
        0.039952483 = weight(_text_:based in 1755) [ClassicSimilarity], result of:
          0.039952483 = score(doc=1755,freq=4.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.28246817 = fieldWeight in 1755, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.046875 = fieldNorm(doc=1755)
      0.25 = coord(1/4)
  0.25 = coord(1/4)
```
Abstract

There has been considerable interest in recent years in providing automated information services, such as information classification, by means of a society of collaborative agents. These agents augment each other's knowledge structures (e.g., the vocabularies) and assist each other in providing efficient information services to a human user. However, when the number of agents present in the society increases, exhaustive communication and collaboration among agents result in a [arge communication overhead and increased delays in response time. This paper introduces a method to achieve selective interaction with a relatively small number of potentially useful agents, based an simple agent modeling and acquaintance lists. The key idea presented here is that the acquaintance list of an agent, representing a small number of other agents to be collaborated with, is dynamically adjusted. The best acquaintances are automatically discovered using a learning algorithm, based an the past history of collaboration. Experimental results are presented to demonstrate that such dynamically learned acquaintance lists can lead to high quality of classification, while significantly reducing the delay in response time.
Sun, A.; Lim, E.-P.; Ng, W.-K.: Performance measurement framework for hierarchical text classification (2003) 0.00
```
0.0024970302 = product of:
  0.009988121 = sum of:
    0.009988121 = product of:
      0.039952483 = sum of:
        0.039952483 = weight(_text_:based in 1808) [ClassicSimilarity], result of:
          0.039952483 = score(doc=1808,freq=4.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.28246817 = fieldWeight in 1808, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.046875 = fieldNorm(doc=1808)
      0.25 = coord(1/4)
  0.25 = coord(1/4)
```
Abstract

Hierarchical text classification or simply hierarchical classification refers to assigning a document to one or more suitable categories from a hierarchical category space. In our literature survey, we have found that the existing hierarchical classification experiments used a variety of measures to evaluate performance. These performance measures often assume independence between categories and do not consider documents misclassified into categories that are similar or not far from the correct categories in the category tree. In this paper, we therefore propose new performance measures for hierarchicai classification. The proposed performance measures consist of category similarity measures and distance-based measures that consider the contributions of misclassified documents. Our experiments an hierarchical classification methods based an SVM classifiers and binary Naive Bayes classifiers showed that SVM classifiers perform better than Naive Bayes classifiers an Reuters-21578 collection according to the extended measures. A new classifier-centric measure called blocking measure is also defined to examine the performance of subtree classifiers in a top-down levelbased hierarchical classificatIon method.
Leroy, G.; Miller, T.; Rosemblat, G.; Browne, A.: ¬A balanced approach to health information evaluation : a vocabulary-based naïve Bayes classifier and readability formulas (2008) 0.00
```
0.0024970302 = product of:
  0.009988121 = sum of:
    0.009988121 = product of:
      0.039952483 = sum of:
        0.039952483 = weight(_text_:based in 1998) [ClassicSimilarity], result of:
          0.039952483 = score(doc=1998,freq=4.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.28246817 = fieldWeight in 1998, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.046875 = fieldNorm(doc=1998)
      0.25 = coord(1/4)
  0.25 = coord(1/4)
```
Abstract

Since millions seek health information online, it is vital for this information to be comprehensible. Most studies use readability formulas, which ignore vocabulary, and conclude that online health information is too difficult. We developed a vocabularly-based, naïve Bayes classifier to distinguish between three difficulty levels in text. It proved 98% accurate in a 250-document evaluation. We compared our classifier with readability formulas for 90 new documents with different origins and asked representative human evaluators, an expert and a consumer, to judge each document. Average readability grade levels for educational and commercial pages was 10th grade or higher, too difficult according to current literature. In contrast, the classifier showed that 70-90% of these pages were written at an intermediate, appropriate level indicating that vocabulary usage is frequently appropriate in text considered too difficult by readability formula evaluations. The expert considered the pages more difficult for a consumer than the consumer did.
Montesi, M.; Navarrete, T.: Classifying web genres in context : A case study documenting the web genres used by a software engineer (2008) 0.00
```
0.0024970302 = product of:
  0.009988121 = sum of:
    0.009988121 = product of:
      0.039952483 = sum of:
        0.039952483 = weight(_text_:based in 2100) [ClassicSimilarity], result of:
          0.039952483 = score(doc=2100,freq=4.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.28246817 = fieldWeight in 2100, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.046875 = fieldNorm(doc=2100)
      0.25 = coord(1/4)
  0.25 = coord(1/4)
```
Abstract

This case study analyzes the Internet-based resources that a software engineer uses in his daily work. Methodologically, we studied the web browser history of the participant, classifying all the web pages he had seen over a period of 12 days into web genres. We interviewed him before and after the analysis of the web browser history. In the first interview, he spoke about his general information behavior; in the second, he commented on each web genre, explaining why and how he used them. As a result, three approaches allow us to describe the set of 23 web genres obtained: (a) the purposes they serve for the participant; (b) the role they play in the various work and search phases; (c) and the way they are used in combination with each other. Further observations concern the way the participant assesses quality of web-based resources, and his information behavior as a software engineer.
Golub, K.: Automated subject classification of textual documents in the context of Web-based hierarchical browsing (2011) 0.00
```
0.0024970302 = product of:
  0.009988121 = sum of:
    0.009988121 = product of:
      0.039952483 = sum of:
        0.039952483 = weight(_text_:based in 4558) [ClassicSimilarity], result of:
          0.039952483 = score(doc=4558,freq=4.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.28246817 = fieldWeight in 4558, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.046875 = fieldNorm(doc=4558)
      0.25 = coord(1/4)
  0.25 = coord(1/4)
```
Abstract

While automated methods for information organization have been around for several decades now, exponential growth of the World Wide Web has put them into the forefront of research in different communities, within which several approaches can be identified: 1) machine learning (algorithms that allow computers to improve their performance based on learning from pre-existing data); 2) document clustering (algorithms for unsupervised document organization and automated topic extraction); and 3) string matching (algorithms that match given strings within larger text). Here the aim was to automatically organize textual documents into hierarchical structures for subject browsing. The string-matching approach was tested using a controlled vocabulary (containing pre-selected and pre-defined authorized terms, each corresponding to only one concept). The results imply that an appropriate controlled vocabulary, with a sufficient number of entry terms designating classes, could in itself be a solution for automated classification. Then, if the same controlled vocabulary had an appropriat hierarchical structure, it would at the same time provide a good browsing structure for the collection of automatically classified documents.
Teich, E.; Degaetano-Ortlieb, S.; Fankhauser, P.; Kermes, H.; Lapshinova-Koltunski, E.: ¬The linguistic construal of disciplinarity : a data-mining approach using register features (2016) 0.00
```
0.0024970302 = product of:
  0.009988121 = sum of:
    0.009988121 = product of:
      0.039952483 = sum of:
        0.039952483 = weight(_text_:based in 3015) [ClassicSimilarity], result of:
          0.039952483 = score(doc=3015,freq=4.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.28246817 = fieldWeight in 3015, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.046875 = fieldNorm(doc=3015)
      0.25 = coord(1/4)
  0.25 = coord(1/4)
```
Abstract

We analyze the linguistic evolution of selected scientific disciplines over a 30-year time span (1970s to 2000s). Our focus is on four highly specialized disciplines at the boundaries of computer science that emerged during that time: computational linguistics, bioinformatics, digital construction, and microelectronics. Our analysis is driven by the question whether these disciplines develop a distinctive language use-both individually and collectively-over the given time period. The data set is the English Scientific Text Corpus (scitex), which includes texts from the 1970s/1980s and early 2000s. Our theoretical basis is register theory. In terms of methods, we combine corpus-based methods of feature extraction (various aggregated features [part-of-speech based], n-grams, lexico-grammatical patterns) and automatic text classification. The results of our research are directly relevant to the study of linguistic variation and languages for specific purposes (LSP) and have implications for various natural language processing (NLP) tasks, for example, authorship attribution, text mining, or training NLP tools.
Cheng, P.T.K.; Wu, A.K.W.: ACS: an automatic classification system (1995) 0.00
```
0.0023542228 = product of:
  0.009416891 = sum of:
    0.009416891 = product of:
      0.037667565 = sum of:
        0.037667565 = weight(_text_:based in 2188) [ClassicSimilarity], result of:
          0.037667565 = score(doc=2188,freq=2.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.26631355 = fieldWeight in 2188, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0625 = fieldNorm(doc=2188)
      0.25 = coord(1/4)
  0.25 = coord(1/4)
```
Abstract

In this paper, we introduce ACS, an automatic classification system for school libraries. First, various approaches towards automatic classification, namely (i) rule-based, (ii) browse and search, and (iii) partial match, are critically reviewed. The central issues of scheme selection, text analysis and similarity measures are discussed. A novel approach towards detecting book-class similarity with Modified Overlap Coefficient (MOC) is also proposed. Finally, the design and implementation of ACS is presented. The test result of over 80% correctness in automatic classification and a cost reduction of 75% compared to manual classification suggest that ACS is highly adoptable
Losee, R.M.; Haas, S.W.: Sublanguage terms : dictionaries, usage, and automatic classification (1995) 0.00
```
0.0023542228 = product of:
  0.009416891 = sum of:
    0.009416891 = product of:
      0.037667565 = sum of:
        0.037667565 = weight(_text_:based in 2650) [ClassicSimilarity], result of:
          0.037667565 = score(doc=2650,freq=2.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.26631355 = fieldWeight in 2650, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0625 = fieldNorm(doc=2650)
      0.25 = coord(1/4)
  0.25 = coord(1/4)
```
Abstract

The use of terms from natural and social science titles and abstracts is studied from the perspective of sublanguages and their specialized dictionaries. Explores different notions of sublanguage distinctiveness. Object methods for separating hard and soft sciences are suggested based on measures of sublanguage use, dictionary characteristics, and sublanguage distinctiveness. Abstracts were automatically classified with a high degree of accuracy by using a formula that condsiders the degree of uniqueness of terms in each sublanguage. This may prove useful for text filtering of information retrieval systems

Search (102 results, page 3 of 6)

Authors

Years

Languages

Types

Themes