Search (30 results, page 1 of 2)

Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.12

0.11659854 = product of:
  0.29149634 = sum of:
    0.24901254 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
      0.24901254 = score(doc=562,freq=2.0), product of:
        0.4430686 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.052260913 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.042483795 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
      0.042483795 = score(doc=562,freq=2.0), product of:
        0.18300882 = queryWeight, product of:
          3.5018296 = idf(docFreq=3622, maxDocs=44218)
          0.052260913 = queryNorm
        0.23214069 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.5018296 = idf(docFreq=3622, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
  0.4 = coord(2/5)

Content: Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
Date: 8. 1.2013 10:22:32

Liu, R.-L.: Context recognition for hierarchical text classification (2009) 0.03
```
0.033387464 = product of:
  0.08346866 = sum of:
    0.04098487 = weight(_text_:it in 2760) [ClassicSimilarity], result of:
      0.04098487 = score(doc=2760,freq=4.0), product of:
        0.15115225 = queryWeight, product of:
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.052260913 = queryNorm
        0.27114958 = fieldWeight in 2760, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.046875 = fieldNorm(doc=2760)
    0.042483795 = weight(_text_:22 in 2760) [ClassicSimilarity], result of:
      0.042483795 = score(doc=2760,freq=2.0), product of:
        0.18300882 = queryWeight, product of:
          3.5018296 = idf(docFreq=3622, maxDocs=44218)
          0.052260913 = queryNorm
        0.23214069 = fieldWeight in 2760, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.5018296 = idf(docFreq=3622, maxDocs=44218)
          0.046875 = fieldNorm(doc=2760)
  0.4 = coord(2/5)
```
Abstract

Information is often organized as a text hierarchy. A hierarchical text-classification system is thus essential for the management, sharing, and dissemination of information. It aims to automatically classify each incoming document into zero, one, or several categories in the text hierarchy. In this paper, we present a technique called CRHTC (context recognition for hierarchical text classification) that performs hierarchical text classification by recognizing the context of discussion (COD) of each category. A category's COD is governed by its ancestor categories, whose contents indicate contextual backgrounds of the category. A document may be classified into a category only if its content matches the category's COD. CRHTC does not require any trials to manually set parameters, and hence is more portable and easier to implement than other methods. It is empirically evaluated under various conditions. The results show that CRHTC achieves both better and more stable performance than several hierarchical and nonhierarchical text-classification methodologies.

Date

22. 3.2009 19:11:54
Khoo, C.S.G.; Ng, K.; Ou, S.: ¬An exploratory study of human clustering of Web pages (2003) 0.02
```
0.019057194 = product of:
  0.047642983 = sum of:
    0.019320453 = weight(_text_:it in 2741) [ClassicSimilarity], result of:
      0.019320453 = score(doc=2741,freq=2.0), product of:
        0.15115225 = queryWeight, product of:
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.052260913 = queryNorm
        0.12782113 = fieldWeight in 2741, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.03125 = fieldNorm(doc=2741)
    0.02832253 = weight(_text_:22 in 2741) [ClassicSimilarity], result of:
      0.02832253 = score(doc=2741,freq=2.0), product of:
        0.18300882 = queryWeight, product of:
          3.5018296 = idf(docFreq=3622, maxDocs=44218)
          0.052260913 = queryNorm
        0.15476047 = fieldWeight in 2741, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.5018296 = idf(docFreq=3622, maxDocs=44218)
          0.03125 = fieldNorm(doc=2741)
  0.4 = coord(2/5)
```
Abstract

This study seeks to find out how human beings cluster Web pages naturally. Twenty Web pages retrieved by the Northem Light search engine for each of 10 queries were sorted by 3 subjects into categories that were natural or meaningful to them. lt was found that different subjects clustered the same set of Web pages quite differently and created different categories. The average inter-subject similarity of the clusters created was a low 0.27. Subjects created an average of 5.4 clusters for each sorting. The categories constructed can be divided into 10 types. About 1/3 of the categories created were topical. Another 20% of the categories relate to the degree of relevance or usefulness. The rest of the categories were subject-independent categories such as format, purpose, authoritativeness and direction to other sources. The authors plan to develop automatic methods for categorizing Web pages using the common categories created by the subjects. lt is hoped that the techniques developed can be used by Web search engines to automatically organize Web pages retrieved into categories that are natural to users. 1. Introduction The World Wide Web is an increasingly important source of information for people globally because of its ease of access, the ease of publishing, its ability to transcend geographic and national boundaries, its flexibility and heterogeneity and its dynamic nature. However, Web users also find it increasingly difficult to locate relevant and useful information in this vast information storehouse. Web search engines, despite their scope and power, appear to be quite ineffective. They retrieve too many pages, and though they attempt to rank retrieved pages in order of probable relevance, often the relevant documents do not appear in the top-ranked 10 or 20 documents displayed. Several studies have found that users do not know how to use the advanced features of Web search engines, and do not know how to formulate and re-formulate queries. Users also typically exert minimal effort in performing, evaluating and refining their searches, and are unwilling to scan more than 10 or 20 items retrieved (Jansen, Spink, Bateman & Saracevic, 1998). This suggests that the conventional ranked-list display of search results does not satisfy user requirements, and that better ways of presenting and summarizing search results have to be developed. One promising approach is to group retrieved pages into clusters or categories to allow users to navigate immediately to the "promising" clusters where the most useful Web pages are likely to be located. This approach has been adopted by a number of search engines (notably Northem Light) and search agents.

Date

12. 9.2004 9:56:22

Subramanian, S.; Shafer, K.E.: Clustering (2001) 0.02

0.016993519 = product of:
  0.08496759 = sum of:
    0.08496759 = weight(_text_:22 in 1046) [ClassicSimilarity], result of:
      0.08496759 = score(doc=1046,freq=2.0), product of:
        0.18300882 = queryWeight, product of:
          3.5018296 = idf(docFreq=3622, maxDocs=44218)
          0.052260913 = queryNorm
        0.46428138 = fieldWeight in 1046, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.5018296 = idf(docFreq=3622, maxDocs=44218)
          0.09375 = fieldNorm(doc=1046)
  0.2 = coord(1/5)

Date: 5. 5.2003 14:17:22

Drori, O.; Alon, N.: Using document classification for displaying search results (2003) 0.01
```
0.011592272 = product of:
  0.057961356 = sum of:
    0.057961356 = weight(_text_:it in 1565) [ClassicSimilarity], result of:
      0.057961356 = score(doc=1565,freq=8.0), product of:
        0.15115225 = queryWeight, product of:
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.052260913 = queryNorm
        0.38346338 = fieldWeight in 1565, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.046875 = fieldNorm(doc=1565)
  0.2 = coord(1/5)
```
Abstract

In this paper, four self-developed user interfaces that display document search results using different methods were compared. In order to create the four interfaces, two information elements: document categories and lines from the document were used. A user study compared the four interfaces. It was found that the category addition to the interface was beneficial in both measurable and subjective measures. It was also found that displaying the relevant lines from the document increased the effectiveness and shortened the search time in all cases and tasks. It was found that the participants preferred the interface containing categories and relevant lines to all other interfaces checked. It was also the fastest in the objective time measurement. Another sub-research that was conducted showed that the most important parameter for the users was the confidence level that the answer was accurate, and the least important parameter was the feeling of comfort while conducting a search

Yoon, Y.; Lee, C.; Lee, G.G.: ¬An effective procedure for constructing a hierarchical text classification system (2006) 0.01

0.009912886 = product of:
  0.04956443 = sum of:
    0.04956443 = weight(_text_:22 in 5273) [ClassicSimilarity], result of:
      0.04956443 = score(doc=5273,freq=2.0), product of:
        0.18300882 = queryWeight, product of:
          3.5018296 = idf(docFreq=3622, maxDocs=44218)
          0.052260913 = queryNorm
        0.2708308 = fieldWeight in 5273, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.5018296 = idf(docFreq=3622, maxDocs=44218)
          0.0546875 = fieldNorm(doc=5273)
  0.2 = coord(1/5)

Date: 22. 7.2006 16:24:52

Yi, K.: Automatic text classification using library classification schemes : trends, issues and challenges (2007) 0.01

0.009912886 = product of:
  0.04956443 = sum of:
    0.04956443 = weight(_text_:22 in 2560) [ClassicSimilarity], result of:
      0.04956443 = score(doc=2560,freq=2.0), product of:
        0.18300882 = queryWeight, product of:
          3.5018296 = idf(docFreq=3622, maxDocs=44218)
          0.052260913 = queryNorm
        0.2708308 = fieldWeight in 2560, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.5018296 = idf(docFreq=3622, maxDocs=44218)
          0.0546875 = fieldNorm(doc=2560)
  0.2 = coord(1/5)

Date: 22. 9.2008 18:31:54

Pfeffer, M.: Automatische Vergabe von RVK-Notationen mittels fallbasiertem Schließen (2009) 0.01

0.008496759 = product of:
  0.042483795 = sum of:
    0.042483795 = weight(_text_:22 in 3051) [ClassicSimilarity], result of:
      0.042483795 = score(doc=3051,freq=2.0), product of:
        0.18300882 = queryWeight, product of:
          3.5018296 = idf(docFreq=3622, maxDocs=44218)
          0.052260913 = queryNorm
        0.23214069 = fieldWeight in 3051, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.5018296 = idf(docFreq=3622, maxDocs=44218)
          0.046875 = fieldNorm(doc=3051)
  0.2 = coord(1/5)

Date: 22. 8.2009 19:51:28

Golub, K.; Hamon, T.; Ardö, A.: Automated classification of textual documents based on a controlled vocabulary in engineering (2007) 0.01
```
0.008196974 = product of:
  0.04098487 = sum of:
    0.04098487 = weight(_text_:it in 1461) [ClassicSimilarity], result of:
      0.04098487 = score(doc=1461,freq=4.0), product of:
        0.15115225 = queryWeight, product of:
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.052260913 = queryNorm
        0.27114958 = fieldWeight in 1461, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.046875 = fieldNorm(doc=1461)
  0.2 = coord(1/5)
```
Abstract

Automated subject classification has been a challenging research issue for many years now, receiving particular attention in the past decade due to rapid increase of digital documents. The most frequent approach to automated classification is machine learning. It, however, requires training documents and performs well on new documents only if these are similar enough to the former. We explore a string-matching algorithm based on a controlled vocabulary, which does not require training documents - instead it reuses the intellectual work put into creating the controlled vocabulary. Terms from the Engineering Information thesaurus and classification scheme were matched against title and abstract of engineering papers from the Compendex database. Simple string-matching was enhanced by several methods such as term weighting schemes and cut-offs, exclusion of certain terms, and en- richment of the controlled vocabulary with automatically extracted terms. The best results are 76% recall when the controlled vocabulary is enriched with new terms, and 79% precision when certain terms are excluded. Precision of individual classes is up to 98%. These results are comparable to state-of-the-art machine-learning algorithms.
Leroy, G.; Miller, T.; Rosemblat, G.; Browne, A.: ¬A balanced approach to health information evaluation : a vocabulary-based naïve Bayes classifier and readability formulas (2008) 0.01
```
0.008196974 = product of:
  0.04098487 = sum of:
    0.04098487 = weight(_text_:it in 1998) [ClassicSimilarity], result of:
      0.04098487 = score(doc=1998,freq=4.0), product of:
        0.15115225 = queryWeight, product of:
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.052260913 = queryNorm
        0.27114958 = fieldWeight in 1998, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.046875 = fieldNorm(doc=1998)
  0.2 = coord(1/5)
```
Abstract

Since millions seek health information online, it is vital for this information to be comprehensible. Most studies use readability formulas, which ignore vocabulary, and conclude that online health information is too difficult. We developed a vocabularly-based, naïve Bayes classifier to distinguish between three difficulty levels in text. It proved 98% accurate in a 250-document evaluation. We compared our classifier with readability formulas for 90 new documents with different origins and asked representative human evaluators, an expert and a consumer, to judge each document. Average readability grade levels for educational and commercial pages was 10th grade or higher, too difficult according to current literature. In contrast, the classifier showed that 70-90% of these pages were written at an intermediate, appropriate level indicating that vocabulary usage is frequently appropriate in text considered too difficult by readability formula evaluations. The expert considered the pages more difficult for a consumer than the consumer did.
Xu, Y.; Bernard, A.: Knowledge organization through statistical computation : a new approach (2009) 0.01
```
0.008196974 = product of:
  0.04098487 = sum of:
    0.04098487 = weight(_text_:it in 3252) [ClassicSimilarity], result of:
      0.04098487 = score(doc=3252,freq=4.0), product of:
        0.15115225 = queryWeight, product of:
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.052260913 = queryNorm
        0.27114958 = fieldWeight in 3252, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.046875 = fieldNorm(doc=3252)
  0.2 = coord(1/5)
```
Abstract

Knowledge organization (KO) is an interdisciplinary issue which includes some problems in knowledge classification such as how to classify newly emerged knowledge. With the great complexity and ambiguity of knowledge, it is becoming sometimes inefficient to classify knowledge by logical reasoning. This paper attempts to propose a statistical approach to knowledge organization in order to resolve the problems in classifying complex and mass knowledge. By integrating the classification process into a mathematical model, a knowledge classifier, based on the maximum entropy theory, is constructed and the experimental results show that the classification results acquired from the classifier are reliable. The approach proposed in this paper is quite formal and is not dependent on specific contexts, so it could easily be adapted to the use of knowledge classification in other domains within KO.

Mengle, S.; Goharian, N.: Passage detection using text classification (2009) 0.01

0.0070806327 = product of:
  0.035403162 = sum of:
    0.035403162 = weight(_text_:22 in 2765) [ClassicSimilarity], result of:
      0.035403162 = score(doc=2765,freq=2.0), product of:
        0.18300882 = queryWeight, product of:
          3.5018296 = idf(docFreq=3622, maxDocs=44218)
          0.052260913 = queryNorm
        0.19345059 = fieldWeight in 2765, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.5018296 = idf(docFreq=3622, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2765)
  0.2 = coord(1/5)

Date: 22. 3.2009 19:14:43

Rooney, N.; Patterson, D.; Galushka, M.; Dobrynin, V.; Smirnova, E.: ¬An investigation into the stability of contextual document clustering (2008) 0.01
```
0.006830811 = product of:
  0.034154054 = sum of:
    0.034154054 = weight(_text_:it in 1356) [ClassicSimilarity], result of:
      0.034154054 = score(doc=1356,freq=4.0), product of:
        0.15115225 = queryWeight, product of:
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.052260913 = queryNorm
        0.22595796 = fieldWeight in 1356, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1356)
  0.2 = coord(1/5)
```
Abstract

In this article, we assess the effectiveness of Contextual Document Clustering (CDC) as a means of indexing within a dynamic and rapidly changing environment. We simulate a dynamic environment, by splitting two chronologically ordered datasets into time-ordered segments and assessing how the technique performs under two different scenarios. The first is when new documents are added incrementally without reclustering [incremental CDC (iCDC)], and the second is when reclustering is performed [nonincremental CDC (nCDC)]. The datasets are very large, are independent of each other, and belong to two very different domains. We show that CDC itself is effective at clustering very large document corpora, and that, significantly, it lends itself to a very simple, efficient incremental document addition process that is seen to be very stable over time despite the size of the corpus growing considerably. It was seen to be effective at incrementally clustering new documents even when the corpus grew to six times its original size. This is in contrast to what other researchers have found when applying similar simple incremental approaches to document clustering. The stability of iCDC is accounted for by the unique manner in which CDC discovers cluster themes.
Chung, Y.-M.; Noh, Y.-H.: Developing a specialized directory system by automatically classifying Web documents (2003) 0.01
```
0.005796136 = product of:
  0.028980678 = sum of:
    0.028980678 = weight(_text_:it in 1566) [ClassicSimilarity], result of:
      0.028980678 = score(doc=1566,freq=2.0), product of:
        0.15115225 = queryWeight, product of:
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.052260913 = queryNorm
        0.19173169 = fieldWeight in 1566, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.046875 = fieldNorm(doc=1566)
  0.2 = coord(1/5)
```
Abstract

This study developed a specialized directory system using an automatic classification technique. Economics was selected as the subject field for the classification experiments with Web documents. The classification scheme of the directory follows the DDC, and subject terms representing each class number or subject category were selected from the DDC table to construct a representative term dictionary. In collecting and classifying the Web documents, various strategies were tested in order to find the optimal thresholds. In the classification experiments, Web documents in economics were classified into a total of 757 hierarchical subject categories built from the DDC scheme. The first and second experiments using the representative term dictionary resulted in relatively high precision ratios of 77 and 60%, respectively. The third experiment employing a machine learning-based k-nearest neighbours (kNN) classifier in a closed experimental setting achieved a precision ratio of 96%. This implies that it is possible to enhance the classification performance by applying a hybrid method combining a dictionary-based technique and a kNN classifier
Wu, K.J.; Chen, M.-C.; Sun, Y.: Automatic topics discovery from hyperlinked documents (2004) 0.01
```
0.005796136 = product of:
  0.028980678 = sum of:
    0.028980678 = weight(_text_:it in 2563) [ClassicSimilarity], result of:
      0.028980678 = score(doc=2563,freq=2.0), product of:
        0.15115225 = queryWeight, product of:
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.052260913 = queryNorm
        0.19173169 = fieldWeight in 2563, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.046875 = fieldNorm(doc=2563)
  0.2 = coord(1/5)
```
Abstract

Topic discovery is an important means for marketing, e-Business and social science studies. As well, it can be applied to various purposes, such as identifying a group with certain properties and observing the emergence and diminishment of a certain cyber community. Previous topic discovery work (J.M. Kleinberg, Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, California, p. 668) requires manual judgment of usefulness of outcomes and is thus incapable of handling the explosive growth of the Internet. In this paper, we propose the Automatic Topic Discovery (ATD) method, which combines a method of base set construction, a clustering algorithm and an iterative principal eigenvector computation method to discover the topics relevant to a given query without using manual examination. Given a query, ATD returns with topics associated with the query and top representative pages for each topic. Our experiments show that the ATD method performs better than the traditional eigenvector method in terms of computation time and topic discovery quality.
Finn, A.; Kushmerick, N.: Learning to classify documents according to genre (2006) 0.01
```
0.005796136 = product of:
  0.028980678 = sum of:
    0.028980678 = weight(_text_:it in 6010) [ClassicSimilarity], result of:
      0.028980678 = score(doc=6010,freq=2.0), product of:
        0.15115225 = queryWeight, product of:
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.052260913 = queryNorm
        0.19173169 = fieldWeight in 6010, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.046875 = fieldNorm(doc=6010)
  0.2 = coord(1/5)
```
Abstract

Current document-retrieval tools succeed in locating large numbers of documents relevant to a given query. While search results may be relevant according to the topic of the documents, it is more difficult to identify which of the relevant documents are most suitable for a particular user. Automatic genre analysis (i.e., the ability to distinguish documents according to style) would be a useful tool for identifying documents that are most suitable for a particular user. We investigate the use of machine learning for automatic genre classification. We introduce the idea of domain transfer-genre classifiers should be reusable across multiple topics-which does not arise in standard text classification. We investigate different features for building genre classifiers and their ability to transfer across multiple-topic domains. We also show how different feature-sets can be used in conjunction with each other to improve performance and reduce the number of documents that need to be labeled.
Liu, R.-L.: Dynamic category profiling for text filtering and classification (2007) 0.01
```
0.005796136 = product of:
  0.028980678 = sum of:
    0.028980678 = weight(_text_:it in 900) [ClassicSimilarity], result of:
      0.028980678 = score(doc=900,freq=2.0), product of:
        0.15115225 = queryWeight, product of:
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.052260913 = queryNorm
        0.19173169 = fieldWeight in 900, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.046875 = fieldNorm(doc=900)
  0.2 = coord(1/5)
```
Abstract

Information is often represented in text form and classified into categories. Unfortunately, automatic classifiers often conduct misclassifications. One of the reasons is that the documents for training the classifiers are mainly from the categories, leading the classifiers to derive category profiles for distinguishing each category from others, rather than measuring the extent to which a document's content overlaps that of a category. To tackle the problem, we present a technique DP4FC that selects suitable features to construct category profiles to distinguish relevant documents from irrelevant documents. More specially, DP4FC is associated with various classifiers. Upon receiving a document, it helps the classifiers to create dynamic category profiles with respect to the document, and accordingly make proper decisions in filtering and classification. Theoretical analysis and empirical results show that DP4FC may significantly promote different classifiers' performances under various environments.
Hagedorn, K.; Chapman, S.; Newman, D.: Enhancing search and browse using automated clustering of subject metadata (2007) 0.01
```
0.005796136 = product of:
  0.028980678 = sum of:
    0.028980678 = weight(_text_:it in 1168) [ClassicSimilarity], result of:
      0.028980678 = score(doc=1168,freq=2.0), product of:
        0.15115225 = queryWeight, product of:
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.052260913 = queryNorm
        0.19173169 = fieldWeight in 1168, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.046875 = fieldNorm(doc=1168)
  0.2 = coord(1/5)
```
Abstract

The Web puzzle of online information resources often hinders end-users from effective and efficient access to these resources. Clustering resources into appropriate subject-based groupings may help alleviate these difficulties, but will it work with heterogeneous material? The University of Michigan and the University of California Irvine joined forces to test automatically enhancing metadata records using the Topic Modeling algorithm on the varied OAIster corpus. We created labels for the resulting clusters of metadata records, matched the clusters to an in-house classification system, and developed a prototype that would showcase methods for search and retrieval using the enhanced records. Results indicated that while the algorithm was somewhat time-intensive to run and using a local classification scheme had its drawbacks, precise clustering of records was achieved and the prototype interface proved that faceted classification could be powerful in helping end-users find resources.
Ko, Y.; Seo, J.: Text classification from unlabeled documents with bootstrapping and feature projection techniques (2009) 0.01
```
0.005796136 = product of:
  0.028980678 = sum of:
    0.028980678 = weight(_text_:it in 2452) [ClassicSimilarity], result of:
      0.028980678 = score(doc=2452,freq=2.0), product of:
        0.15115225 = queryWeight, product of:
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.052260913 = queryNorm
        0.19173169 = fieldWeight in 2452, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          2.892262 = idf(docFreq=6664, maxDocs=44218)
          0.046875 = fieldNorm(doc=2452)
  0.2 = coord(1/5)
```
Abstract

Many machine learning algorithms have been applied to text classification tasks. In the machine learning paradigm, a general inductive process automatically builds a text classifier by learning, generally known as supervised learning. However, the supervised learning approaches have some problems. The most notable problem is that they require a large number of labeled training documents for accurate learning. While unlabeled documents are easily collected and plentiful, labeled documents are difficultly generated because a labeling task must be done by human developers. In this paper, we propose a new text classification method based on unsupervised or semi-supervised learning. The proposed method launches text classification tasks with only unlabeled documents and the title word of each category for learning, and then it automatically learns text classifier by using bootstrapping and feature projection techniques. The results of experiments showed that the proposed method achieved reasonably useful performance compared to a supervised method. If the proposed method is used in a text classification task, building text classification systems will become significantly faster and less expensive.

Reiner, U.: Automatische DDC-Klassifizierung bibliografischer Titeldatensätze der Deutschen Nationalbibliografie (2009) 0.01

0.0056645065 = product of:
  0.02832253 = sum of:
    0.02832253 = weight(_text_:22 in 3284) [ClassicSimilarity], result of:
      0.02832253 = score(doc=3284,freq=2.0), product of:
        0.18300882 = queryWeight, product of:
          3.5018296 = idf(docFreq=3622, maxDocs=44218)
          0.052260913 = queryNorm
        0.15476047 = fieldWeight in 3284, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.5018296 = idf(docFreq=3622, maxDocs=44218)
          0.03125 = fieldNorm(doc=3284)
  0.2 = coord(1/5)

Date: 22. 1.2010 14:41:24

Search (30 results, page 1 of 2)

Authors

Languages

Themes