Search (153 results, page 1 of 8)

Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.06

0.06056131 = product of:
  0.090841964 = sum of:
    0.072331384 = product of:
      0.21699414 = sum of:
        0.21699414 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
          0.21699414 = score(doc=562,freq=2.0), product of:
            0.38609818 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.045541126 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.33333334 = coord(1/3)
    0.018510582 = product of:
      0.037021164 = sum of:
        0.037021164 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
          0.037021164 = score(doc=562,freq=2.0), product of:
            0.15947726 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.045541126 = queryNorm
            0.23214069 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Content: Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
Date: 8. 1.2013 10:22:32

Jenkins, C.: Automatic classification of Web resources using Java and Dewey Decimal Classification (1998) 0.03

0.029919475 = product of:
  0.044879213 = sum of:
    0.023283537 = weight(_text_:to in 1673) [ClassicSimilarity], result of:
      0.023283537 = score(doc=1673,freq=8.0), product of:
        0.08279609 = queryWeight, product of:
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.045541126 = queryNorm
        0.28121543 = fieldWeight in 1673, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1673)
    0.021595677 = product of:
      0.043191355 = sum of:
        0.043191355 = weight(_text_:22 in 1673) [ClassicSimilarity], result of:
          0.043191355 = score(doc=1673,freq=2.0), product of:
            0.15947726 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.045541126 = queryNorm
            0.2708308 = fieldWeight in 1673, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1673)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Abstract: The Wolverhampton Web Library (WWLib) is a WWW search engine that provides access to UK based information. The experimental version developed in 1995, was a success but highlighted the need for a much higher degree of automation. An interesting feature of the experimental WWLib was that it organised information according to DDC. Discusses the advantages of classification and describes the automatic classifier that is being developed in Java as part of the new, fully automated WWLib
Date: 1. 8.1996 22:08:06
Footnote: Contribution to a special issue devoted to the Proceedings of the 7th International World Wide Web Conference, held 14-18 April 1998, Brisbane, Australia; vgl. auch: http://www7.scu.edu.au/programme/posters/1846/com1846.htm.

Egbert, J.; Biber, D.; Davies, M.: Developing a bottom-up, user-based method of web register classification (2015) 0.03

0.028635468 = product of:
  0.0429532 = sum of:
    0.02444262 = weight(_text_:to in 2158) [ClassicSimilarity], result of:
      0.02444262 = score(doc=2158,freq=12.0), product of:
        0.08279609 = queryWeight, product of:
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.045541126 = queryNorm
        0.29521468 = fieldWeight in 2158, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.046875 = fieldNorm(doc=2158)
    0.018510582 = product of:
      0.037021164 = sum of:
        0.037021164 = weight(_text_:22 in 2158) [ClassicSimilarity], result of:
          0.037021164 = score(doc=2158,freq=2.0), product of:
            0.15947726 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.045541126 = queryNorm
            0.23214069 = fieldWeight in 2158, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=2158)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Abstract: This paper introduces a project to develop a reliable, cost-effective method for classifying Internet texts into register categories, and apply that approach to the analysis of a large corpus of web documents. To date, the project has proceeded in 2 key phases. First, we developed a bottom-up method for web register classification, asking end users of the web to utilize a decision-tree survey to code relevant situational characteristics of web documents, resulting in a bottom-up identification of register and subregister categories. We present details regarding the development and testing of this method through a series of 10 pilot studies. Then, in the second phase of our project we applied this procedure to a corpus of 53,000 web documents. An analysis of the results demonstrates the effectiveness of these methods for web register classification and provides a preliminary description of the types and distribution of registers on the web.
Date: 4. 8.2015 19:22:04

Khoo, C.S.G.; Ng, K.; Ou, S.: ¬An exploratory study of human clustering of Web pages (2003) 0.03
```
0.028060667 = product of:
  0.042091 = sum of:
    0.029750613 = weight(_text_:to in 2741) [ClassicSimilarity], result of:
      0.029750613 = score(doc=2741,freq=40.0), product of:
        0.08279609 = queryWeight, product of:
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.045541126 = queryNorm
        0.3593239 = fieldWeight in 2741, product of:
          6.3245554 = tf(freq=40.0), with freq of:
            40.0 = termFreq=40.0
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.03125 = fieldNorm(doc=2741)
    0.012340387 = product of:
      0.024680775 = sum of:
        0.024680775 = weight(_text_:22 in 2741) [ClassicSimilarity], result of:
          0.024680775 = score(doc=2741,freq=2.0), product of:
            0.15947726 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.045541126 = queryNorm
            0.15476047 = fieldWeight in 2741, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.03125 = fieldNorm(doc=2741)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

This study seeks to find out how human beings cluster Web pages naturally. Twenty Web pages retrieved by the Northem Light search engine for each of 10 queries were sorted by 3 subjects into categories that were natural or meaningful to them. lt was found that different subjects clustered the same set of Web pages quite differently and created different categories. The average inter-subject similarity of the clusters created was a low 0.27. Subjects created an average of 5.4 clusters for each sorting. The categories constructed can be divided into 10 types. About 1/3 of the categories created were topical. Another 20% of the categories relate to the degree of relevance or usefulness. The rest of the categories were subject-independent categories such as format, purpose, authoritativeness and direction to other sources. The authors plan to develop automatic methods for categorizing Web pages using the common categories created by the subjects. lt is hoped that the techniques developed can be used by Web search engines to automatically organize Web pages retrieved into categories that are natural to users. 1. Introduction The World Wide Web is an increasingly important source of information for people globally because of its ease of access, the ease of publishing, its ability to transcend geographic and national boundaries, its flexibility and heterogeneity and its dynamic nature. However, Web users also find it increasingly difficult to locate relevant and useful information in this vast information storehouse. Web search engines, despite their scope and power, appear to be quite ineffective. They retrieve too many pages, and though they attempt to rank retrieved pages in order of probable relevance, often the relevant documents do not appear in the top-ranked 10 or 20 documents displayed. Several studies have found that users do not know how to use the advanced features of Web search engines, and do not know how to formulate and re-formulate queries. Users also typically exert minimal effort in performing, evaluating and refining their searches, and are unwilling to scan more than 10 or 20 items retrieved (Jansen, Spink, Bateman & Saracevic, 1998). This suggests that the conventional ranked-list display of search results does not satisfy user requirements, and that better ways of presenting and summarizing search results have to be developed. One promising approach is to group retrieved pages into clusters or categories to allow users to navigate immediately to the "promising" clusters where the most useful Web pages are likely to be located. This approach has been adopted by a number of search engines (notably Northem Light) and search agents.

Date

12. 9.2004 9:56:22

Yoon, Y.; Lee, C.; Lee, G.G.: ¬An effective procedure for constructing a hierarchical text classification system (2006) 0.03

0.027839875 = product of:
  0.04175981 = sum of:
    0.020164136 = weight(_text_:to in 5273) [ClassicSimilarity], result of:
      0.020164136 = score(doc=5273,freq=6.0), product of:
        0.08279609 = queryWeight, product of:
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.045541126 = queryNorm
        0.24353972 = fieldWeight in 5273, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.0546875 = fieldNorm(doc=5273)
    0.021595677 = product of:
      0.043191355 = sum of:
        0.043191355 = weight(_text_:22 in 5273) [ClassicSimilarity], result of:
          0.043191355 = score(doc=5273,freq=2.0), product of:
            0.15947726 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.045541126 = queryNorm
            0.2708308 = fieldWeight in 5273, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=5273)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Abstract: In text categorization tasks, classification on some class hierarchies has better results than in cases without the hierarchy. Currently, because a large number of documents are divided into several subgroups in a hierarchy, we can appropriately use a hierarchical classification method. However, we have no systematic method to build a hierarchical classification system that performs well with large collections of practical data. In this article, we introduce a new evaluation scheme for internal node classifiers, which can be used effectively to develop a hierarchical classification system. We also show that our method for constructing the hierarchical classification system is very effective, especially for the task of constructing classifiers applied to hierarchy tree with a lot of levels.
Date: 22. 7.2006 16:24:52

Mengle, S.; Goharian, N.: Passage detection using text classification (2009) 0.03
```
0.025963604 = product of:
  0.038945407 = sum of:
    0.023519924 = weight(_text_:to in 2765) [ClassicSimilarity], result of:
      0.023519924 = score(doc=2765,freq=16.0), product of:
        0.08279609 = queryWeight, product of:
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.045541126 = queryNorm
        0.28407046 = fieldWeight in 2765, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2765)
    0.015425485 = product of:
      0.03085097 = sum of:
        0.03085097 = weight(_text_:22 in 2765) [ClassicSimilarity], result of:
          0.03085097 = score(doc=2765,freq=2.0), product of:
            0.15947726 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.045541126 = queryNorm
            0.19345059 = fieldWeight in 2765, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2765)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

Passages can be hidden within a text to circumvent their disallowed transfer. Such release of compartmentalized information is of concern to all corporate and governmental organizations. Passage retrieval is well studied; we posit, however, that passage detection is not. Passage retrieval is the determination of the degree of relevance of blocks of text, namely passages, comprising a document. Rather than determining the relevance of a document in its entirety, passage retrieval determines the relevance of the individual passages. As such, modified traditional information-retrieval techniques compare terms found in user queries with the individual passages to determine a similarity score for passages of interest. In passage detection, passages are classified into predetermined categories. More often than not, passage detection techniques are deployed to detect hidden paragraphs in documents. That is, to hide information, documents are injected with hidden text into passages. Rather than matching query terms against passages to determine their relevance, using text-mining techniques, the passages are classified. Those documents with hidden passages are defined as infected. Thus, simply stated, passage retrieval is the search for passages relevant to a user query, while passage detection is the classification of passages. That is, in passage detection, passages are labeled with one or more categories from a set of predetermined categories. We present a keyword-based dynamic passage approach (KDP) and demonstrate that KDP outperforms statistically significantly (99% confidence) the other document-splitting approaches by 12% to 18% in the passage detection and passage category-prediction tasks. Furthermore, we evaluate the effects of the feature selection, passage length, ambiguous passages, and finally training-data category distribution on passage-detection accuracy.

Date

22. 3.2009 19:14:43

Automatic classification research at OCLC (2002) 0.03

0.025373083 = product of:
  0.038059622 = sum of:
    0.016463947 = weight(_text_:to in 1563) [ClassicSimilarity], result of:
      0.016463947 = score(doc=1563,freq=4.0), product of:
        0.08279609 = queryWeight, product of:
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.045541126 = queryNorm
        0.19884932 = fieldWeight in 1563, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1563)
    0.021595677 = product of:
      0.043191355 = sum of:
        0.043191355 = weight(_text_:22 in 1563) [ClassicSimilarity], result of:
          0.043191355 = score(doc=1563,freq=2.0), product of:
            0.15947726 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.045541126 = queryNorm
            0.2708308 = fieldWeight in 1563, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1563)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Abstract: OCLC enlists the cooperation of the world's libraries to make the written record of humankind's cultural heritage more accessible through electronic media. Part of this goal can be accomplished through the application of the principles of knowledge organization. We believe that cultural artifacts are effectively lost unless they are indexed, cataloged and classified. Accordingly, OCLC has developed products, sponsored research projects, and encouraged the participation in international standards communities whose outcome has been improved library classification schemes, cataloging productivity tools, and new proposals for the creation and maintenance of metadata. Though cataloging and classification requires expert intellectual effort, we recognize that at least some of the work must be automated if we hope to keep pace with cultural change
Date: 5. 5.2003 9:22:09

Liu, R.-L.: ¬A passage extractor for classification of disease aspect information (2013) 0.02
```
0.023862889 = product of:
  0.035794333 = sum of:
    0.02036885 = weight(_text_:to in 1107) [ClassicSimilarity], result of:
      0.02036885 = score(doc=1107,freq=12.0), product of:
        0.08279609 = queryWeight, product of:
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.045541126 = queryNorm
        0.24601223 = fieldWeight in 1107, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1107)
    0.015425485 = product of:
      0.03085097 = sum of:
        0.03085097 = weight(_text_:22 in 1107) [ClassicSimilarity], result of:
          0.03085097 = score(doc=1107,freq=2.0), product of:
            0.15947726 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.045541126 = queryNorm
            0.19345059 = fieldWeight in 1107, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1107)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

Retrieval of disease information is often based on several key aspects such as etiology, diagnosis, treatment, prevention, and symptoms of diseases. Automatic identification of disease aspect information is thus essential. In this article, I model the aspect identification problem as a text classification (TC) problem in which a disease aspect corresponds to a category. The disease aspect classification problem poses two challenges to classifiers: (a) a medical text often contains information about multiple aspects of a disease and hence produces noise for the classifiers and (b) text classifiers often cannot extract the textual parts (i.e., passages) about the categories of interest. I thus develop a technique, PETC (Passage Extractor for Text Classification), that extracts passages (from medical texts) for the underlying text classifiers to classify. Case studies on thousands of Chinese and English medical texts show that PETC enhances a support vector machine (SVM) classifier in classifying disease aspect information. PETC also performs better than three state-of-the-art classifier enhancement techniques, including two passage extraction techniques for text classifiers and a technique that employs term proximity information to enhance text classifiers. The contribution is of significance to evidence-based medicine, health education, and healthcare decision support. PETC can be used in those application domains in which a text to be classified may have several parts about different categories.

Date

28.10.2013 19:22:57

Liu, R.-L.: Context recognition for hierarchical text classification (2009) 0.02

0.023862753 = product of:
  0.035794128 = sum of:
    0.017283546 = weight(_text_:to in 2760) [ClassicSimilarity], result of:
      0.017283546 = score(doc=2760,freq=6.0), product of:
        0.08279609 = queryWeight, product of:
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.045541126 = queryNorm
        0.20874833 = fieldWeight in 2760, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.046875 = fieldNorm(doc=2760)
    0.018510582 = product of:
      0.037021164 = sum of:
        0.037021164 = weight(_text_:22 in 2760) [ClassicSimilarity], result of:
          0.037021164 = score(doc=2760,freq=2.0), product of:
            0.15947726 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.045541126 = queryNorm
            0.23214069 = fieldWeight in 2760, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=2760)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Abstract: Information is often organized as a text hierarchy. A hierarchical text-classification system is thus essential for the management, sharing, and dissemination of information. It aims to automatically classify each incoming document into zero, one, or several categories in the text hierarchy. In this paper, we present a technique called CRHTC (context recognition for hierarchical text classification) that performs hierarchical text classification by recognizing the context of discussion (COD) of each category. A category's COD is governed by its ancestor categories, whose contents indicate contextual backgrounds of the category. A document may be classified into a category only if its content matches the category's COD. CRHTC does not require any trials to manually set parameters, and hence is more portable and easier to implement than other methods. It is empirically evaluated under various conditions. The results show that CRHTC achieves both better and more stable performance than several hierarchical and nonhierarchical text-classification methodologies.
Date: 22. 3.2009 19:11:54

Yi, K.: Automatic text classification using library classification schemes : trends, issues and challenges (2007) 0.02

0.022158299 = product of:
  0.033237446 = sum of:
    0.011641769 = weight(_text_:to in 2560) [ClassicSimilarity], result of:
      0.011641769 = score(doc=2560,freq=2.0), product of:
        0.08279609 = queryWeight, product of:
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.045541126 = queryNorm
        0.14060771 = fieldWeight in 2560, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.0546875 = fieldNorm(doc=2560)
    0.021595677 = product of:
      0.043191355 = sum of:
        0.043191355 = weight(_text_:22 in 2560) [ClassicSimilarity], result of:
          0.043191355 = score(doc=2560,freq=2.0), product of:
            0.15947726 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.045541126 = queryNorm
            0.2708308 = fieldWeight in 2560, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2560)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Abstract: The proliferation of digital resources and their integration into a traditional library setting has created a pressing need for an automated tool that organizes textual information based on library classification schemes. Automated text classification is a research field of developing tools, methods, and models to automate text classification. This article describes the current popular approach for text classification and major text classification projects and applications that are based on library classification schemes. Related issues and challenges are discussed, and a number of considerations for the challenges are examined.
Date: 22. 9.2008 18:31:54

Zhu, W.Z.; Allen, R.B.: Document clustering using the LSI subspace signature model (2013) 0.02
```
0.018992826 = product of:
  0.02848924 = sum of:
    0.0099786585 = weight(_text_:to in 690) [ClassicSimilarity], result of:
      0.0099786585 = score(doc=690,freq=2.0), product of:
        0.08279609 = queryWeight, product of:
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.045541126 = queryNorm
        0.12052089 = fieldWeight in 690, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.046875 = fieldNorm(doc=690)
    0.018510582 = product of:
      0.037021164 = sum of:
        0.037021164 = weight(_text_:22 in 690) [ClassicSimilarity], result of:
          0.037021164 = score(doc=690,freq=2.0), product of:
            0.15947726 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.045541126 = queryNorm
            0.23214069 = fieldWeight in 690, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=690)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

We describe the latent semantic indexing subspace signature model (LSISSM) for semantic content representation of unstructured text. Grounded on singular value decomposition, the model represents terms and documents by the distribution signatures of their statistical contribution across the top-ranking latent concept dimensions. LSISSM matches term signatures with document signatures according to their mapping coherence between latent semantic indexing (LSI) term subspace and LSI document subspace. LSISSM does feature reduction and finds a low-rank approximation of scalable and sparse term-document matrices. Experiments demonstrate that this approach significantly improves the performance of major clustering algorithms such as standard K-means and self-organizing maps compared with the vector space model and the traditional LSI model. The unique contribution ranking mechanism in LSISSM also improves the initialization of standard K-means compared with random seeding procedure, which sometimes causes low efficiency and effectiveness of clustering. A two-stage initialization strategy based on LSISSM significantly reduces the running time of standard K-means procedures.

Date

23. 3.2013 13:22:36

Subramanian, S.; Shafer, K.E.: Clustering (2001) 0.01

0.012340388 = product of:
  0.037021164 = sum of:
    0.037021164 = product of:
      0.07404233 = sum of:
        0.07404233 = weight(_text_:22 in 1046) [ClassicSimilarity], result of:
          0.07404233 = score(doc=1046,freq=2.0), product of:
            0.15947726 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.045541126 = queryNorm
            0.46428138 = fieldWeight in 1046, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.09375 = fieldNorm(doc=1046)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)

Date: 5. 5.2003 14:17:22

Finn, A.; Kushmerick, N.: Learning to classify documents according to genre (2006) 0.01
```
0.010518431 = product of:
  0.03155529 = sum of:
    0.03155529 = weight(_text_:to in 6010) [ClassicSimilarity], result of:
      0.03155529 = score(doc=6010,freq=20.0), product of:
        0.08279609 = queryWeight, product of:
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.045541126 = queryNorm
        0.38112053 = fieldWeight in 6010, product of:
          4.472136 = tf(freq=20.0), with freq of:
            20.0 = termFreq=20.0
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.046875 = fieldNorm(doc=6010)
  0.33333334 = coord(1/3)
```
Abstract

Current document-retrieval tools succeed in locating large numbers of documents relevant to a given query. While search results may be relevant according to the topic of the documents, it is more difficult to identify which of the relevant documents are most suitable for a particular user. Automatic genre analysis (i.e., the ability to distinguish documents according to style) would be a useful tool for identifying documents that are most suitable for a particular user. We investigate the use of machine learning for automatic genre classification. We introduce the idea of domain transfer-genre classifiers should be reusable across multiple topics-which does not arise in standard text classification. We investigate different features for building genre classifiers and their ability to transfer across multiple-topic domains. We also show how different feature-sets can be used in conjunction with each other to improve performance and reduce the number of documents that need to be labeled.

Reiner, U.: Automatische DDC-Klassifizierung von bibliografischen Titeldatensätzen (2009) 0.01

0.010283656 = product of:
  0.03085097 = sum of:
    0.03085097 = product of:
      0.06170194 = sum of:
        0.06170194 = weight(_text_:22 in 611) [ClassicSimilarity], result of:
          0.06170194 = score(doc=611,freq=2.0), product of:
            0.15947726 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.045541126 = queryNorm
            0.38690117 = fieldWeight in 611, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.078125 = fieldNorm(doc=611)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)

Date: 22. 8.2009 12:54:24

HaCohen-Kerner, Y. et al.: Classification using various machine learning methods and combinations of key-phrases and visual features (2016) 0.01

0.010283656 = product of:
  0.03085097 = sum of:
    0.03085097 = product of:
      0.06170194 = sum of:
        0.06170194 = weight(_text_:22 in 2748) [ClassicSimilarity], result of:
          0.06170194 = score(doc=2748,freq=2.0), product of:
            0.15947726 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.045541126 = queryNorm
            0.38690117 = fieldWeight in 2748, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.078125 = fieldNorm(doc=2748)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)

Date: 1. 2.2016 18:25:22

Cosh, K.J.; Burns, R.; Daniel, T.: Content clouds : classifying content in Web 2.0 (2008) 0.01
```
0.00940797 = product of:
  0.02822391 = sum of:
    0.02822391 = weight(_text_:to in 2013) [ClassicSimilarity], result of:
      0.02822391 = score(doc=2013,freq=16.0), product of:
        0.08279609 = queryWeight, product of:
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.045541126 = queryNorm
        0.34088457 = fieldWeight in 2013, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.046875 = fieldNorm(doc=2013)
  0.33333334 = coord(1/3)
```
Abstract

Purpose - With increasing amounts of user generated content being produced electronically in the form of wikis, blogs, forums etc. the purpose of this paper is to investigate a new approach to classifying ad hoc content. Design/methodology/approach - The approach applies natural language processing (NLP) tools to automatically extract the content of some text, visualizing the results in a content cloud. Findings - Content clouds share the visual simplicity of a tag cloud, but display the details of an article at a different level of abstraction, providing a complimentary classification. Research limitations/implications - Provides the general approach to creating a content cloud. In the future, the process can be refined and enhanced by further evaluation of results. Further work is also required to better identify closely related articles. Practical implications - Being able to automatically classify the content generated by web users will enable others to find more appropriate content. Originality/value - The approach is original. Other researchers have produced a cloud, simply by using skiplists to filter unwanted words, this paper's approach improves this by applying appropriate NLP techniques.
Pong, J.Y.-H.; Kwok, R.C.-W.; Lau, R.Y.-K.; Hao, J.-X.; Wong, P.C.-C.: ¬A comparative study of two automatic document classification methods in a library setting (2008) 0.01
```
0.009193186 = product of:
  0.027579557 = sum of:
    0.027579557 = weight(_text_:to in 2532) [ClassicSimilarity], result of:
      0.027579557 = score(doc=2532,freq=22.0), product of:
        0.08279609 = queryWeight, product of:
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.045541126 = queryNorm
        0.33310217 = fieldWeight in 2532, product of:
          4.690416 = tf(freq=22.0), with freq of:
            22.0 = termFreq=22.0
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2532)
  0.33333334 = coord(1/3)
```
Abstract

In current library practice, trained human experts usually carry out document cataloguing and indexing based on a manual approach. With the explosive growth in the number of electronic documents available on the Internet and digital libraries, it is increasingly difficult for library practitioners to categorize both electronic documents and traditional library materials using just a manual approach. To improve the effectiveness and efficiency of document categorization at the library setting, more in-depth studies of using automatic document classification methods to categorize library items are required. Machine learning research has advanced rapidly in recent years. However, applying machine learning techniques to improve library practice is still a relatively unexplored area. This paper illustrates the design and development of a machine learning based automatic document classification system to alleviate the manual categorization problem encountered within the library setting. Two supervised machine learning algorithms have been tested. Our empirical tests show that supervised machine learning algorithms in general, and the k-nearest neighbours (KNN) algorithm in particular, can be used to develop an effective document classification system to enhance current library practice. Moreover, some concrete recommendations regarding how to practically apply the KNN algorithm to develop automatic document classification in a library setting are made. To our best knowledge, this is the first in-depth study of applying the KNN algorithm to automatic document classification based on the widely used LCC classification scheme adopted by many large libraries.
AlQenaei, Z.M.; Monarchi, D.E.: ¬The use of learning techniques to analyze the results of a manual classification system (2016) 0.01
```
0.009193186 = product of:
  0.027579557 = sum of:
    0.027579557 = weight(_text_:to in 2836) [ClassicSimilarity], result of:
      0.027579557 = score(doc=2836,freq=22.0), product of:
        0.08279609 = queryWeight, product of:
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.045541126 = queryNorm
        0.33310217 = fieldWeight in 2836, product of:
          4.690416 = tf(freq=22.0), with freq of:
            22.0 = termFreq=22.0
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2836)
  0.33333334 = coord(1/3)
```
Abstract

Classification is the process of assigning objects to pre-defined classes based on observations or characteristics of those objects, and there are many approaches to performing this task. The overall objective of this study is to demonstrate the use of two learning techniques to analyze the results of a manual classification system. Our sample consisted of 1,026 documents, from the ACM Computing Classification System, classified by their authors as belonging to one of the groups of the classification system: "H.3 Information Storage and Retrieval." A singular value decomposition of the documents' weighted term-frequency matrix was used to represent each document in a 50-dimensional vector space. The analysis of the representation using both supervised (decision tree) and unsupervised (clustering) techniques suggests that two pairs of the ACM classes are closely related to each other in the vector space. Class 1 (Content Analysis and Indexing) is closely related to Class 3 (Information Search and Retrieval), and Class 4 (Systems and Software) is closely related to Class 5 (Online Information Services). Further analysis was performed to test the diffusion of the words in the two classes using both cosine and Euclidean distance.
Dolin, R.; Agrawal, D.; El Abbadi, A.; Pearlman, J.: Using automated classification for summarizing and selecting heterogeneous information sources (1998) 0.01
```
0.009109228 = product of:
  0.027327683 = sum of:
    0.027327683 = weight(_text_:to in 1253) [ClassicSimilarity], result of:
      0.027327683 = score(doc=1253,freq=60.0), product of:
        0.08279609 = queryWeight, product of:
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.045541126 = queryNorm
        0.33006006 = fieldWeight in 1253, product of:
          7.745967 = tf(freq=60.0), with freq of:
            60.0 = termFreq=60.0
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.0234375 = fieldNorm(doc=1253)
  0.33333334 = coord(1/3)
```
Abstract

Information retrieval over the Internet increasingly requires the filtering of thousands of heterogeneous information sources. Important sources of information include not only traditional databases with structured data and queries, but also increasing numbers of non-traditional, semi- or unstructured collections such as Web sites, FTP archives, etc. As the number and variability of sources increases, new ways of automatically summarizing, discovering, and selecting collections relevant to a user's query are needed. One such method involves the use of classification schemes, such as the Library of Congress Classification (LCC), within which a collection may be represented based on its content, irrespective of the structure of the actual data or documents. For such a system to be useful in a large-scale distributed environment, it must be easy to use for both collection managers and users. As a result, it must be possible to classify documents automatically within a classification scheme. Furthermore, there must be a straightforward and intuitive interface with which the user may use the scheme to assist in information retrieval (IR). Our work with the Alexandria Digital Library (ADL) Project focuses on geo-referenced information, whether text, maps, aerial photographs, or satellite images. As a result, we have emphasized techniques which work with both text and non-text, such as combined textual and graphical queries, multi-dimensional indexing, and IR methods which are not solely dependent on words or phrases. Part of this work involves locating relevant online sources of information. In particular, we have designed and are currently testing aspects of an architecture, Pharos, which we believe will scale up to 1.000.000 heterogeneous sources. Pharos accommodates heterogeneity in content and format, both among multiple sources as well as within a single source. That is, we consider sources to include Web sites, FTP archives, newsgroups, and full digital libraries; all of these systems can include a wide variety of content and multimedia data formats. Pharos is based on the use of hierarchical classification schemes. These include not only well-known 'subject' (or 'concept') based schemes such as the Dewey Decimal System and the LCC, but also, for example, geographic classifications, which might be constructed as layers of smaller and smaller hierarchical longitude/latitude boxes. Pharos is designed to work with sophisticated queries which utilize subjects, geographical locations, temporal specifications, and other types of information domains. The Pharos architecture requires that hierarchically structured collection metadata be extracted so that it can be partitioned in such a way as to greatly enhance scalability. Automated classification is important to Pharos because it allows information sources to extract the requisite collection metadata automatically that must be distributed.
We are currently experimenting with newsgroups as collections. We have built an initial prototype which automatically classifies and summarizes newsgroups within the LCC. (The prototype can be tested below, and more details may be found at http://pharos.alexandria.ucsb.edu/). The prototype uses electronic library catalog records as a `training set' and Latent Semantic Indexing (LSI) for IR. We use the training set to build a rich set of classification terminology, and associate these terms with the relevant categories in the LCC. This association between terms and classification categories allows us to relate users' queries to nodes in the LCC so that users can select appropriate query categories. Newsgroups are similarly associated with classification categories. Pharos then matches the categories selected by users to relevant newsgroups. In principle, this approach allows users to exclude newsgroups that might have been selected based on an unintended meaning of a query term, and to include newsgroups with relevant content even though the exact query terms may not have been used. This work is extensible to other types of classification, including geographical, temporal, and image feature. Before discussing the methodology of the collection summarization and selection, we first present an online demonstration below. The demonstration is not intended to be a complete end-user interface. Rather, it is intended merely to offer a view of the process to suggest the "look and feel" of the prototype. The demo works as follows. First supply it with a few keywords of interest. The system will then use those terms to try to return to you the most relevant subject categories within the LCC. Assuming that the system recognizes any of your terms (it has over 400,000 terms indexed), it will give you a list of 15 LCC categories sorted by relevancy ranking. From there, you have two choices. The first choice, by clicking on the "News" links, is to get a list of newsgroups which the system has identified as relevant to the LCC category you select. The other choice, by clicking on the LCC ID links, is to enter the LCC hierarchy starting at the category of your choice and navigate the tree until you locate the best category for your query. From there, again, you can get a list of newsgroups by clicking on the "News" links. After having shown this demonstration to many people, we would like to suggest that you first give it easier examples before trying to break it. For example, "prostate cancer" (discussed below), "remote sensing", "investment banking", and "gershwin" all work reasonably well.
Liu, R.-L.: Dynamic category profiling for text filtering and classification (2007) 0.01
```
0.00880035 = product of:
  0.026401049 = sum of:
    0.026401049 = weight(_text_:to in 900) [ClassicSimilarity], result of:
      0.026401049 = score(doc=900,freq=14.0), product of:
        0.08279609 = queryWeight, product of:
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.045541126 = queryNorm
        0.3188683 = fieldWeight in 900, product of:
          3.7416575 = tf(freq=14.0), with freq of:
            14.0 = termFreq=14.0
          1.818051 = idf(docFreq=19512, maxDocs=44218)
          0.046875 = fieldNorm(doc=900)
  0.33333334 = coord(1/3)
```
Abstract

Information is often represented in text form and classified into categories. Unfortunately, automatic classifiers often conduct misclassifications. One of the reasons is that the documents for training the classifiers are mainly from the categories, leading the classifiers to derive category profiles for distinguishing each category from others, rather than measuring the extent to which a document's content overlaps that of a category. To tackle the problem, we present a technique DP4FC that selects suitable features to construct category profiles to distinguish relevant documents from irrelevant documents. More specially, DP4FC is associated with various classifiers. Upon receiving a document, it helps the classifiers to create dynamic category profiles with respect to the document, and accordingly make proper decisions in filtering and classification. Theoretical analysis and empirical results show that DP4FC may significantly promote different classifiers' performances under various environments.

Search (153 results, page 1 of 8)

Authors

Years

Languages

Types

Themes

Subjects