Search (65 results, page 1 of 4)

  • × year_i:[2000 TO 2010}
  • × theme_ss:"Automatisches Klassifizieren"
  1. Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.19
    0.18589748 = sum of:
      0.08280347 = product of:
        0.2484104 = sum of:
          0.2484104 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
            0.2484104 = score(doc=562,freq=2.0), product of:
              0.4419972 = queryWeight, product of:
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.05213454 = queryNorm
              0.56201804 = fieldWeight in 562, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.046875 = fieldNorm(doc=562)
        0.33333334 = coord(1/3)
      0.10309401 = sum of:
        0.060712952 = weight(_text_:classification in 562) [ClassicSimilarity], result of:
          0.060712952 = score(doc=562,freq=6.0), product of:
            0.16603322 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.05213454 = queryNorm
            0.3656675 = fieldWeight in 562, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
        0.04238106 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
          0.04238106 = score(doc=562,freq=2.0), product of:
            0.18256627 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.05213454 = queryNorm
            0.23214069 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
    
    Abstract
    Document representations for text classification are typically based on the classical Bag-Of-Words paradigm. This approach comes with deficiencies that motivate the integration of features on a higher semantic level than single words. In this paper we propose an enhancement of the classical document representation through concepts extracted from background knowledge. Boosting is used for actual classification. Experimental evaluations on two well known text corpora support our approach through consistent improvement of the results.
    Content
    Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
    Date
    8. 1.2013 10:22:32
  2. Yi, K.: Automatic text classification using library classification schemes : trends, issues and challenges (2007) 0.08
    0.08255619 = product of:
      0.16511238 = sum of:
        0.16511238 = sum of:
          0.115667805 = weight(_text_:classification in 2560) [ClassicSimilarity], result of:
            0.115667805 = score(doc=2560,freq=16.0), product of:
              0.16603322 = queryWeight, product of:
                3.1847067 = idf(docFreq=4974, maxDocs=44218)
                0.05213454 = queryNorm
              0.69665456 = fieldWeight in 2560, product of:
                4.0 = tf(freq=16.0), with freq of:
                  16.0 = termFreq=16.0
                3.1847067 = idf(docFreq=4974, maxDocs=44218)
                0.0546875 = fieldNorm(doc=2560)
          0.04944457 = weight(_text_:22 in 2560) [ClassicSimilarity], result of:
            0.04944457 = score(doc=2560,freq=2.0), product of:
              0.18256627 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.05213454 = queryNorm
              0.2708308 = fieldWeight in 2560, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.0546875 = fieldNorm(doc=2560)
      0.5 = coord(1/2)
    
    Abstract
    The proliferation of digital resources and their integration into a traditional library setting has created a pressing need for an automated tool that organizes textual information based on library classification schemes. Automated text classification is a research field of developing tools, methods, and models to automate text classification. This article describes the current popular approach for text classification and major text classification projects and applications that are based on library classification schemes. Related issues and challenges are discussed, and a number of considerations for the challenges are examined.
    Date
    22. 9.2008 18:31:54
  3. Yoon, Y.; Lee, C.; Lee, G.G.: ¬An effective procedure for constructing a hierarchical text classification system (2006) 0.07
    0.07480791 = product of:
      0.14961582 = sum of:
        0.14961582 = sum of:
          0.10017125 = weight(_text_:classification in 5273) [ClassicSimilarity], result of:
            0.10017125 = score(doc=5273,freq=12.0), product of:
              0.16603322 = queryWeight, product of:
                3.1847067 = idf(docFreq=4974, maxDocs=44218)
                0.05213454 = queryNorm
              0.60332054 = fieldWeight in 5273, product of:
                3.4641016 = tf(freq=12.0), with freq of:
                  12.0 = termFreq=12.0
                3.1847067 = idf(docFreq=4974, maxDocs=44218)
                0.0546875 = fieldNorm(doc=5273)
          0.04944457 = weight(_text_:22 in 5273) [ClassicSimilarity], result of:
            0.04944457 = score(doc=5273,freq=2.0), product of:
              0.18256627 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.05213454 = queryNorm
              0.2708308 = fieldWeight in 5273, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.0546875 = fieldNorm(doc=5273)
      0.5 = coord(1/2)
    
    Abstract
    In text categorization tasks, classification on some class hierarchies has better results than in cases without the hierarchy. Currently, because a large number of documents are divided into several subgroups in a hierarchy, we can appropriately use a hierarchical classification method. However, we have no systematic method to build a hierarchical classification system that performs well with large collections of practical data. In this article, we introduce a new evaluation scheme for internal node classifiers, which can be used effectively to develop a hierarchical classification system. We also show that our method for constructing the hierarchical classification system is very effective, especially for the task of constructing classifiers applied to hierarchy tree with a lot of levels.
    Date
    22. 7.2006 16:24:52
  4. Liu, R.-L.: Context recognition for hierarchical text classification (2009) 0.06
    0.06038057 = product of:
      0.12076114 = sum of:
        0.12076114 = sum of:
          0.078380086 = weight(_text_:classification in 2760) [ClassicSimilarity], result of:
            0.078380086 = score(doc=2760,freq=10.0), product of:
              0.16603322 = queryWeight, product of:
                3.1847067 = idf(docFreq=4974, maxDocs=44218)
                0.05213454 = queryNorm
              0.4720747 = fieldWeight in 2760, product of:
                3.1622777 = tf(freq=10.0), with freq of:
                  10.0 = termFreq=10.0
                3.1847067 = idf(docFreq=4974, maxDocs=44218)
                0.046875 = fieldNorm(doc=2760)
          0.04238106 = weight(_text_:22 in 2760) [ClassicSimilarity], result of:
            0.04238106 = score(doc=2760,freq=2.0), product of:
              0.18256627 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.05213454 = queryNorm
              0.23214069 = fieldWeight in 2760, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.046875 = fieldNorm(doc=2760)
      0.5 = coord(1/2)
    
    Abstract
    Information is often organized as a text hierarchy. A hierarchical text-classification system is thus essential for the management, sharing, and dissemination of information. It aims to automatically classify each incoming document into zero, one, or several categories in the text hierarchy. In this paper, we present a technique called CRHTC (context recognition for hierarchical text classification) that performs hierarchical text classification by recognizing the context of discussion (COD) of each category. A category's COD is governed by its ancestor categories, whose contents indicate contextual backgrounds of the category. A document may be classified into a category only if its content matches the category's COD. CRHTC does not require any trials to manually set parameters, and hence is more portable and easier to implement than other methods. It is empirically evaluated under various conditions. The results show that CRHTC achieves both better and more stable performance than several hierarchical and nonhierarchical text-classification methodologies.
    Date
    22. 3.2009 19:11:54
  5. Automatic classification research at OCLC (2002) 0.06
    0.060138173 = product of:
      0.12027635 = sum of:
        0.12027635 = sum of:
          0.070831776 = weight(_text_:classification in 1563) [ClassicSimilarity], result of:
            0.070831776 = score(doc=1563,freq=6.0), product of:
              0.16603322 = queryWeight, product of:
                3.1847067 = idf(docFreq=4974, maxDocs=44218)
                0.05213454 = queryNorm
              0.42661208 = fieldWeight in 1563, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                3.1847067 = idf(docFreq=4974, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1563)
          0.04944457 = weight(_text_:22 in 1563) [ClassicSimilarity], result of:
            0.04944457 = score(doc=1563,freq=2.0), product of:
              0.18256627 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.05213454 = queryNorm
              0.2708308 = fieldWeight in 1563, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1563)
      0.5 = coord(1/2)
    
    Abstract
    OCLC enlists the cooperation of the world's libraries to make the written record of humankind's cultural heritage more accessible through electronic media. Part of this goal can be accomplished through the application of the principles of knowledge organization. We believe that cultural artifacts are effectively lost unless they are indexed, cataloged and classified. Accordingly, OCLC has developed products, sponsored research projects, and encouraged the participation in international standards communities whose outcome has been improved library classification schemes, cataloging productivity tools, and new proposals for the creation and maintenance of metadata. Though cataloging and classification requires expert intellectual effort, we recognize that at least some of the work must be automated if we hope to keep pace with cultural change
    Date
    5. 5.2003 9:22:09
  6. Pfeffer, M.: Automatische Vergabe von RVK-Notationen mittels fallbasiertem Schließen (2009) 0.04
    0.03871685 = product of:
      0.0774337 = sum of:
        0.0774337 = sum of:
          0.03505264 = weight(_text_:classification in 3051) [ClassicSimilarity], result of:
            0.03505264 = score(doc=3051,freq=2.0), product of:
              0.16603322 = queryWeight, product of:
                3.1847067 = idf(docFreq=4974, maxDocs=44218)
                0.05213454 = queryNorm
              0.21111822 = fieldWeight in 3051, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.1847067 = idf(docFreq=4974, maxDocs=44218)
                0.046875 = fieldNorm(doc=3051)
          0.04238106 = weight(_text_:22 in 3051) [ClassicSimilarity], result of:
            0.04238106 = score(doc=3051,freq=2.0), product of:
              0.18256627 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.05213454 = queryNorm
              0.23214069 = fieldWeight in 3051, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.046875 = fieldNorm(doc=3051)
      0.5 = coord(1/2)
    
    Date
    22. 8.2009 19:51:28
    Footnote
    Vgl. auch die Präsentationen unter: http://www.bibliothek.uni-regensburg.de/Systematik/pdf/Anw2008_PPT1.pdf. http://blog.bib.uni-mannheim.de/Classification/wp-content/uploads/2007/10/hu-berlin-2007-2.pdf. Volltexte unter:
  7. Mengle, S.; Goharian, N.: Passage detection using text classification (2009) 0.04
    0.03831374 = product of:
      0.07662748 = sum of:
        0.07662748 = sum of:
          0.04130993 = weight(_text_:classification in 2765) [ClassicSimilarity], result of:
            0.04130993 = score(doc=2765,freq=4.0), product of:
              0.16603322 = queryWeight, product of:
                3.1847067 = idf(docFreq=4974, maxDocs=44218)
                0.05213454 = queryNorm
              0.24880521 = fieldWeight in 2765, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.1847067 = idf(docFreq=4974, maxDocs=44218)
                0.0390625 = fieldNorm(doc=2765)
          0.03531755 = weight(_text_:22 in 2765) [ClassicSimilarity], result of:
            0.03531755 = score(doc=2765,freq=2.0), product of:
              0.18256627 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.05213454 = queryNorm
              0.19345059 = fieldWeight in 2765, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.0390625 = fieldNorm(doc=2765)
      0.5 = coord(1/2)
    
    Abstract
    Passages can be hidden within a text to circumvent their disallowed transfer. Such release of compartmentalized information is of concern to all corporate and governmental organizations. Passage retrieval is well studied; we posit, however, that passage detection is not. Passage retrieval is the determination of the degree of relevance of blocks of text, namely passages, comprising a document. Rather than determining the relevance of a document in its entirety, passage retrieval determines the relevance of the individual passages. As such, modified traditional information-retrieval techniques compare terms found in user queries with the individual passages to determine a similarity score for passages of interest. In passage detection, passages are classified into predetermined categories. More often than not, passage detection techniques are deployed to detect hidden paragraphs in documents. That is, to hide information, documents are injected with hidden text into passages. Rather than matching query terms against passages to determine their relevance, using text-mining techniques, the passages are classified. Those documents with hidden passages are defined as infected. Thus, simply stated, passage retrieval is the search for passages relevant to a user query, while passage detection is the classification of passages. That is, in passage detection, passages are labeled with one or more categories from a set of predetermined categories. We present a keyword-based dynamic passage approach (KDP) and demonstrate that KDP outperforms statistically significantly (99% confidence) the other document-splitting approaches by 12% to 18% in the passage detection and passage category-prediction tasks. Furthermore, we evaluate the effects of the feature selection, passage length, ambiguous passages, and finally training-data category distribution on passage-detection accuracy.
    Date
    22. 3.2009 19:14:43
  8. Koch, T.; Ardö, A.: Automatic classification of full-text HTML-documents from one specific subject area : DESIRE II D3.6a, Working Paper 2 (2000) 0.03
    0.028620359 = product of:
      0.057240717 = sum of:
        0.057240717 = product of:
          0.114481434 = sum of:
            0.114481434 = weight(_text_:classification in 1667) [ClassicSimilarity], result of:
              0.114481434 = score(doc=1667,freq=12.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.6895092 = fieldWeight in 1667, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.0625 = fieldNorm(doc=1667)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Content
    1 Introduction / 2 Method overview / 3 Ei thesaurus preprocessing / 4 Automatic classification process: 4.1 Matching -- 4.2 Weighting -- 4.3 Preparation for display / 5 Results of the classification process / 6 Evaluations / 7 Software / 8 Other applications / 9 Experiments with universal classification systems / References / Appendix A: Ei classification service: Software / Appendix B: Use of the classification software as subject filter in a WWW harvester.
  9. Yi, K.: Challenges in automated classification using library classification schemes (2006) 0.03
    0.028620359 = product of:
      0.057240717 = sum of:
        0.057240717 = product of:
          0.114481434 = sum of:
            0.114481434 = weight(_text_:classification in 5810) [ClassicSimilarity], result of:
              0.114481434 = score(doc=5810,freq=12.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.6895092 = fieldWeight in 5810, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.0625 = fieldNorm(doc=5810)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    A major library classification scheme has long been standard classification framework for information sources in traditional library environment, and text classification (TC) becomes a popular and attractive tool of organizing digital information. This paper gives an overview of previous projects and studies on TC using major library classification schemes, and summarizes a discussion of TC research challenges.
  10. Golub, K.; Lykke, M.: Automated classification of web pages in hierarchical browsing (2009) 0.03
    0.02633002 = product of:
      0.05266004 = sum of:
        0.05266004 = product of:
          0.10532008 = sum of:
            0.10532008 = weight(_text_:classification in 3614) [ClassicSimilarity], result of:
              0.10532008 = score(doc=3614,freq=26.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.63433135 = fieldWeight in 3614, product of:
                  5.0990195 = tf(freq=26.0), with freq of:
                    26.0 = termFreq=26.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=3614)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Purpose - The purpose of this study is twofold: to investigate whether it is meaningful to use the Engineering Index (Ei) classification scheme for browsing, and then, if proven useful, to investigate the performance of an automated classification algorithm based on the Ei classification scheme. Design/methodology/approach - A user study was conducted in which users solved four controlled searching tasks. The users browsed the Ei classification scheme in order to examine the suitability of the classification systems for browsing. The classification algorithm was evaluated by the users who judged the correctness of the automatically assigned classes. Findings - The study showed that the Ei classification scheme is suited for browsing. Automatically assigned classes were on average partly correct, with some classes working better than others. Success of browsing showed to be correlated and dependent on classification correctness. Research limitations/implications - Further research should address problems of disparate evaluations of one and the same web page. Additional reasons behind browsing failures in the Ei classification scheme also need further investigation. Practical implications - Improvements for browsing were identified: describing class captions and/or listing their subclasses from start; allowing for searching for words from class captions with synonym search (easily provided for Ei since the classes are mapped to thesauri terms); when searching for class captions, returning the hierarchical tree expanded around the class in which caption the search term is found. The need for improvements of classification schemes was also indicated. Originality/value - A user-based evaluation of automated subject classification in the context of browsing has not been conducted before; hence the study also presents new findings concerning methodology.
    Object
    Engineering Index Classification
  11. Yoon, Y.; Lee, G.G.: Efficient implementation of associative classifiers for document classification (2007) 0.02
    0.024785958 = product of:
      0.049571916 = sum of:
        0.049571916 = product of:
          0.09914383 = sum of:
            0.09914383 = weight(_text_:classification in 909) [ClassicSimilarity], result of:
              0.09914383 = score(doc=909,freq=16.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.5971325 = fieldWeight in 909, product of:
                  4.0 = tf(freq=16.0), with freq of:
                    16.0 = termFreq=16.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.046875 = fieldNorm(doc=909)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    In practical text classification tasks, the ability to interpret the classification result is as important as the ability to classify exactly. Associative classifiers have many favorable characteristics such as rapid training, good classification accuracy, and excellent interpretation. However, associative classifiers also have some obstacles to overcome when they are applied in the area of text classification. The target text collection generally has a very high dimension, thus the training process might take a very long time. We propose a feature selection based on the mutual information between the word and class variables to reduce the space dimension of the associative classifiers. In addition, the training process of the associative classifier produces a huge amount of classification rules, which makes the prediction with a new document ineffective. We resolve this by introducing a new efficient method for storing and pruning classification rules. This method can also be used when predicting a test document. Experimental results using the 20-newsgroups dataset show many benefits of the associative classification in both training and predicting when applied to a real world problem.
  12. Sun, A.; Lim, E.-P.; Ng, W.-K.: Performance measurement framework for hierarchical text classification (2003) 0.02
    0.023185141 = product of:
      0.046370283 = sum of:
        0.046370283 = product of:
          0.092740566 = sum of:
            0.092740566 = weight(_text_:classification in 1808) [ClassicSimilarity], result of:
              0.092740566 = score(doc=1808,freq=14.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.55856633 = fieldWeight in 1808, product of:
                  3.7416575 = tf(freq=14.0), with freq of:
                    14.0 = termFreq=14.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.046875 = fieldNorm(doc=1808)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Hierarchical text classification or simply hierarchical classification refers to assigning a document to one or more suitable categories from a hierarchical category space. In our literature survey, we have found that the existing hierarchical classification experiments used a variety of measures to evaluate performance. These performance measures often assume independence between categories and do not consider documents misclassified into categories that are similar or not far from the correct categories in the category tree. In this paper, we therefore propose new performance measures for hierarchicai classification. The proposed performance measures consist of category similarity measures and distance-based measures that consider the contributions of misclassified documents. Our experiments an hierarchical classification methods based an SVM classifiers and binary Naive Bayes classifiers showed that SVM classifiers perform better than Naive Bayes classifiers an Reuters-21578 collection according to the extended measures. A new classifier-centric measure called blocking measure is also defined to examine the performance of subtree classifiers in a top-down levelbased hierarchical classificatIon method.
  13. Peng, F.; Huang, X.: Machine learning for Asian language text classification (2007) 0.02
    0.023092953 = product of:
      0.046185907 = sum of:
        0.046185907 = product of:
          0.092371814 = sum of:
            0.092371814 = weight(_text_:classification in 831) [ClassicSimilarity], result of:
              0.092371814 = score(doc=831,freq=20.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.55634534 = fieldWeight in 831, product of:
                  4.472136 = tf(freq=20.0), with freq of:
                    20.0 = termFreq=20.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=831)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Purpose - The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word boundary information is available in written text. The paper advocates a simple language modeling based approach for this task. Design/methodology/approach - Naïve Bayes, maximum entropy model, support vector machines, and language modeling approaches were implemented and were applied to Chinese and Japanese text classification. To investigate the influence of word segmentation, different word segmentation approaches were investigated and applied to Chinese text. A segmentation-based approach was compared with the non-segmentation-based approach. Findings - There were two findings: the experiments show that statistical language modeling can significantly outperform standard techniques, given the same set of features; and it was found that classification with word level features normally yields improved classification performance, but that classification performance is not monotonically related to segmentation accuracy. In particular, classification performance may initially improve with increased segmentation accuracy, but eventually classification performance stops improving, and can in fact even decrease, after a certain level of segmentation accuracy. Practical implications - Apply the findings to real web text classification is ongoing work. Originality/value - The paper is very relevant to Chinese and Japanese information processing, e.g. webpage classification, web search.
  14. Li, T.; Zhu, S.; Ogihara, M.: Text categorization via generalized discriminant analysis (2008) 0.02
    0.021907898 = product of:
      0.043815795 = sum of:
        0.043815795 = product of:
          0.08763159 = sum of:
            0.08763159 = weight(_text_:classification in 2119) [ClassicSimilarity], result of:
              0.08763159 = score(doc=2119,freq=18.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.5277955 = fieldWeight in 2119, product of:
                  4.2426405 = tf(freq=18.0), with freq of:
                    18.0 = termFreq=18.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2119)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Text categorization is an important research area and has been receiving much attention due to the growth of the on-line information and of Internet. Automated text categorization is generally cast as a multi-class classification problem. Much of previous work focused on binary document classification problems. Support vector machines (SVMs) excel in binary classification, but the elegant theory behind large-margin hyperplane cannot be easily extended to multi-class text classification. In addition, the training time and scaling are also important concerns. On the other hand, other techniques naturally extensible to handle multi-class classification are generally not as accurate as SVM. This paper presents a simple and efficient solution to multi-class text categorization. Classification problems are first formulated as optimization via discriminant analysis. Text categorization is then cast as the problem of finding coordinate transformations that reflects the inherent similarity from the data. While most of the previous approaches decompose a multi-class classification problem into multiple independent binary classification tasks, the proposed approach enables direct multi-class classification. By using generalized singular value decomposition (GSVD), a coordinate transformation that reflects the inherent class structure indicated by the generalized singular values is identified. Extensive experiments demonstrate the efficiency and effectiveness of the proposed approach.
  15. Prabowo, R.; Jackson, M.; Burden, P.; Knoell, H.-D.: Ontology-based automatic classification for the Web pages : design, implementation and evaluation (2002) 0.02
    0.02146527 = product of:
      0.04293054 = sum of:
        0.04293054 = product of:
          0.08586108 = sum of:
            0.08586108 = weight(_text_:classification in 3383) [ClassicSimilarity], result of:
              0.08586108 = score(doc=3383,freq=12.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.5171319 = fieldWeight in 3383, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.046875 = fieldNorm(doc=3383)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    In recent years, we have witnessed the continual growth in the use of ontologies in order to provide a mechanism to enable machine reasoning. This paper describes an automatic classifier, which focuses on the use of ontologies for classifying Web pages with respect to the Dewey Decimal Classification (DDC) and Library of Congress Classification (LCC) schemes. Firstly, we explain how these ontologies can be built in a modular fashion, and mapped into DDC and LCC. Secondly, we propose the formal definition of a DDC-LCC and an ontology-classification-scheme mapping. Thirdly, we explain the way the classifier uses these ontologies to assist classification. Finally, an experiment in which the accuracy of the classifier was evaluated is presented. The experiment shows that our approach results an improved classification in terms of accuracy. This improvement, however, comes at a cost in a low overage ratio due to the incompleteness of the ontologies used
  16. Ko, Y.; Seo, J.: Text classification from unlabeled documents with bootstrapping and feature projection techniques (2009) 0.02
    0.02146527 = product of:
      0.04293054 = sum of:
        0.04293054 = product of:
          0.08586108 = sum of:
            0.08586108 = weight(_text_:classification in 2452) [ClassicSimilarity], result of:
              0.08586108 = score(doc=2452,freq=12.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.5171319 = fieldWeight in 2452, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2452)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Many machine learning algorithms have been applied to text classification tasks. In the machine learning paradigm, a general inductive process automatically builds a text classifier by learning, generally known as supervised learning. However, the supervised learning approaches have some problems. The most notable problem is that they require a large number of labeled training documents for accurate learning. While unlabeled documents are easily collected and plentiful, labeled documents are difficultly generated because a labeling task must be done by human developers. In this paper, we propose a new text classification method based on unsupervised or semi-supervised learning. The proposed method launches text classification tasks with only unlabeled documents and the title word of each category for learning, and then it automatically learns text classifier by using bootstrapping and feature projection techniques. The results of experiments showed that the proposed method achieved reasonably useful performance compared to a supervised method. If the proposed method is used in a text classification task, building text classification systems will become significantly faster and less expensive.
  17. Choi, B.; Peng, X.: Dynamic and hierarchical classification of Web pages (2004) 0.02
    0.02146527 = product of:
      0.04293054 = sum of:
        0.04293054 = product of:
          0.08586108 = sum of:
            0.08586108 = weight(_text_:classification in 2555) [ClassicSimilarity], result of:
              0.08586108 = score(doc=2555,freq=12.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.5171319 = fieldWeight in 2555, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2555)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Automatic classification of Web pages is an effective way to organise the vast amount of information and to assist in retrieving relevant information from the Internet. Although many automatic classification systems have been proposed, most of them ignore the conflict between the fixed number of categories and the growing number of Web pages being added into the systems. They also require searching through all existing categories to make any classification. This article proposes a dynamic and hierarchical classification system that is capable of adding new categories as required, organising the Web pages into a tree structure, and classifying Web pages by searching through only one path of the tree. The proposed single-path search technique reduces the search complexity from (n) to (log(n)). Test results show that the system improves the accuracy of classification by 6 percent in comparison to related systems. The dynamic-category expansion technique also achieves satisfying results for adding new categories into the system as required.
  18. Subramanian, S.; Shafer, K.E.: Clustering (2001) 0.02
    0.02119053 = product of:
      0.04238106 = sum of:
        0.04238106 = product of:
          0.08476212 = sum of:
            0.08476212 = weight(_text_:22 in 1046) [ClassicSimilarity], result of:
              0.08476212 = score(doc=1046,freq=2.0), product of:
                0.18256627 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.05213454 = queryNorm
                0.46428138 = fieldWeight in 1046, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.09375 = fieldNorm(doc=1046)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Date
    5. 5.2003 14:17:22
  19. Wang, J.: ¬An extensive study on automated Dewey Decimal Classification (2009) 0.02
    0.020654965 = product of:
      0.04130993 = sum of:
        0.04130993 = product of:
          0.08261986 = sum of:
            0.08261986 = weight(_text_:classification in 3172) [ClassicSimilarity], result of:
              0.08261986 = score(doc=3172,freq=16.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.49761042 = fieldWeight in 3172, product of:
                  4.0 = tf(freq=16.0), with freq of:
                    16.0 = termFreq=16.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=3172)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    In this paper, we present a theoretical analysis and extensive experiments on the automated assignment of Dewey Decimal Classification (DDC) classes to bibliographic data with a supervised machine-learning approach. Library classification systems, such as the DDC, impose great obstacles on state-of-art text categorization (TC) technologies, including deep hierarchy, data sparseness, and skewed distribution. We first analyze statistically the document and category distributions over the DDC, and discuss the obstacles imposed by bibliographic corpora and library classification schemes on TC technology. To overcome these obstacles, we propose an innovative algorithm to reshape the DDC structure into a balanced virtual tree by balancing the category distribution and flattening the hierarchy. To improve the classification effectiveness to a level acceptable to real-world applications, we propose an interactive classification model that is able to predict a class of any depth within a limited number of user interactions. The experiments are conducted on a large bibliographic collection created by the Library of Congress within the science and technology domains over 10 years. With no more than three interactions, a classification accuracy of nearly 90% is achieved, thus providing a practical solution to the automatic bibliographic classification problem.
  20. Wu, M.; Fuller, M.; Wilkinson, R.: Using clustering and classification approaches in interactive retrieval (2001) 0.02
    0.020447372 = product of:
      0.040894743 = sum of:
        0.040894743 = product of:
          0.081789486 = sum of:
            0.081789486 = weight(_text_:classification in 2666) [ClassicSimilarity], result of:
              0.081789486 = score(doc=2666,freq=2.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.49260917 = fieldWeight in 2666, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.109375 = fieldNorm(doc=2666)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    

Languages

  • e 59
  • d 5
  • a 1
  • More… Less…

Types

  • a 55
  • el 11
  • m 1
  • r 1
  • s 1
  • x 1
  • More… Less…