Search (57 results, page 1 of 3)

  • × theme_ss:"Automatisches Klassifizieren"
  1. Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.42
    0.41826025 = product of:
      0.5019123 = sum of:
        0.060683887 = product of:
          0.18205166 = sum of:
            0.18205166 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
              0.18205166 = score(doc=562,freq=2.0), product of:
                0.32392493 = queryWeight, product of:
                  8.478011 = idf(docFreq=24, maxDocs=44218)
                  0.038207654 = queryNorm
                0.56201804 = fieldWeight in 562, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  8.478011 = idf(docFreq=24, maxDocs=44218)
                  0.046875 = fieldNorm(doc=562)
          0.33333334 = coord(1/3)
        0.18205166 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
          0.18205166 = score(doc=562,freq=2.0), product of:
            0.32392493 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.038207654 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
        0.18205166 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
          0.18205166 = score(doc=562,freq=2.0), product of:
            0.32392493 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.038207654 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
        0.06677184 = weight(_text_:propose in 562) [ClassicSimilarity], result of:
          0.06677184 = score(doc=562,freq=2.0), product of:
            0.19617504 = queryWeight, product of:
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.038207654 = queryNorm
            0.3403687 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
        0.010353219 = product of:
          0.031059656 = sum of:
            0.031059656 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
              0.031059656 = score(doc=562,freq=2.0), product of:
                0.13379669 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.038207654 = queryNorm
                0.23214069 = fieldWeight in 562, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.046875 = fieldNorm(doc=562)
          0.33333334 = coord(1/3)
      0.8333333 = coord(5/6)
    
    Abstract
    Document representations for text classification are typically based on the classical Bag-Of-Words paradigm. This approach comes with deficiencies that motivate the integration of features on a higher semantic level than single words. In this paper we propose an enhancement of the classical document representation through concepts extracted from background knowledge. Boosting is used for actual classification. Experimental evaluations on two well known text corpora support our approach through consistent improvement of the results.
    Content
    Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
    Date
    8. 1.2013 10:22:32
  2. Liu, X.; Yu, S.; Janssens, F.; Glänzel, W.; Moreau, Y.; Moor, B.de: Weighted hybrid clustering by combining text mining and bibliometrics on a large-scale journal database (2010) 0.03
    0.03495895 = product of:
      0.104876846 = sum of:
        0.09442965 = weight(_text_:propose in 3464) [ClassicSimilarity], result of:
          0.09442965 = score(doc=3464,freq=4.0), product of:
            0.19617504 = queryWeight, product of:
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.038207654 = queryNorm
            0.48135406 = fieldWeight in 3464, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.046875 = fieldNorm(doc=3464)
        0.0104471985 = product of:
          0.031341594 = sum of:
            0.031341594 = weight(_text_:29 in 3464) [ClassicSimilarity], result of:
              0.031341594 = score(doc=3464,freq=2.0), product of:
                0.13440257 = queryWeight, product of:
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.038207654 = queryNorm
                0.23319192 = fieldWeight in 3464, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.046875 = fieldNorm(doc=3464)
          0.33333334 = coord(1/3)
      0.33333334 = coord(2/6)
    
    Abstract
    We propose a new hybrid clustering framework to incorporate text mining with bibliometrics in journal set analysis. The framework integrates two different approaches: clustering ensemble and kernel-fusion clustering. To improve the flexibility and the efficiency of processing large-scale data, we propose an information-based weighting scheme to leverage the effect of multiple data sources in hybrid clustering. Three different algorithms are extended by the proposed weighting scheme and they are employed on a large journal set retrieved from the Web of Science (WoS) database. The clustering performance of the proposed algorithms is systematically evaluated using multiple evaluation methods, and they were cross-compared with alternative methods. Experimental results demonstrate that the proposed weighted hybrid clustering strategy is superior to other methods in clustering performance and efficiency. The proposed approach also provides a more refined structural mapping of journal sets, which is useful for monitoring and detecting new trends in different scientific fields.
    Date
    1. 6.2010 9:29:57
  3. Li, T.; Zhu, S.; Ogihara, M.: Hierarchical document classification using automatically generated hierarchy (2007) 0.03
    0.025739681 = product of:
      0.07721904 = sum of:
        0.06677184 = weight(_text_:propose in 4797) [ClassicSimilarity], result of:
          0.06677184 = score(doc=4797,freq=2.0), product of:
            0.19617504 = queryWeight, product of:
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.038207654 = queryNorm
            0.3403687 = fieldWeight in 4797, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.046875 = fieldNorm(doc=4797)
        0.0104471985 = product of:
          0.031341594 = sum of:
            0.031341594 = weight(_text_:29 in 4797) [ClassicSimilarity], result of:
              0.031341594 = score(doc=4797,freq=2.0), product of:
                0.13440257 = queryWeight, product of:
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.038207654 = queryNorm
                0.23319192 = fieldWeight in 4797, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.046875 = fieldNorm(doc=4797)
          0.33333334 = coord(1/3)
      0.33333334 = coord(2/6)
    
    Abstract
    Automated text categorization has witnessed a booming interest with the exponential growth of information and the ever-increasing needs for organizations. The underlying hierarchical structure identifies the relationships of dependence between different categories and provides valuable sources of information for categorization. Although considerable research has been conducted in the field of hierarchical document categorization, little has been done on automatic generation of topic hierarchies. In this paper, we propose the method of using linear discriminant projection to generate more meaningful intermediate levels of hierarchies in large flat sets of classes. The linear discriminant projection approach first transforms all documents onto a low-dimensional space and then clusters the categories into hier- archies accordingly. The paper also investigates the effect of using generated hierarchical structure for text classification. Our experiments show that generated hierarchies improve classification performance in most cases.
    Source
    Journal of intelligent information systems. 29(2007) no.2, S.211-230
  4. Pfeffer, M.: Automatische Vergabe von RVK-Notationen mittels fallbasiertem Schließen (2009) 0.02
    0.023433086 = product of:
      0.07029925 = sum of:
        0.059946038 = weight(_text_:forschung in 3051) [ClassicSimilarity], result of:
          0.059946038 = score(doc=3051,freq=2.0), product of:
            0.1858777 = queryWeight, product of:
              4.8649335 = idf(docFreq=926, maxDocs=44218)
              0.038207654 = queryNorm
            0.32250258 = fieldWeight in 3051, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.8649335 = idf(docFreq=926, maxDocs=44218)
              0.046875 = fieldNorm(doc=3051)
        0.010353219 = product of:
          0.031059656 = sum of:
            0.031059656 = weight(_text_:22 in 3051) [ClassicSimilarity], result of:
              0.031059656 = score(doc=3051,freq=2.0), product of:
                0.13379669 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.038207654 = queryNorm
                0.23214069 = fieldWeight in 3051, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.046875 = fieldNorm(doc=3051)
          0.33333334 = coord(1/3)
      0.33333334 = coord(2/6)
    
    Abstract
    Klassifikation von bibliografischen Einheiten ist für einen systematischen Zugang zu den Beständen einer Bibliothek und deren Aufstellung unumgänglich. Bislang wurde diese Aufgabe von Fachexperten manuell erledigt, sei es individuell nach einer selbst entwickelten Systematik oder kooperativ nach einer gemeinsamen Systematik. In dieser Arbeit wird ein Verfahren zur Automatisierung des Klassifikationsvorgangs vorgestellt. Dabei kommt das Verfahren des fallbasierten Schließens zum Einsatz, das im Kontext der Forschung zur künstlichen Intelligenz entwickelt wurde. Das Verfahren liefert für jedes Werk, für das bibliografische Daten vorliegen, eine oder mehrere mögliche Klassifikationen. In Experimenten werden die Ergebnisse der automatischen Klassifikation mit der durch Fachexperten verglichen. Diese Experimente belegen die hohe Qualität der automatischen Klassifikation und dass das Verfahren geeignet ist, Fachexperten bei der Klassifikationsarbeit signifikant zu entlasten. Auch die nahezu vollständige Resystematisierung eines Bibliothekskataloges ist - mit gewissen Abstrichen - möglich.
    Date
    22. 8.2009 19:51:28
  5. Ibekwe-SanJuan, F.; SanJuan, E.: From term variants to research topics (2002) 0.02
    0.021449735 = product of:
      0.064349204 = sum of:
        0.055643205 = weight(_text_:propose in 1853) [ClassicSimilarity], result of:
          0.055643205 = score(doc=1853,freq=2.0), product of:
            0.19617504 = queryWeight, product of:
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.038207654 = queryNorm
            0.2836406 = fieldWeight in 1853, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1853)
        0.008706 = product of:
          0.026117997 = sum of:
            0.026117997 = weight(_text_:29 in 1853) [ClassicSimilarity], result of:
              0.026117997 = score(doc=1853,freq=2.0), product of:
                0.13440257 = queryWeight, product of:
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.038207654 = queryNorm
                0.19432661 = fieldWeight in 1853, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1853)
          0.33333334 = coord(1/3)
      0.33333334 = coord(2/6)
    
    Abstract
    In a scientific and technological watch (STW) task, an expert user needs to survey the evolution of research topics in his area of specialisation in order to detect interesting changes. The majority of methods proposing evaluation metrics (bibliometrics and scientometrics studies) for STW rely solely an statistical data analysis methods (Co-citation analysis, co-word analysis). Such methods usually work an structured databases where the units of analysis (words, keywords) are already attributed to documents by human indexers. The advent of huge amounts of unstructured textual data has rendered necessary the integration of natural language processing (NLP) techniques to first extract meaningful units from texts. We propose a method for STW which is NLP-oriented. The method not only analyses texts linguistically in order to extract terms from them, but also uses linguistic relations (syntactic variations) as the basis for clustering. Terms and variation relations are formalised as weighted di-graphs which the clustering algorithm, CPCL (Classification by Preferential Clustered Link) will seek to reduce in order to produces classes. These classes ideally represent the research topics present in the corpus. The results of the classification are subjected to validation by an expert in STW.
    Source
    Knowledge organization. 29(2002) nos.3/4, S.181-197
  6. Ma, Z.; Sun, A.; Cong, G.: On predicting the popularity of newly emerging hashtags in Twitter (2013) 0.02
    0.021449735 = product of:
      0.064349204 = sum of:
        0.055643205 = weight(_text_:propose in 967) [ClassicSimilarity], result of:
          0.055643205 = score(doc=967,freq=2.0), product of:
            0.19617504 = queryWeight, product of:
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.038207654 = queryNorm
            0.2836406 = fieldWeight in 967, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.0390625 = fieldNorm(doc=967)
        0.008706 = product of:
          0.026117997 = sum of:
            0.026117997 = weight(_text_:29 in 967) [ClassicSimilarity], result of:
              0.026117997 = score(doc=967,freq=2.0), product of:
                0.13440257 = queryWeight, product of:
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.038207654 = queryNorm
                0.19432661 = fieldWeight in 967, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=967)
          0.33333334 = coord(1/3)
      0.33333334 = coord(2/6)
    
    Abstract
    Because of Twitter's popularity and the viral nature of information dissemination on Twitter, predicting which Twitter topics will become popular in the near future becomes a task of considerable economic importance. Many Twitter topics are annotated by hashtags. In this article, we propose methods to predict the popularity of new hashtags on Twitter by formulating the problem as a classification task. We use five standard classification models (i.e., Naïve bayes, k-nearest neighbors, decision trees, support vector machines, and logistic regression) for prediction. The main challenge is the identification of effective features for describing new hashtags. We extract 7 content features from a hashtag string and the collection of tweets containing the hashtag and 11 contextual features from the social graph formed by users who have adopted the hashtag. We conducted experiments on a Twitter data set consisting of 31 million tweets from 2 million Singapore-based users. The experimental results show that the standard classifiers using the extracted features significantly outperform the baseline methods that do not use these features. Among the five classifiers, the logistic regression model performs the best in terms of the Micro-F1 measure. We also observe that contextual features are more effective than content features.
    Date
    25. 6.2013 19:05:29
  7. Wang, J.: ¬An extensive study on automated Dewey Decimal Classification (2009) 0.01
    0.01311523 = product of:
      0.07869138 = sum of:
        0.07869138 = weight(_text_:propose in 3172) [ClassicSimilarity], result of:
          0.07869138 = score(doc=3172,freq=4.0), product of:
            0.19617504 = queryWeight, product of:
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.038207654 = queryNorm
            0.40112838 = fieldWeight in 3172, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3172)
      0.16666667 = coord(1/6)
    
    Abstract
    In this paper, we present a theoretical analysis and extensive experiments on the automated assignment of Dewey Decimal Classification (DDC) classes to bibliographic data with a supervised machine-learning approach. Library classification systems, such as the DDC, impose great obstacles on state-of-art text categorization (TC) technologies, including deep hierarchy, data sparseness, and skewed distribution. We first analyze statistically the document and category distributions over the DDC, and discuss the obstacles imposed by bibliographic corpora and library classification schemes on TC technology. To overcome these obstacles, we propose an innovative algorithm to reshape the DDC structure into a balanced virtual tree by balancing the category distribution and flattening the hierarchy. To improve the classification effectiveness to a level acceptable to real-world applications, we propose an interactive classification model that is able to predict a class of any depth within a limited number of user interactions. The experiments are conducted on a large bibliographic collection created by the Library of Congress within the science and technology domains over 10 years. With no more than three interactions, a classification accuracy of nearly 90% is achieved, thus providing a practical solution to the automatic bibliographic classification problem.
  8. Dang, E.K.F.; Luk, R.W.P.; Ho, K.S.; Chan, S.C.F.; Lee, D.L.: ¬A new measure of clustering effectiveness : algorithms and experimental studies (2008) 0.01
    0.012983414 = product of:
      0.077900484 = sum of:
        0.077900484 = weight(_text_:propose in 1367) [ClassicSimilarity], result of:
          0.077900484 = score(doc=1367,freq=2.0), product of:
            0.19617504 = queryWeight, product of:
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.038207654 = queryNorm
            0.3970968 = fieldWeight in 1367, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1367)
      0.16666667 = coord(1/6)
    
    Abstract
    We propose a new optimal clustering effectiveness measure, called CS1, based on a combination of clusters rather than selecting a single optimal cluster as in the traditional MK1 measure. For hierarchical clustering, we present an algorithm to compute CS1, defined by seeking the optimal combinations of disjoint clusters obtained by cutting the hierarchical structure at a certain similarity level. By reformulating the optimization to a 0-1 linear fractional programming problem, we demonstrate that an exact solution can be obtained by a linear time algorithm. We further discuss how our approach can be generalized to more general problems involving overlapping clusters, and we show how optimal estimates can be obtained by greedy algorithms.
  9. Walther, R.: Möglichkeiten und Grenzen automatischer Klassifikationen von Web-Dokumenten (2001) 0.01
    0.011656174 = product of:
      0.06993704 = sum of:
        0.06993704 = weight(_text_:forschung in 1562) [ClassicSimilarity], result of:
          0.06993704 = score(doc=1562,freq=2.0), product of:
            0.1858777 = queryWeight, product of:
              4.8649335 = idf(docFreq=926, maxDocs=44218)
              0.038207654 = queryNorm
            0.376253 = fieldWeight in 1562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.8649335 = idf(docFreq=926, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1562)
      0.16666667 = coord(1/6)
    
    Abstract
    Automatische Klassifikationen von Web- und andern Textdokumenten ermöglichen es, betriebsinterne und externe Informationen geordnet zugänglich zu machen. Die Forschung zur automatischen Klassifikation hat sich in den letzten Jahren intensiviert. Das Resultat sind verschiedenen Methoden, die heute in der Praxis einzeln oder kombiniert für die Klassifikation im Einsatz sind. In der vorliegenden Lizenziatsarbeit werden neben allgemeinen Grundsätzen einige Methoden zur automatischen Klassifikation genauer betrachtet und ihre Möglichkeiten und Grenzen erörtert. Daneben erfolgt die Präsentation der Resultate aus einer Umfrage bei Anbieterrfirmen von Softwarelösungen zur automatische Klassifikation von Text-Dokumenten. Die Ausführungen dienen der myax internet AG als Basis, ein eigenes Klassifikations-Produkt zu entwickeln
  10. Sun, A.; Lim, E.-P.; Ng, W.-K.: Performance measurement framework for hierarchical text classification (2003) 0.01
    0.011128641 = product of:
      0.06677184 = sum of:
        0.06677184 = weight(_text_:propose in 1808) [ClassicSimilarity], result of:
          0.06677184 = score(doc=1808,freq=2.0), product of:
            0.19617504 = queryWeight, product of:
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.038207654 = queryNorm
            0.3403687 = fieldWeight in 1808, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.046875 = fieldNorm(doc=1808)
      0.16666667 = coord(1/6)
    
    Abstract
    Hierarchical text classification or simply hierarchical classification refers to assigning a document to one or more suitable categories from a hierarchical category space. In our literature survey, we have found that the existing hierarchical classification experiments used a variety of measures to evaluate performance. These performance measures often assume independence between categories and do not consider documents misclassified into categories that are similar or not far from the correct categories in the category tree. In this paper, we therefore propose new performance measures for hierarchicai classification. The proposed performance measures consist of category similarity measures and distance-based measures that consider the contributions of misclassified documents. Our experiments an hierarchical classification methods based an SVM classifiers and binary Naive Bayes classifiers showed that SVM classifiers perform better than Naive Bayes classifiers an Reuters-21578 collection according to the extended measures. A new classifier-centric measure called blocking measure is also defined to examine the performance of subtree classifiers in a top-down levelbased hierarchical classificatIon method.
  11. Wu, K.J.; Chen, M.-C.; Sun, Y.: Automatic topics discovery from hyperlinked documents (2004) 0.01
    0.011128641 = product of:
      0.06677184 = sum of:
        0.06677184 = weight(_text_:propose in 2563) [ClassicSimilarity], result of:
          0.06677184 = score(doc=2563,freq=2.0), product of:
            0.19617504 = queryWeight, product of:
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.038207654 = queryNorm
            0.3403687 = fieldWeight in 2563, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.046875 = fieldNorm(doc=2563)
      0.16666667 = coord(1/6)
    
    Abstract
    Topic discovery is an important means for marketing, e-Business and social science studies. As well, it can be applied to various purposes, such as identifying a group with certain properties and observing the emergence and diminishment of a certain cyber community. Previous topic discovery work (J.M. Kleinberg, Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, California, p. 668) requires manual judgment of usefulness of outcomes and is thus incapable of handling the explosive growth of the Internet. In this paper, we propose the Automatic Topic Discovery (ATD) method, which combines a method of base set construction, a clustering algorithm and an iterative principal eigenvector computation method to discover the topics relevant to a given query without using manual examination. Given a query, ATD returns with topics associated with the query and top representative pages for each topic. Our experiments show that the ATD method performs better than the traditional eigenvector method in terms of computation time and topic discovery quality.
  12. Prabowo, R.; Jackson, M.; Burden, P.; Knoell, H.-D.: Ontology-based automatic classification for the Web pages : design, implementation and evaluation (2002) 0.01
    0.011128641 = product of:
      0.06677184 = sum of:
        0.06677184 = weight(_text_:propose in 3383) [ClassicSimilarity], result of:
          0.06677184 = score(doc=3383,freq=2.0), product of:
            0.19617504 = queryWeight, product of:
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.038207654 = queryNorm
            0.3403687 = fieldWeight in 3383, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.046875 = fieldNorm(doc=3383)
      0.16666667 = coord(1/6)
    
    Abstract
    In recent years, we have witnessed the continual growth in the use of ontologies in order to provide a mechanism to enable machine reasoning. This paper describes an automatic classifier, which focuses on the use of ontologies for classifying Web pages with respect to the Dewey Decimal Classification (DDC) and Library of Congress Classification (LCC) schemes. Firstly, we explain how these ontologies can be built in a modular fashion, and mapped into DDC and LCC. Secondly, we propose the formal definition of a DDC-LCC and an ontology-classification-scheme mapping. Thirdly, we explain the way the classifier uses these ontologies to assist classification. Finally, an experiment in which the accuracy of the classifier was evaluated is presented. The experiment shows that our approach results an improved classification in terms of accuracy. This improvement, however, comes at a cost in a low overage ratio due to the incompleteness of the ontologies used
  13. Duwairi, R.M.: Machine learning for Arabic text categorization (2006) 0.01
    0.011128641 = product of:
      0.06677184 = sum of:
        0.06677184 = weight(_text_:propose in 5115) [ClassicSimilarity], result of:
          0.06677184 = score(doc=5115,freq=2.0), product of:
            0.19617504 = queryWeight, product of:
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.038207654 = queryNorm
            0.3403687 = fieldWeight in 5115, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.046875 = fieldNorm(doc=5115)
      0.16666667 = coord(1/6)
    
    Abstract
    In this article we propose a distance-based classifier for categorizing Arabic text. Each category is represented as a vector of words in an m-dimensional space, and documents are classified on the basis of their closeness to feature vectors of categories. The classifier, in its learning phase, scans the set of training documents to extract features of categories that capture inherent category-specific properties; in its testing phase the classifier uses previously determined category-specific features to categorize unclassified documents. Stemming was used to reduce the dimensionality of feature vectors of documents. The accuracy of the classifier was tested by carrying out several categorization tasks on an in-house collected Arabic corpus. The results show that the proposed classifier is very accurate and robust.
  14. Yoon, Y.; Lee, G.G.: Efficient implementation of associative classifiers for document classification (2007) 0.01
    0.011128641 = product of:
      0.06677184 = sum of:
        0.06677184 = weight(_text_:propose in 909) [ClassicSimilarity], result of:
          0.06677184 = score(doc=909,freq=2.0), product of:
            0.19617504 = queryWeight, product of:
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.038207654 = queryNorm
            0.3403687 = fieldWeight in 909, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.046875 = fieldNorm(doc=909)
      0.16666667 = coord(1/6)
    
    Abstract
    In practical text classification tasks, the ability to interpret the classification result is as important as the ability to classify exactly. Associative classifiers have many favorable characteristics such as rapid training, good classification accuracy, and excellent interpretation. However, associative classifiers also have some obstacles to overcome when they are applied in the area of text classification. The target text collection generally has a very high dimension, thus the training process might take a very long time. We propose a feature selection based on the mutual information between the word and class variables to reduce the space dimension of the associative classifiers. In addition, the training process of the associative classifier produces a huge amount of classification rules, which makes the prediction with a new document ineffective. We resolve this by introducing a new efficient method for storing and pruning classification rules. This method can also be used when predicting a test document. Experimental results using the 20-newsgroups dataset show many benefits of the associative classification in both training and predicting when applied to a real world problem.
  15. Denoyer, L.; Gallinari, P.: Bayesian network model for semi-structured document classification (2004) 0.01
    0.011128641 = product of:
      0.06677184 = sum of:
        0.06677184 = weight(_text_:propose in 995) [ClassicSimilarity], result of:
          0.06677184 = score(doc=995,freq=2.0), product of:
            0.19617504 = queryWeight, product of:
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.038207654 = queryNorm
            0.3403687 = fieldWeight in 995, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.046875 = fieldNorm(doc=995)
      0.16666667 = coord(1/6)
    
    Abstract
    Recently, a new community has started to emerge around the development of new information research methods for searching and analyzing semi-structured and XML like documents. The goal is to handle both content and structural information, and to deal with different types of information content (text, image, etc.). We consider here the task of structured document classification. We propose a generative model able to handle both structure and content which is based on Bayesian networks. We then show how to transform this generative model into a discriminant classifier using the method of Fisher kernel. The model is then extended for dealing with different types of content information (here text and images). The model was tested on three databases: the classical webKB corpus composed of HTML pages, the new INEX corpus which has become a reference in the field of ad-hoc retrieval for XML documents, and a multimedia corpus of Web pages.
  16. Ko, Y.; Seo, J.: Text classification from unlabeled documents with bootstrapping and feature projection techniques (2009) 0.01
    0.011128641 = product of:
      0.06677184 = sum of:
        0.06677184 = weight(_text_:propose in 2452) [ClassicSimilarity], result of:
          0.06677184 = score(doc=2452,freq=2.0), product of:
            0.19617504 = queryWeight, product of:
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.038207654 = queryNorm
            0.3403687 = fieldWeight in 2452, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.046875 = fieldNorm(doc=2452)
      0.16666667 = coord(1/6)
    
    Abstract
    Many machine learning algorithms have been applied to text classification tasks. In the machine learning paradigm, a general inductive process automatically builds a text classifier by learning, generally known as supervised learning. However, the supervised learning approaches have some problems. The most notable problem is that they require a large number of labeled training documents for accurate learning. While unlabeled documents are easily collected and plentiful, labeled documents are difficultly generated because a labeling task must be done by human developers. In this paper, we propose a new text classification method based on unsupervised or semi-supervised learning. The proposed method launches text classification tasks with only unlabeled documents and the title word of each category for learning, and then it automatically learns text classifier by using bootstrapping and feature projection techniques. The results of experiments showed that the proposed method achieved reasonably useful performance compared to a supervised method. If the proposed method is used in a text classification task, building text classification systems will become significantly faster and less expensive.
  17. Xu, Y.; Bernard, A.: Knowledge organization through statistical computation : a new approach (2009) 0.01
    0.011128641 = product of:
      0.06677184 = sum of:
        0.06677184 = weight(_text_:propose in 3252) [ClassicSimilarity], result of:
          0.06677184 = score(doc=3252,freq=2.0), product of:
            0.19617504 = queryWeight, product of:
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.038207654 = queryNorm
            0.3403687 = fieldWeight in 3252, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.046875 = fieldNorm(doc=3252)
      0.16666667 = coord(1/6)
    
    Abstract
    Knowledge organization (KO) is an interdisciplinary issue which includes some problems in knowledge classification such as how to classify newly emerged knowledge. With the great complexity and ambiguity of knowledge, it is becoming sometimes inefficient to classify knowledge by logical reasoning. This paper attempts to propose a statistical approach to knowledge organization in order to resolve the problems in classifying complex and mass knowledge. By integrating the classification process into a mathematical model, a knowledge classifier, based on the maximum entropy theory, is constructed and the experimental results show that the classification results acquired from the classifier are reliable. The approach proposed in this paper is quite formal and is not dependent on specific contexts, so it could easily be adapted to the use of knowledge classification in other domains within KO.
  18. Malo, P.; Sinha, A.; Wallenius, J.; Korhonen, P.: Concept-based document classification using Wikipedia and value function (2011) 0.01
    0.011128641 = product of:
      0.06677184 = sum of:
        0.06677184 = weight(_text_:propose in 4948) [ClassicSimilarity], result of:
          0.06677184 = score(doc=4948,freq=2.0), product of:
            0.19617504 = queryWeight, product of:
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.038207654 = queryNorm
            0.3403687 = fieldWeight in 4948, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.046875 = fieldNorm(doc=4948)
      0.16666667 = coord(1/6)
    
    Abstract
    In this article, we propose a new concept-based method for document classification. The conceptual knowledge associated with the words is drawn from Wikipedia. The purpose is to utilize the abundant semantic relatedness information available in Wikipedia in an efficient value function-based query learning algorithm. The procedure learns the value function by solving a simple linear programming problem formulated using the training documents. The learning involves a step-wise iterative process that helps in generating a value function with an appropriate set of concepts (dimensions) chosen from a collection of concepts. Once the value function is formulated, it is utilized to make a decision between relevance and irrelevance. The value assigned to a particular document from the value function can be further used to rank the documents according to their relevance. Reuters newswire documents have been used to evaluate the efficacy of the procedure. An extensive comparison with other frameworks has been performed. The results are promising.
  19. Schaalje, G.B.; Blades, N.J.; Funai, T.: ¬An open-set size-adjusted Bayesian classifier for authorship attribution (2013) 0.01
    0.011128641 = product of:
      0.06677184 = sum of:
        0.06677184 = weight(_text_:propose in 1041) [ClassicSimilarity], result of:
          0.06677184 = score(doc=1041,freq=2.0), product of:
            0.19617504 = queryWeight, product of:
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.038207654 = queryNorm
            0.3403687 = fieldWeight in 1041, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.1344433 = idf(docFreq=707, maxDocs=44218)
              0.046875 = fieldNorm(doc=1041)
      0.16666667 = coord(1/6)
    
    Abstract
    Recent studies of authorship attribution have used machine-learning methods including regularized multinomial logistic regression, neural nets, support vector machines, and the nearest shrunken centroid classifier to identify likely authors of disputed texts. These methods are all limited by an inability to perform open-set classification and account for text and corpus size. We propose a customized Bayesian logit-normal-beta-binomial classification model for supervised authorship attribution. The model is based on the beta-binomial distribution with an explicit inverse relationship between extra-binomial variation and text size. The model internally estimates the relationship of extra-binomial variation to text size, and uses Markov Chain Monte Carlo (MCMC) to produce distributions of posterior authorship probabilities instead of point estimates. We illustrate the method by training the machine-learning methods as well as the open-set Bayesian classifier on undisputed papers of The Federalist, and testing the method on documents historically attributed to Alexander Hamilton, John Jay, and James Madison. The Bayesian classifier was the best classifier of these texts.
  20. Koch, T.; Vizine-Goetz, D.: DDC and knowledge organization in the digital library : Research and development. Demonstration pages (1999) 0.01
    0.009991007 = product of:
      0.059946038 = sum of:
        0.059946038 = weight(_text_:forschung in 942) [ClassicSimilarity], result of:
          0.059946038 = score(doc=942,freq=2.0), product of:
            0.1858777 = queryWeight, product of:
              4.8649335 = idf(docFreq=926, maxDocs=44218)
              0.038207654 = queryNorm
            0.32250258 = fieldWeight in 942, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.8649335 = idf(docFreq=926, maxDocs=44218)
              0.046875 = fieldNorm(doc=942)
      0.16666667 = coord(1/6)
    
    Abstract
    Der Workshop gibt einen Einblick in die aktuelle Forschung und Entwicklung zur Wissensorganisation in digitalen Bibliotheken. Diane Vizine-Goetz vom OCLC Office of Research in Dublin, Ohio, stellt die Forschungsprojekte von OCLC zur Anpassung und Weiterentwicklung der Dewey Decimal Classification als Wissensorganisationsinstrument fuer grosse digitale Dokumentensammlungen vor. Traugott Koch, NetLab, Universität Lund in Schweden, demonstriert die Ansätze und Lösungen des EU-Projekts DESIRE zum Einsatz von intellektueller und vor allem automatischer Klassifikation in Fachinformationsdiensten im Internet.

Years

Languages

  • e 48
  • d 9

Types

  • a 50
  • el 5
  • x 3
  • More… Less…