Search (69 results, page 1 of 4)

Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.36

0.36031273 = product of:
  0.6305472 = sum of:
    0.047687992 = product of:
      0.14306398 = sum of:
        0.14306398 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
          0.14306398 = score(doc=562,freq=2.0), product of:
            0.25455406 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.03002521 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.33333334 = coord(1/3)
    0.14306398 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
      0.14306398 = score(doc=562,freq=2.0), product of:
        0.25455406 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.03002521 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.03496567 = weight(_text_:classification in 562) [ClassicSimilarity], result of:
      0.03496567 = score(doc=562,freq=6.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.3656675 = fieldWeight in 562, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.07153199 = product of:
      0.14306398 = sum of:
        0.14306398 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
          0.14306398 = score(doc=562,freq=2.0), product of:
            0.25455406 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.03002521 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.5 = coord(1/2)
    0.14306398 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
      0.14306398 = score(doc=562,freq=2.0), product of:
        0.25455406 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.03002521 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.03496567 = weight(_text_:classification in 562) [ClassicSimilarity], result of:
      0.03496567 = score(doc=562,freq=6.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.3656675 = fieldWeight in 562, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.14306398 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
      0.14306398 = score(doc=562,freq=2.0), product of:
        0.25455406 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.03002521 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.0122040035 = product of:
      0.024408007 = sum of:
        0.024408007 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
          0.024408007 = score(doc=562,freq=2.0), product of:
            0.10514317 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.03002521 = queryNorm
            0.23214069 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.5 = coord(1/2)
  0.5714286 = coord(8/14)

Abstract: Document representations for text classification are typically based on the classical Bag-Of-Words paradigm. This approach comes with deficiencies that motivate the integration of features on a higher semantic level than single words. In this paper we propose an enhancement of the classical document representation through concepts extracted from background knowledge. Boosting is used for actual classification. Experimental evaluations on two well known text corpora support our approach through consistent improvement of the results.
Content: Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
Date: 8. 1.2013 10:22:32

Yi, K.: Automatic text classification using library classification schemes : trends, issues and challenges (2007) 0.09

0.085803576 = product of:
  0.24025 = sum of:
    0.06661515 = weight(_text_:classification in 2560) [ClassicSimilarity], result of:
      0.06661515 = score(doc=2560,freq=16.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.69665456 = fieldWeight in 2560, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=2560)
    0.057587784 = product of:
      0.11517557 = sum of:
        0.11517557 = weight(_text_:schemes in 2560) [ClassicSimilarity], result of:
          0.11517557 = score(doc=2560,freq=6.0), product of:
            0.16067243 = queryWeight, product of:
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.03002521 = queryNorm
            0.71683466 = fieldWeight in 2560, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2560)
      0.5 = coord(1/2)
    0.035193928 = weight(_text_:bibliographic in 2560) [ClassicSimilarity], result of:
      0.035193928 = score(doc=2560,freq=2.0), product of:
        0.11688946 = queryWeight, product of:
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.03002521 = queryNorm
        0.30108726 = fieldWeight in 2560, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.0546875 = fieldNorm(doc=2560)
    0.06661515 = weight(_text_:classification in 2560) [ClassicSimilarity], result of:
      0.06661515 = score(doc=2560,freq=16.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.69665456 = fieldWeight in 2560, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=2560)
    0.014238005 = product of:
      0.02847601 = sum of:
        0.02847601 = weight(_text_:22 in 2560) [ClassicSimilarity], result of:
          0.02847601 = score(doc=2560,freq=2.0), product of:
            0.10514317 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.03002521 = queryNorm
            0.2708308 = fieldWeight in 2560, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2560)
      0.5 = coord(1/2)
  0.35714287 = coord(5/14)

Abstract: The proliferation of digital resources and their integration into a traditional library setting has created a pressing need for an automated tool that organizes textual information based on library classification schemes. Automated text classification is a research field of developing tools, methods, and models to automate text classification. This article describes the current popular approach for text classification and major text classification projects and applications that are based on library classification schemes. Related issues and challenges are discussed, and a number of considerations for the challenges are examined.
Date: 22. 9.2008 18:31:54
Source: International cataloguing and bibliographic control. 36(2007) no.4, S.78-82

Wang, J.: ¬An extensive study on automated Dewey Decimal Classification (2009) 0.05

0.0483401 = product of:
  0.16919035 = sum of:
    0.04758225 = weight(_text_:classification in 3172) [ClassicSimilarity], result of:
      0.04758225 = score(doc=3172,freq=16.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.49761042 = fieldWeight in 3172, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3172)
    0.0237488 = product of:
      0.0474976 = sum of:
        0.0474976 = weight(_text_:schemes in 3172) [ClassicSimilarity], result of:
          0.0474976 = score(doc=3172,freq=2.0), product of:
            0.16067243 = queryWeight, product of:
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.03002521 = queryNorm
            0.2956176 = fieldWeight in 3172, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3172)
      0.5 = coord(1/2)
    0.05027704 = weight(_text_:bibliographic in 3172) [ClassicSimilarity], result of:
      0.05027704 = score(doc=3172,freq=8.0), product of:
        0.11688946 = queryWeight, product of:
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.03002521 = queryNorm
        0.43012467 = fieldWeight in 3172, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3172)
    0.04758225 = weight(_text_:classification in 3172) [ClassicSimilarity], result of:
      0.04758225 = score(doc=3172,freq=16.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.49761042 = fieldWeight in 3172, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3172)
  0.2857143 = coord(4/14)

Abstract: In this paper, we present a theoretical analysis and extensive experiments on the automated assignment of Dewey Decimal Classification (DDC) classes to bibliographic data with a supervised machine-learning approach. Library classification systems, such as the DDC, impose great obstacles on state-of-art text categorization (TC) technologies, including deep hierarchy, data sparseness, and skewed distribution. We first analyze statistically the document and category distributions over the DDC, and discuss the obstacles imposed by bibliographic corpora and library classification schemes on TC technology. To overcome these obstacles, we propose an innovative algorithm to reshape the DDC structure into a balanced virtual tree by balancing the category distribution and flattening the hierarchy. To improve the classification effectiveness to a level acceptable to real-world applications, we propose an interactive classification model that is able to predict a class of any depth within a limited number of user interactions. The experiments are conducted on a large bibliographic collection created by the Library of Congress within the science and technology domains over 10 years. With no more than three interactions, a classification accuracy of nearly 90% is achieved, thus providing a practical solution to the automatic bibliographic classification problem.

Golub, K.; Lykke, M.: Automated classification of web pages in hierarchical browsing (2009) 0.05

0.047508016 = product of:
  0.16627805 = sum of:
    0.021217827 = weight(_text_:subject in 3614) [ClassicSimilarity], result of:
      0.021217827 = score(doc=3614,freq=2.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.19758089 = fieldWeight in 3614, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3614)
    0.06065571 = weight(_text_:classification in 3614) [ClassicSimilarity], result of:
      0.06065571 = score(doc=3614,freq=26.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.63433135 = fieldWeight in 3614, product of:
          5.0990195 = tf(freq=26.0), with freq of:
            26.0 = termFreq=26.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3614)
    0.0237488 = product of:
      0.0474976 = sum of:
        0.0474976 = weight(_text_:schemes in 3614) [ClassicSimilarity], result of:
          0.0474976 = score(doc=3614,freq=2.0), product of:
            0.16067243 = queryWeight, product of:
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.03002521 = queryNorm
            0.2956176 = fieldWeight in 3614, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3614)
      0.5 = coord(1/2)
    0.06065571 = weight(_text_:classification in 3614) [ClassicSimilarity], result of:
      0.06065571 = score(doc=3614,freq=26.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.63433135 = fieldWeight in 3614, product of:
          5.0990195 = tf(freq=26.0), with freq of:
            26.0 = termFreq=26.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3614)
  0.2857143 = coord(4/14)

Abstract: Purpose - The purpose of this study is twofold: to investigate whether it is meaningful to use the Engineering Index (Ei) classification scheme for browsing, and then, if proven useful, to investigate the performance of an automated classification algorithm based on the Ei classification scheme. Design/methodology/approach - A user study was conducted in which users solved four controlled searching tasks. The users browsed the Ei classification scheme in order to examine the suitability of the classification systems for browsing. The classification algorithm was evaluated by the users who judged the correctness of the automatically assigned classes. Findings - The study showed that the Ei classification scheme is suited for browsing. Automatically assigned classes were on average partly correct, with some classes working better than others. Success of browsing showed to be correlated and dependent on classification correctness. Research limitations/implications - Further research should address problems of disparate evaluations of one and the same web page. Additional reasons behind browsing failures in the Ei classification scheme also need further investigation. Practical implications - Improvements for browsing were identified: describing class captions and/or listing their subclasses from start; allowing for searching for words from class captions with synonym search (easily provided for Ei since the classes are mapped to thesauri terms); when searching for class captions, returning the hierarchical tree expanded around the class in which caption the search term is found. The need for improvements of classification schemes was also indicated. Originality/value - A user-based evaluation of automated subject classification in the context of browsing has not been conducted before; hence the study also presents new findings concerning methodology.
Object: Engineering Index Classification

Reiner, U.: DDC-based search in the data of the German National Bibliography (2008) 0.04

0.041978236 = product of:
  0.14692383 = sum of:
    0.036007844 = weight(_text_:subject in 2166) [ClassicSimilarity], result of:
      0.036007844 = score(doc=2166,freq=4.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.33530587 = fieldWeight in 2166, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.046875 = fieldNorm(doc=2166)
    0.04037488 = weight(_text_:classification in 2166) [ClassicSimilarity], result of:
      0.04037488 = score(doc=2166,freq=8.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.42223644 = fieldWeight in 2166, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=2166)
    0.030166224 = weight(_text_:bibliographic in 2166) [ClassicSimilarity], result of:
      0.030166224 = score(doc=2166,freq=2.0), product of:
        0.11688946 = queryWeight, product of:
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.03002521 = queryNorm
        0.2580748 = fieldWeight in 2166, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.046875 = fieldNorm(doc=2166)
    0.04037488 = weight(_text_:classification in 2166) [ClassicSimilarity], result of:
      0.04037488 = score(doc=2166,freq=8.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.42223644 = fieldWeight in 2166, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=2166)
  0.2857143 = coord(4/14)

Abstract: In 2004, the German National Library began to classify title records of the German National Bibliography according to subject groups based on the divisions of the Dewey Decimal Classification (DDC). Since 2006, all titles of the main series of the German National Bibliography are classified in strict compliance with the DDC. On this basis, an enhanced DDC-based search can be realized - e.g., searching the data of the German National Bibliography for title records using number components of synthesized classification numbers or searching for DDC numbers using unclassified title records. This paper gives an account of the current research and development of the DDC-based search. The work is conducted in the VZG project Colibri that focuses on the automatic analysis of DDC-synthesized numbers and the automatic classification of bibliographic title records.
Source: New pespectives on subject indexing and classification: essays in honour of Magda Heiner-Freiling. Red.: K. Knull-Schlomann, u.a

Yi, K.: Challenges in automated classification using library classification schemes (2006) 0.04

0.039771687 = product of:
  0.1856012 = sum of:
    0.0659319 = weight(_text_:classification in 5810) [ClassicSimilarity], result of:
      0.0659319 = score(doc=5810,freq=12.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.6895092 = fieldWeight in 5810, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0625 = fieldNorm(doc=5810)
    0.053737402 = product of:
      0.107474804 = sum of:
        0.107474804 = weight(_text_:schemes in 5810) [ClassicSimilarity], result of:
          0.107474804 = score(doc=5810,freq=4.0), product of:
            0.16067243 = queryWeight, product of:
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.03002521 = queryNorm
            0.66890633 = fieldWeight in 5810, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.0625 = fieldNorm(doc=5810)
      0.5 = coord(1/2)
    0.0659319 = weight(_text_:classification in 5810) [ClassicSimilarity], result of:
      0.0659319 = score(doc=5810,freq=12.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.6895092 = fieldWeight in 5810, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0625 = fieldNorm(doc=5810)
  0.21428572 = coord(3/14)

Abstract: A major library classification scheme has long been standard classification framework for information sources in traditional library environment, and text classification (TC) becomes a popular and attractive tool of organizing digital information. This paper gives an overview of previous projects and studies on TC using major library classification schemes, and summarizes a discussion of TC research challenges.

Koch, T.; Ardö, A.: Automatic classification of full-text HTML-documents from one specific subject area : DESIRE II D3.6a, Working Paper 2 (2000) 0.04

0.038544483 = product of:
  0.17987426 = sum of:
    0.048010457 = weight(_text_:subject in 1667) [ClassicSimilarity], result of:
      0.048010457 = score(doc=1667,freq=4.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.4470745 = fieldWeight in 1667, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.0625 = fieldNorm(doc=1667)
    0.0659319 = weight(_text_:classification in 1667) [ClassicSimilarity], result of:
      0.0659319 = score(doc=1667,freq=12.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.6895092 = fieldWeight in 1667, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0625 = fieldNorm(doc=1667)
    0.0659319 = weight(_text_:classification in 1667) [ClassicSimilarity], result of:
      0.0659319 = score(doc=1667,freq=12.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.6895092 = fieldWeight in 1667, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0625 = fieldNorm(doc=1667)
  0.21428572 = coord(3/14)

Content: 1 Introduction / 2 Method overview / 3 Ei thesaurus preprocessing / 4 Automatic classification process: 4.1 Matching -- 4.2 Weighting -- 4.3 Preparation for display / 5 Results of the classification process / 6 Evaluations / 7 Software / 8 Other applications / 9 Experiments with universal classification systems / References / Appendix A: Ei classification service: Software / Appendix B: Use of the classification software as subject filter in a WWW harvester.

Golub, K.; Hamon, T.; Ardö, A.: Automated classification of textual documents based on a controlled vocabulary in engineering (2007) 0.04

0.03848849 = product of:
  0.1347097 = sum of:
    0.02546139 = weight(_text_:subject in 1461) [ClassicSimilarity], result of:
      0.02546139 = score(doc=1461,freq=2.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.23709705 = fieldWeight in 1461, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.046875 = fieldNorm(doc=1461)
    0.04037488 = weight(_text_:classification in 1461) [ClassicSimilarity], result of:
      0.04037488 = score(doc=1461,freq=8.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.42223644 = fieldWeight in 1461, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=1461)
    0.02849856 = product of:
      0.05699712 = sum of:
        0.05699712 = weight(_text_:schemes in 1461) [ClassicSimilarity], result of:
          0.05699712 = score(doc=1461,freq=2.0), product of:
            0.16067243 = queryWeight, product of:
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.03002521 = queryNorm
            0.35474116 = fieldWeight in 1461, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.046875 = fieldNorm(doc=1461)
      0.5 = coord(1/2)
    0.04037488 = weight(_text_:classification in 1461) [ClassicSimilarity], result of:
      0.04037488 = score(doc=1461,freq=8.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.42223644 = fieldWeight in 1461, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=1461)
  0.2857143 = coord(4/14)

Abstract: Automated subject classification has been a challenging research issue for many years now, receiving particular attention in the past decade due to rapid increase of digital documents. The most frequent approach to automated classification is machine learning. It, however, requires training documents and performs well on new documents only if these are similar enough to the former. We explore a string-matching algorithm based on a controlled vocabulary, which does not require training documents - instead it reuses the intellectual work put into creating the controlled vocabulary. Terms from the Engineering Information thesaurus and classification scheme were matched against title and abstract of engineering papers from the Compendex database. Simple string-matching was enhanced by several methods such as term weighting schemes and cut-offs, exclusion of certain terms, and en- richment of the controlled vocabulary with automatically extracted terms. The best results are 76% recall when the controlled vocabulary is enriched with new terms, and 79% precision when certain terms are excluded. Precision of individual classes is up to 98%. These results are comparable to state-of-the-art machine-learning algorithms.

Automatic classification research at OCLC (2002) 0.04

0.03687797 = product of:
  0.12907289 = sum of:
    0.04079328 = weight(_text_:classification in 1563) [ClassicSimilarity], result of:
      0.04079328 = score(doc=1563,freq=6.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.42661208 = fieldWeight in 1563, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1563)
    0.03324832 = product of:
      0.06649664 = sum of:
        0.06649664 = weight(_text_:schemes in 1563) [ClassicSimilarity], result of:
          0.06649664 = score(doc=1563,freq=2.0), product of:
            0.16067243 = queryWeight, product of:
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.03002521 = queryNorm
            0.41386467 = fieldWeight in 1563, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1563)
      0.5 = coord(1/2)
    0.04079328 = weight(_text_:classification in 1563) [ClassicSimilarity], result of:
      0.04079328 = score(doc=1563,freq=6.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.42661208 = fieldWeight in 1563, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1563)
    0.014238005 = product of:
      0.02847601 = sum of:
        0.02847601 = weight(_text_:22 in 1563) [ClassicSimilarity], result of:
          0.02847601 = score(doc=1563,freq=2.0), product of:
            0.10514317 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.03002521 = queryNorm
            0.2708308 = fieldWeight in 1563, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1563)
      0.5 = coord(1/2)
  0.2857143 = coord(4/14)

Abstract: OCLC enlists the cooperation of the world's libraries to make the written record of humankind's cultural heritage more accessible through electronic media. Part of this goal can be accomplished through the application of the principles of knowledge organization. We believe that cultural artifacts are effectively lost unless they are indexed, cataloged and classified. Accordingly, OCLC has developed products, sponsored research projects, and encouraged the participation in international standards communities whose outcome has been improved library classification schemes, cataloging productivity tools, and new proposals for the creation and maintenance of metadata. Though cataloging and classification requires expert intellectual effort, we recognize that at least some of the work must be automated if we hope to keep pace with cultural change
Date: 5. 5.2003 9:22:09

Godby, C.J.; Stuler, J.: ¬The Library of Congress Classification as a knowledge base for automatic subject categorization : subject access issues (2003) 0.03

0.032918133 = product of:
  0.15361795 = sum of:
    0.059409913 = weight(_text_:subject in 3962) [ClassicSimilarity], result of:
      0.059409913 = score(doc=3962,freq=8.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.5532265 = fieldWeight in 3962, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.0546875 = fieldNorm(doc=3962)
    0.047104023 = weight(_text_:classification in 3962) [ClassicSimilarity], result of:
      0.047104023 = score(doc=3962,freq=8.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.49260917 = fieldWeight in 3962, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=3962)
    0.047104023 = weight(_text_:classification in 3962) [ClassicSimilarity], result of:
      0.047104023 = score(doc=3962,freq=8.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.49260917 = fieldWeight in 3962, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=3962)
  0.21428572 = coord(3/14)

Abstract: This paper describes a set of experiments in adapting a subset of the Library of Congress Classification for use as a database for automatic classification. A high degree of concept integrity was obtained when subject headings were mapped from OCLC's WorldCat database and filtered using the log-likelihood statistic.
Source: Subject retrieval in a networked environment: Proceedings of the IFLA Satellite Meeting held in Dublin, OH, 14-16 August 2001 and sponsored by the IFLA Classification and Indexing Section, the IFLA Information Technology Section and OCLC. Ed.: I.C. McIlwaine

Godby, C. J.; Stuler, J.: ¬The Library of Congress Classification as a knowledge base for automatic subject categorization (2001) 0.03

0.032580506 = product of:
  0.15204236 = sum of:
    0.058800567 = weight(_text_:subject in 1567) [ClassicSimilarity], result of:
      0.058800567 = score(doc=1567,freq=6.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.5475522 = fieldWeight in 1567, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.0625 = fieldNorm(doc=1567)
    0.046620894 = weight(_text_:classification in 1567) [ClassicSimilarity], result of:
      0.046620894 = score(doc=1567,freq=6.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.48755667 = fieldWeight in 1567, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0625 = fieldNorm(doc=1567)
    0.046620894 = weight(_text_:classification in 1567) [ClassicSimilarity], result of:
      0.046620894 = score(doc=1567,freq=6.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.48755667 = fieldWeight in 1567, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0625 = fieldNorm(doc=1567)
  0.21428572 = coord(3/14)

Abstract: This paper describes a set of experiments in adapting a subset of the Library of Congress Classification for use as a database for automatic classification. A high degree of concept integrity was obtained when subject headings were mapped from OCLC's WorldCat database and filtered using the log-likelihood statistic
Footnote: Paper, IFLA Preconference "Subject Retrieval in a Networked Environment", Dublin, OH, August 2001.

Chung, Y.-M.; Noh, Y.-H.: Developing a specialized directory system by automatically classifying Web documents (2003) 0.03

0.03025795 = product of:
  0.14120376 = sum of:
    0.05092278 = weight(_text_:subject in 1566) [ClassicSimilarity], result of:
      0.05092278 = score(doc=1566,freq=8.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.4741941 = fieldWeight in 1566, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.046875 = fieldNorm(doc=1566)
    0.045140486 = weight(_text_:classification in 1566) [ClassicSimilarity], result of:
      0.045140486 = score(doc=1566,freq=10.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.4720747 = fieldWeight in 1566, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=1566)
    0.045140486 = weight(_text_:classification in 1566) [ClassicSimilarity], result of:
      0.045140486 = score(doc=1566,freq=10.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.4720747 = fieldWeight in 1566, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=1566)
  0.21428572 = coord(3/14)

Abstract: This study developed a specialized directory system using an automatic classification technique. Economics was selected as the subject field for the classification experiments with Web documents. The classification scheme of the directory follows the DDC, and subject terms representing each class number or subject category were selected from the DDC table to construct a representative term dictionary. In collecting and classifying the Web documents, various strategies were tested in order to find the optimal thresholds. In the classification experiments, Web documents in economics were classified into a total of 757 hierarchical subject categories built from the DDC scheme. The first and second experiments using the representative term dictionary resulted in relatively high precision ratios of 77 and 60%, respectively. The third experiment employing a machine learning-based k-nearest neighbours (kNN) classifier in a closed experimental setting achieved a precision ratio of 96%. This implies that it is possible to enhance the classification performance by applying a hybrid method combining a dictionary-based technique and a kNN classifier

Yoon, Y.; Lee, C.; Lee, G.G.: ¬An effective procedure for constructing a hierarchical text classification system (2006) 0.03

0.027775463 = product of:
  0.12961882 = sum of:
    0.057690408 = weight(_text_:classification in 5273) [ClassicSimilarity], result of:
      0.057690408 = score(doc=5273,freq=12.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.60332054 = fieldWeight in 5273, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=5273)
    0.057690408 = weight(_text_:classification in 5273) [ClassicSimilarity], result of:
      0.057690408 = score(doc=5273,freq=12.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.60332054 = fieldWeight in 5273, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=5273)
    0.014238005 = product of:
      0.02847601 = sum of:
        0.02847601 = weight(_text_:22 in 5273) [ClassicSimilarity], result of:
          0.02847601 = score(doc=5273,freq=2.0), product of:
            0.10514317 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.03002521 = queryNorm
            0.2708308 = fieldWeight in 5273, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=5273)
      0.5 = coord(1/2)
  0.21428572 = coord(3/14)

Abstract: In text categorization tasks, classification on some class hierarchies has better results than in cases without the hierarchy. Currently, because a large number of documents are divided into several subgroups in a hierarchy, we can appropriately use a hierarchical classification method. However, we have no systematic method to build a hierarchical classification system that performs well with large collections of practical data. In this article, we introduce a new evaluation scheme for internal node classifiers, which can be used effectively to develop a hierarchical classification system. We also show that our method for constructing the hierarchical classification system is very effective, especially for the task of constructing classifiers applied to hierarchy tree with a lot of levels.
Date: 22. 7.2006 16:24:52

Prabowo, R.; Jackson, M.; Burden, P.; Knoell, H.-D.: Ontology-based automatic classification for the Web pages : design, implementation and evaluation (2002) 0.03

0.027299229 = product of:
  0.1273964 = sum of:
    0.049448926 = weight(_text_:classification in 3383) [ClassicSimilarity], result of:
      0.049448926 = score(doc=3383,freq=12.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.5171319 = fieldWeight in 3383, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=3383)
    0.02849856 = product of:
      0.05699712 = sum of:
        0.05699712 = weight(_text_:schemes in 3383) [ClassicSimilarity], result of:
          0.05699712 = score(doc=3383,freq=2.0), product of:
            0.16067243 = queryWeight, product of:
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.03002521 = queryNorm
            0.35474116 = fieldWeight in 3383, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.046875 = fieldNorm(doc=3383)
      0.5 = coord(1/2)
    0.049448926 = weight(_text_:classification in 3383) [ClassicSimilarity], result of:
      0.049448926 = score(doc=3383,freq=12.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.5171319 = fieldWeight in 3383, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=3383)
  0.21428572 = coord(3/14)

Abstract: In recent years, we have witnessed the continual growth in the use of ontologies in order to provide a mechanism to enable machine reasoning. This paper describes an automatic classifier, which focuses on the use of ontologies for classifying Web pages with respect to the Dewey Decimal Classification (DDC) and Library of Congress Classification (LCC) schemes. Firstly, we explain how these ontologies can be built in a modular fashion, and mapped into DDC and LCC. Secondly, we propose the formal definition of a DDC-LCC and an ontology-classification-scheme mapping. Thirdly, we explain the way the classifier uses these ontologies to assist classification. Finally, an experiment in which the accuracy of the classifier was evaluated is presented. The experiment shows that our approach results an improved classification in terms of accuracy. This improvement, however, comes at a cost in a low overage ratio due to the incompleteness of the ontologies used

Golub, K.: Automated subject classification of textual Web pages, based on a controlled vocabulary : challenges and recommendations (2006) 0.03

0.025019486 = product of:
  0.1167576 = sum of:
    0.036007844 = weight(_text_:subject in 5897) [ClassicSimilarity], result of:
      0.036007844 = score(doc=5897,freq=4.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.33530587 = fieldWeight in 5897, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.046875 = fieldNorm(doc=5897)
    0.04037488 = weight(_text_:classification in 5897) [ClassicSimilarity], result of:
      0.04037488 = score(doc=5897,freq=8.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.42223644 = fieldWeight in 5897, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=5897)
    0.04037488 = weight(_text_:classification in 5897) [ClassicSimilarity], result of:
      0.04037488 = score(doc=5897,freq=8.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.42223644 = fieldWeight in 5897, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=5897)
  0.21428572 = coord(3/14)

Abstract: The primary objective of this study was to identify and address problems of applying a controlled vocabulary in automated subject classification of textual Web pages, in the area of engineering. Web pages have special characteristics such as structural information, but are at the same time rather heterogeneous. The classification approach used comprises string-to-string matching between words in a term list extracted from the Ei (Engineering Information) thesaurus and classification scheme, and words in the text to be classified. Based on a sample of 70 Web pages, a number of problems with the term list are identified. Reasons for those problems are discussed and improvements proposed. Methods for implementing the improvements are also specified, suggesting further research.

Sebastiani, F.: Classification of text, automatic (2006) 0.02

0.024849901 = product of:
  0.1159662 = sum of:
    0.033307575 = weight(_text_:classification in 5003) [ClassicSimilarity], result of:
      0.033307575 = score(doc=5003,freq=4.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.34832728 = fieldWeight in 5003, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=5003)
    0.033307575 = weight(_text_:classification in 5003) [ClassicSimilarity], result of:
      0.033307575 = score(doc=5003,freq=4.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.34832728 = fieldWeight in 5003, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=5003)
    0.04935105 = product of:
      0.0987021 = sum of:
        0.0987021 = weight(_text_:texts in 5003) [ClassicSimilarity], result of:
          0.0987021 = score(doc=5003,freq=4.0), product of:
            0.16460659 = queryWeight, product of:
              5.4822793 = idf(docFreq=499, maxDocs=44218)
              0.03002521 = queryNorm
            0.5996243 = fieldWeight in 5003, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.4822793 = idf(docFreq=499, maxDocs=44218)
              0.0546875 = fieldNorm(doc=5003)
      0.5 = coord(1/2)
  0.21428572 = coord(3/14)

Abstract: Automatic text classification (ATC) is a discipline at the crossroads of information retrieval (IR), machine learning (ML), and computational linguistics (CL), and consists in the realization of text classifiers, i.e. software systems capable of assigning texts to one or more categories, or classes, from a predefined set. Applications range from the automated indexing of scientific articles, to e-mail routing, spam filtering, authorship attribution, and automated survey coding. This article will focus on the ML approach to ATC, whereby a software system (called the learner) automatically builds a classifier for the categories of interest by generalizing from a "training" set of pre-classified texts.

Lindholm, J.; Schönthal, T.; Jansson , K.: Experiences of harvesting Web resources in engineering using automatic classification (2003) 0.02

0.023588596 = product of:
  0.110080115 = sum of:
    0.033948522 = weight(_text_:subject in 4088) [ClassicSimilarity], result of:
      0.033948522 = score(doc=4088,freq=2.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.31612942 = fieldWeight in 4088, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.0625 = fieldNorm(doc=4088)
    0.0380658 = weight(_text_:classification in 4088) [ClassicSimilarity], result of:
      0.0380658 = score(doc=4088,freq=4.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.39808834 = fieldWeight in 4088, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0625 = fieldNorm(doc=4088)
    0.0380658 = weight(_text_:classification in 4088) [ClassicSimilarity], result of:
      0.0380658 = score(doc=4088,freq=4.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.39808834 = fieldWeight in 4088, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0625 = fieldNorm(doc=4088)
  0.21428572 = coord(3/14)

Abstract: Authors describe the background and the work involved in setting up Engine-e, a Web index that uses automatic classification as a mean for the selection of resources in Engineering. Considerations in offering a robot-generated Web index as a successor to a manually indexed quality-controlled subject gateway are also discussed

Hagedorn, K.; Chapman, S.; Newman, D.: Enhancing search and browse using automated clustering of subject metadata (2007) 0.02

0.022701254 = product of:
  0.10593919 = sum of:
    0.036007844 = weight(_text_:subject in 1168) [ClassicSimilarity], result of:
      0.036007844 = score(doc=1168,freq=4.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.33530587 = fieldWeight in 1168, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.046875 = fieldNorm(doc=1168)
    0.03496567 = weight(_text_:classification in 1168) [ClassicSimilarity], result of:
      0.03496567 = score(doc=1168,freq=6.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.3656675 = fieldWeight in 1168, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=1168)
    0.03496567 = weight(_text_:classification in 1168) [ClassicSimilarity], result of:
      0.03496567 = score(doc=1168,freq=6.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.3656675 = fieldWeight in 1168, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=1168)
  0.21428572 = coord(3/14)

Abstract: The Web puzzle of online information resources often hinders end-users from effective and efficient access to these resources. Clustering resources into appropriate subject-based groupings may help alleviate these difficulties, but will it work with heterogeneous material? The University of Michigan and the University of California Irvine joined forces to test automatically enhancing metadata records using the Topic Modeling algorithm on the varied OAIster corpus. We created labels for the resulting clusters of metadata records, matched the clusters to an in-house classification system, and developed a prototype that would showcase methods for search and retrieval using the enhanced records. Results indicated that while the algorithm was somewhat time-intensive to run and using a local classification scheme had its drawbacks, precise clustering of records was achieved and the prototype interface proved that faceted classification could be powerful in helping end-users find resources.

Golub, K.: Automated subject classification of textual web documents (2006) 0.02
```
0.022207009 = product of:
  0.1036327 = sum of:
    0.021217827 = weight(_text_:subject in 5600) [ClassicSimilarity], result of:
      0.021217827 = score(doc=5600,freq=2.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.19758089 = fieldWeight in 5600, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5600)
    0.041207436 = weight(_text_:classification in 5600) [ClassicSimilarity], result of:
      0.041207436 = score(doc=5600,freq=12.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.43094325 = fieldWeight in 5600, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5600)
    0.041207436 = weight(_text_:classification in 5600) [ClassicSimilarity], result of:
      0.041207436 = score(doc=5600,freq=12.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.43094325 = fieldWeight in 5600, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5600)
  0.21428572 = coord(3/14)
```
Abstract

Purpose - To provide an integrated perspective to similarities and differences between approaches to automated classification in different research communities (machine learning, information retrieval and library science), and point to problems with the approaches and automated classification as such. Design/methodology/approach - A range of works dealing with automated classification of full-text web documents are discussed. Explorations of individual approaches are given in the following sections: special features (description, differences, evaluation), application and characteristics of web pages. Findings - Provides major similarities and differences between the three approaches: document pre-processing and utilization of web-specific document characteristics is common to all the approaches; major differences are in applied algorithms, employment or not of the vector space model and of controlled vocabularies. Problems of automated classification are recognized. Research limitations/implications - The paper does not attempt to provide an exhaustive bibliography of related resources. Practical implications - As an integrated overview of approaches from different research communities with application examples, it is very useful for students in library and information science and computer science, as well as for practitioners. Researchers from one community have the information on how similar tasks are conducted in different communities. Originality/value - To the author's knowledge, no review paper on automated text classification attempted to discuss more than one community's approach from an integrated perspective.

Liu, R.-L.: Context recognition for hierarchical text classification (2009) 0.02

0.021961067 = product of:
  0.10248498 = sum of:
    0.045140486 = weight(_text_:classification in 2760) [ClassicSimilarity], result of:
      0.045140486 = score(doc=2760,freq=10.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.4720747 = fieldWeight in 2760, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=2760)
    0.045140486 = weight(_text_:classification in 2760) [ClassicSimilarity], result of:
      0.045140486 = score(doc=2760,freq=10.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.4720747 = fieldWeight in 2760, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=2760)
    0.0122040035 = product of:
      0.024408007 = sum of:
        0.024408007 = weight(_text_:22 in 2760) [ClassicSimilarity], result of:
          0.024408007 = score(doc=2760,freq=2.0), product of:
            0.10514317 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.03002521 = queryNorm
            0.23214069 = fieldWeight in 2760, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=2760)
      0.5 = coord(1/2)
  0.21428572 = coord(3/14)

Abstract: Information is often organized as a text hierarchy. A hierarchical text-classification system is thus essential for the management, sharing, and dissemination of information. It aims to automatically classify each incoming document into zero, one, or several categories in the text hierarchy. In this paper, we present a technique called CRHTC (context recognition for hierarchical text classification) that performs hierarchical text classification by recognizing the context of discussion (COD) of each category. A category's COD is governed by its ancestor categories, whose contents indicate contextual backgrounds of the category. A document may be classified into a category only if its content matches the category's COD. CRHTC does not require any trials to manually set parameters, and hence is more portable and easier to implement than other methods. It is empirically evaluated under various conditions. The results show that CRHTC achieves both better and more stable performance than several hierarchical and nonhierarchical text-classification methodologies.
Date: 22. 3.2009 19:11:54

Search (69 results, page 1 of 4)

Authors

Languages

Types

Themes

Subjects