Search (133 results, page 1 of 7)

Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.36

0.36031273 = product of:
  0.6305472 = sum of:
    0.047687992 = product of:
      0.14306398 = sum of:
        0.14306398 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
          0.14306398 = score(doc=562,freq=2.0), product of:
            0.25455406 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.03002521 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.33333334 = coord(1/3)
    0.14306398 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
      0.14306398 = score(doc=562,freq=2.0), product of:
        0.25455406 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.03002521 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.03496567 = weight(_text_:classification in 562) [ClassicSimilarity], result of:
      0.03496567 = score(doc=562,freq=6.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.3656675 = fieldWeight in 562, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.07153199 = product of:
      0.14306398 = sum of:
        0.14306398 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
          0.14306398 = score(doc=562,freq=2.0), product of:
            0.25455406 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.03002521 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.5 = coord(1/2)
    0.14306398 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
      0.14306398 = score(doc=562,freq=2.0), product of:
        0.25455406 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.03002521 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.03496567 = weight(_text_:classification in 562) [ClassicSimilarity], result of:
      0.03496567 = score(doc=562,freq=6.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.3656675 = fieldWeight in 562, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.14306398 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
      0.14306398 = score(doc=562,freq=2.0), product of:
        0.25455406 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.03002521 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.0122040035 = product of:
      0.024408007 = sum of:
        0.024408007 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
          0.024408007 = score(doc=562,freq=2.0), product of:
            0.10514317 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.03002521 = queryNorm
            0.23214069 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.5 = coord(1/2)
  0.5714286 = coord(8/14)

Abstract: Document representations for text classification are typically based on the classical Bag-Of-Words paradigm. This approach comes with deficiencies that motivate the integration of features on a higher semantic level than single words. In this paper we propose an enhancement of the classical document representation through concepts extracted from background knowledge. Boosting is used for actual classification. Experimental evaluations on two well known text corpora support our approach through consistent improvement of the results.
Content: Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
Date: 8. 1.2013 10:22:32

Yi, K.: Automatic text classification using library classification schemes : trends, issues and challenges (2007) 0.09

0.085803576 = product of:
  0.24025 = sum of:
    0.06661515 = weight(_text_:classification in 2560) [ClassicSimilarity], result of:
      0.06661515 = score(doc=2560,freq=16.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.69665456 = fieldWeight in 2560, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=2560)
    0.057587784 = product of:
      0.11517557 = sum of:
        0.11517557 = weight(_text_:schemes in 2560) [ClassicSimilarity], result of:
          0.11517557 = score(doc=2560,freq=6.0), product of:
            0.16067243 = queryWeight, product of:
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.03002521 = queryNorm
            0.71683466 = fieldWeight in 2560, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2560)
      0.5 = coord(1/2)
    0.035193928 = weight(_text_:bibliographic in 2560) [ClassicSimilarity], result of:
      0.035193928 = score(doc=2560,freq=2.0), product of:
        0.11688946 = queryWeight, product of:
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.03002521 = queryNorm
        0.30108726 = fieldWeight in 2560, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.0546875 = fieldNorm(doc=2560)
    0.06661515 = weight(_text_:classification in 2560) [ClassicSimilarity], result of:
      0.06661515 = score(doc=2560,freq=16.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.69665456 = fieldWeight in 2560, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=2560)
    0.014238005 = product of:
      0.02847601 = sum of:
        0.02847601 = weight(_text_:22 in 2560) [ClassicSimilarity], result of:
          0.02847601 = score(doc=2560,freq=2.0), product of:
            0.10514317 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.03002521 = queryNorm
            0.2708308 = fieldWeight in 2560, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2560)
      0.5 = coord(1/2)
  0.35714287 = coord(5/14)

Abstract: The proliferation of digital resources and their integration into a traditional library setting has created a pressing need for an automated tool that organizes textual information based on library classification schemes. Automated text classification is a research field of developing tools, methods, and models to automate text classification. This article describes the current popular approach for text classification and major text classification projects and applications that are based on library classification schemes. Related issues and challenges are discussed, and a number of considerations for the challenges are examined.
Date: 22. 9.2008 18:31:54
Source: International cataloguing and bibliographic control. 36(2007) no.4, S.78-82

Golub, K.; Hansson, J.; Soergel, D.; Tudhope, D.: Managing classification in libraries : a methodological outline for evaluating automatic subject indexing and classification in Swedish library catalogues (2015) 0.05

0.04855399 = product of:
  0.16993895 = sum of:
    0.051972847 = weight(_text_:subject in 2300) [ClassicSimilarity], result of:
      0.051972847 = score(doc=2300,freq=12.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.48397237 = fieldWeight in 2300, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2300)
    0.041207436 = weight(_text_:classification in 2300) [ClassicSimilarity], result of:
      0.041207436 = score(doc=2300,freq=12.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.43094325 = fieldWeight in 2300, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2300)
    0.035551235 = weight(_text_:bibliographic in 2300) [ClassicSimilarity], result of:
      0.035551235 = score(doc=2300,freq=4.0), product of:
        0.11688946 = queryWeight, product of:
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.03002521 = queryNorm
        0.30414405 = fieldWeight in 2300, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2300)
    0.041207436 = weight(_text_:classification in 2300) [ClassicSimilarity], result of:
      0.041207436 = score(doc=2300,freq=12.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.43094325 = fieldWeight in 2300, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2300)
  0.2857143 = coord(4/14)

Abstract: Subject terms play a crucial role in resource discovery but require substantial effort to produce. Automatic subject classification and indexing address problems of scale and sustainability and can be used to enrich existing bibliographic records, establish more connections across and between resources and enhance consistency of bibliographic data. The paper aims to put forward a complex methodological framework to evaluate automatic classification tools of Swedish textual documents based on the Dewey Decimal Classification (DDC) recently introduced to Swedish libraries. Three major complementary approaches are suggested: a quality-built gold standard, retrieval effects, domain analysis. The gold standard is built based on input from at least two catalogue librarians, end-users expert in the subject, end users inexperienced in the subject and automated tools. Retrieval effects are studied through a combination of assigned and free tasks, including factual and comprehensive types. The study also takes into consideration the different role and character of subject terms in various knowledge domains, such as scientific disciplines. As a theoretical framework, domain analysis is used and applied in relation to the implementation of DDC in Swedish libraries and chosen domains of knowledge within the DDC itself.
Source: Classification and authority control: expanding resource discovery: proceedings of the International UDC Seminar 2015, 29-30 October 2015, Lisbon, Portugal. Eds.: Slavic, A. u. M.I. Cordeiro

Wang, J.: ¬An extensive study on automated Dewey Decimal Classification (2009) 0.05

0.0483401 = product of:
  0.16919035 = sum of:
    0.04758225 = weight(_text_:classification in 3172) [ClassicSimilarity], result of:
      0.04758225 = score(doc=3172,freq=16.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.49761042 = fieldWeight in 3172, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3172)
    0.0237488 = product of:
      0.0474976 = sum of:
        0.0474976 = weight(_text_:schemes in 3172) [ClassicSimilarity], result of:
          0.0474976 = score(doc=3172,freq=2.0), product of:
            0.16067243 = queryWeight, product of:
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.03002521 = queryNorm
            0.2956176 = fieldWeight in 3172, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3172)
      0.5 = coord(1/2)
    0.05027704 = weight(_text_:bibliographic in 3172) [ClassicSimilarity], result of:
      0.05027704 = score(doc=3172,freq=8.0), product of:
        0.11688946 = queryWeight, product of:
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.03002521 = queryNorm
        0.43012467 = fieldWeight in 3172, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3172)
    0.04758225 = weight(_text_:classification in 3172) [ClassicSimilarity], result of:
      0.04758225 = score(doc=3172,freq=16.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.49761042 = fieldWeight in 3172, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3172)
  0.2857143 = coord(4/14)

Abstract: In this paper, we present a theoretical analysis and extensive experiments on the automated assignment of Dewey Decimal Classification (DDC) classes to bibliographic data with a supervised machine-learning approach. Library classification systems, such as the DDC, impose great obstacles on state-of-art text categorization (TC) technologies, including deep hierarchy, data sparseness, and skewed distribution. We first analyze statistically the document and category distributions over the DDC, and discuss the obstacles imposed by bibliographic corpora and library classification schemes on TC technology. To overcome these obstacles, we propose an innovative algorithm to reshape the DDC structure into a balanced virtual tree by balancing the category distribution and flattening the hierarchy. To improve the classification effectiveness to a level acceptable to real-world applications, we propose an interactive classification model that is able to predict a class of any depth within a limited number of user interactions. The experiments are conducted on a large bibliographic collection created by the Library of Congress within the science and technology domains over 10 years. With no more than three interactions, a classification accuracy of nearly 90% is achieved, thus providing a practical solution to the automatic bibliographic classification problem.

Golub, K.; Lykke, M.: Automated classification of web pages in hierarchical browsing (2009) 0.05

0.047508016 = product of:
  0.16627805 = sum of:
    0.021217827 = weight(_text_:subject in 3614) [ClassicSimilarity], result of:
      0.021217827 = score(doc=3614,freq=2.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.19758089 = fieldWeight in 3614, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3614)
    0.06065571 = weight(_text_:classification in 3614) [ClassicSimilarity], result of:
      0.06065571 = score(doc=3614,freq=26.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.63433135 = fieldWeight in 3614, product of:
          5.0990195 = tf(freq=26.0), with freq of:
            26.0 = termFreq=26.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3614)
    0.0237488 = product of:
      0.0474976 = sum of:
        0.0474976 = weight(_text_:schemes in 3614) [ClassicSimilarity], result of:
          0.0474976 = score(doc=3614,freq=2.0), product of:
            0.16067243 = queryWeight, product of:
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.03002521 = queryNorm
            0.2956176 = fieldWeight in 3614, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3614)
      0.5 = coord(1/2)
    0.06065571 = weight(_text_:classification in 3614) [ClassicSimilarity], result of:
      0.06065571 = score(doc=3614,freq=26.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.63433135 = fieldWeight in 3614, product of:
          5.0990195 = tf(freq=26.0), with freq of:
            26.0 = termFreq=26.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3614)
  0.2857143 = coord(4/14)

Abstract: Purpose - The purpose of this study is twofold: to investigate whether it is meaningful to use the Engineering Index (Ei) classification scheme for browsing, and then, if proven useful, to investigate the performance of an automated classification algorithm based on the Ei classification scheme. Design/methodology/approach - A user study was conducted in which users solved four controlled searching tasks. The users browsed the Ei classification scheme in order to examine the suitability of the classification systems for browsing. The classification algorithm was evaluated by the users who judged the correctness of the automatically assigned classes. Findings - The study showed that the Ei classification scheme is suited for browsing. Automatically assigned classes were on average partly correct, with some classes working better than others. Success of browsing showed to be correlated and dependent on classification correctness. Research limitations/implications - Further research should address problems of disparate evaluations of one and the same web page. Additional reasons behind browsing failures in the Ei classification scheme also need further investigation. Practical implications - Improvements for browsing were identified: describing class captions and/or listing their subclasses from start; allowing for searching for words from class captions with synonym search (easily provided for Ei since the classes are mapped to thesauri terms); when searching for class captions, returning the hierarchical tree expanded around the class in which caption the search term is found. The need for improvements of classification schemes was also indicated. Originality/value - A user-based evaluation of automated subject classification in the context of browsing has not been conducted before; hence the study also presents new findings concerning methodology.
Object: Engineering Index Classification

Desale, S.K.; Kumbhar, R.: Research on automatic classification of documents in library environment : a literature review (2013) 0.05

0.04728191 = product of:
  0.16548668 = sum of:
    0.053410944 = weight(_text_:classification in 1071) [ClassicSimilarity], result of:
      0.053410944 = score(doc=1071,freq=14.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.55856633 = fieldWeight in 1071, product of:
          3.7416575 = tf(freq=14.0), with freq of:
            14.0 = termFreq=14.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=1071)
    0.02849856 = product of:
      0.05699712 = sum of:
        0.05699712 = weight(_text_:schemes in 1071) [ClassicSimilarity], result of:
          0.05699712 = score(doc=1071,freq=2.0), product of:
            0.16067243 = queryWeight, product of:
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.03002521 = queryNorm
            0.35474116 = fieldWeight in 1071, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.046875 = fieldNorm(doc=1071)
      0.5 = coord(1/2)
    0.030166224 = weight(_text_:bibliographic in 1071) [ClassicSimilarity], result of:
      0.030166224 = score(doc=1071,freq=2.0), product of:
        0.11688946 = queryWeight, product of:
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.03002521 = queryNorm
        0.2580748 = fieldWeight in 1071, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.046875 = fieldNorm(doc=1071)
    0.053410944 = weight(_text_:classification in 1071) [ClassicSimilarity], result of:
      0.053410944 = score(doc=1071,freq=14.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.55856633 = fieldWeight in 1071, product of:
          3.7416575 = tf(freq=14.0), with freq of:
            14.0 = termFreq=14.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=1071)
  0.2857143 = coord(4/14)

Abstract: This paper aims to provide an overview of automatic classification research, which focuses on issues related to the automatic classification of documents in a library environment. The review covers literature published in mainstream library and information science studies. The review was done on literature published in both academic and professional LIS journals and other documents. This review reveals that basically three types of research are being done on automatic classification: 1) hierarchical classification using different library classification schemes, 2) text categorization and document categorization using different type of classifiers with or without using training documents, and 3) automatic bibliographic classification. Predominantly this research is directed towards solving problems of organization of digital documents in an online environment. However, very little research is devoted towards solving the problems of arrangement of physical documents.

Reiner, U.: DDC-based search in the data of the German National Bibliography (2008) 0.04

0.041978236 = product of:
  0.14692383 = sum of:
    0.036007844 = weight(_text_:subject in 2166) [ClassicSimilarity], result of:
      0.036007844 = score(doc=2166,freq=4.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.33530587 = fieldWeight in 2166, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.046875 = fieldNorm(doc=2166)
    0.04037488 = weight(_text_:classification in 2166) [ClassicSimilarity], result of:
      0.04037488 = score(doc=2166,freq=8.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.42223644 = fieldWeight in 2166, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=2166)
    0.030166224 = weight(_text_:bibliographic in 2166) [ClassicSimilarity], result of:
      0.030166224 = score(doc=2166,freq=2.0), product of:
        0.11688946 = queryWeight, product of:
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.03002521 = queryNorm
        0.2580748 = fieldWeight in 2166, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.046875 = fieldNorm(doc=2166)
    0.04037488 = weight(_text_:classification in 2166) [ClassicSimilarity], result of:
      0.04037488 = score(doc=2166,freq=8.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.42223644 = fieldWeight in 2166, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=2166)
  0.2857143 = coord(4/14)

Abstract: In 2004, the German National Library began to classify title records of the German National Bibliography according to subject groups based on the divisions of the Dewey Decimal Classification (DDC). Since 2006, all titles of the main series of the German National Bibliography are classified in strict compliance with the DDC. On this basis, an enhanced DDC-based search can be realized - e.g., searching the data of the German National Bibliography for title records using number components of synthesized classification numbers or searching for DDC numbers using unclassified title records. This paper gives an account of the current research and development of the DDC-based search. The work is conducted in the VZG project Colibri that focuses on the automatic analysis of DDC-synthesized numbers and the automatic classification of bibliographic title records.
Source: New pespectives on subject indexing and classification: essays in honour of Magda Heiner-Freiling. Red.: K. Knull-Schlomann, u.a

Qu, B.; Cong, G.; Li, C.; Sun, A.; Chen, H.: ¬An evaluation of classification models for question topic categorization (2012) 0.04

0.040373825 = product of:
  0.14130838 = sum of:
    0.021217827 = weight(_text_:subject in 237) [ClassicSimilarity], result of:
      0.021217827 = score(doc=237,freq=2.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.19758089 = fieldWeight in 237, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.0390625 = fieldNorm(doc=237)
    0.04758225 = weight(_text_:classification in 237) [ClassicSimilarity], result of:
      0.04758225 = score(doc=237,freq=16.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.49761042 = fieldWeight in 237, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=237)
    0.04758225 = weight(_text_:classification in 237) [ClassicSimilarity], result of:
      0.04758225 = score(doc=237,freq=16.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.49761042 = fieldWeight in 237, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=237)
    0.024926046 = product of:
      0.04985209 = sum of:
        0.04985209 = weight(_text_:texts in 237) [ClassicSimilarity], result of:
          0.04985209 = score(doc=237,freq=2.0), product of:
            0.16460659 = queryWeight, product of:
              5.4822793 = idf(docFreq=499, maxDocs=44218)
              0.03002521 = queryNorm
            0.302856 = fieldWeight in 237, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.4822793 = idf(docFreq=499, maxDocs=44218)
              0.0390625 = fieldNorm(doc=237)
      0.5 = coord(1/2)
  0.2857143 = coord(4/14)

Abstract: We study the problem of question topic classification using a very large real-world Community Question Answering (CQA) dataset from Yahoo! Answers. The dataset comprises 3.9 million questions and these questions are organized into more than 1,000 categories in a hierarchy. To the best knowledge, this is the first systematic evaluation of the performance of different classification methods on question topic classification as well as short texts. Specifically, we empirically evaluate the following in classifying questions into CQA categories: (a) the usefulness of n-gram features and bag-of-word features; (b) the performance of three standard classification algorithms (naive Bayes, maximum entropy, and support vector machines); (c) the performance of the state-of-the-art hierarchical classification algorithms; (d) the effect of training data size on performance; and (e) the effectiveness of the different components of CQA data, including subject, content, asker, and the best answer. The experimental results show what aspects are important for question topic classification in terms of both effectiveness and efficiency. We believe that the experimental findings from this study will be useful in real-world classification problems.

Golub, K.; Hamon, T.; Ardö, A.: Automated classification of textual documents based on a controlled vocabulary in engineering (2007) 0.04

0.03848849 = product of:
  0.1347097 = sum of:
    0.02546139 = weight(_text_:subject in 1461) [ClassicSimilarity], result of:
      0.02546139 = score(doc=1461,freq=2.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.23709705 = fieldWeight in 1461, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.046875 = fieldNorm(doc=1461)
    0.04037488 = weight(_text_:classification in 1461) [ClassicSimilarity], result of:
      0.04037488 = score(doc=1461,freq=8.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.42223644 = fieldWeight in 1461, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=1461)
    0.02849856 = product of:
      0.05699712 = sum of:
        0.05699712 = weight(_text_:schemes in 1461) [ClassicSimilarity], result of:
          0.05699712 = score(doc=1461,freq=2.0), product of:
            0.16067243 = queryWeight, product of:
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.03002521 = queryNorm
            0.35474116 = fieldWeight in 1461, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.046875 = fieldNorm(doc=1461)
      0.5 = coord(1/2)
    0.04037488 = weight(_text_:classification in 1461) [ClassicSimilarity], result of:
      0.04037488 = score(doc=1461,freq=8.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.42223644 = fieldWeight in 1461, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=1461)
  0.2857143 = coord(4/14)

Abstract: Automated subject classification has been a challenging research issue for many years now, receiving particular attention in the past decade due to rapid increase of digital documents. The most frequent approach to automated classification is machine learning. It, however, requires training documents and performs well on new documents only if these are similar enough to the former. We explore a string-matching algorithm based on a controlled vocabulary, which does not require training documents - instead it reuses the intellectual work put into creating the controlled vocabulary. Terms from the Engineering Information thesaurus and classification scheme were matched against title and abstract of engineering papers from the Compendex database. Simple string-matching was enhanced by several methods such as term weighting schemes and cut-offs, exclusion of certain terms, and en- richment of the controlled vocabulary with automatically extracted terms. The best results are 76% recall when the controlled vocabulary is enriched with new terms, and 79% precision when certain terms are excluded. Precision of individual classes is up to 98%. These results are comparable to state-of-the-art machine-learning algorithms.

Borko, H.: Research in computer based classification systems (1985) 0.04
```
0.037140485 = product of:
  0.1299917 = sum of:
    0.014852478 = weight(_text_:subject in 3647) [ClassicSimilarity], result of:
      0.014852478 = score(doc=3647,freq=2.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.13830662 = fieldWeight in 3647, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.02734375 = fieldNorm(doc=3647)
    0.042458992 = weight(_text_:classification in 3647) [ClassicSimilarity], result of:
      0.042458992 = score(doc=3647,freq=26.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.44403192 = fieldWeight in 3647, product of:
          5.0990195 = tf(freq=26.0), with freq of:
            26.0 = termFreq=26.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.02734375 = fieldNorm(doc=3647)
    0.042458992 = weight(_text_:classification in 3647) [ClassicSimilarity], result of:
      0.042458992 = score(doc=3647,freq=26.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.44403192 = fieldWeight in 3647, product of:
          5.0990195 = tf(freq=26.0), with freq of:
            26.0 = termFreq=26.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.02734375 = fieldNorm(doc=3647)
    0.030221226 = product of:
      0.06044245 = sum of:
        0.06044245 = weight(_text_:texts in 3647) [ClassicSimilarity], result of:
          0.06044245 = score(doc=3647,freq=6.0), product of:
            0.16460659 = queryWeight, product of:
              5.4822793 = idf(docFreq=499, maxDocs=44218)
              0.03002521 = queryNorm
            0.3671934 = fieldWeight in 3647, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              5.4822793 = idf(docFreq=499, maxDocs=44218)
              0.02734375 = fieldNorm(doc=3647)
      0.5 = coord(1/2)
  0.2857143 = coord(4/14)
```
Abstract

The selection in this reader by R. M. Needham and K. Sparck Jones reports an early approach to automatic classification that was taken in England. The following selection reviews various approaches that were being pursued in the United States at about the same time. It then discusses a particular approach initiated in the early 1960s by Harold Borko, at that time Head of the Language Processing and Retrieval Research Staff at the System Development Corporation, Santa Monica, California and, since 1966, a member of the faculty at the Graduate School of Library and Information Science, University of California, Los Angeles. As was described earlier, there are two steps in automatic classification, the first being to identify pairs of terms that are similar by virtue of co-occurring as index terms in the same documents, and the second being to form equivalence classes of intersubstitutable terms. To compute similarities, Borko and his associates used a standard correlation formula; to derive classification categories, where Needham and Sparck Jones used clumping, the Borko team used the statistical technique of factor analysis. The fact that documents can be classified automatically, and in any number of ways, is worthy of passing notice. Worthy of serious attention would be a demonstra tion that a computer-based classification system was effective in the organization and retrieval of documents. One reason for the inclusion of the following selection in the reader is that it addresses the question of evaluation. To evaluate the effectiveness of their automatically derived classification, Borko and his team asked three questions. The first was Is the classification reliable? in other words, could the categories derived from one sample of texts be used to classify other texts? Reliability was assessed by a case-study comparison of the classes derived from three different samples of abstracts. The notso-surprising conclusion reached was that automatically derived classes were reliable only to the extent that the sample from which they were derived was representative of the total document collection. The second evaluation question asked whether the classification was reasonable, in the sense of adequately describing the content of the document collection. The answer was sought by comparing the automatically derived categories with categories in a related classification system that was manually constructed. Here the conclusion was that the automatic method yielded categories that fairly accurately reflected the major area of interest in the sample collection of texts; however, since there were only eleven such categories and they were quite broad, they could not be regarded as suitable for use in a university or any large general library. The third evaluation question asked whether automatic classification was accurate, in the sense of producing results similar to those obtainabie by human cIassifiers. When using human classification as a criterion, automatic classification was found to be 50 percent accurate.

Footnote

Original in: Classification research: Proceedings of the Second International Study Conference held at Hotel Prins Hamlet, Elsinore, Denmark, 14th-18th Sept. 1964. Ed.: Pauline Atherton. Copenhagen: Munksgaard 1965. S.220-238.

Source

Theory of subject analysis: a sourcebook. Ed.: L.M. Chan, et al

Bianchini, C.; Bargioni, S.: Automated classification using linked open data : a case study on faceted classification and Wikidata (2021) 0.04

0.036646504 = product of:
  0.17101702 = sum of:
    0.029704956 = weight(_text_:subject in 724) [ClassicSimilarity], result of:
      0.029704956 = score(doc=724,freq=2.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.27661324 = fieldWeight in 724, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.0546875 = fieldNorm(doc=724)
    0.07065603 = weight(_text_:classification in 724) [ClassicSimilarity], result of:
      0.07065603 = score(doc=724,freq=18.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.7389137 = fieldWeight in 724, product of:
          4.2426405 = tf(freq=18.0), with freq of:
            18.0 = termFreq=18.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=724)
    0.07065603 = weight(_text_:classification in 724) [ClassicSimilarity], result of:
      0.07065603 = score(doc=724,freq=18.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.7389137 = fieldWeight in 724, product of:
          4.2426405 = tf(freq=18.0), with freq of:
            18.0 = termFreq=18.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=724)
  0.21428572 = coord(3/14)

Abstract: The Wikidata gadget, CCLitBox, for the automated classification of literary authors and works by a faceted classification and using Linked Open Data (LOD) is presented. The tool reproduces the classification algorithm of class O Literature of the Colon Classification and uses data freely available in Wikidata to create Colon Classification class numbers. CCLitBox is totally free and enables any user to classify literary authors and their works; it is easily accessible to everybody; it uses LOD from Wikidata but missing data for classification can be freely added if necessary; it is readymade for any cooperative and networked project.
Footnote: Teil eines Themenheftes: Artificial intelligence (AI) and automated processes for subject sccess
Source: Cataloging and classification quarterly. 59(2021) no.8, p.835-852

Suominen, A.; Toivanen, H.: Map of science with topic modeling : comparison of unsupervised learning and human-assigned subject classification (2016) 0.04

0.036394715 = product of:
  0.1273815 = sum of:
    0.021217827 = weight(_text_:subject in 3121) [ClassicSimilarity], result of:
      0.021217827 = score(doc=3121,freq=2.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.19758089 = fieldWeight in 3121, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3121)
    0.041207436 = weight(_text_:classification in 3121) [ClassicSimilarity], result of:
      0.041207436 = score(doc=3121,freq=12.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.43094325 = fieldWeight in 3121, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3121)
    0.0237488 = product of:
      0.0474976 = sum of:
        0.0474976 = weight(_text_:schemes in 3121) [ClassicSimilarity], result of:
          0.0474976 = score(doc=3121,freq=2.0), product of:
            0.16067243 = queryWeight, product of:
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.03002521 = queryNorm
            0.2956176 = fieldWeight in 3121, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3121)
      0.5 = coord(1/2)
    0.041207436 = weight(_text_:classification in 3121) [ClassicSimilarity], result of:
      0.041207436 = score(doc=3121,freq=12.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.43094325 = fieldWeight in 3121, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3121)
  0.2857143 = coord(4/14)

Abstract: The delineation of coordinates is fundamental for the cartography of science, and accurate and credible classification of scientific knowledge presents a persistent challenge in this regard. We present a map of Finnish science based on unsupervised-learning classification, and discuss the advantages and disadvantages of this approach vis-à-vis those generated by human reasoning. We conclude that from theoretical and practical perspectives there exist several challenges for human reasoning-based classification frameworks of scientific knowledge, as they typically try to fit new-to-the-world knowledge into historical models of scientific knowledge, and cannot easily be deployed for new large-scale data sets. Automated classification schemes, in contrast, generate classification models only from the available text corpus, thereby identifying credibly novel bodies of knowledge. They also lend themselves to versatile large-scale data analysis, and enable a range of Big Data possibilities. However, we also argue that it is neither possible nor fruitful to declare one or another method a superior approach in terms of realism to classify scientific knowledge, and we believe that the merits of each approach are dependent on the practical objectives of analysis.

Schiminovich, S.: Automatic classification and retrieval of documents by means of a bibliographic pattern discovery algorithm (1971) 0.04

0.03527055 = product of:
  0.1645959 = sum of:
    0.047104023 = weight(_text_:classification in 4846) [ClassicSimilarity], result of:
      0.047104023 = score(doc=4846,freq=2.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.49260917 = fieldWeight in 4846, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.109375 = fieldNorm(doc=4846)
    0.070387855 = weight(_text_:bibliographic in 4846) [ClassicSimilarity], result of:
      0.070387855 = score(doc=4846,freq=2.0), product of:
        0.11688946 = queryWeight, product of:
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.03002521 = queryNorm
        0.6021745 = fieldWeight in 4846, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.109375 = fieldNorm(doc=4846)
    0.047104023 = weight(_text_:classification in 4846) [ClassicSimilarity], result of:
      0.047104023 = score(doc=4846,freq=2.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.49260917 = fieldWeight in 4846, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.109375 = fieldNorm(doc=4846)
  0.21428572 = coord(3/14)

Liu, R.-L.: ¬A passage extractor for classification of disease aspect information (2013) 0.03

0.033885635 = product of:
  0.15813297 = sum of:
    0.03364573 = weight(_text_:classification in 1107) [ClassicSimilarity], result of:
      0.03364573 = score(doc=1107,freq=8.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.35186368 = fieldWeight in 1107, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1107)
    0.03364573 = weight(_text_:classification in 1107) [ClassicSimilarity], result of:
      0.03364573 = score(doc=1107,freq=8.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.35186368 = fieldWeight in 1107, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1107)
    0.09084152 = sum of:
      0.07050151 = weight(_text_:texts in 1107) [ClassicSimilarity], result of:
        0.07050151 = score(doc=1107,freq=4.0), product of:
          0.16460659 = queryWeight, product of:
            5.4822793 = idf(docFreq=499, maxDocs=44218)
            0.03002521 = queryNorm
          0.42830306 = fieldWeight in 1107, product of:
            2.0 = tf(freq=4.0), with freq of:
              4.0 = termFreq=4.0
            5.4822793 = idf(docFreq=499, maxDocs=44218)
            0.0390625 = fieldNorm(doc=1107)
      0.020340007 = weight(_text_:22 in 1107) [ClassicSimilarity], result of:
        0.020340007 = score(doc=1107,freq=2.0), product of:
          0.10514317 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.03002521 = queryNorm
          0.19345059 = fieldWeight in 1107, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0390625 = fieldNorm(doc=1107)
  0.21428572 = coord(3/14)

Abstract: Retrieval of disease information is often based on several key aspects such as etiology, diagnosis, treatment, prevention, and symptoms of diseases. Automatic identification of disease aspect information is thus essential. In this article, I model the aspect identification problem as a text classification (TC) problem in which a disease aspect corresponds to a category. The disease aspect classification problem poses two challenges to classifiers: (a) a medical text often contains information about multiple aspects of a disease and hence produces noise for the classifiers and (b) text classifiers often cannot extract the textual parts (i.e., passages) about the categories of interest. I thus develop a technique, PETC (Passage Extractor for Text Classification), that extracts passages (from medical texts) for the underlying text classifiers to classify. Case studies on thousands of Chinese and English medical texts show that PETC enhances a support vector machine (SVM) classifier in classifying disease aspect information. PETC also performs better than three state-of-the-art classifier enhancement techniques, including two passage extraction techniques for text classifiers and a technique that employs term proximity information to enhance text classifiers. The contribution is of significance to evidence-based medicine, health education, and healthcare decision support. PETC can be used in those application domains in which a text to be classified may have several parts about different categories.
Date: 28.10.2013 19:22:57

Egbert, J.; Biber, D.; Davies, M.: Developing a bottom-up, user-based method of web register classification (2015) 0.03

0.033034686 = product of:
  0.15416187 = sum of:
    0.03496567 = weight(_text_:classification in 2158) [ClassicSimilarity], result of:
      0.03496567 = score(doc=2158,freq=6.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.3656675 = fieldWeight in 2158, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=2158)
    0.03496567 = weight(_text_:classification in 2158) [ClassicSimilarity], result of:
      0.03496567 = score(doc=2158,freq=6.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.3656675 = fieldWeight in 2158, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=2158)
    0.08423052 = sum of:
      0.059822515 = weight(_text_:texts in 2158) [ClassicSimilarity], result of:
        0.059822515 = score(doc=2158,freq=2.0), product of:
          0.16460659 = queryWeight, product of:
            5.4822793 = idf(docFreq=499, maxDocs=44218)
            0.03002521 = queryNorm
          0.36342722 = fieldWeight in 2158, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            5.4822793 = idf(docFreq=499, maxDocs=44218)
            0.046875 = fieldNorm(doc=2158)
      0.024408007 = weight(_text_:22 in 2158) [ClassicSimilarity], result of:
        0.024408007 = score(doc=2158,freq=2.0), product of:
          0.10514317 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.03002521 = queryNorm
          0.23214069 = fieldWeight in 2158, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.046875 = fieldNorm(doc=2158)
  0.21428572 = coord(3/14)

Abstract: This paper introduces a project to develop a reliable, cost-effective method for classifying Internet texts into register categories, and apply that approach to the analysis of a large corpus of web documents. To date, the project has proceeded in 2 key phases. First, we developed a bottom-up method for web register classification, asking end users of the web to utilize a decision-tree survey to code relevant situational characteristics of web documents, resulting in a bottom-up identification of register and subregister categories. We present details regarding the development and testing of this method through a series of 10 pilot studies. Then, in the second phase of our project we applied this procedure to a corpus of 53,000 web documents. An analysis of the results demonstrates the effectiveness of these methods for web register classification and provides a preliminary description of the types and distribution of registers on the web.
Date: 4. 8.2015 19:22:04

Godby, C.J.; Stuler, J.: ¬The Library of Congress Classification as a knowledge base for automatic subject categorization : subject access issues (2003) 0.03

0.032918133 = product of:
  0.15361795 = sum of:
    0.059409913 = weight(_text_:subject in 3962) [ClassicSimilarity], result of:
      0.059409913 = score(doc=3962,freq=8.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.5532265 = fieldWeight in 3962, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.0546875 = fieldNorm(doc=3962)
    0.047104023 = weight(_text_:classification in 3962) [ClassicSimilarity], result of:
      0.047104023 = score(doc=3962,freq=8.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.49260917 = fieldWeight in 3962, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=3962)
    0.047104023 = weight(_text_:classification in 3962) [ClassicSimilarity], result of:
      0.047104023 = score(doc=3962,freq=8.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.49260917 = fieldWeight in 3962, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=3962)
  0.21428572 = coord(3/14)

Abstract: This paper describes a set of experiments in adapting a subset of the Library of Congress Classification for use as a database for automatic classification. A high degree of concept integrity was obtained when subject headings were mapped from OCLC's WorldCat database and filtered using the log-likelihood statistic.
Source: Subject retrieval in a networked environment: Proceedings of the IFLA Satellite Meeting held in Dublin, OH, 14-16 August 2001 and sponsored by the IFLA Classification and Indexing Section, the IFLA Information Technology Section and OCLC. Ed.: I.C. McIlwaine

Kragelj, M.; Borstnar, M.K.: Automatic classification of older electronic texts into the Universal Decimal Classification-UDC (2021) 0.03
```
0.031475578 = product of:
  0.14688602 = sum of:
    0.040374875 = weight(_text_:classification in 175) [ClassicSimilarity], result of:
      0.040374875 = score(doc=175,freq=18.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.4222364 = fieldWeight in 175, product of:
          4.2426405 = tf(freq=18.0), with freq of:
            18.0 = termFreq=18.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03125 = fieldNorm(doc=175)
    0.040374875 = weight(_text_:classification in 175) [ClassicSimilarity], result of:
      0.040374875 = score(doc=175,freq=18.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.4222364 = fieldWeight in 175, product of:
          4.2426405 = tf(freq=18.0), with freq of:
            18.0 = termFreq=18.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03125 = fieldNorm(doc=175)
    0.06613628 = product of:
      0.13227256 = sum of:
        0.13227256 = weight(_text_:texts in 175) [ClassicSimilarity], result of:
          0.13227256 = score(doc=175,freq=22.0), product of:
            0.16460659 = queryWeight, product of:
              5.4822793 = idf(docFreq=499, maxDocs=44218)
              0.03002521 = queryNorm
            0.8035678 = fieldWeight in 175, product of:
              4.690416 = tf(freq=22.0), with freq of:
                22.0 = termFreq=22.0
              5.4822793 = idf(docFreq=499, maxDocs=44218)
              0.03125 = fieldNorm(doc=175)
      0.5 = coord(1/2)
  0.21428572 = coord(3/14)
```
Abstract

Purpose The purpose of this study is to develop a model for automated classification of old digitised texts to the Universal Decimal Classification (UDC), using machine-learning methods. Design/methodology/approach The general research approach is inherent to design science research, in which the problem of UDC assignment of the old, digitised texts is addressed by developing a machine-learning classification model. A corpus of 70,000 scholarly texts, fully bibliographically processed by librarians, was used to train and test the model, which was used for classification of old texts on a corpus of 200,000 items. Human experts evaluated the performance of the model. Findings Results suggest that machine-learning models can correctly assign the UDC at some level for almost any scholarly text. Furthermore, the model can be recommended for the UDC assignment of older texts. Ten librarians corroborated this on 150 randomly selected texts. Research limitations/implications The main limitations of this study were unavailability of labelled older texts and the limited availability of librarians. Practical implications The classification model can provide a recommendation to the librarians during their classification work; furthermore, it can be implemented as an add-on to full-text search in the library databases. Social implications The proposed methodology supports librarians by recommending UDC classifiers, thus saving time in their daily work. By automatically classifying older texts, digital libraries can provide a better user experience by enabling structured searches. These contribute to making knowledge more widely available and useable. Originality/value These findings contribute to the field of automated classification of bibliographical information with the usage of full texts, especially in cases in which the texts are old, unstructured and in which archaic language and vocabulary are used.
Dolin, R.; Agrawal, D.; El Abbadi, A.; Pearlman, J.: Using automated classification for summarizing and selecting heterogeneous information sources (1998) 0.03
```
0.030435055 = product of:
  0.10652269 = sum of:
    0.018003922 = weight(_text_:subject in 1253) [ClassicSimilarity], result of:
      0.018003922 = score(doc=1253,freq=4.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.16765293 = fieldWeight in 1253, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.0234375 = fieldNorm(doc=1253)
    0.031919144 = weight(_text_:classification in 1253) [ClassicSimilarity], result of:
      0.031919144 = score(doc=1253,freq=20.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.33380723 = fieldWeight in 1253, product of:
          4.472136 = tf(freq=20.0), with freq of:
            20.0 = termFreq=20.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0234375 = fieldNorm(doc=1253)
    0.024680478 = product of:
      0.049360957 = sum of:
        0.049360957 = weight(_text_:schemes in 1253) [ClassicSimilarity], result of:
          0.049360957 = score(doc=1253,freq=6.0), product of:
            0.16067243 = queryWeight, product of:
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.03002521 = queryNorm
            0.30721486 = fieldWeight in 1253, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.0234375 = fieldNorm(doc=1253)
      0.5 = coord(1/2)
    0.031919144 = weight(_text_:classification in 1253) [ClassicSimilarity], result of:
      0.031919144 = score(doc=1253,freq=20.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.33380723 = fieldWeight in 1253, product of:
          4.472136 = tf(freq=20.0), with freq of:
            20.0 = termFreq=20.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0234375 = fieldNorm(doc=1253)
  0.2857143 = coord(4/14)
```
Abstract

Information retrieval over the Internet increasingly requires the filtering of thousands of heterogeneous information sources. Important sources of information include not only traditional databases with structured data and queries, but also increasing numbers of non-traditional, semi- or unstructured collections such as Web sites, FTP archives, etc. As the number and variability of sources increases, new ways of automatically summarizing, discovering, and selecting collections relevant to a user's query are needed. One such method involves the use of classification schemes, such as the Library of Congress Classification (LCC), within which a collection may be represented based on its content, irrespective of the structure of the actual data or documents. For such a system to be useful in a large-scale distributed environment, it must be easy to use for both collection managers and users. As a result, it must be possible to classify documents automatically within a classification scheme. Furthermore, there must be a straightforward and intuitive interface with which the user may use the scheme to assist in information retrieval (IR). Our work with the Alexandria Digital Library (ADL) Project focuses on geo-referenced information, whether text, maps, aerial photographs, or satellite images. As a result, we have emphasized techniques which work with both text and non-text, such as combined textual and graphical queries, multi-dimensional indexing, and IR methods which are not solely dependent on words or phrases. Part of this work involves locating relevant online sources of information. In particular, we have designed and are currently testing aspects of an architecture, Pharos, which we believe will scale up to 1.000.000 heterogeneous sources. Pharos accommodates heterogeneity in content and format, both among multiple sources as well as within a single source. That is, we consider sources to include Web sites, FTP archives, newsgroups, and full digital libraries; all of these systems can include a wide variety of content and multimedia data formats. Pharos is based on the use of hierarchical classification schemes. These include not only well-known 'subject' (or 'concept') based schemes such as the Dewey Decimal System and the LCC, but also, for example, geographic classifications, which might be constructed as layers of smaller and smaller hierarchical longitude/latitude boxes. Pharos is designed to work with sophisticated queries which utilize subjects, geographical locations, temporal specifications, and other types of information domains. The Pharos architecture requires that hierarchically structured collection metadata be extracted so that it can be partitioned in such a way as to greatly enhance scalability. Automated classification is important to Pharos because it allows information sources to extract the requisite collection metadata automatically that must be distributed.
We are currently experimenting with newsgroups as collections. We have built an initial prototype which automatically classifies and summarizes newsgroups within the LCC. (The prototype can be tested below, and more details may be found at http://pharos.alexandria.ucsb.edu/). The prototype uses electronic library catalog records as a `training set' and Latent Semantic Indexing (LSI) for IR. We use the training set to build a rich set of classification terminology, and associate these terms with the relevant categories in the LCC. This association between terms and classification categories allows us to relate users' queries to nodes in the LCC so that users can select appropriate query categories. Newsgroups are similarly associated with classification categories. Pharos then matches the categories selected by users to relevant newsgroups. In principle, this approach allows users to exclude newsgroups that might have been selected based on an unintended meaning of a query term, and to include newsgroups with relevant content even though the exact query terms may not have been used. This work is extensible to other types of classification, including geographical, temporal, and image feature. Before discussing the methodology of the collection summarization and selection, we first present an online demonstration below. The demonstration is not intended to be a complete end-user interface. Rather, it is intended merely to offer a view of the process to suggest the "look and feel" of the prototype. The demo works as follows. First supply it with a few keywords of interest. The system will then use those terms to try to return to you the most relevant subject categories within the LCC. Assuming that the system recognizes any of your terms (it has over 400,000 terms indexed), it will give you a list of 15 LCC categories sorted by relevancy ranking. From there, you have two choices. The first choice, by clicking on the "News" links, is to get a list of newsgroups which the system has identified as relevant to the LCC category you select. The other choice, by clicking on the LCC ID links, is to enter the LCC hierarchy starting at the category of your choice and navigate the tree until you locate the best category for your query. From there, again, you can get a list of newsgroups by clicking on the "News" links. After having shown this demonstration to many people, we would like to suggest that you first give it easier examples before trying to break it. For example, "prostate cancer" (discussed below), "remote sensing", "investment banking", and "gershwin" all work reasonably well.

Wu, M.; Liu, Y.-H.; Brownlee, R.; Zhang, X.: Evaluating utility and automatic classification of subject metadata from Research Data Australia (2021) 0.03

0.03041722 = product of:
  0.14194703 = sum of:
    0.07201569 = weight(_text_:subject in 453) [ClassicSimilarity], result of:
      0.07201569 = score(doc=453,freq=16.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.67061174 = fieldWeight in 453, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.046875 = fieldNorm(doc=453)
    0.03496567 = weight(_text_:classification in 453) [ClassicSimilarity], result of:
      0.03496567 = score(doc=453,freq=6.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.3656675 = fieldWeight in 453, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=453)
    0.03496567 = weight(_text_:classification in 453) [ClassicSimilarity], result of:
      0.03496567 = score(doc=453,freq=6.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.3656675 = fieldWeight in 453, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=453)
  0.21428572 = coord(3/14)

Abstract: In this paper, we present a case study of how well subject metadata (comprising headings from an international classification scheme) has been deployed in a national data catalogue, and how often data seekers use subject metadata when searching for data. Through an analysis of user search behaviour as recorded in search logs, we find evidence that users utilise the subject metadata for data discovery. Since approximately half of the records ingested by the catalogue did not include subject metadata at the time of harvest, we experimented with automatic subject classification approaches in order to enrich these records and to provide additional support for user search and data discovery. Our results show that automatic methods work well for well represented categories of subject metadata, and these categories tend to have features that can distinguish themselves from the other categories. Our findings raise implications for data catalogue providers; they should invest more effort to enhance the quality of data records by providing an adequate description of these records for under-represented subject categories.

Chung, Y.-M.; Noh, Y.-H.: Developing a specialized directory system by automatically classifying Web documents (2003) 0.03

0.03025795 = product of:
  0.14120376 = sum of:
    0.05092278 = weight(_text_:subject in 1566) [ClassicSimilarity], result of:
      0.05092278 = score(doc=1566,freq=8.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.4741941 = fieldWeight in 1566, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.046875 = fieldNorm(doc=1566)
    0.045140486 = weight(_text_:classification in 1566) [ClassicSimilarity], result of:
      0.045140486 = score(doc=1566,freq=10.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.4720747 = fieldWeight in 1566, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=1566)
    0.045140486 = weight(_text_:classification in 1566) [ClassicSimilarity], result of:
      0.045140486 = score(doc=1566,freq=10.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.4720747 = fieldWeight in 1566, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=1566)
  0.21428572 = coord(3/14)

Abstract: This study developed a specialized directory system using an automatic classification technique. Economics was selected as the subject field for the classification experiments with Web documents. The classification scheme of the directory follows the DDC, and subject terms representing each class number or subject category were selected from the DDC table to construct a representative term dictionary. In collecting and classifying the Web documents, various strategies were tested in order to find the optimal thresholds. In the classification experiments, Web documents in economics were classified into a total of 757 hierarchical subject categories built from the DDC scheme. The first and second experiments using the representative term dictionary resulted in relatively high precision ratios of 77 and 60%, respectively. The third experiment employing a machine learning-based k-nearest neighbours (kNN) classifier in a closed experimental setting achieved a precision ratio of 96%. This implies that it is possible to enhance the classification performance by applying a hybrid method combining a dictionary-based technique and a kNN classifier

Search (133 results, page 1 of 7)

Authors

Years

Languages

Themes