Search (151 results, page 2 of 8)

Kragelj, M.; Borstnar, M.K.: Automatic classification of older electronic texts into the Universal Decimal Classification-UDC (2021) 0.03
```
0.031475578 = product of:
  0.14688602 = sum of:
    0.040374875 = weight(_text_:classification in 175) [ClassicSimilarity], result of:
      0.040374875 = score(doc=175,freq=18.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.4222364 = fieldWeight in 175, product of:
          4.2426405 = tf(freq=18.0), with freq of:
            18.0 = termFreq=18.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03125 = fieldNorm(doc=175)
    0.040374875 = weight(_text_:classification in 175) [ClassicSimilarity], result of:
      0.040374875 = score(doc=175,freq=18.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.4222364 = fieldWeight in 175, product of:
          4.2426405 = tf(freq=18.0), with freq of:
            18.0 = termFreq=18.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03125 = fieldNorm(doc=175)
    0.06613628 = product of:
      0.13227256 = sum of:
        0.13227256 = weight(_text_:texts in 175) [ClassicSimilarity], result of:
          0.13227256 = score(doc=175,freq=22.0), product of:
            0.16460659 = queryWeight, product of:
              5.4822793 = idf(docFreq=499, maxDocs=44218)
              0.03002521 = queryNorm
            0.8035678 = fieldWeight in 175, product of:
              4.690416 = tf(freq=22.0), with freq of:
                22.0 = termFreq=22.0
              5.4822793 = idf(docFreq=499, maxDocs=44218)
              0.03125 = fieldNorm(doc=175)
      0.5 = coord(1/2)
  0.21428572 = coord(3/14)
```
Abstract

Purpose The purpose of this study is to develop a model for automated classification of old digitised texts to the Universal Decimal Classification (UDC), using machine-learning methods. Design/methodology/approach The general research approach is inherent to design science research, in which the problem of UDC assignment of the old, digitised texts is addressed by developing a machine-learning classification model. A corpus of 70,000 scholarly texts, fully bibliographically processed by librarians, was used to train and test the model, which was used for classification of old texts on a corpus of 200,000 items. Human experts evaluated the performance of the model. Findings Results suggest that machine-learning models can correctly assign the UDC at some level for almost any scholarly text. Furthermore, the model can be recommended for the UDC assignment of older texts. Ten librarians corroborated this on 150 randomly selected texts. Research limitations/implications The main limitations of this study were unavailability of labelled older texts and the limited availability of librarians. Practical implications The classification model can provide a recommendation to the librarians during their classification work; furthermore, it can be implemented as an add-on to full-text search in the library databases. Social implications The proposed methodology supports librarians by recommending UDC classifiers, thus saving time in their daily work. By automatically classifying older texts, digital libraries can provide a better user experience by enabling structured searches. These contribute to making knowledge more widely available and useable. Originality/value These findings contribute to the field of automated classification of bibliographical information with the usage of full texts, especially in cases in which the texts are old, unstructured and in which archaic language and vocabulary are used.
Dolin, R.; Agrawal, D.; El Abbadi, A.; Pearlman, J.: Using automated classification for summarizing and selecting heterogeneous information sources (1998) 0.03
```
0.030435055 = product of:
  0.10652269 = sum of:
    0.018003922 = weight(_text_:subject in 1253) [ClassicSimilarity], result of:
      0.018003922 = score(doc=1253,freq=4.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.16765293 = fieldWeight in 1253, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.0234375 = fieldNorm(doc=1253)
    0.031919144 = weight(_text_:classification in 1253) [ClassicSimilarity], result of:
      0.031919144 = score(doc=1253,freq=20.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.33380723 = fieldWeight in 1253, product of:
          4.472136 = tf(freq=20.0), with freq of:
            20.0 = termFreq=20.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0234375 = fieldNorm(doc=1253)
    0.024680478 = product of:
      0.049360957 = sum of:
        0.049360957 = weight(_text_:schemes in 1253) [ClassicSimilarity], result of:
          0.049360957 = score(doc=1253,freq=6.0), product of:
            0.16067243 = queryWeight, product of:
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.03002521 = queryNorm
            0.30721486 = fieldWeight in 1253, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.0234375 = fieldNorm(doc=1253)
      0.5 = coord(1/2)
    0.031919144 = weight(_text_:classification in 1253) [ClassicSimilarity], result of:
      0.031919144 = score(doc=1253,freq=20.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.33380723 = fieldWeight in 1253, product of:
          4.472136 = tf(freq=20.0), with freq of:
            20.0 = termFreq=20.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0234375 = fieldNorm(doc=1253)
  0.2857143 = coord(4/14)
```
Abstract

Information retrieval over the Internet increasingly requires the filtering of thousands of heterogeneous information sources. Important sources of information include not only traditional databases with structured data and queries, but also increasing numbers of non-traditional, semi- or unstructured collections such as Web sites, FTP archives, etc. As the number and variability of sources increases, new ways of automatically summarizing, discovering, and selecting collections relevant to a user's query are needed. One such method involves the use of classification schemes, such as the Library of Congress Classification (LCC), within which a collection may be represented based on its content, irrespective of the structure of the actual data or documents. For such a system to be useful in a large-scale distributed environment, it must be easy to use for both collection managers and users. As a result, it must be possible to classify documents automatically within a classification scheme. Furthermore, there must be a straightforward and intuitive interface with which the user may use the scheme to assist in information retrieval (IR). Our work with the Alexandria Digital Library (ADL) Project focuses on geo-referenced information, whether text, maps, aerial photographs, or satellite images. As a result, we have emphasized techniques which work with both text and non-text, such as combined textual and graphical queries, multi-dimensional indexing, and IR methods which are not solely dependent on words or phrases. Part of this work involves locating relevant online sources of information. In particular, we have designed and are currently testing aspects of an architecture, Pharos, which we believe will scale up to 1.000.000 heterogeneous sources. Pharos accommodates heterogeneity in content and format, both among multiple sources as well as within a single source. That is, we consider sources to include Web sites, FTP archives, newsgroups, and full digital libraries; all of these systems can include a wide variety of content and multimedia data formats. Pharos is based on the use of hierarchical classification schemes. These include not only well-known 'subject' (or 'concept') based schemes such as the Dewey Decimal System and the LCC, but also, for example, geographic classifications, which might be constructed as layers of smaller and smaller hierarchical longitude/latitude boxes. Pharos is designed to work with sophisticated queries which utilize subjects, geographical locations, temporal specifications, and other types of information domains. The Pharos architecture requires that hierarchically structured collection metadata be extracted so that it can be partitioned in such a way as to greatly enhance scalability. Automated classification is important to Pharos because it allows information sources to extract the requisite collection metadata automatically that must be distributed.
We are currently experimenting with newsgroups as collections. We have built an initial prototype which automatically classifies and summarizes newsgroups within the LCC. (The prototype can be tested below, and more details may be found at http://pharos.alexandria.ucsb.edu/). The prototype uses electronic library catalog records as a `training set' and Latent Semantic Indexing (LSI) for IR. We use the training set to build a rich set of classification terminology, and associate these terms with the relevant categories in the LCC. This association between terms and classification categories allows us to relate users' queries to nodes in the LCC so that users can select appropriate query categories. Newsgroups are similarly associated with classification categories. Pharos then matches the categories selected by users to relevant newsgroups. In principle, this approach allows users to exclude newsgroups that might have been selected based on an unintended meaning of a query term, and to include newsgroups with relevant content even though the exact query terms may not have been used. This work is extensible to other types of classification, including geographical, temporal, and image feature. Before discussing the methodology of the collection summarization and selection, we first present an online demonstration below. The demonstration is not intended to be a complete end-user interface. Rather, it is intended merely to offer a view of the process to suggest the "look and feel" of the prototype. The demo works as follows. First supply it with a few keywords of interest. The system will then use those terms to try to return to you the most relevant subject categories within the LCC. Assuming that the system recognizes any of your terms (it has over 400,000 terms indexed), it will give you a list of 15 LCC categories sorted by relevancy ranking. From there, you have two choices. The first choice, by clicking on the "News" links, is to get a list of newsgroups which the system has identified as relevant to the LCC category you select. The other choice, by clicking on the LCC ID links, is to enter the LCC hierarchy starting at the category of your choice and navigate the tree until you locate the best category for your query. From there, again, you can get a list of newsgroups by clicking on the "News" links. After having shown this demonstration to many people, we would like to suggest that you first give it easier examples before trying to break it. For example, "prostate cancer" (discussed below), "remote sensing", "investment banking", and "gershwin" all work reasonably well.

Wu, M.; Liu, Y.-H.; Brownlee, R.; Zhang, X.: Evaluating utility and automatic classification of subject metadata from Research Data Australia (2021) 0.03

0.03041722 = product of:
  0.14194703 = sum of:
    0.07201569 = weight(_text_:subject in 453) [ClassicSimilarity], result of:
      0.07201569 = score(doc=453,freq=16.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.67061174 = fieldWeight in 453, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.046875 = fieldNorm(doc=453)
    0.03496567 = weight(_text_:classification in 453) [ClassicSimilarity], result of:
      0.03496567 = score(doc=453,freq=6.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.3656675 = fieldWeight in 453, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=453)
    0.03496567 = weight(_text_:classification in 453) [ClassicSimilarity], result of:
      0.03496567 = score(doc=453,freq=6.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.3656675 = fieldWeight in 453, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=453)
  0.21428572 = coord(3/14)

Abstract: In this paper, we present a case study of how well subject metadata (comprising headings from an international classification scheme) has been deployed in a national data catalogue, and how often data seekers use subject metadata when searching for data. Through an analysis of user search behaviour as recorded in search logs, we find evidence that users utilise the subject metadata for data discovery. Since approximately half of the records ingested by the catalogue did not include subject metadata at the time of harvest, we experimented with automatic subject classification approaches in order to enrich these records and to provide additional support for user search and data discovery. Our results show that automatic methods work well for well represented categories of subject metadata, and these categories tend to have features that can distinguish themselves from the other categories. Our findings raise implications for data catalogue providers; they should invest more effort to enhance the quality of data records by providing an adequate description of these records for under-represented subject categories.

Chung, Y.-M.; Noh, Y.-H.: Developing a specialized directory system by automatically classifying Web documents (2003) 0.03

0.03025795 = product of:
  0.14120376 = sum of:
    0.05092278 = weight(_text_:subject in 1566) [ClassicSimilarity], result of:
      0.05092278 = score(doc=1566,freq=8.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.4741941 = fieldWeight in 1566, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.046875 = fieldNorm(doc=1566)
    0.045140486 = weight(_text_:classification in 1566) [ClassicSimilarity], result of:
      0.045140486 = score(doc=1566,freq=10.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.4720747 = fieldWeight in 1566, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=1566)
    0.045140486 = weight(_text_:classification in 1566) [ClassicSimilarity], result of:
      0.045140486 = score(doc=1566,freq=10.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.4720747 = fieldWeight in 1566, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=1566)
  0.21428572 = coord(3/14)

Abstract: This study developed a specialized directory system using an automatic classification technique. Economics was selected as the subject field for the classification experiments with Web documents. The classification scheme of the directory follows the DDC, and subject terms representing each class number or subject category were selected from the DDC table to construct a representative term dictionary. In collecting and classifying the Web documents, various strategies were tested in order to find the optimal thresholds. In the classification experiments, Web documents in economics were classified into a total of 757 hierarchical subject categories built from the DDC scheme. The first and second experiments using the representative term dictionary resulted in relatively high precision ratios of 77 and 60%, respectively. The third experiment employing a machine learning-based k-nearest neighbours (kNN) classifier in a closed experimental setting achieved a precision ratio of 96%. This implies that it is possible to enhance the classification performance by applying a hybrid method combining a dictionary-based technique and a kNN classifier

Ardö, A.; Koch, T.: Automatic classification applied to full-text Internet documents in a robot-generated subject index (1999) 0.03

0.028215542 = product of:
  0.13167253 = sum of:
    0.05092278 = weight(_text_:subject in 382) [ClassicSimilarity], result of:
      0.05092278 = score(doc=382,freq=2.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.4741941 = fieldWeight in 382, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.09375 = fieldNorm(doc=382)
    0.04037488 = weight(_text_:classification in 382) [ClassicSimilarity], result of:
      0.04037488 = score(doc=382,freq=2.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.42223644 = fieldWeight in 382, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.09375 = fieldNorm(doc=382)
    0.04037488 = weight(_text_:classification in 382) [ClassicSimilarity], result of:
      0.04037488 = score(doc=382,freq=2.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.42223644 = fieldWeight in 382, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.09375 = fieldNorm(doc=382)
  0.21428572 = coord(3/14)

Yoon, Y.; Lee, C.; Lee, G.G.: ¬An effective procedure for constructing a hierarchical text classification system (2006) 0.03

0.027775463 = product of:
  0.12961882 = sum of:
    0.057690408 = weight(_text_:classification in 5273) [ClassicSimilarity], result of:
      0.057690408 = score(doc=5273,freq=12.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.60332054 = fieldWeight in 5273, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=5273)
    0.057690408 = weight(_text_:classification in 5273) [ClassicSimilarity], result of:
      0.057690408 = score(doc=5273,freq=12.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.60332054 = fieldWeight in 5273, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=5273)
    0.014238005 = product of:
      0.02847601 = sum of:
        0.02847601 = weight(_text_:22 in 5273) [ClassicSimilarity], result of:
          0.02847601 = score(doc=5273,freq=2.0), product of:
            0.10514317 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.03002521 = queryNorm
            0.2708308 = fieldWeight in 5273, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=5273)
      0.5 = coord(1/2)
  0.21428572 = coord(3/14)

Abstract: In text categorization tasks, classification on some class hierarchies has better results than in cases without the hierarchy. Currently, because a large number of documents are divided into several subgroups in a hierarchy, we can appropriately use a hierarchical classification method. However, we have no systematic method to build a hierarchical classification system that performs well with large collections of practical data. In this article, we introduce a new evaluation scheme for internal node classifiers, which can be used effectively to develop a hierarchical classification system. We also show that our method for constructing the hierarchical classification system is very effective, especially for the task of constructing classifiers applied to hierarchy tree with a lot of levels.
Date: 22. 7.2006 16:24:52

Prabowo, R.; Jackson, M.; Burden, P.; Knoell, H.-D.: Ontology-based automatic classification for the Web pages : design, implementation and evaluation (2002) 0.03

0.027299229 = product of:
  0.1273964 = sum of:
    0.049448926 = weight(_text_:classification in 3383) [ClassicSimilarity], result of:
      0.049448926 = score(doc=3383,freq=12.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.5171319 = fieldWeight in 3383, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=3383)
    0.02849856 = product of:
      0.05699712 = sum of:
        0.05699712 = weight(_text_:schemes in 3383) [ClassicSimilarity], result of:
          0.05699712 = score(doc=3383,freq=2.0), product of:
            0.16067243 = queryWeight, product of:
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.03002521 = queryNorm
            0.35474116 = fieldWeight in 3383, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.046875 = fieldNorm(doc=3383)
      0.5 = coord(1/2)
    0.049448926 = weight(_text_:classification in 3383) [ClassicSimilarity], result of:
      0.049448926 = score(doc=3383,freq=12.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.5171319 = fieldWeight in 3383, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=3383)
  0.21428572 = coord(3/14)

Abstract: In recent years, we have witnessed the continual growth in the use of ontologies in order to provide a mechanism to enable machine reasoning. This paper describes an automatic classifier, which focuses on the use of ontologies for classifying Web pages with respect to the Dewey Decimal Classification (DDC) and Library of Congress Classification (LCC) schemes. Firstly, we explain how these ontologies can be built in a modular fashion, and mapped into DDC and LCC. Secondly, we propose the formal definition of a DDC-LCC and an ontology-classification-scheme mapping. Thirdly, we explain the way the classifier uses these ontologies to assist classification. Finally, an experiment in which the accuracy of the classifier was evaluated is presented. The experiment shows that our approach results an improved classification in terms of accuracy. This improvement, however, comes at a cost in a low overage ratio due to the incompleteness of the ontologies used

Shafer, K.E.: Evaluating Scorpion results (1998) 0.03

0.027279545 = product of:
  0.12730454 = sum of:
    0.060013074 = weight(_text_:subject in 1569) [ClassicSimilarity], result of:
      0.060013074 = score(doc=1569,freq=4.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.55884314 = fieldWeight in 1569, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.078125 = fieldNorm(doc=1569)
    0.03364573 = weight(_text_:classification in 1569) [ClassicSimilarity], result of:
      0.03364573 = score(doc=1569,freq=2.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.35186368 = fieldWeight in 1569, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.078125 = fieldNorm(doc=1569)
    0.03364573 = weight(_text_:classification in 1569) [ClassicSimilarity], result of:
      0.03364573 = score(doc=1569,freq=2.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.35186368 = fieldWeight in 1569, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.078125 = fieldNorm(doc=1569)
  0.21428572 = coord(3/14)

Abstract: Scorpion is a research project at OCLC that builds tools for automatic subject assignment by combining library science and information retrieval techniques. A thesis of Scorpion is that the Dewey Decimal Classification (Dewey) can be used to perform automatic subject assignment for electronic items.

Wartena, C.; Sommer, M.: Automatic classification of scientific records using the German Subject Heading Authority File (SWD) (2012) 0.03

0.025535407 = product of:
  0.11916523 = sum of:
    0.03675035 = weight(_text_:subject in 472) [ClassicSimilarity], result of:
      0.03675035 = score(doc=472,freq=6.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.34222013 = fieldWeight in 472, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.0390625 = fieldNorm(doc=472)
    0.041207436 = weight(_text_:classification in 472) [ClassicSimilarity], result of:
      0.041207436 = score(doc=472,freq=12.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.43094325 = fieldWeight in 472, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=472)
    0.041207436 = weight(_text_:classification in 472) [ClassicSimilarity], result of:
      0.041207436 = score(doc=472,freq=12.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.43094325 = fieldWeight in 472, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=472)
  0.21428572 = coord(3/14)

Abstract: The following paper deals with an automatic text classification method which does not require training documents. For this method the German Subject Heading Authority File (SWD), provided by the linked data service of the German National Library is used. Recently the SWD was enriched with notations of the Dewey Decimal Classification (DDC). In consequence it became possible to utilize the subject headings as textual representations for the notations of the DDC. Basically, we we derive the classification of a text from the classification of the words in the text given by the thesaurus. The method was tested by classifying 3826 OAI-Records from 7 different repositories. Mean reciprocal rank and recall were chosen as evaluation measure. Direct comparison to a machine learning method has shown that this method is definitely competitive. Thus we can conclude that the enriched version of the SWD provides high quality information with a broad coverage for classification of German scientific articles.

Golub, K.: Automated subject classification of textual Web pages, based on a controlled vocabulary : challenges and recommendations (2006) 0.03

0.025019486 = product of:
  0.1167576 = sum of:
    0.036007844 = weight(_text_:subject in 5897) [ClassicSimilarity], result of:
      0.036007844 = score(doc=5897,freq=4.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.33530587 = fieldWeight in 5897, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.046875 = fieldNorm(doc=5897)
    0.04037488 = weight(_text_:classification in 5897) [ClassicSimilarity], result of:
      0.04037488 = score(doc=5897,freq=8.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.42223644 = fieldWeight in 5897, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=5897)
    0.04037488 = weight(_text_:classification in 5897) [ClassicSimilarity], result of:
      0.04037488 = score(doc=5897,freq=8.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.42223644 = fieldWeight in 5897, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=5897)
  0.21428572 = coord(3/14)

Abstract: The primary objective of this study was to identify and address problems of applying a controlled vocabulary in automated subject classification of textual Web pages, in the area of engineering. Web pages have special characteristics such as structural information, but are at the same time rather heterogeneous. The classification approach used comprises string-to-string matching between words in a term list extracted from the Ei (Engineering Information) thesaurus and classification scheme, and words in the text to be classified. Based on a sample of 70 Web pages, a number of problems with the term list are identified. Reasons for those problems are discussed and improvements proposed. Methods for implementing the improvements are also specified, suggesting further research.

Sebastiani, F.: Classification of text, automatic (2006) 0.02

0.024849901 = product of:
  0.1159662 = sum of:
    0.033307575 = weight(_text_:classification in 5003) [ClassicSimilarity], result of:
      0.033307575 = score(doc=5003,freq=4.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.34832728 = fieldWeight in 5003, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=5003)
    0.033307575 = weight(_text_:classification in 5003) [ClassicSimilarity], result of:
      0.033307575 = score(doc=5003,freq=4.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.34832728 = fieldWeight in 5003, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=5003)
    0.04935105 = product of:
      0.0987021 = sum of:
        0.0987021 = weight(_text_:texts in 5003) [ClassicSimilarity], result of:
          0.0987021 = score(doc=5003,freq=4.0), product of:
            0.16460659 = queryWeight, product of:
              5.4822793 = idf(docFreq=499, maxDocs=44218)
              0.03002521 = queryNorm
            0.5996243 = fieldWeight in 5003, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.4822793 = idf(docFreq=499, maxDocs=44218)
              0.0546875 = fieldNorm(doc=5003)
      0.5 = coord(1/2)
  0.21428572 = coord(3/14)

Abstract: Automatic text classification (ATC) is a discipline at the crossroads of information retrieval (IR), machine learning (ML), and computational linguistics (CL), and consists in the realization of text classifiers, i.e. software systems capable of assigning texts to one or more categories, or classes, from a predefined set. Applications range from the automated indexing of scientific articles, to e-mail routing, spam filtering, authorship attribution, and automated survey coding. This article will focus on the ML approach to ATC, whereby a software system (called the learner) automatically builds a classifier for the categories of interest by generalizing from a "training" set of pre-classified texts.

Larson, R.R.: Experiments in automatic Library of Congress Classification (1992) 0.02

0.024801936 = product of:
  0.11574237 = sum of:
    0.02546139 = weight(_text_:subject in 1054) [ClassicSimilarity], result of:
      0.02546139 = score(doc=1054,freq=2.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.23709705 = fieldWeight in 1054, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.046875 = fieldNorm(doc=1054)
    0.045140486 = weight(_text_:classification in 1054) [ClassicSimilarity], result of:
      0.045140486 = score(doc=1054,freq=10.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.4720747 = fieldWeight in 1054, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=1054)
    0.045140486 = weight(_text_:classification in 1054) [ClassicSimilarity], result of:
      0.045140486 = score(doc=1054,freq=10.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.4720747 = fieldWeight in 1054, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=1054)
  0.21428572 = coord(3/14)

Abstract: This article presents the results of research into the automatic selection of Library of Congress Classification numbers based on the titles and subject headings in MARC records. The method used in this study was based on partial match retrieval techniques using various elements of new recors (i.e., those to be classified) as "queries", and a test database of classification clusters generated from previously classified MARC records. Sixty individual methods for automatic classification were tested on a set of 283 new records, using all combinations of four different partial match methods, five query types, and three representations of search terms. The results indicate that if the best method for a particular case can be determined, then up to 86% of the new records may be correctly classified. The single method with the best accuracy was able to select the correct classification for about 46% of the new records.

Ko, Y.: ¬A new term-weighting scheme for text classification using the odds of positive and negative class probabilities (2015) 0.02

0.024449104 = product of:
  0.11409582 = sum of:
    0.028549349 = weight(_text_:classification in 2339) [ClassicSimilarity], result of:
      0.028549349 = score(doc=2339,freq=4.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.29856625 = fieldWeight in 2339, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=2339)
    0.05699712 = product of:
      0.11399424 = sum of:
        0.11399424 = weight(_text_:schemes in 2339) [ClassicSimilarity], result of:
          0.11399424 = score(doc=2339,freq=8.0), product of:
            0.16067243 = queryWeight, product of:
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.03002521 = queryNorm
            0.7094823 = fieldWeight in 2339, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.046875 = fieldNorm(doc=2339)
      0.5 = coord(1/2)
    0.028549349 = weight(_text_:classification in 2339) [ClassicSimilarity], result of:
      0.028549349 = score(doc=2339,freq=4.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.29856625 = fieldWeight in 2339, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=2339)
  0.21428572 = coord(3/14)

Abstract: Text classification (TC) is a core technique for text mining and information retrieval. It has been applied to many applications in many different research and industrial areas. Term-weighting schemes assign an appropriate weight to each term to obtain a high TC performance. Although term weighting is one of the important modules for TC and TC has different peculiarities from those in information retrieval, many term-weighting schemes used in information retrieval, such as term frequency-inverse document frequency (tf-idf), have been used in TC in the same manner. The peculiarity of TC that differs most from information retrieval is the existence of class information. This article proposes a new term-weighting scheme that uses class information using positive and negative class distributions. As a result, the proposed scheme, log tf-TRR, consistently performs better than do other schemes using class information as well as traditional schemes such as tf-idf.

Koch, T.; Vizine-Goetz, D.: Automatic classification and content navigation support for Web services : DESIRE II cooperates with OCLC (1998) 0.02

0.023848182 = product of:
  0.11129151 = sum of:
    0.029704956 = weight(_text_:subject in 1568) [ClassicSimilarity], result of:
      0.029704956 = score(doc=1568,freq=2.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.27661324 = fieldWeight in 1568, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1568)
    0.04079328 = weight(_text_:classification in 1568) [ClassicSimilarity], result of:
      0.04079328 = score(doc=1568,freq=6.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.42661208 = fieldWeight in 1568, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1568)
    0.04079328 = weight(_text_:classification in 1568) [ClassicSimilarity], result of:
      0.04079328 = score(doc=1568,freq=6.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.42661208 = fieldWeight in 1568, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1568)
  0.21428572 = coord(3/14)

Abstract: Emerging standards in knowledge representation and organization are preparing the way for distributed vocabulary support in Internet search services. NetLab researchers are exploring several innovative solutions for searching and browsing in the subject-based Internet gateway, Electronic Engineering Library, Sweden (EELS). The implementation of the EELS service is described, specifically, the generation of the robot-gathered database 'All' engineering and the automated application of the Ei thesaurus and classification scheme. NetLab and OCLC researchers are collaborating to investigate advanced solutions to automated classification in the DESIRE II context. A plan for furthering the development of distributed vocabulary support in Internet search services is offered.

Lindholm, J.; Schönthal, T.; Jansson , K.: Experiences of harvesting Web resources in engineering using automatic classification (2003) 0.02

0.023588596 = product of:
  0.110080115 = sum of:
    0.033948522 = weight(_text_:subject in 4088) [ClassicSimilarity], result of:
      0.033948522 = score(doc=4088,freq=2.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.31612942 = fieldWeight in 4088, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.0625 = fieldNorm(doc=4088)
    0.0380658 = weight(_text_:classification in 4088) [ClassicSimilarity], result of:
      0.0380658 = score(doc=4088,freq=4.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.39808834 = fieldWeight in 4088, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0625 = fieldNorm(doc=4088)
    0.0380658 = weight(_text_:classification in 4088) [ClassicSimilarity], result of:
      0.0380658 = score(doc=4088,freq=4.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.39808834 = fieldWeight in 4088, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0625 = fieldNorm(doc=4088)
  0.21428572 = coord(3/14)

Abstract: Authors describe the background and the work involved in setting up Engine-e, a Web index that uses automatic classification as a mean for the selection of resources in Engineering. Considerations in offering a robot-generated Web index as a successor to a manually indexed quality-controlled subject gateway are also discussed

Dolin, R.; Agrawal, D.; El Abbadi, A.; Pearlman, J.: Using automated classification for summarizing and selecting heterogeneous information sources (1998) 0.02

0.023410352 = product of:
  0.10924831 = sum of:
    0.04037488 = weight(_text_:classification in 316) [ClassicSimilarity], result of:
      0.04037488 = score(doc=316,freq=8.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.42223644 = fieldWeight in 316, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=316)
    0.02849856 = product of:
      0.05699712 = sum of:
        0.05699712 = weight(_text_:schemes in 316) [ClassicSimilarity], result of:
          0.05699712 = score(doc=316,freq=2.0), product of:
            0.16067243 = queryWeight, product of:
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.03002521 = queryNorm
            0.35474116 = fieldWeight in 316, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.3512506 = idf(docFreq=569, maxDocs=44218)
              0.046875 = fieldNorm(doc=316)
      0.5 = coord(1/2)
    0.04037488 = weight(_text_:classification in 316) [ClassicSimilarity], result of:
      0.04037488 = score(doc=316,freq=8.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.42223644 = fieldWeight in 316, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=316)
  0.21428572 = coord(3/14)

Abstract: Information retrieval over the Internet increasingly requires the filtering of thousands of heterogeneous information sources. Important sources of information include not only traditional databases with structured data and queries, but also increasing numbers of non-traditional, semi- or unstructured collections such as Web sites, FTP archives, etc. As the number and variability of sources increases, new ways of automatically summarizing, discovering, and selecting collections relevant to a user's query are needed. One such method involves the use of classification schemes, such as the Library of Congress Classification (LCC) [10], within which a collection may be represented based on its content, irrespective of the structure of the actual data or documents. For such a system to be useful in a large-scale distributed environment, it must be easy to use for both collection managers and users. As a result, it must be possible to classify documents automatically within a classification scheme. Furthermore, there must be a straightforward and intuitive interface with which the user may use the scheme to assist in information retrieval (IR).

Koch, T.; Vizine-Goetz, D.: DDC and knowledge organization in the digital library : Research and development. Demonstration pages (1999) 0.02

0.02275953 = product of:
  0.10621114 = sum of:
    0.02546139 = weight(_text_:subject in 942) [ClassicSimilarity], result of:
      0.02546139 = score(doc=942,freq=2.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.23709705 = fieldWeight in 942, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.046875 = fieldNorm(doc=942)
    0.04037488 = weight(_text_:classification in 942) [ClassicSimilarity], result of:
      0.04037488 = score(doc=942,freq=8.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.42223644 = fieldWeight in 942, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=942)
    0.04037488 = weight(_text_:classification in 942) [ClassicSimilarity], result of:
      0.04037488 = score(doc=942,freq=8.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.42223644 = fieldWeight in 942, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=942)
  0.21428572 = coord(3/14)

Abstract: Der Workshop gibt einen Einblick in die aktuelle Forschung und Entwicklung zur Wissensorganisation in digitalen Bibliotheken. Diane Vizine-Goetz vom OCLC Office of Research in Dublin, Ohio, stellt die Forschungsprojekte von OCLC zur Anpassung und Weiterentwicklung der Dewey Decimal Classification als Wissensorganisationsinstrument fuer grosse digitale Dokumentensammlungen vor. Traugott Koch, NetLab, Universität Lund in Schweden, demonstriert die Ansätze und Lösungen des EU-Projekts DESIRE zum Einsatz von intellektueller und vor allem automatischer Klassifikation in Fachinformationsdiensten im Internet.
Content: 1. Increased Importance of Knowledge Organization in Internet Services - 2. Quality Subject Service and the role of classification - 3. Developing the DDC into a knowledge organization instrument for the digital library. OCLC site - 4. DESIRE's Barefoot Solutions of Automatic Classification - 5. Advanced Classification Solutions in DESIRE and CORC - 6. Future directions of research and development - 7. General references

Hagedorn, K.; Chapman, S.; Newman, D.: Enhancing search and browse using automated clustering of subject metadata (2007) 0.02

0.022701254 = product of:
  0.10593919 = sum of:
    0.036007844 = weight(_text_:subject in 1168) [ClassicSimilarity], result of:
      0.036007844 = score(doc=1168,freq=4.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.33530587 = fieldWeight in 1168, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.046875 = fieldNorm(doc=1168)
    0.03496567 = weight(_text_:classification in 1168) [ClassicSimilarity], result of:
      0.03496567 = score(doc=1168,freq=6.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.3656675 = fieldWeight in 1168, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=1168)
    0.03496567 = weight(_text_:classification in 1168) [ClassicSimilarity], result of:
      0.03496567 = score(doc=1168,freq=6.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.3656675 = fieldWeight in 1168, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=1168)
  0.21428572 = coord(3/14)

Abstract: The Web puzzle of online information resources often hinders end-users from effective and efficient access to these resources. Clustering resources into appropriate subject-based groupings may help alleviate these difficulties, but will it work with heterogeneous material? The University of Michigan and the University of California Irvine joined forces to test automatically enhancing metadata records using the Topic Modeling algorithm on the varied OAIster corpus. We created labels for the resulting clusters of metadata records, matched the clusters to an in-house classification system, and developed a prototype that would showcase methods for search and retrieval using the enhanced records. Results indicated that while the algorithm was somewhat time-intensive to run and using a local classification scheme had its drawbacks, precise clustering of records was achieved and the prototype interface proved that faceted classification could be powerful in helping end-users find resources.

Golub, K.: Automated subject classification of textual web documents (2006) 0.02
```
0.022207009 = product of:
  0.1036327 = sum of:
    0.021217827 = weight(_text_:subject in 5600) [ClassicSimilarity], result of:
      0.021217827 = score(doc=5600,freq=2.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.19758089 = fieldWeight in 5600, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5600)
    0.041207436 = weight(_text_:classification in 5600) [ClassicSimilarity], result of:
      0.041207436 = score(doc=5600,freq=12.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.43094325 = fieldWeight in 5600, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5600)
    0.041207436 = weight(_text_:classification in 5600) [ClassicSimilarity], result of:
      0.041207436 = score(doc=5600,freq=12.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.43094325 = fieldWeight in 5600, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5600)
  0.21428572 = coord(3/14)
```
Abstract

Purpose - To provide an integrated perspective to similarities and differences between approaches to automated classification in different research communities (machine learning, information retrieval and library science), and point to problems with the approaches and automated classification as such. Design/methodology/approach - A range of works dealing with automated classification of full-text web documents are discussed. Explorations of individual approaches are given in the following sections: special features (description, differences, evaluation), application and characteristics of web pages. Findings - Provides major similarities and differences between the three approaches: document pre-processing and utilization of web-specific document characteristics is common to all the approaches; major differences are in applied algorithms, employment or not of the vector space model and of controlled vocabularies. Problems of automated classification are recognized. Research limitations/implications - The paper does not attempt to provide an exhaustive bibliography of related resources. Practical implications - As an integrated overview of approaches from different research communities with application examples, it is very useful for students in library and information science and computer science, as well as for practitioners. Researchers from one community have the information on how similar tasks are conducted in different communities. Originality/value - To the author's knowledge, no review paper on automated text classification attempted to discuss more than one community's approach from an integrated perspective.

Liu, R.-L.: Context recognition for hierarchical text classification (2009) 0.02

0.021961067 = product of:
  0.10248498 = sum of:
    0.045140486 = weight(_text_:classification in 2760) [ClassicSimilarity], result of:
      0.045140486 = score(doc=2760,freq=10.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.4720747 = fieldWeight in 2760, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=2760)
    0.045140486 = weight(_text_:classification in 2760) [ClassicSimilarity], result of:
      0.045140486 = score(doc=2760,freq=10.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.4720747 = fieldWeight in 2760, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.046875 = fieldNorm(doc=2760)
    0.0122040035 = product of:
      0.024408007 = sum of:
        0.024408007 = weight(_text_:22 in 2760) [ClassicSimilarity], result of:
          0.024408007 = score(doc=2760,freq=2.0), product of:
            0.10514317 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.03002521 = queryNorm
            0.23214069 = fieldWeight in 2760, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=2760)
      0.5 = coord(1/2)
  0.21428572 = coord(3/14)

Abstract: Information is often organized as a text hierarchy. A hierarchical text-classification system is thus essential for the management, sharing, and dissemination of information. It aims to automatically classify each incoming document into zero, one, or several categories in the text hierarchy. In this paper, we present a technique called CRHTC (context recognition for hierarchical text classification) that performs hierarchical text classification by recognizing the context of discussion (COD) of each category. A category's COD is governed by its ancestor categories, whose contents indicate contextual backgrounds of the category. A document may be classified into a category only if its content matches the category's COD. CRHTC does not require any trials to manually set parameters, and hence is more portable and easier to implement than other methods. It is empirically evaluated under various conditions. The results show that CRHTC achieves both better and more stable performance than several hierarchical and nonhierarchical text-classification methodologies.
Date: 22. 3.2009 19:11:54

Search (151 results, page 2 of 8)

Authors

Years

Languages

Types

Themes

Subjects