Search (20 results, page 1 of 1)

  • × type_ss:"el"
  • × theme_ss:"Automatisches Klassifizieren"
  1. Automatic classification research at OCLC (2002) 0.06
    0.060138173 = product of:
      0.12027635 = sum of:
        0.12027635 = sum of:
          0.070831776 = weight(_text_:classification in 1563) [ClassicSimilarity], result of:
            0.070831776 = score(doc=1563,freq=6.0), product of:
              0.16603322 = queryWeight, product of:
                3.1847067 = idf(docFreq=4974, maxDocs=44218)
                0.05213454 = queryNorm
              0.42661208 = fieldWeight in 1563, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                3.1847067 = idf(docFreq=4974, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1563)
          0.04944457 = weight(_text_:22 in 1563) [ClassicSimilarity], result of:
            0.04944457 = score(doc=1563,freq=2.0), product of:
              0.18256627 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.05213454 = queryNorm
              0.2708308 = fieldWeight in 1563, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1563)
      0.5 = coord(1/2)
    
    Abstract
    OCLC enlists the cooperation of the world's libraries to make the written record of humankind's cultural heritage more accessible through electronic media. Part of this goal can be accomplished through the application of the principles of knowledge organization. We believe that cultural artifacts are effectively lost unless they are indexed, cataloged and classified. Accordingly, OCLC has developed products, sponsored research projects, and encouraged the participation in international standards communities whose outcome has been improved library classification schemes, cataloging productivity tools, and new proposals for the creation and maintenance of metadata. Though cataloging and classification requires expert intellectual effort, we recognize that at least some of the work must be automated if we hope to keep pace with cultural change
    Date
    5. 5.2003 9:22:09
  2. Koch, T.; Ardö, A.: Automatic classification of full-text HTML-documents from one specific subject area : DESIRE II D3.6a, Working Paper 2 (2000) 0.03
    0.028620359 = product of:
      0.057240717 = sum of:
        0.057240717 = product of:
          0.114481434 = sum of:
            0.114481434 = weight(_text_:classification in 1667) [ClassicSimilarity], result of:
              0.114481434 = score(doc=1667,freq=12.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.6895092 = fieldWeight in 1667, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.0625 = fieldNorm(doc=1667)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Content
    1 Introduction / 2 Method overview / 3 Ei thesaurus preprocessing / 4 Automatic classification process: 4.1 Matching -- 4.2 Weighting -- 4.3 Preparation for display / 5 Results of the classification process / 6 Evaluations / 7 Software / 8 Other applications / 9 Experiments with universal classification systems / References / Appendix A: Ei classification service: Software / Appendix B: Use of the classification software as subject filter in a WWW harvester.
  3. Yi, K.: Challenges in automated classification using library classification schemes (2006) 0.03
    0.028620359 = product of:
      0.057240717 = sum of:
        0.057240717 = product of:
          0.114481434 = sum of:
            0.114481434 = weight(_text_:classification in 5810) [ClassicSimilarity], result of:
              0.114481434 = score(doc=5810,freq=12.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.6895092 = fieldWeight in 5810, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.0625 = fieldNorm(doc=5810)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    A major library classification scheme has long been standard classification framework for information sources in traditional library environment, and text classification (TC) becomes a popular and attractive tool of organizing digital information. This paper gives an overview of previous projects and studies on TC using major library classification schemes, and summarizes a discussion of TC research challenges.
  4. Prabowo, R.; Jackson, M.; Burden, P.; Knoell, H.-D.: Ontology-based automatic classification for the Web pages : design, implementation and evaluation (2002) 0.02
    0.02146527 = product of:
      0.04293054 = sum of:
        0.04293054 = product of:
          0.08586108 = sum of:
            0.08586108 = weight(_text_:classification in 3383) [ClassicSimilarity], result of:
              0.08586108 = score(doc=3383,freq=12.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.5171319 = fieldWeight in 3383, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.046875 = fieldNorm(doc=3383)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    In recent years, we have witnessed the continual growth in the use of ontologies in order to provide a mechanism to enable machine reasoning. This paper describes an automatic classifier, which focuses on the use of ontologies for classifying Web pages with respect to the Dewey Decimal Classification (DDC) and Library of Congress Classification (LCC) schemes. Firstly, we explain how these ontologies can be built in a modular fashion, and mapped into DDC and LCC. Secondly, we propose the formal definition of a DDC-LCC and an ontology-classification-scheme mapping. Thirdly, we explain the way the classifier uses these ontologies to assist classification. Finally, an experiment in which the accuracy of the classifier was evaluated is presented. The experiment shows that our approach results an improved classification in terms of accuracy. This improvement, however, comes at a cost in a low overage ratio due to the incompleteness of the ontologies used
  5. Godby, C. J.; Stuler, J.: ¬The Library of Congress Classification as a knowledge base for automatic subject categorization (2001) 0.02
    0.02023765 = product of:
      0.0404753 = sum of:
        0.0404753 = product of:
          0.0809506 = sum of:
            0.0809506 = weight(_text_:classification in 1567) [ClassicSimilarity], result of:
              0.0809506 = score(doc=1567,freq=6.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.48755667 = fieldWeight in 1567, product of:
                  2.4494898 = tf(freq=6.0), with freq of:
                    6.0 = termFreq=6.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.0625 = fieldNorm(doc=1567)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    This paper describes a set of experiments in adapting a subset of the Library of Congress Classification for use as a database for automatic classification. A high degree of concept integrity was obtained when subject headings were mapped from OCLC's WorldCat database and filtered using the log-likelihood statistic
  6. Autonomy, Inc.: Automatic classification (o.J.) 0.02
    0.02023765 = product of:
      0.0404753 = sum of:
        0.0404753 = product of:
          0.0809506 = sum of:
            0.0809506 = weight(_text_:classification in 1666) [ClassicSimilarity], result of:
              0.0809506 = score(doc=1666,freq=6.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.48755667 = fieldWeight in 1666, product of:
                  2.4494898 = tf(freq=6.0), with freq of:
                    6.0 = termFreq=6.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.0625 = fieldNorm(doc=1666)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Autonomy's Classification solutions remove the necessity for organizations to rely on human intervention or manual processing of information, such as manual tagging, typically required to make most other e-business applications work. Autonomy's ability to consistently and accurately classify data automatically is a unique infrastructure solution that overcomes the predicaments surrounding the exponential growth of unstructured data.
    Source
    http://www.autonomy.com/Content/Products/IDOL/f/Classification#01
  7. Adams, K.C.: Word wranglers : Automatic classification tools transform enterprise documents from "bags of words" into knowledge resources (2003) 0.02
    0.017887725 = product of:
      0.03577545 = sum of:
        0.03577545 = product of:
          0.0715509 = sum of:
            0.0715509 = weight(_text_:classification in 1665) [ClassicSimilarity], result of:
              0.0715509 = score(doc=1665,freq=12.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.43094325 = fieldWeight in 1665, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1665)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Taxonomies are an important part of any knowledge management (KM) system, and automatic classification software is emerging as a "killer app" for consumer and enterprise portals. A number of companies such as Inxight Software , Mohomine, Metacode, and others claim to interpret the semantic content of any textual document and automatically classify text on the fly. The promise that software could automatically produce a Yahoo-style directory is a siren call not many IT managers are able to resist. KM needs have grown more complex due to the increasing amount of digital information, the declining effectiveness of keyword searching, and heterogeneous document formats in corporate databases. This environment requires innovative KM tools, and automatic classification technology is an example of this new kind of software. These products can be divided into three categories according to their underlying technology - rules-based, catalog-by-example, and statistical clustering. Evolving trends in this market include framing classification as a cyborg (computer- and human-based) activity and the increasing use of extensible markup language (XML) and support vector machine (SVM) technology. In this article, we'll survey the rapidly changing automatic classification software market and examine the features and capabilities of leading classification products.
  8. Wartena, C.; Sommer, M.: Automatic classification of scientific records using the German Subject Heading Authority File (SWD) (2012) 0.02
    0.017887725 = product of:
      0.03577545 = sum of:
        0.03577545 = product of:
          0.0715509 = sum of:
            0.0715509 = weight(_text_:classification in 472) [ClassicSimilarity], result of:
              0.0715509 = score(doc=472,freq=12.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.43094325 = fieldWeight in 472, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=472)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    The following paper deals with an automatic text classification method which does not require training documents. For this method the German Subject Heading Authority File (SWD), provided by the linked data service of the German National Library is used. Recently the SWD was enriched with notations of the Dewey Decimal Classification (DDC). In consequence it became possible to utilize the subject headings as textual representations for the notations of the DDC. Basically, we we derive the classification of a text from the classification of the words in the text given by the thesaurus. The method was tested by classifying 3826 OAI-Records from 7 different repositories. Mean reciprocal rank and recall were chosen as evaluation measure. Direct comparison to a machine learning method has shown that this method is definitely competitive. Thus we can conclude that the enriched version of the SWD provides high quality information with a broad coverage for classification of German scientific articles.
  9. Koch, T.; Vizine-Goetz, D.: Automatic classification and content navigation support for Web services : DESIRE II cooperates with OCLC (1998) 0.02
    0.017707944 = product of:
      0.035415888 = sum of:
        0.035415888 = product of:
          0.070831776 = sum of:
            0.070831776 = weight(_text_:classification in 1568) [ClassicSimilarity], result of:
              0.070831776 = score(doc=1568,freq=6.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.42661208 = fieldWeight in 1568, product of:
                  2.4494898 = tf(freq=6.0), with freq of:
                    6.0 = termFreq=6.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=1568)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Emerging standards in knowledge representation and organization are preparing the way for distributed vocabulary support in Internet search services. NetLab researchers are exploring several innovative solutions for searching and browsing in the subject-based Internet gateway, Electronic Engineering Library, Sweden (EELS). The implementation of the EELS service is described, specifically, the generation of the robot-gathered database 'All' engineering and the automated application of the Ei thesaurus and classification scheme. NetLab and OCLC researchers are collaborating to investigate advanced solutions to automated classification in the DESIRE II context. A plan for furthering the development of distributed vocabulary support in Internet search services is offered.
  10. Reiner, U.: Automatische DDC-Klassifizierung von bibliografischen Titeldatensätzen (2009) 0.02
    0.017658776 = product of:
      0.03531755 = sum of:
        0.03531755 = product of:
          0.0706351 = sum of:
            0.0706351 = weight(_text_:22 in 611) [ClassicSimilarity], result of:
              0.0706351 = score(doc=611,freq=2.0), product of:
                0.18256627 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.05213454 = queryNorm
                0.38690117 = fieldWeight in 611, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.078125 = fieldNorm(doc=611)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Date
    22. 8.2009 12:54:24
  11. Dolin, R.; Agrawal, D.; El Abbadi, A.; Pearlman, J.: Using automated classification for summarizing and selecting heterogeneous information sources (1998) 0.02
    0.01752632 = product of:
      0.03505264 = sum of:
        0.03505264 = product of:
          0.07010528 = sum of:
            0.07010528 = weight(_text_:classification in 316) [ClassicSimilarity], result of:
              0.07010528 = score(doc=316,freq=8.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.42223644 = fieldWeight in 316, product of:
                  2.828427 = tf(freq=8.0), with freq of:
                    8.0 = termFreq=8.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.046875 = fieldNorm(doc=316)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Information retrieval over the Internet increasingly requires the filtering of thousands of heterogeneous information sources. Important sources of information include not only traditional databases with structured data and queries, but also increasing numbers of non-traditional, semi- or unstructured collections such as Web sites, FTP archives, etc. As the number and variability of sources increases, new ways of automatically summarizing, discovering, and selecting collections relevant to a user's query are needed. One such method involves the use of classification schemes, such as the Library of Congress Classification (LCC) [10], within which a collection may be represented based on its content, irrespective of the structure of the actual data or documents. For such a system to be useful in a large-scale distributed environment, it must be easy to use for both collection managers and users. As a result, it must be possible to classify documents automatically within a classification scheme. Furthermore, there must be a straightforward and intuitive interface with which the user may use the scheme to assist in information retrieval (IR).
  12. Koch, T.; Vizine-Goetz, D.: DDC and knowledge organization in the digital library : Research and development. Demonstration pages (1999) 0.02
    0.01752632 = product of:
      0.03505264 = sum of:
        0.03505264 = product of:
          0.07010528 = sum of:
            0.07010528 = weight(_text_:classification in 942) [ClassicSimilarity], result of:
              0.07010528 = score(doc=942,freq=8.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.42223644 = fieldWeight in 942, product of:
                  2.828427 = tf(freq=8.0), with freq of:
                    8.0 = termFreq=8.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.046875 = fieldNorm(doc=942)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Der Workshop gibt einen Einblick in die aktuelle Forschung und Entwicklung zur Wissensorganisation in digitalen Bibliotheken. Diane Vizine-Goetz vom OCLC Office of Research in Dublin, Ohio, stellt die Forschungsprojekte von OCLC zur Anpassung und Weiterentwicklung der Dewey Decimal Classification als Wissensorganisationsinstrument fuer grosse digitale Dokumentensammlungen vor. Traugott Koch, NetLab, Universität Lund in Schweden, demonstriert die Ansätze und Lösungen des EU-Projekts DESIRE zum Einsatz von intellektueller und vor allem automatischer Klassifikation in Fachinformationsdiensten im Internet.
    Content
    1. Increased Importance of Knowledge Organization in Internet Services - 2. Quality Subject Service and the role of classification - 3. Developing the DDC into a knowledge organization instrument for the digital library. OCLC site - 4. DESIRE's Barefoot Solutions of Automatic Classification - 5. Advanced Classification Solutions in DESIRE and CORC - 6. Future directions of research and development - 7. General references
  13. Lindholm, J.; Schönthal, T.; Jansson , K.: Experiences of harvesting Web resources in engineering using automatic classification (2003) 0.02
    0.016523972 = product of:
      0.033047944 = sum of:
        0.033047944 = product of:
          0.06609589 = sum of:
            0.06609589 = weight(_text_:classification in 4088) [ClassicSimilarity], result of:
              0.06609589 = score(doc=4088,freq=4.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.39808834 = fieldWeight in 4088, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.0625 = fieldNorm(doc=4088)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Authors describe the background and the work involved in setting up Engine-e, a Web index that uses automatic classification as a mean for the selection of resources in Engineering. Considerations in offering a robot-generated Web index as a successor to a manually indexed quality-controlled subject gateway are also discussed
  14. Hagedorn, K.; Chapman, S.; Newman, D.: Enhancing search and browse using automated clustering of subject metadata (2007) 0.02
    0.015178238 = product of:
      0.030356476 = sum of:
        0.030356476 = product of:
          0.060712952 = sum of:
            0.060712952 = weight(_text_:classification in 1168) [ClassicSimilarity], result of:
              0.060712952 = score(doc=1168,freq=6.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.3656675 = fieldWeight in 1168, product of:
                  2.4494898 = tf(freq=6.0), with freq of:
                    6.0 = termFreq=6.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.046875 = fieldNorm(doc=1168)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    The Web puzzle of online information resources often hinders end-users from effective and efficient access to these resources. Clustering resources into appropriate subject-based groupings may help alleviate these difficulties, but will it work with heterogeneous material? The University of Michigan and the University of California Irvine joined forces to test automatically enhancing metadata records using the Topic Modeling algorithm on the varied OAIster corpus. We created labels for the resulting clusters of metadata records, matched the clusters to an in-house classification system, and developed a prototype that would showcase methods for search and retrieval using the enhanced records. Results indicated that while the algorithm was somewhat time-intensive to run and using a local classification scheme had its drawbacks, precise clustering of records was achieved and the prototype interface proved that faceted classification could be powerful in helping end-users find resources.
  15. Shafer, K.E.: Evaluating Scorpion results (1998) 0.01
    0.014605265 = product of:
      0.02921053 = sum of:
        0.02921053 = product of:
          0.05842106 = sum of:
            0.05842106 = weight(_text_:classification in 1569) [ClassicSimilarity], result of:
              0.05842106 = score(doc=1569,freq=2.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.35186368 = fieldWeight in 1569, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.078125 = fieldNorm(doc=1569)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Scorpion is a research project at OCLC that builds tools for automatic subject assignment by combining library science and information retrieval techniques. A thesis of Scorpion is that the Dewey Decimal Classification (Dewey) can be used to perform automatic subject assignment for electronic items.
  16. Dolin, R.; Agrawal, D.; El Abbadi, A.; Pearlman, J.: Using automated classification for summarizing and selecting heterogeneous information sources (1998) 0.01
    0.013855773 = product of:
      0.027711546 = sum of:
        0.027711546 = product of:
          0.055423092 = sum of:
            0.055423092 = weight(_text_:classification in 1253) [ClassicSimilarity], result of:
              0.055423092 = score(doc=1253,freq=20.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.33380723 = fieldWeight in 1253, product of:
                  4.472136 = tf(freq=20.0), with freq of:
                    20.0 = termFreq=20.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.0234375 = fieldNorm(doc=1253)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Information retrieval over the Internet increasingly requires the filtering of thousands of heterogeneous information sources. Important sources of information include not only traditional databases with structured data and queries, but also increasing numbers of non-traditional, semi- or unstructured collections such as Web sites, FTP archives, etc. As the number and variability of sources increases, new ways of automatically summarizing, discovering, and selecting collections relevant to a user's query are needed. One such method involves the use of classification schemes, such as the Library of Congress Classification (LCC), within which a collection may be represented based on its content, irrespective of the structure of the actual data or documents. For such a system to be useful in a large-scale distributed environment, it must be easy to use for both collection managers and users. As a result, it must be possible to classify documents automatically within a classification scheme. Furthermore, there must be a straightforward and intuitive interface with which the user may use the scheme to assist in information retrieval (IR). Our work with the Alexandria Digital Library (ADL) Project focuses on geo-referenced information, whether text, maps, aerial photographs, or satellite images. As a result, we have emphasized techniques which work with both text and non-text, such as combined textual and graphical queries, multi-dimensional indexing, and IR methods which are not solely dependent on words or phrases. Part of this work involves locating relevant online sources of information. In particular, we have designed and are currently testing aspects of an architecture, Pharos, which we believe will scale up to 1.000.000 heterogeneous sources. Pharos accommodates heterogeneity in content and format, both among multiple sources as well as within a single source. That is, we consider sources to include Web sites, FTP archives, newsgroups, and full digital libraries; all of these systems can include a wide variety of content and multimedia data formats. Pharos is based on the use of hierarchical classification schemes. These include not only well-known 'subject' (or 'concept') based schemes such as the Dewey Decimal System and the LCC, but also, for example, geographic classifications, which might be constructed as layers of smaller and smaller hierarchical longitude/latitude boxes. Pharos is designed to work with sophisticated queries which utilize subjects, geographical locations, temporal specifications, and other types of information domains. The Pharos architecture requires that hierarchically structured collection metadata be extracted so that it can be partitioned in such a way as to greatly enhance scalability. Automated classification is important to Pharos because it allows information sources to extract the requisite collection metadata automatically that must be distributed.
    We are currently experimenting with newsgroups as collections. We have built an initial prototype which automatically classifies and summarizes newsgroups within the LCC. (The prototype can be tested below, and more details may be found at http://pharos.alexandria.ucsb.edu/). The prototype uses electronic library catalog records as a `training set' and Latent Semantic Indexing (LSI) for IR. We use the training set to build a rich set of classification terminology, and associate these terms with the relevant categories in the LCC. This association between terms and classification categories allows us to relate users' queries to nodes in the LCC so that users can select appropriate query categories. Newsgroups are similarly associated with classification categories. Pharos then matches the categories selected by users to relevant newsgroups. In principle, this approach allows users to exclude newsgroups that might have been selected based on an unintended meaning of a query term, and to include newsgroups with relevant content even though the exact query terms may not have been used. This work is extensible to other types of classification, including geographical, temporal, and image feature. Before discussing the methodology of the collection summarization and selection, we first present an online demonstration below. The demonstration is not intended to be a complete end-user interface. Rather, it is intended merely to offer a view of the process to suggest the "look and feel" of the prototype. The demo works as follows. First supply it with a few keywords of interest. The system will then use those terms to try to return to you the most relevant subject categories within the LCC. Assuming that the system recognizes any of your terms (it has over 400,000 terms indexed), it will give you a list of 15 LCC categories sorted by relevancy ranking. From there, you have two choices. The first choice, by clicking on the "News" links, is to get a list of newsgroups which the system has identified as relevant to the LCC category you select. The other choice, by clicking on the LCC ID links, is to enter the LCC hierarchy starting at the category of your choice and navigate the tree until you locate the best category for your query. From there, again, you can get a list of newsgroups by clicking on the "News" links. After having shown this demonstration to many people, we would like to suggest that you first give it easier examples before trying to break it. For example, "prostate cancer" (discussed below), "remote sensing", "investment banking", and "gershwin" all work reasonably well.
  17. Koch, T.; Ardö, A.; Noodén, L.: ¬The construction of a robot-generated subject index : DESIRE II D3.6a, Working Paper 1 (1999) 0.01
    0.00876316 = product of:
      0.01752632 = sum of:
        0.01752632 = product of:
          0.03505264 = sum of:
            0.03505264 = weight(_text_:classification in 1668) [ClassicSimilarity], result of:
              0.03505264 = score(doc=1668,freq=2.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.21111822 = fieldWeight in 1668, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.046875 = fieldNorm(doc=1668)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    This working paper describes the creation of a test database to carry out the automatic classification tasks of the DESIRE II work package D3.6a on. It is an improved version of NetLab's existing "All" Engineering database created after a comparative study of the outcome of two different approaches to collecting the documents. These two methods were selected from seven different general methodologies to build robot-generated subject indices, presented in this paper. We found a surprisingly low overlap between the Engineering link collections we used as seed pages for the robot and subsequently an even more surprisingly low overlap between the resources collected by the two different approaches. That inspite of using basically the same services to start the harvesting process from. A intellectual evaluation of the contents of both databases showed almost exactly the same percentage of relevant documents (77%), indicating that the main difference between those aproaches was the coverage of the resulting database.
  18. Sebastiani, F.: ¬A tutorial an automated text categorisation (1999) 0.01
    0.00876316 = product of:
      0.01752632 = sum of:
        0.01752632 = product of:
          0.03505264 = sum of:
            0.03505264 = weight(_text_:classification in 3390) [ClassicSimilarity], result of:
              0.03505264 = score(doc=3390,freq=2.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.21111822 = fieldWeight in 3390, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.046875 = fieldNorm(doc=3390)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    The automated categorisation (or classification) of texts into topical categories has a long history, dating back at least to 1960. Until the late '80s, the dominant approach to the problem involved knowledge-engineering automatic categorisers, i.e. manually building a set of rules encoding expert knowledge an how to classify documents. In the '90s, with the booming production and availability of on-line documents, automated text categorisation has witnessed an increased and renewed interest. A newer paradigm based an machine learning has superseded the previous approach. Within this paradigm, a general inductive process automatically builds a classifier by "learning", from a set of previously classified documents, the characteristics of one or more categories; the advantages are a very good effectiveness, a considerable savings in terms of expert manpower, and domain independence. In this tutorial we look at the main approaches that have been taken towards automatic text categorisation within the general machine learning paradigm. Issues of document indexing, classifier construction, and classifier evaluation, will be touched upon.
  19. Reiner, U.: VZG-Projekt Colibri : Bewertung von automatisch DDC-klassifizierten Titeldatensätzen der Deutschen Nationalbibliothek (DNB) (2009) 0.01
    0.0073026326 = product of:
      0.014605265 = sum of:
        0.014605265 = product of:
          0.02921053 = sum of:
            0.02921053 = weight(_text_:classification in 2675) [ClassicSimilarity], result of:
              0.02921053 = score(doc=2675,freq=2.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.17593184 = fieldWeight in 2675, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2675)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Das VZG-Projekt Colibri/DDC beschäftigt sich seit 2003 mit automatischen Verfahren zur Dewey-Dezimalklassifikation (Dewey Decimal Classification, kurz DDC). Ziel des Projektes ist eine einheitliche DDC-Erschließung von bibliografischen Titeldatensätzen und eine Unterstützung der DDC-Expert(inn)en und DDC-Laien, z. B. bei der Analyse und Synthese von DDC-Notationen und deren Qualitätskontrolle und der DDC-basierten Suche. Der vorliegende Bericht konzentriert sich auf die erste größere automatische DDC-Klassifizierung und erste automatische und intellektuelle Bewertung mit der Klassifizierungskomponente vc_dcl1. Grundlage hierfür waren die von der Deutschen Nationabibliothek (DNB) im November 2007 zur Verfügung gestellten 25.653 Titeldatensätze (12 Wochen-/Monatslieferungen) der Deutschen Nationalbibliografie der Reihen A, B und H. Nach Erläuterung der automatischen DDC-Klassifizierung und automatischen Bewertung in Kapitel 2 wird in Kapitel 3 auf den DNB-Bericht "Colibri_Auswertung_DDC_Endbericht_Sommer_2008" eingegangen. Es werden Sachverhalte geklärt und Fragen gestellt, deren Antworten die Weichen für den Verlauf der weiteren Klassifizierungstests stellen werden. Über das Kapitel 3 hinaus führende weitergehende Betrachtungen und Gedanken zur Fortführung der automatischen DDC-Klassifizierung werden in Kapitel 4 angestellt. Der Bericht dient dem vertieften Verständnis für die automatischen Verfahren.
  20. Reiner, U.: Automatische DDC-Klassifizierung bibliografischer Titeldatensätze der Deutschen Nationalbibliografie (2009) 0.01
    0.0070635104 = product of:
      0.014127021 = sum of:
        0.014127021 = product of:
          0.028254041 = sum of:
            0.028254041 = weight(_text_:22 in 3284) [ClassicSimilarity], result of:
              0.028254041 = score(doc=3284,freq=2.0), product of:
                0.18256627 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.05213454 = queryNorm
                0.15476047 = fieldWeight in 3284, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.03125 = fieldNorm(doc=3284)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Date
    22. 1.2010 14:41:24