Search (31 results, page 1 of 2)

  • × year_i:[2000 TO 2010}
  • × theme_ss:"Automatisches Klassifizieren"
  1. Yi, K.: Automatic text classification using library classification schemes : trends, issues and challenges (2007) 0.10
    0.103261515 = product of:
      0.13768202 = sum of:
        0.060314562 = weight(_text_:digital in 2560) [ClassicSimilarity], result of:
          0.060314562 = score(doc=2560,freq=2.0), product of:
            0.19770671 = queryWeight, product of:
              3.944552 = idf(docFreq=2326, maxDocs=44218)
              0.050121464 = queryNorm
            0.30507088 = fieldWeight in 2560, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.944552 = idf(docFreq=2326, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2560)
        0.053599782 = weight(_text_:library in 2560) [ClassicSimilarity], result of:
          0.053599782 = score(doc=2560,freq=8.0), product of:
            0.1317883 = queryWeight, product of:
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.050121464 = queryNorm
            0.40671125 = fieldWeight in 2560, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2560)
        0.023767682 = product of:
          0.047535364 = sum of:
            0.047535364 = weight(_text_:22 in 2560) [ClassicSimilarity], result of:
              0.047535364 = score(doc=2560,freq=2.0), product of:
                0.17551683 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.050121464 = queryNorm
                0.2708308 = fieldWeight in 2560, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=2560)
          0.5 = coord(1/2)
      0.75 = coord(3/4)
    
    Abstract
    The proliferation of digital resources and their integration into a traditional library setting has created a pressing need for an automated tool that organizes textual information based on library classification schemes. Automated text classification is a research field of developing tools, methods, and models to automate text classification. This article describes the current popular approach for text classification and major text classification projects and applications that are based on library classification schemes. Related issues and challenges are discussed, and a number of considerations for the challenges are examined.
    Date
    22. 9.2008 18:31:54
  2. Yi, K.: Challenges in automated classification using library classification schemes (2006) 0.07
    0.065093905 = product of:
      0.13018781 = sum of:
        0.068930924 = weight(_text_:digital in 5810) [ClassicSimilarity], result of:
          0.068930924 = score(doc=5810,freq=2.0), product of:
            0.19770671 = queryWeight, product of:
              3.944552 = idf(docFreq=2326, maxDocs=44218)
              0.050121464 = queryNorm
            0.34865242 = fieldWeight in 5810, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.944552 = idf(docFreq=2326, maxDocs=44218)
              0.0625 = fieldNorm(doc=5810)
        0.061256893 = weight(_text_:library in 5810) [ClassicSimilarity], result of:
          0.061256893 = score(doc=5810,freq=8.0), product of:
            0.1317883 = queryWeight, product of:
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.050121464 = queryNorm
            0.46481284 = fieldWeight in 5810, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.0625 = fieldNorm(doc=5810)
      0.5 = coord(2/4)
    
    Abstract
    A major library classification scheme has long been standard classification framework for information sources in traditional library environment, and text classification (TC) becomes a popular and attractive tool of organizing digital information. This paper gives an overview of previous projects and studies on TC using major library classification schemes, and summarizes a discussion of TC research challenges.
  3. Pong, J.Y.-H.; Kwok, R.C.-W.; Lau, R.Y.-K.; Hao, J.-X.; Wong, P.C.-C.: ¬A comparative study of two automatic document classification methods in a library setting (2008) 0.05
    0.051808305 = product of:
      0.10361661 = sum of:
        0.043081827 = weight(_text_:digital in 2532) [ClassicSimilarity], result of:
          0.043081827 = score(doc=2532,freq=2.0), product of:
            0.19770671 = queryWeight, product of:
              3.944552 = idf(docFreq=2326, maxDocs=44218)
              0.050121464 = queryNorm
            0.21790776 = fieldWeight in 2532, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.944552 = idf(docFreq=2326, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2532)
        0.060534786 = weight(_text_:library in 2532) [ClassicSimilarity], result of:
          0.060534786 = score(doc=2532,freq=20.0), product of:
            0.1317883 = queryWeight, product of:
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.050121464 = queryNorm
            0.45933354 = fieldWeight in 2532, product of:
              4.472136 = tf(freq=20.0), with freq of:
                20.0 = termFreq=20.0
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2532)
      0.5 = coord(2/4)
    
    Abstract
    In current library practice, trained human experts usually carry out document cataloguing and indexing based on a manual approach. With the explosive growth in the number of electronic documents available on the Internet and digital libraries, it is increasingly difficult for library practitioners to categorize both electronic documents and traditional library materials using just a manual approach. To improve the effectiveness and efficiency of document categorization at the library setting, more in-depth studies of using automatic document classification methods to categorize library items are required. Machine learning research has advanced rapidly in recent years. However, applying machine learning techniques to improve library practice is still a relatively unexplored area. This paper illustrates the design and development of a machine learning based automatic document classification system to alleviate the manual categorization problem encountered within the library setting. Two supervised machine learning algorithms have been tested. Our empirical tests show that supervised machine learning algorithms in general, and the k-nearest neighbours (KNN) algorithm in particular, can be used to develop an effective document classification system to enhance current library practice. Moreover, some concrete recommendations regarding how to practically apply the KNN algorithm to develop automatic document classification in a library setting are made. To our best knowledge, this is the first in-depth study of applying the KNN algorithm to automatic document classification based on the widely used LCC classification scheme adopted by many large libraries.
  4. Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.05
    0.04998924 = product of:
      0.09997848 = sum of:
        0.079606175 = product of:
          0.23881851 = sum of:
            0.23881851 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
              0.23881851 = score(doc=562,freq=2.0), product of:
                0.42493033 = queryWeight, product of:
                  8.478011 = idf(docFreq=24, maxDocs=44218)
                  0.050121464 = queryNorm
                0.56201804 = fieldWeight in 562, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  8.478011 = idf(docFreq=24, maxDocs=44218)
                  0.046875 = fieldNorm(doc=562)
          0.33333334 = coord(1/3)
        0.0203723 = product of:
          0.0407446 = sum of:
            0.0407446 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
              0.0407446 = score(doc=562,freq=2.0), product of:
                0.17551683 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.050121464 = queryNorm
                0.23214069 = fieldWeight in 562, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.046875 = fieldNorm(doc=562)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Content
    Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
    Date
    8. 1.2013 10:22:32
  5. Subramanian, S.; Shafer, K.E.: Clustering (2001) 0.04
    0.043343633 = product of:
      0.08668727 = sum of:
        0.045942668 = weight(_text_:library in 1046) [ClassicSimilarity], result of:
          0.045942668 = score(doc=1046,freq=2.0), product of:
            0.1317883 = queryWeight, product of:
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.050121464 = queryNorm
            0.34860963 = fieldWeight in 1046, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.09375 = fieldNorm(doc=1046)
        0.0407446 = product of:
          0.0814892 = sum of:
            0.0814892 = weight(_text_:22 in 1046) [ClassicSimilarity], result of:
              0.0814892 = score(doc=1046,freq=2.0), product of:
                0.17551683 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.050121464 = queryNorm
                0.46428138 = fieldWeight in 1046, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.09375 = fieldNorm(doc=1046)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Date
    5. 5.2003 14:17:22
    Source
    Journal of library administration. 34(2001) nos.3/4, S.221-228
  6. Reiner, U.: DDC-based search in the data of the German National Bibliography (2008) 0.03
    0.02628516 = product of:
      0.05257032 = sum of:
        0.022971334 = weight(_text_:library in 2166) [ClassicSimilarity], result of:
          0.022971334 = score(doc=2166,freq=2.0), product of:
            0.1317883 = queryWeight, product of:
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.050121464 = queryNorm
            0.17430481 = fieldWeight in 2166, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.046875 = fieldNorm(doc=2166)
        0.029598987 = product of:
          0.059197973 = sum of:
            0.059197973 = weight(_text_:project in 2166) [ClassicSimilarity], result of:
              0.059197973 = score(doc=2166,freq=2.0), product of:
                0.21156175 = queryWeight, product of:
                  4.220981 = idf(docFreq=1764, maxDocs=44218)
                  0.050121464 = queryNorm
                0.27981415 = fieldWeight in 2166, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.220981 = idf(docFreq=1764, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2166)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Abstract
    In 2004, the German National Library began to classify title records of the German National Bibliography according to subject groups based on the divisions of the Dewey Decimal Classification (DDC). Since 2006, all titles of the main series of the German National Bibliography are classified in strict compliance with the DDC. On this basis, an enhanced DDC-based search can be realized - e.g., searching the data of the German National Bibliography for title records using number components of synthesized classification numbers or searching for DDC numbers using unclassified title records. This paper gives an account of the current research and development of the DDC-based search. The work is conducted in the VZG project Colibri that focuses on the automatic analysis of DDC-synthesized numbers and the automatic classification of bibliographic title records.
  7. Automatic classification research at OCLC (2002) 0.03
    0.025283787 = product of:
      0.050567575 = sum of:
        0.026799891 = weight(_text_:library in 1563) [ClassicSimilarity], result of:
          0.026799891 = score(doc=1563,freq=2.0), product of:
            0.1317883 = queryWeight, product of:
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.050121464 = queryNorm
            0.20335563 = fieldWeight in 1563, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1563)
        0.023767682 = product of:
          0.047535364 = sum of:
            0.047535364 = weight(_text_:22 in 1563) [ClassicSimilarity], result of:
              0.047535364 = score(doc=1563,freq=2.0), product of:
                0.17551683 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.050121464 = queryNorm
                0.2708308 = fieldWeight in 1563, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=1563)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Abstract
    OCLC enlists the cooperation of the world's libraries to make the written record of humankind's cultural heritage more accessible through electronic media. Part of this goal can be accomplished through the application of the principles of knowledge organization. We believe that cultural artifacts are effectively lost unless they are indexed, cataloged and classified. Accordingly, OCLC has developed products, sponsored research projects, and encouraged the participation in international standards communities whose outcome has been improved library classification schemes, cataloging productivity tools, and new proposals for the creation and maintenance of metadata. Though cataloging and classification requires expert intellectual effort, we recognize that at least some of the work must be automated if we hope to keep pace with cultural change
    Date
    5. 5.2003 9:22:09
  8. Cui, H.; Heidorn, P.B.; Zhang, H.: ¬An approach to automatic classification of text for information retrieval (2002) 0.02
    0.015078641 = product of:
      0.060314562 = sum of:
        0.060314562 = weight(_text_:digital in 174) [ClassicSimilarity], result of:
          0.060314562 = score(doc=174,freq=2.0), product of:
            0.19770671 = queryWeight, product of:
              3.944552 = idf(docFreq=2326, maxDocs=44218)
              0.050121464 = queryNorm
            0.30507088 = fieldWeight in 174, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.944552 = idf(docFreq=2326, maxDocs=44218)
              0.0546875 = fieldNorm(doc=174)
      0.25 = coord(1/4)
    
    Source
    Proceedings of the Second ACM/IEEE-CS Joint Conference on Digital Libraries : JCDL 2002 ; July 14 - 18, 2002, Portland, Oregon, USA. Ed. by Gary Marchionini
  9. Sebastiani, F.: Machine learning in automated text categorization (2002) 0.01
    0.012924549 = product of:
      0.051698197 = sum of:
        0.051698197 = weight(_text_:digital in 3389) [ClassicSimilarity], result of:
          0.051698197 = score(doc=3389,freq=2.0), product of:
            0.19770671 = queryWeight, product of:
              3.944552 = idf(docFreq=2326, maxDocs=44218)
              0.050121464 = queryNorm
            0.26148933 = fieldWeight in 3389, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.944552 = idf(docFreq=2326, maxDocs=44218)
              0.046875 = fieldNorm(doc=3389)
      0.25 = coord(1/4)
    
    Abstract
    The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based an machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
  10. Golub, K.; Hamon, T.; Ardö, A.: Automated classification of textual documents based on a controlled vocabulary in engineering (2007) 0.01
    0.012924549 = product of:
      0.051698197 = sum of:
        0.051698197 = weight(_text_:digital in 1461) [ClassicSimilarity], result of:
          0.051698197 = score(doc=1461,freq=2.0), product of:
            0.19770671 = queryWeight, product of:
              3.944552 = idf(docFreq=2326, maxDocs=44218)
              0.050121464 = queryNorm
            0.26148933 = fieldWeight in 1461, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.944552 = idf(docFreq=2326, maxDocs=44218)
              0.046875 = fieldNorm(doc=1461)
      0.25 = coord(1/4)
    
    Abstract
    Automated subject classification has been a challenging research issue for many years now, receiving particular attention in the past decade due to rapid increase of digital documents. The most frequent approach to automated classification is machine learning. It, however, requires training documents and performs well on new documents only if these are similar enough to the former. We explore a string-matching algorithm based on a controlled vocabulary, which does not require training documents - instead it reuses the intellectual work put into creating the controlled vocabulary. Terms from the Engineering Information thesaurus and classification scheme were matched against title and abstract of engineering papers from the Compendex database. Simple string-matching was enhanced by several methods such as term weighting schemes and cut-offs, exclusion of certain terms, and en- richment of the controlled vocabulary with automatically extracted terms. The best results are 76% recall when the controlled vocabulary is enriched with new terms, and 79% precision when certain terms are excluded. Precision of individual classes is up to 98%. These results are comparable to state-of-the-art machine-learning algorithms.
  11. Frank, E.; Paynter, G.W.: Predicting Library of Congress Classifications from Library of Congress Subject Headings (2004) 0.01
    0.012841367 = product of:
      0.05136547 = sum of:
        0.05136547 = weight(_text_:library in 2218) [ClassicSimilarity], result of:
          0.05136547 = score(doc=2218,freq=10.0), product of:
            0.1317883 = queryWeight, product of:
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.050121464 = queryNorm
            0.38975742 = fieldWeight in 2218, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.046875 = fieldNorm(doc=2218)
      0.25 = coord(1/4)
    
    Abstract
    This paper addresses the problem of automatically assigning a Library of Congress Classification (LCC) to a work given its set of Library of Congress Subject Headings (LCSH). LCCs are organized in a tree: The root node of this hierarchy comprises all possible topics, and leaf nodes correspond to the most specialized topic areas defined. We describe a procedure that, given a resource identified by its LCSH, automatically places that resource in the LCC hierarchy. The procedure uses machine learning techniques and training data from a large library catalog to learn a model that maps from sets of LCSH to classifications from the LCC tree. We present empirical results for our technique showing its accuracy an an independent collection of 50,000 LCSH/LCC pairs.
  12. Shafer, K.E.: Automatic Subject Assignment via the Scorpion System (2001) 0.01
    0.011485667 = product of:
      0.045942668 = sum of:
        0.045942668 = weight(_text_:library in 1043) [ClassicSimilarity], result of:
          0.045942668 = score(doc=1043,freq=2.0), product of:
            0.1317883 = queryWeight, product of:
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.050121464 = queryNorm
            0.34860963 = fieldWeight in 1043, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.09375 = fieldNorm(doc=1043)
      0.25 = coord(1/4)
    
    Source
    Journal of library administration. 34(2001) nos.1/2, S.187-189
  13. Godby, C. J.; Stuler, J.: ¬The Library of Congress Classification as a knowledge base for automatic subject categorization (2001) 0.01
    0.010828791 = product of:
      0.043315165 = sum of:
        0.043315165 = weight(_text_:library in 1567) [ClassicSimilarity], result of:
          0.043315165 = score(doc=1567,freq=4.0), product of:
            0.1317883 = queryWeight, product of:
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.050121464 = queryNorm
            0.32867232 = fieldWeight in 1567, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.0625 = fieldNorm(doc=1567)
      0.25 = coord(1/4)
    
    Abstract
    This paper describes a set of experiments in adapting a subset of the Library of Congress Classification for use as a database for automatic classification. A high degree of concept integrity was obtained when subject headings were mapped from OCLC's WorldCat database and filtered using the log-likelihood statistic
  14. Adams, K.C.: Word wranglers : Automatic classification tools transform enterprise documents from "bags of words" into knowledge resources (2003) 0.01
    0.010770457 = product of:
      0.043081827 = sum of:
        0.043081827 = weight(_text_:digital in 1665) [ClassicSimilarity], result of:
          0.043081827 = score(doc=1665,freq=2.0), product of:
            0.19770671 = queryWeight, product of:
              3.944552 = idf(docFreq=2326, maxDocs=44218)
              0.050121464 = queryNorm
            0.21790776 = fieldWeight in 1665, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.944552 = idf(docFreq=2326, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1665)
      0.25 = coord(1/4)
    
    Abstract
    Taxonomies are an important part of any knowledge management (KM) system, and automatic classification software is emerging as a "killer app" for consumer and enterprise portals. A number of companies such as Inxight Software , Mohomine, Metacode, and others claim to interpret the semantic content of any textual document and automatically classify text on the fly. The promise that software could automatically produce a Yahoo-style directory is a siren call not many IT managers are able to resist. KM needs have grown more complex due to the increasing amount of digital information, the declining effectiveness of keyword searching, and heterogeneous document formats in corporate databases. This environment requires innovative KM tools, and automatic classification technology is an example of this new kind of software. These products can be divided into three categories according to their underlying technology - rules-based, catalog-by-example, and statistical clustering. Evolving trends in this market include framing classification as a cyborg (computer- and human-based) activity and the increasing use of extensible markup language (XML) and support vector machine (SVM) technology. In this article, we'll survey the rapidly changing automatic classification software market and examine the features and capabilities of leading classification products.
  15. Mengle, S.S.R.; Goharian, N.: Ambiguity measure feature-selection algorithm (2009) 0.01
    0.010770457 = product of:
      0.043081827 = sum of:
        0.043081827 = weight(_text_:digital in 2804) [ClassicSimilarity], result of:
          0.043081827 = score(doc=2804,freq=2.0), product of:
            0.19770671 = queryWeight, product of:
              3.944552 = idf(docFreq=2326, maxDocs=44218)
              0.050121464 = queryNorm
            0.21790776 = fieldWeight in 2804, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.944552 = idf(docFreq=2326, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2804)
      0.25 = coord(1/4)
    
    Abstract
    With the increasing number of digital documents, the ability to automatically classify those documents both efficiently and accurately is becoming more critical and difficult. One of the major problems in text classification is the high dimensionality of feature space. We present the ambiguity measure (AM) feature-selection algorithm, which selects the most unambiguous features from the feature set. Unambiguous features are those features whose presence in a document indicate a strong degree of confidence that a document belongs to only one specific category. We apply AM feature selection on a naïve Bayes text classifier. We favorably show the effectiveness of our approach in outperforming eight existing feature-selection methods, using five benchmark datasets with a statistical significance of at least 95% confidence. The support vector machine (SVM) text classifier is shown to perform consistently better than the naïve Bayes text classifier. The drawback, however, is the time complexity in training a model. We further explore the effect of using the AM feature-selection method on an SVM text classifier. Our results indicate that the training time for the SVM algorithm can be reduced by more than 50%, while still improving the accuracy of the text classifier. We favorably show the effectiveness of our approach by demonstrating that it statistically significantly (99% confidence) outperforms eight existing feature-selection methods using four standard benchmark datasets.
  16. Shafer, K.E.: Evaluating Scorpion Results (2001) 0.01
    0.00957139 = product of:
      0.03828556 = sum of:
        0.03828556 = weight(_text_:library in 4085) [ClassicSimilarity], result of:
          0.03828556 = score(doc=4085,freq=2.0), product of:
            0.1317883 = queryWeight, product of:
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.050121464 = queryNorm
            0.29050803 = fieldWeight in 4085, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.078125 = fieldNorm(doc=4085)
      0.25 = coord(1/4)
    
    Source
    Journal of library administration. 34(2001) nos.3/4, S.237-244
  17. Godby, C.J.; Stuler, J.: ¬The Library of Congress Classification as a knowledge base for automatic subject categorization : subject access issues (2003) 0.01
    0.009475192 = product of:
      0.03790077 = sum of:
        0.03790077 = weight(_text_:library in 3962) [ClassicSimilarity], result of:
          0.03790077 = score(doc=3962,freq=4.0), product of:
            0.1317883 = queryWeight, product of:
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.050121464 = queryNorm
            0.28758827 = fieldWeight in 3962, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.0546875 = fieldNorm(doc=3962)
      0.25 = coord(1/4)
    
    Abstract
    This paper describes a set of experiments in adapting a subset of the Library of Congress Classification for use as a database for automatic classification. A high degree of concept integrity was obtained when subject headings were mapped from OCLC's WorldCat database and filtered using the log-likelihood statistic.
  18. Reiner, U.: Automatische DDC-Klassifizierung von bibliografischen Titeldatensätzen (2009) 0.01
    0.008488459 = product of:
      0.033953834 = sum of:
        0.033953834 = product of:
          0.06790767 = sum of:
            0.06790767 = weight(_text_:22 in 611) [ClassicSimilarity], result of:
              0.06790767 = score(doc=611,freq=2.0), product of:
                0.17551683 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.050121464 = queryNorm
                0.38690117 = fieldWeight in 611, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.078125 = fieldNorm(doc=611)
          0.5 = coord(1/2)
      0.25 = coord(1/4)
    
    Date
    22. 8.2009 12:54:24
  19. Wang, J.: ¬An extensive study on automated Dewey Decimal Classification (2009) 0.01
    0.008289068 = product of:
      0.033156272 = sum of:
        0.033156272 = weight(_text_:library in 3172) [ClassicSimilarity], result of:
          0.033156272 = score(doc=3172,freq=6.0), product of:
            0.1317883 = queryWeight, product of:
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.050121464 = queryNorm
            0.25158736 = fieldWeight in 3172, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3172)
      0.25 = coord(1/4)
    
    Abstract
    In this paper, we present a theoretical analysis and extensive experiments on the automated assignment of Dewey Decimal Classification (DDC) classes to bibliographic data with a supervised machine-learning approach. Library classification systems, such as the DDC, impose great obstacles on state-of-art text categorization (TC) technologies, including deep hierarchy, data sparseness, and skewed distribution. We first analyze statistically the document and category distributions over the DDC, and discuss the obstacles imposed by bibliographic corpora and library classification schemes on TC technology. To overcome these obstacles, we propose an innovative algorithm to reshape the DDC structure into a balanced virtual tree by balancing the category distribution and flattening the hierarchy. To improve the classification effectiveness to a level acceptable to real-world applications, we propose an interactive classification model that is able to predict a class of any depth within a limited number of user interactions. The experiments are conducted on a large bibliographic collection created by the Library of Congress within the science and technology domains over 10 years. With no more than three interactions, a classification accuracy of nearly 90% is achieved, thus providing a practical solution to the automatic bibliographic classification problem.
  20. Golub, K.: Automated subject classification of textual web documents (2006) 0.01
    0.0067679947 = product of:
      0.027071979 = sum of:
        0.027071979 = weight(_text_:library in 5600) [ClassicSimilarity], result of:
          0.027071979 = score(doc=5600,freq=4.0), product of:
            0.1317883 = queryWeight, product of:
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.050121464 = queryNorm
            0.2054202 = fieldWeight in 5600, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5600)
      0.25 = coord(1/4)
    
    Abstract
    Purpose - To provide an integrated perspective to similarities and differences between approaches to automated classification in different research communities (machine learning, information retrieval and library science), and point to problems with the approaches and automated classification as such. Design/methodology/approach - A range of works dealing with automated classification of full-text web documents are discussed. Explorations of individual approaches are given in the following sections: special features (description, differences, evaluation), application and characteristics of web pages. Findings - Provides major similarities and differences between the three approaches: document pre-processing and utilization of web-specific document characteristics is common to all the approaches; major differences are in applied algorithms, employment or not of the vector space model and of controlled vocabularies. Problems of automated classification are recognized. Research limitations/implications - The paper does not attempt to provide an exhaustive bibliography of related resources. Practical implications - As an integrated overview of approaches from different research communities with application examples, it is very useful for students in library and information science and computer science, as well as for practitioners. Researchers from one community have the information on how similar tasks are conducted in different communities. Originality/value - To the author's knowledge, no review paper on automated text classification attempted to discuss more than one community's approach from an integrated perspective.