Search (64 results, page 2 of 4)

Savic, D.: Designing an expert system for classifying office documents (1994) 0.01

0.013953096 = product of:
  0.055812385 = sum of:
    0.055812385 = product of:
      0.11162477 = sum of:
        0.11162477 = weight(_text_:project in 2655) [ClassicSimilarity], result of:
          0.11162477 = score(doc=2655,freq=4.0), product of:
            0.21156175 = queryWeight, product of:
              4.220981 = idf(docFreq=1764, maxDocs=44218)
              0.050121464 = queryNorm
            0.52762264 = fieldWeight in 2655, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.220981 = idf(docFreq=1764, maxDocs=44218)
              0.0625 = fieldNorm(doc=2655)
      0.5 = coord(1/2)
  0.25 = coord(1/4)

Abstract: Can records management benefit from artificial intelligence technology, in particular from expert systems? Gives an answer to this question by showing an example of a small scale prototype project in automatic classification of office documents. Project methodology and basic elements of an expert system's approach are elaborated to give guidelines to potential users of this promising technology

Sebastiani, F.: Machine learning in automated text categorization (2002) 0.01
```
0.012924549 = product of:
  0.051698197 = sum of:
    0.051698197 = weight(_text_:digital in 3389) [ClassicSimilarity], result of:
      0.051698197 = score(doc=3389,freq=2.0), product of:
        0.19770671 = queryWeight, product of:
          3.944552 = idf(docFreq=2326, maxDocs=44218)
          0.050121464 = queryNorm
        0.26148933 = fieldWeight in 3389, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.944552 = idf(docFreq=2326, maxDocs=44218)
          0.046875 = fieldNorm(doc=3389)
  0.25 = coord(1/4)
```
Abstract

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based an machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
Golub, K.; Hamon, T.; Ardö, A.: Automated classification of textual documents based on a controlled vocabulary in engineering (2007) 0.01
```
0.012924549 = product of:
  0.051698197 = sum of:
    0.051698197 = weight(_text_:digital in 1461) [ClassicSimilarity], result of:
      0.051698197 = score(doc=1461,freq=2.0), product of:
        0.19770671 = queryWeight, product of:
          3.944552 = idf(docFreq=2326, maxDocs=44218)
          0.050121464 = queryNorm
        0.26148933 = fieldWeight in 1461, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.944552 = idf(docFreq=2326, maxDocs=44218)
          0.046875 = fieldNorm(doc=1461)
  0.25 = coord(1/4)
```
Abstract

Automated subject classification has been a challenging research issue for many years now, receiving particular attention in the past decade due to rapid increase of digital documents. The most frequent approach to automated classification is machine learning. It, however, requires training documents and performs well on new documents only if these are similar enough to the former. We explore a string-matching algorithm based on a controlled vocabulary, which does not require training documents - instead it reuses the intellectual work put into creating the controlled vocabulary. Terms from the Engineering Information thesaurus and classification scheme were matched against title and abstract of engineering papers from the Compendex database. Simple string-matching was enhanced by several methods such as term weighting schemes and cut-offs, exclusion of certain terms, and en- richment of the controlled vocabulary with automatically extracted terms. The best results are 76% recall when the controlled vocabulary is enriched with new terms, and 79% precision when certain terms are excluded. Precision of individual classes is up to 98%. These results are comparable to state-of-the-art machine-learning algorithms.
Kasprzik, A.: Automatisierte und semiautomatisierte Klassifizierung : eine Analyse aktueller Projekte (2014) 0.01
```
0.012924549 = product of:
  0.051698197 = sum of:
    0.051698197 = weight(_text_:digital in 2470) [ClassicSimilarity], result of:
      0.051698197 = score(doc=2470,freq=2.0), product of:
        0.19770671 = queryWeight, product of:
          3.944552 = idf(docFreq=2326, maxDocs=44218)
          0.050121464 = queryNorm
        0.26148933 = fieldWeight in 2470, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.944552 = idf(docFreq=2326, maxDocs=44218)
          0.046875 = fieldNorm(doc=2470)
  0.25 = coord(1/4)
```
Abstract

Das sprunghafte Anwachsen der Menge digital verfügbarer Dokumente gepaart mit dem Zeit- und Personalmangel an wissenschaftlichen Bibliotheken legt den Einsatz von halb- oder vollautomatischen Verfahren für die verbale und klassifikatorische Inhaltserschließung nahe. Nach einer kurzen allgemeinen Einführung in die gängige Methodik beleuchtet dieser Artikel eine Reihe von Projekten zur automatisierten Klassifizierung aus dem Zeitraum 2007-2012 und aus dem deutschsprachigen Raum. Ein Großteil der vorgestellten Projekte verwendet Methoden des Maschinellen Lernens aus der Künstlichen Intelligenz, arbeitet meist mit angepassten Versionen einer kommerziellen Software und bezieht sich in der Regel auf die Dewey Decimal Classification (DDC). Als Datengrundlage dienen Metadatensätze, Abstracs, Inhaltsverzeichnisse und Volltexte in diversen Datenformaten. Die abschließende Analyse enthält eine Anordnung der Projekte nach einer Reihe von verschiedenen Kriterien und eine Zusammenfassung der aktuellen Lage und der größten Herausfordungen für automatisierte Klassifizierungsverfahren.
Teich, E.; Degaetano-Ortlieb, S.; Fankhauser, P.; Kermes, H.; Lapshinova-Koltunski, E.: ¬The linguistic construal of disciplinarity : a data-mining approach using register features (2016) 0.01
```
0.012924549 = product of:
  0.051698197 = sum of:
    0.051698197 = weight(_text_:digital in 3015) [ClassicSimilarity], result of:
      0.051698197 = score(doc=3015,freq=2.0), product of:
        0.19770671 = queryWeight, product of:
          3.944552 = idf(docFreq=2326, maxDocs=44218)
          0.050121464 = queryNorm
        0.26148933 = fieldWeight in 3015, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.944552 = idf(docFreq=2326, maxDocs=44218)
          0.046875 = fieldNorm(doc=3015)
  0.25 = coord(1/4)
```
Abstract

We analyze the linguistic evolution of selected scientific disciplines over a 30-year time span (1970s to 2000s). Our focus is on four highly specialized disciplines at the boundaries of computer science that emerged during that time: computational linguistics, bioinformatics, digital construction, and microelectronics. Our analysis is driven by the question whether these disciplines develop a distinctive language use-both individually and collectively-over the given time period. The data set is the English Scientific Text Corpus (scitex), which includes texts from the 1970s/1980s and early 2000s. Our theoretical basis is register theory. In terms of methods, we combine corpus-based methods of feature extraction (various aggregated features [part-of-speech based], n-grams, lexico-grammatical patterns) and automatic text classification. The results of our research are directly relevant to the study of linguistic variation and languages for specific purposes (LSP) and have implications for various natural language processing (NLP) tasks, for example, authorship attribution, text mining, or training NLP tools.
Frank, E.; Paynter, G.W.: Predicting Library of Congress Classifications from Library of Congress Subject Headings (2004) 0.01
```
0.012841367 = product of:
  0.05136547 = sum of:
    0.05136547 = weight(_text_:library in 2218) [ClassicSimilarity], result of:
      0.05136547 = score(doc=2218,freq=10.0), product of:
        0.1317883 = queryWeight, product of:
          2.6293786 = idf(docFreq=8668, maxDocs=44218)
          0.050121464 = queryNorm
        0.38975742 = fieldWeight in 2218, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          2.6293786 = idf(docFreq=8668, maxDocs=44218)
          0.046875 = fieldNorm(doc=2218)
  0.25 = coord(1/4)
```
Abstract

This paper addresses the problem of automatically assigning a Library of Congress Classification (LCC) to a work given its set of Library of Congress Subject Headings (LCSH). LCCs are organized in a tree: The root node of this hierarchy comprises all possible topics, and leaf nodes correspond to the most specialized topic areas defined. We describe a procedure that, given a resource identified by its LCSH, automatically places that resource in the LCC hierarchy. The procedure uses machine learning techniques and training data from a large library catalog to learn a model that maps from sets of LCSH to classifications from the LCC tree. We present empirical results for our technique showing its accuracy an an independent collection of 50,000 LCSH/LCC pairs.

Subramanian, S.; Shafer, K.E.: Clustering (1998) 0.01

0.012332911 = product of:
  0.049331643 = sum of:
    0.049331643 = product of:
      0.098663285 = sum of:
        0.098663285 = weight(_text_:project in 1103) [ClassicSimilarity], result of:
          0.098663285 = score(doc=1103,freq=2.0), product of:
            0.21156175 = queryWeight, product of:
              4.220981 = idf(docFreq=1764, maxDocs=44218)
              0.050121464 = queryNorm
            0.4663569 = fieldWeight in 1103, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.220981 = idf(docFreq=1764, maxDocs=44218)
              0.078125 = fieldNorm(doc=1103)
      0.5 = coord(1/2)
  0.25 = coord(1/4)

Abstract: This article presents our exploration of computer science clustering algorithms as they relate to the Scorpion system. Scorpion is a research project at OCLC that explores the indexing and cataloging of electronic resources. For a more complete description of the Scorpion, please visit the Scorpion Web site at <http://purl.oclc.org/scorpion>

Shafer, K.E.: Automatic Subject Assignment via the Scorpion System (2001) 0.01

0.011485667 = product of:
  0.045942668 = sum of:
    0.045942668 = weight(_text_:library in 1043) [ClassicSimilarity], result of:
      0.045942668 = score(doc=1043,freq=2.0), product of:
        0.1317883 = queryWeight, product of:
          2.6293786 = idf(docFreq=8668, maxDocs=44218)
          0.050121464 = queryNorm
        0.34860963 = fieldWeight in 1043, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          2.6293786 = idf(docFreq=8668, maxDocs=44218)
          0.09375 = fieldNorm(doc=1043)
  0.25 = coord(1/4)

Source: Journal of library administration. 34(2001) nos.1/2, S.187-189

Godby, C. J.; Stuler, J.: ¬The Library of Congress Classification as a knowledge base for automatic subject categorization (2001) 0.01

0.010828791 = product of:
  0.043315165 = sum of:
    0.043315165 = weight(_text_:library in 1567) [ClassicSimilarity], result of:
      0.043315165 = score(doc=1567,freq=4.0), product of:
        0.1317883 = queryWeight, product of:
          2.6293786 = idf(docFreq=8668, maxDocs=44218)
          0.050121464 = queryNorm
        0.32867232 = fieldWeight in 1567, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          2.6293786 = idf(docFreq=8668, maxDocs=44218)
          0.0625 = fieldNorm(doc=1567)
  0.25 = coord(1/4)

Abstract: This paper describes a set of experiments in adapting a subset of the Library of Congress Classification for use as a database for automatic classification. A high degree of concept integrity was obtained when subject headings were mapped from OCLC's WorldCat database and filtered using the log-likelihood statistic

Adams, K.C.: Word wranglers : Automatic classification tools transform enterprise documents from "bags of words" into knowledge resources (2003) 0.01
```
0.010770457 = product of:
  0.043081827 = sum of:
    0.043081827 = weight(_text_:digital in 1665) [ClassicSimilarity], result of:
      0.043081827 = score(doc=1665,freq=2.0), product of:
        0.19770671 = queryWeight, product of:
          3.944552 = idf(docFreq=2326, maxDocs=44218)
          0.050121464 = queryNorm
        0.21790776 = fieldWeight in 1665, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.944552 = idf(docFreq=2326, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1665)
  0.25 = coord(1/4)
```
Abstract

Taxonomies are an important part of any knowledge management (KM) system, and automatic classification software is emerging as a "killer app" for consumer and enterprise portals. A number of companies such as Inxight Software , Mohomine, Metacode, and others claim to interpret the semantic content of any textual document and automatically classify text on the fly. The promise that software could automatically produce a Yahoo-style directory is a siren call not many IT managers are able to resist. KM needs have grown more complex due to the increasing amount of digital information, the declining effectiveness of keyword searching, and heterogeneous document formats in corporate databases. This environment requires innovative KM tools, and automatic classification technology is an example of this new kind of software. These products can be divided into three categories according to their underlying technology - rules-based, catalog-by-example, and statistical clustering. Evolving trends in this market include framing classification as a cyborg (computer- and human-based) activity and the increasing use of extensible markup language (XML) and support vector machine (SVM) technology. In this article, we'll survey the rapidly changing automatic classification software market and examine the features and capabilities of leading classification products.
Mengle, S.S.R.; Goharian, N.: Ambiguity measure feature-selection algorithm (2009) 0.01
```
0.010770457 = product of:
  0.043081827 = sum of:
    0.043081827 = weight(_text_:digital in 2804) [ClassicSimilarity], result of:
      0.043081827 = score(doc=2804,freq=2.0), product of:
        0.19770671 = queryWeight, product of:
          3.944552 = idf(docFreq=2326, maxDocs=44218)
          0.050121464 = queryNorm
        0.21790776 = fieldWeight in 2804, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.944552 = idf(docFreq=2326, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2804)
  0.25 = coord(1/4)
```
Abstract

With the increasing number of digital documents, the ability to automatically classify those documents both efficiently and accurately is becoming more critical and difficult. One of the major problems in text classification is the high dimensionality of feature space. We present the ambiguity measure (AM) feature-selection algorithm, which selects the most unambiguous features from the feature set. Unambiguous features are those features whose presence in a document indicate a strong degree of confidence that a document belongs to only one specific category. We apply AM feature selection on a naïve Bayes text classifier. We favorably show the effectiveness of our approach in outperforming eight existing feature-selection methods, using five benchmark datasets with a statistical significance of at least 95% confidence. The support vector machine (SVM) text classifier is shown to perform consistently better than the naïve Bayes text classifier. The drawback, however, is the time complexity in training a model. We further explore the effect of using the AM feature-selection method on an SVM text classifier. Our results indicate that the training time for the SVM algorithm can be reduced by more than 50%, while still improving the accuracy of the text classifier. We favorably show the effectiveness of our approach by demonstrating that it statistically significantly (99% confidence) outperforms eight existing feature-selection methods using four standard benchmark datasets.
Sojka, P.; Lee, M.; Rehurek, R.; Hatlapatka, R.; Kucbel, M.; Bouche, T.; Goutorbe, C.; Anghelache, R.; Wojciechowski, K.: Toolset for entity and semantic associations : Final Release (2013) 0.01
```
0.010464822 = product of:
  0.041859288 = sum of:
    0.041859288 = product of:
      0.083718576 = sum of:
        0.083718576 = weight(_text_:project in 1057) [ClassicSimilarity], result of:
          0.083718576 = score(doc=1057,freq=4.0), product of:
            0.21156175 = queryWeight, product of:
              4.220981 = idf(docFreq=1764, maxDocs=44218)
              0.050121464 = queryNorm
            0.39571697 = fieldWeight in 1057, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.220981 = idf(docFreq=1764, maxDocs=44218)
              0.046875 = fieldNorm(doc=1057)
      0.5 = coord(1/2)
  0.25 = coord(1/4)
```
Abstract

In this document we describe the final release of the toolset for entity and semantic associations, integrating two versions (language dependent and language independent) of Unsupervised Document Similarity implemented by MU (using gensim tool) and Citation Indexing, Resolution and Matching (UJF/CMD). We give a brief description of tools, the rationale behind decisions made, and provide elementary evaluation. Tools are integrated in the main project result, EuDML website, and they deliver the needed functionality for exploratory searching and browsing the collected documents. EuDML users and content providers thus benefit from millions of algorithmically generated similarity and citation links, developed using state of the art machine learning and matching methods.

Content

Vgl. auch: https://is.muni.cz/repo/1076213/en/Lee-Sojka-Rehurek-Bolikowski/Toolset-for-Entity-and-Semantic-Associations-Initial-Release-Deliverable-82-of-project-EuDML?lang=en.

Shafer, K.E.: Evaluating Scorpion Results (2001) 0.01

0.00957139 = product of:
  0.03828556 = sum of:
    0.03828556 = weight(_text_:library in 4085) [ClassicSimilarity], result of:
      0.03828556 = score(doc=4085,freq=2.0), product of:
        0.1317883 = queryWeight, product of:
          2.6293786 = idf(docFreq=8668, maxDocs=44218)
          0.050121464 = queryNorm
        0.29050803 = fieldWeight in 4085, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          2.6293786 = idf(docFreq=8668, maxDocs=44218)
          0.078125 = fieldNorm(doc=4085)
  0.25 = coord(1/4)

Source: Journal of library administration. 34(2001) nos.3/4, S.237-244

Godby, C.J.; Stuler, J.: ¬The Library of Congress Classification as a knowledge base for automatic subject categorization : subject access issues (2003) 0.01

0.009475192 = product of:
  0.03790077 = sum of:
    0.03790077 = weight(_text_:library in 3962) [ClassicSimilarity], result of:
      0.03790077 = score(doc=3962,freq=4.0), product of:
        0.1317883 = queryWeight, product of:
          2.6293786 = idf(docFreq=8668, maxDocs=44218)
          0.050121464 = queryNorm
        0.28758827 = fieldWeight in 3962, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          2.6293786 = idf(docFreq=8668, maxDocs=44218)
          0.0546875 = fieldNorm(doc=3962)
  0.25 = coord(1/4)

Abstract: This paper describes a set of experiments in adapting a subset of the Library of Congress Classification for use as a database for automatic classification. A high degree of concept integrity was obtained when subject headings were mapped from OCLC's WorldCat database and filtered using the log-likelihood statistic.

Bianchini, C.; Bargioni, S.: Automated classification using linked open data : a case study on faceted classification and Wikidata (2021) 0.01
```
0.008633038 = product of:
  0.034532152 = sum of:
    0.034532152 = product of:
      0.069064304 = sum of:
        0.069064304 = weight(_text_:project in 724) [ClassicSimilarity], result of:
          0.069064304 = score(doc=724,freq=2.0), product of:
            0.21156175 = queryWeight, product of:
              4.220981 = idf(docFreq=1764, maxDocs=44218)
              0.050121464 = queryNorm
            0.32644984 = fieldWeight in 724, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.220981 = idf(docFreq=1764, maxDocs=44218)
              0.0546875 = fieldNorm(doc=724)
      0.5 = coord(1/2)
  0.25 = coord(1/4)
```
Abstract

The Wikidata gadget, CCLitBox, for the automated classification of literary authors and works by a faceted classification and using Linked Open Data (LOD) is presented. The tool reproduces the classification algorithm of class O Literature of the Colon Classification and uses data freely available in Wikidata to create Colon Classification class numbers. CCLitBox is totally free and enables any user to classify literary authors and their works; it is easily accessible to everybody; it uses LOD from Wikidata but missing data for classification can be freely added if necessary; it is readymade for any cooperative and networked project.

Reiner, U.: Automatische DDC-Klassifizierung von bibliografischen Titeldatensätzen (2009) 0.01

0.008488459 = product of:
  0.033953834 = sum of:
    0.033953834 = product of:
      0.06790767 = sum of:
        0.06790767 = weight(_text_:22 in 611) [ClassicSimilarity], result of:
          0.06790767 = score(doc=611,freq=2.0), product of:
            0.17551683 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.050121464 = queryNorm
            0.38690117 = fieldWeight in 611, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.078125 = fieldNorm(doc=611)
      0.5 = coord(1/2)
  0.25 = coord(1/4)

Date: 22. 8.2009 12:54:24

HaCohen-Kerner, Y. et al.: Classification using various machine learning methods and combinations of key-phrases and visual features (2016) 0.01

0.008488459 = product of:
  0.033953834 = sum of:
    0.033953834 = product of:
      0.06790767 = sum of:
        0.06790767 = weight(_text_:22 in 2748) [ClassicSimilarity], result of:
          0.06790767 = score(doc=2748,freq=2.0), product of:
            0.17551683 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.050121464 = queryNorm
            0.38690117 = fieldWeight in 2748, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.078125 = fieldNorm(doc=2748)
      0.5 = coord(1/2)
  0.25 = coord(1/4)

Date: 1. 2.2016 18:25:22

Wang, J.: ¬An extensive study on automated Dewey Decimal Classification (2009) 0.01
```
0.008289068 = product of:
  0.033156272 = sum of:
    0.033156272 = weight(_text_:library in 3172) [ClassicSimilarity], result of:
      0.033156272 = score(doc=3172,freq=6.0), product of:
        0.1317883 = queryWeight, product of:
          2.6293786 = idf(docFreq=8668, maxDocs=44218)
          0.050121464 = queryNorm
        0.25158736 = fieldWeight in 3172, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          2.6293786 = idf(docFreq=8668, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3172)
  0.25 = coord(1/4)
```
Abstract

In this paper, we present a theoretical analysis and extensive experiments on the automated assignment of Dewey Decimal Classification (DDC) classes to bibliographic data with a supervised machine-learning approach. Library classification systems, such as the DDC, impose great obstacles on state-of-art text categorization (TC) technologies, including deep hierarchy, data sparseness, and skewed distribution. We first analyze statistically the document and category distributions over the DDC, and discuss the obstacles imposed by bibliographic corpora and library classification schemes on TC technology. To overcome these obstacles, we propose an innovative algorithm to reshape the DDC structure into a balanced virtual tree by balancing the category distribution and flattening the hierarchy. To improve the classification effectiveness to a level acceptable to real-world applications, we propose an interactive classification model that is able to predict a class of any depth within a limited number of user interactions. The experiments are conducted on a large bibliographic collection created by the Library of Congress within the science and technology domains over 10 years. With no more than three interactions, a classification accuracy of nearly 90% is achieved, thus providing a practical solution to the automatic bibliographic classification problem.
Ahmed, M.; Mukhopadhyay, M.; Mukhopadhyay, P.: Automated knowledge organization : AI ML based subject indexing system for libraries (2023) 0.01
```
0.008289068 = product of:
  0.033156272 = sum of:
    0.033156272 = weight(_text_:library in 977) [ClassicSimilarity], result of:
      0.033156272 = score(doc=977,freq=6.0), product of:
        0.1317883 = queryWeight, product of:
          2.6293786 = idf(docFreq=8668, maxDocs=44218)
          0.050121464 = queryNorm
        0.25158736 = fieldWeight in 977, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          2.6293786 = idf(docFreq=8668, maxDocs=44218)
          0.0390625 = fieldNorm(doc=977)
  0.25 = coord(1/4)
```
Abstract

The research study as reported here is an attempt to explore the possibilities of an AI/ML-based semi-automated indexing system in a library setup to handle large volumes of documents. It uses the Python virtual environment to install and configure an open source AI environment (named Annif) to feed the LOD (Linked Open Data) dataset of Library of Congress Subject Headings (LCSH) as a standard KOS (Knowledge Organisation System). The framework deployed the Turtle format of LCSH after cleaning the file with Skosify, applied an array of backend algorithms (namely TF-IDF, Omikuji, and NN-Ensemble) to measure relative performance, and selected Snowball as an analyser. The training of Annif was conducted with a large set of bibliographic records populated with subject descriptors (MARC tag 650$a) and indexed by trained LIS professionals. The training dataset is first treated with MarcEdit to export it in a format suitable for OpenRefine, and then in OpenRefine it undergoes many steps to produce a bibliographic record set suitable to train Annif. The framework, after training, has been tested with a bibliographic dataset to measure indexing efficiencies, and finally, the automated indexing framework is integrated with data wrangling software (OpenRefine) to produce suggested headings on a mass scale. The entire framework is based on open-source software, open datasets, and open standards.

Source

DESIDOC journal of library and information technology. 43(2023) no.1, S.45-54
Larson, R.R.: Experiments in automatic Library of Congress Classification (1992) 0.01
```
0.008121594 = product of:
  0.032486375 = sum of:
    0.032486375 = weight(_text_:library in 1054) [ClassicSimilarity], result of:
      0.032486375 = score(doc=1054,freq=4.0), product of:
        0.1317883 = queryWeight, product of:
          2.6293786 = idf(docFreq=8668, maxDocs=44218)
          0.050121464 = queryNorm
        0.24650425 = fieldWeight in 1054, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          2.6293786 = idf(docFreq=8668, maxDocs=44218)
          0.046875 = fieldNorm(doc=1054)
  0.25 = coord(1/4)
```
Abstract

This article presents the results of research into the automatic selection of Library of Congress Classification numbers based on the titles and subject headings in MARC records. The method used in this study was based on partial match retrieval techniques using various elements of new recors (i.e., those to be classified) as "queries", and a test database of classification clusters generated from previously classified MARC records. Sixty individual methods for automatic classification were tested on a set of 283 new records, using all combinations of four different partial match methods, five query types, and three representations of search terms. The results indicate that if the best method for a particular case can be determined, then up to 86% of the new records may be correctly classified. The single method with the best accuracy was able to select the correct classification for about 46% of the new records.

Search (64 results, page 2 of 4)

Authors

Years

Languages

Types

Themes