Search (7 results, page 1 of 1)

Suominen, O.; Koskenniemi, I.: Annif Analyzer Shootout : comparing text lemmatization methods for automated subject indexing (2022) 0.03

0.02826747 = product of:
  0.13191485 = sum of:
    0.03675035 = weight(_text_:subject in 658) [ClassicSimilarity], result of:
      0.03675035 = score(doc=658,freq=6.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.34222013 = fieldWeight in 658, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.0390625 = fieldNorm(doc=658)
    0.04758225 = weight(_text_:classification in 658) [ClassicSimilarity], result of:
      0.04758225 = score(doc=658,freq=16.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.49761042 = fieldWeight in 658, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=658)
    0.04758225 = weight(_text_:classification in 658) [ClassicSimilarity], result of:
      0.04758225 = score(doc=658,freq=16.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.49761042 = fieldWeight in 658, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=658)
  0.21428572 = coord(3/14)

Abstract: Automated text classification is an important function for many AI systems relevant to libraries, including automated subject indexing and classification. When implemented using the traditional natural language processing (NLP) paradigm, one key part of the process is the normalization of words using stemming or lemmatization, which reduces the amount of linguistic variation and often improves the quality of classification. In this paper, we compare the output of seven different text lemmatization algorithms as well as two baseline methods. We measure how the choice of method affects the quality of text classification using example corpora in three languages. The experiments have been performed using the open source Annif toolkit for automated subject indexing and classification, but should generalize also to other NLP toolkits and similar text classification tasks. The results show that lemmatization methods in most cases outperform baseline methods in text classification particularly for Finnish and Swedish text, but not English, where baseline methods are most effective. The differences between lemmatization methods are quite small. The systematic comparison will help optimize text classification pipelines and inform the further development of the Annif toolkit to incorporate a wider choice of normalization methods.

Toepfer, M.; Seifert, C.: Content-based quality estimation for automatic subject indexing of short texts under precision and recall constraints 0.03

0.025746947 = product of:
  0.09011431 = sum of:
    0.021217827 = weight(_text_:subject in 4309) [ClassicSimilarity], result of:
      0.021217827 = score(doc=4309,freq=2.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.19758089 = fieldWeight in 4309, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.0390625 = fieldNorm(doc=4309)
    0.016822865 = weight(_text_:classification in 4309) [ClassicSimilarity], result of:
      0.016822865 = score(doc=4309,freq=2.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.17593184 = fieldWeight in 4309, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=4309)
    0.016822865 = weight(_text_:classification in 4309) [ClassicSimilarity], result of:
      0.016822865 = score(doc=4309,freq=2.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.17593184 = fieldWeight in 4309, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0390625 = fieldNorm(doc=4309)
    0.035250753 = product of:
      0.07050151 = sum of:
        0.07050151 = weight(_text_:texts in 4309) [ClassicSimilarity], result of:
          0.07050151 = score(doc=4309,freq=4.0), product of:
            0.16460659 = queryWeight, product of:
              5.4822793 = idf(docFreq=499, maxDocs=44218)
              0.03002521 = queryNorm
            0.42830306 = fieldWeight in 4309, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.4822793 = idf(docFreq=499, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4309)
      0.5 = coord(1/2)
  0.2857143 = coord(4/14)

Abstract: Semantic annotations have to satisfy quality constraints to be useful for digital libraries, which is particularly challenging on large and diverse datasets. Confidence scores of multi-label classification methods typically refer only to the relevance of particular subjects, disregarding indicators of insufficient content representation at the document-level. Therefore, we propose a novel approach that detects documents rather than concepts where quality criteria are met. Our approach uses a deep, multi-layered regression architecture, which comprises a variety of content-based indicators. We evaluated multiple configurations using text collections from law and economics, where the available content is restricted to very short texts. Notably, we demonstrate that the proposed quality estimation technique can determine subsets of the previously unseen data where considerable gains in document-level recall can be achieved, while upholding precision at the same time. Hence, the approach effectively performs a filtering that ensures high data quality standards in operative information retrieval systems.

Junger, U.: Can indexing be automated? : the example of the Deutsche Nationalbibliothek (2012) 0.02

0.022824414 = product of:
  0.10651393 = sum of:
    0.059409913 = weight(_text_:subject in 1717) [ClassicSimilarity], result of:
      0.059409913 = score(doc=1717,freq=8.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.5532265 = fieldWeight in 1717, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1717)
    0.023552012 = weight(_text_:classification in 1717) [ClassicSimilarity], result of:
      0.023552012 = score(doc=1717,freq=2.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.24630459 = fieldWeight in 1717, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1717)
    0.023552012 = weight(_text_:classification in 1717) [ClassicSimilarity], result of:
      0.023552012 = score(doc=1717,freq=2.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.24630459 = fieldWeight in 1717, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1717)
  0.21428572 = coord(3/14)

Abstract: The German subject headings authority file (Schlagwortnormdatei/SWD) provides a broad controlled vocabulary for indexing documents of all subjects. Traditionally used for intellectual subject cataloguing primarily of books the Deutsche Nationalbibliothek (DNB, German National Library) has been working on developping and implementing procedures for automated assignment of subject headings for online publications. This project, its results and problems are sketched in the paper.
Content: Beitrag für die Tagung: Beyond libraries - subject metadata in the digital environment and semantic web. IFLA Satellite Post-Conference, 17-18 August 2012, Tallinn. Vgl.: http://http://www.nlib.ee/index.php?id=17763.
Source: Cataloguing & Classification Quarterly 52(2014) no.1, S.102-109

Gábor, K.; Zargayouna, H.; Tellier, I.; Buscaldi, D.; Charnois, T.: ¬A typology of semantic relations dedicated to scientific literature analysis (2016) 0.01

0.0067291465 = product of:
  0.047104023 = sum of:
    0.023552012 = weight(_text_:classification in 2933) [ClassicSimilarity], result of:
      0.023552012 = score(doc=2933,freq=2.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.24630459 = fieldWeight in 2933, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=2933)
    0.023552012 = weight(_text_:classification in 2933) [ClassicSimilarity], result of:
      0.023552012 = score(doc=2933,freq=2.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.24630459 = fieldWeight in 2933, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.0546875 = fieldNorm(doc=2933)
  0.14285715 = coord(2/14)

Abstract: We propose a method for improving access to scientific literature by analyzing the content of research papers beyond citation links and topic tracking. Our model relies on a typology of explicit semantic relations. These relations are instantiated in the abstract/introduction part of the papers and can be identified automatically using textual data and external ontologies. Preliminary results show a promising precision in unsupervised relationship classification.

Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D.: ¬A picture is worth a thousand (coherent) words : building a natural description of images (2014) 0.00
```
0.0033645732 = product of:
  0.023552012 = sum of:
    0.011776006 = weight(_text_:classification in 1874) [ClassicSimilarity], result of:
      0.011776006 = score(doc=1874,freq=2.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.12315229 = fieldWeight in 1874, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.02734375 = fieldNorm(doc=1874)
    0.011776006 = weight(_text_:classification in 1874) [ClassicSimilarity], result of:
      0.011776006 = score(doc=1874,freq=2.0), product of:
        0.09562149 = queryWeight, product of:
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.03002521 = queryNorm
        0.12315229 = fieldWeight in 1874, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1847067 = idf(docFreq=4974, maxDocs=44218)
          0.02734375 = fieldNorm(doc=1874)
  0.14285715 = coord(2/14)
```
Content

"People can summarize a complex scene in a few words without thinking twice. It's much more difficult for computers. But we've just gotten a bit closer -- we've developed a machine-learning system that can automatically produce captions (like the three above) to accurately describe images the first time it sees them. This kind of system could eventually help visually impaired people understand pictures, provide alternate text for images in parts of the world where mobile connections are slow, and make it easier for everyone to search on Google for images. Recent research has greatly improved object detection, classification, and labeling. But accurately describing a complex scene requires a deeper representation of what's going on in the scene, capturing how the various objects relate to one another and translating it all into natural-sounding language. Many efforts to construct computer-generated natural descriptions of images propose combining current state-of-the-art techniques in both computer vision and natural language processing to form a complete image description approach. But what if we instead merged recent computer vision and language models into a single jointly trained system, taking an image and directly producing a human readable sequence of words to describe it? This idea comes from recent advances in machine translation between languages, where a Recurrent Neural Network (RNN) transforms, say, a French sentence into a vector representation, and a second RNN uses that vector representation to generate a target sentence in German. Now, what if we replaced that first RNN and its input words with a deep Convolutional Neural Network (CNN) trained to classify objects in images? Normally, the CNN's last layer is used in a final Softmax among known classes of objects, assigning a probability that each object might be in the image. But if we remove that final layer, we can instead feed the CNN's rich encoding of the image into a RNN designed to produce phrases. We can then train the whole system directly on images and their captions, so it maximizes the likelihood that descriptions it produces best match the training descriptions for each image.
Wolfe, EW.: a case study in automated metadata enhancement : Natural Language Processing in the humanities (2019) 0.00
```
0.0021217826 = product of:
  0.029704956 = sum of:
    0.029704956 = weight(_text_:subject in 5236) [ClassicSimilarity], result of:
      0.029704956 = score(doc=5236,freq=2.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.27661324 = fieldWeight in 5236, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.0546875 = fieldNorm(doc=5236)
  0.071428575 = coord(1/14)
```
Abstract

The Black Book Interactive Project at the University of Kansas (KU) is developing an expanded corpus of novels by African American authors, with an emphasis on lesser known writers and a goal of expanding research in this field. Using a custom metadata schema with an emphasis on race-related elements, each novel is analyzed for a variety of elements such as literary style, targeted content analysis, historical context, and other areas. Librarians at KU have worked to develop a variety of computational text analysis processes designed to assist with specific aspects of this metadata collection, including text mining and natural language processing, automated subject extraction based on word sense disambiguation, harvesting data from Wikidata, and other actions.

Mongin, L.; Fu, Y.Y.; Mostafa, J.: Open Archives data Service prototype and automated subject indexing using D-Lib archive content as a testbed (2003) 0.00

0.0018186709 = product of:
  0.02546139 = sum of:
    0.02546139 = weight(_text_:subject in 1167) [ClassicSimilarity], result of:
      0.02546139 = score(doc=1167,freq=2.0), product of:
        0.10738805 = queryWeight, product of:
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.03002521 = queryNorm
        0.23709705 = fieldWeight in 1167, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.576596 = idf(docFreq=3361, maxDocs=44218)
          0.046875 = fieldNorm(doc=1167)
  0.071428575 = coord(1/14)

Search (7 results, page 1 of 1)

Authors

Years

Themes