Search (7 results, page 1 of 1)

  • × theme_ss:"Automatisches Indexieren"
  • × type_ss:"a"
  • × year_i:[2020 TO 2030}
  1. Chou, C.; Chu, T.: ¬An analysis of BERT (NLP) for assisted subject indexing for Project Gutenberg (2022) 0.01
    0.013759367 = product of:
      0.05503747 = sum of:
        0.05503747 = weight(_text_:processing in 1139) [ClassicSimilarity], result of:
          0.05503747 = score(doc=1139,freq=2.0), product of:
            0.175792 = queryWeight, product of:
              4.048147 = idf(docFreq=2097, maxDocs=44218)
              0.043425296 = queryNorm
            0.3130829 = fieldWeight in 1139, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.048147 = idf(docFreq=2097, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1139)
      0.25 = coord(1/4)
    
    Abstract
    In light of AI (Artificial Intelligence) and NLP (Natural language processing) technologies, this article examines the feasibility of using AI/NLP models to enhance the subject indexing of digital resources. While BERT (Bidirectional Encoder Representations from Transformers) models are widely used in scholarly communities, the authors assess whether BERT models can be used in machine-assisted indexing in the Project Gutenberg collection, through suggesting Library of Congress subject headings filtered by certain Library of Congress Classification subclass labels. The findings of this study are informative for further research on BERT models to assist with automatic subject indexing for digital library collections.
  2. Suominen, O.; Koskenniemi, I.: Annif Analyzer Shootout : comparing text lemmatization methods for automated subject indexing (2022) 0.01
    0.00982812 = product of:
      0.03931248 = sum of:
        0.03931248 = weight(_text_:processing in 658) [ClassicSimilarity], result of:
          0.03931248 = score(doc=658,freq=2.0), product of:
            0.175792 = queryWeight, product of:
              4.048147 = idf(docFreq=2097, maxDocs=44218)
              0.043425296 = queryNorm
            0.22363065 = fieldWeight in 658, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.048147 = idf(docFreq=2097, maxDocs=44218)
              0.0390625 = fieldNorm(doc=658)
      0.25 = coord(1/4)
    
    Abstract
    Automated text classification is an important function for many AI systems relevant to libraries, including automated subject indexing and classification. When implemented using the traditional natural language processing (NLP) paradigm, one key part of the process is the normalization of words using stemming or lemmatization, which reduces the amount of linguistic variation and often improves the quality of classification. In this paper, we compare the output of seven different text lemmatization algorithms as well as two baseline methods. We measure how the choice of method affects the quality of text classification using example corpora in three languages. The experiments have been performed using the open source Annif toolkit for automated subject indexing and classification, but should generalize also to other NLP toolkits and similar text classification tasks. The results show that lemmatization methods in most cases outperform baseline methods in text classification particularly for Finnish and Swedish text, but not English, where baseline methods are most effective. The differences between lemmatization methods are quite small. The systematic comparison will help optimize text classification pipelines and inform the further development of the Annif toolkit to incorporate a wider choice of normalization methods.
  3. Lepsky, K.: Automatisches Indexieren (2023) 0.00
    0.003463212 = product of:
      0.013852848 = sum of:
        0.013852848 = product of:
          0.04155854 = sum of:
            0.04155854 = weight(_text_:29 in 781) [ClassicSimilarity], result of:
              0.04155854 = score(doc=781,freq=2.0), product of:
                0.15275662 = queryWeight, product of:
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.043425296 = queryNorm
                0.27205724 = fieldWeight in 781, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=781)
          0.33333334 = coord(1/3)
      0.25 = coord(1/4)
    
    Date
    24.11.2022 13:29:16
  4. Matthews, P.; Glitre, K.: Genre analysis of movies using a topic model of plot summaries (2021) 0.00
    0.002353985 = product of:
      0.00941594 = sum of:
        0.00941594 = product of:
          0.02824782 = sum of:
            0.02824782 = weight(_text_:science in 412) [ClassicSimilarity], result of:
              0.02824782 = score(doc=412,freq=4.0), product of:
                0.11438741 = queryWeight, product of:
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.043425296 = queryNorm
                0.24694869 = fieldWeight in 412, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.046875 = fieldNorm(doc=412)
          0.33333334 = coord(1/3)
      0.25 = coord(1/4)
    
    Abstract
    Genre plays an important role in the description, navigation, and discovery of movies, but it is rarely studied at large scale using quantitative methods. This allows an analysis of how genre labels are applied, how genres are composed and how these ingredients change, and how genres compare. We apply unsupervised topic modeling to a large collection of textual movie summaries and then use the model's topic proportions to investigate key questions in genre, including recognizability, mapping, canonicity, and change over time. We find that many genres can be quite easily predicted by their lexical signatures and this defines their position on the genre landscape. We find significant genre composition changes between periods for westerns, science fiction and road movies, reflecting changes in production and consumption values. We show that in terms of canonicity, canonical examples are often at the high end of the topic distribution profile for the genre rather than central as might be predicted by categorization theory.
    Source
    Journal of the Association for Information Science and Technology. 72(2021) no.12, S.1511-1527
  5. Lowe, D.B.; Dollinger, I.; Koster, T.; Herbert, B.E.: Text mining for type of research classification (2021) 0.00
    0.0016645187 = product of:
      0.006658075 = sum of:
        0.006658075 = product of:
          0.019974224 = sum of:
            0.019974224 = weight(_text_:science in 720) [ClassicSimilarity], result of:
              0.019974224 = score(doc=720,freq=2.0), product of:
                0.11438741 = queryWeight, product of:
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.043425296 = queryNorm
                0.17461908 = fieldWeight in 720, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.046875 = fieldNorm(doc=720)
          0.33333334 = coord(1/3)
      0.25 = coord(1/4)
    
    Abstract
    This project brought together undergraduate students in Computer Science with librarians to mine abstracts of articles from the Texas A&M University Libraries' institutional repository, OAKTrust, in order to probe the creation of new metadata to improve discovery and use. The mining operation task consisted simply of classifying the articles into two categories of research type: basic research ("for understanding," "curiosity-based," or "knowledge-based") and applied research ("use-based"). These categories are fundamental especially for funders but are also important to researchers. The mining-to-classification steps took several iterations, but ultimately, we achieved good results with the toolkit BERT (Bidirectional Encoder Representations from Transformers). The project and its workflows represent a preview of what may lie ahead in the future of crafting metadata using text mining techniques to enhance discoverability.
  6. Zhang, Y.; Zhang, C.; Li, J.: Joint modeling of characters, words, and conversation contexts for microblog keyphrase extraction (2020) 0.00
    0.001387099 = product of:
      0.005548396 = sum of:
        0.005548396 = product of:
          0.016645188 = sum of:
            0.016645188 = weight(_text_:science in 5816) [ClassicSimilarity], result of:
              0.016645188 = score(doc=5816,freq=2.0), product of:
                0.11438741 = queryWeight, product of:
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.043425296 = queryNorm
                0.1455159 = fieldWeight in 5816, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=5816)
          0.33333334 = coord(1/3)
      0.25 = coord(1/4)
    
    Source
    Journal of the Association for Information Science and Technology. 71(2020) no.5, S.553-567
  7. Yang, T.-H.; Hsieh, Y.-L.; Liu, S.-H.; Chang, Y.-C.; Hsu, W.-L.: ¬A flexible template generation and matching method with applications for publication reference metadata extraction (2021) 0.00
    0.001387099 = product of:
      0.005548396 = sum of:
        0.005548396 = product of:
          0.016645188 = sum of:
            0.016645188 = weight(_text_:science in 63) [ClassicSimilarity], result of:
              0.016645188 = score(doc=63,freq=2.0), product of:
                0.11438741 = queryWeight, product of:
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.043425296 = queryNorm
                0.1455159 = fieldWeight in 63, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=63)
          0.33333334 = coord(1/3)
      0.25 = coord(1/4)
    
    Source
    Journal of the Association for Information Science and Technology. 72(2021) no.1, S.32-45