Document (#43660)

Author
Suominen, O.
Koskenniemi, I.
Title
Annif Analyzer Shootout : comparing text lemmatization methods for automated subject indexing
Source
Code4Lib journal. Issue 54(2022), [http://journal.code4lib.org]
Year
2022
Abstract
Automated text classification is an important function for many AI systems relevant to libraries, including automated subject indexing and classification. When implemented using the traditional natural language processing (NLP) paradigm, one key part of the process is the normalization of words using stemming or lemmatization, which reduces the amount of linguistic variation and often improves the quality of classification. In this paper, we compare the output of seven different text lemmatization algorithms as well as two baseline methods. We measure how the choice of method affects the quality of text classification using example corpora in three languages. The experiments have been performed using the open source Annif toolkit for automated subject indexing and classification, but should generalize also to other NLP toolkits and similar text classification tasks. The results show that lemmatization methods in most cases outperform baseline methods in text classification particularly for Finnish and Swedish text, but not English, where baseline methods are most effective. The differences between lemmatization methods are quite small. The systematic comparison will help optimize text classification pipelines and inform the further development of the Annif toolkit to incorporate a wider choice of normalization methods.
Content
Vgl.: https://journal.code4lib.org/articles/16719.
Theme
Automatisches Indexieren

Similar documents (author)

  1. Suominen, V.: Linguistic / semiotic conditions of information retrieval / documentation in the light of a sausurean conception of language : 'organising knowledge' or 'communication concerning documents'? (1998) 6.01
    6.010904 = sum of:
      6.010904 = weight(author_txt:suominen in 81) [ClassicSimilarity], result of:
        6.010904 = fieldWeight in 81, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.617446 = idf(docFreq=7, maxDocs=44218)
          0.625 = fieldNorm(doc=81)
    
  2. Suominen, A.; Toivanen, H.: Map of science with topic modeling : comparison of unsupervised learning and human-assigned subject classification (2016) 4.81
    4.808723 = sum of:
      4.808723 = weight(author_txt:suominen in 3121) [ClassicSimilarity], result of:
        4.808723 = fieldWeight in 3121, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.617446 = idf(docFreq=7, maxDocs=44218)
          0.5 = fieldNorm(doc=3121)
    
  3. Suominen, O.; Hyvönen, N.: From MARC silos to Linked Data silos? (2017) 4.81
    4.808723 = sum of:
      4.808723 = weight(author_txt:suominen in 3732) [ClassicSimilarity], result of:
        4.808723 = fieldWeight in 3732, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.617446 = idf(docFreq=7, maxDocs=44218)
          0.5 = fieldNorm(doc=3732)
    
  4. Suominen, V.; Tuomi, P.: Literacies, hermeneutics, and literature (2015) 4.81
    4.808723 = sum of:
      4.808723 = weight(author_txt:suominen in 5543) [ClassicSimilarity], result of:
        4.808723 = fieldWeight in 5543, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.617446 = idf(docFreq=7, maxDocs=44218)
          0.5 = fieldNorm(doc=5543)
    
  5. Friman, M.; Jansson, P.; Suominen, V.: Chaos or order? : Aby Warburg's library of cultural history and its classification (1995) 3.61
    3.606542 = sum of:
      3.606542 = weight(author_txt:suominen in 1089) [ClassicSimilarity], result of:
        3.606542 = fieldWeight in 1089, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.617446 = idf(docFreq=7, maxDocs=44218)
          0.375 = fieldNorm(doc=1089)
    

Similar documents (content)

  1. Kettunen, K.; Kunttu, T.; Järvelin, K.: To stem or lemmatize a highly inflectional language in a probabilistic IR environment? (2005) 0.32
    0.3187919 = sum of:
      0.3187919 = product of:
        1.1385425 = sum of:
          0.045232806 = weight(abstract_txt:stemming in 4395) [ClassicSimilarity], result of:
            0.045232806 = score(doc=4395,freq=2.0), product of:
              0.07852138 = queryWeight, product of:
                1.0604115 = boost
                7.448392 = idf(docFreq=69, maxDocs=44218)
                0.009941478 = queryNorm
              0.5760572 = fieldWeight in 4395, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.448392 = idf(docFreq=69, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4395)
          0.07247957 = weight(abstract_txt:finnish in 4395) [ClassicSimilarity], result of:
            0.07247957 = score(doc=4395,freq=4.0), product of:
              0.08534 = queryWeight, product of:
                1.105495 = boost
                7.7650614 = idf(docFreq=50, maxDocs=44218)
                0.009941478 = queryNorm
              0.8493036 = fieldWeight in 4395, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.7650614 = idf(docFreq=50, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4395)
          0.009494896 = weight(abstract_txt:most in 4395) [ClassicSimilarity], result of:
            0.009494896 = score(doc=4395,freq=1.0), product of:
              0.044024967 = queryWeight, product of:
                1.12291 = boost
                3.943693 = idf(docFreq=2328, maxDocs=44218)
                0.009941478 = queryNorm
              0.2156707 = fieldWeight in 4395, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.943693 = idf(docFreq=2328, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4395)
          0.033266943 = weight(abstract_txt:choice in 4395) [ClassicSimilarity], result of:
            0.033266943 = score(doc=4395,freq=1.0), product of:
              0.10155801 = queryWeight, product of:
                1.7055032 = boost
                5.989777 = idf(docFreq=300, maxDocs=44218)
                0.009941478 = queryNorm
              0.32756594 = fieldWeight in 4395, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.989777 = idf(docFreq=300, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4395)
          0.018185616 = weight(abstract_txt:using in 4395) [ClassicSimilarity], result of:
            0.018185616 = score(doc=4395,freq=2.0), product of:
              0.067898095 = queryWeight, product of:
                1.9721467 = boost
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.009941478 = queryNorm
              0.2678369 = fieldWeight in 4395, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4395)
          0.038634308 = weight(abstract_txt:methods in 4395) [ClassicSimilarity], result of:
            0.038634308 = score(doc=4395,freq=1.0), product of:
              0.17036368 = queryWeight, product of:
                4.132549 = boost
                4.146752 = idf(docFreq=1900, maxDocs=44218)
                0.009941478 = queryNorm
              0.2267755 = fieldWeight in 4395, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.146752 = idf(docFreq=1900, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4395)
          0.9212484 = weight(abstract_txt:lemmatization in 4395) [ClassicSimilarity], result of:
            0.9212484 = score(doc=4395,freq=6.0), product of:
              0.6943093 = queryWeight, product of:
                7.050858 = boost
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.009941478 = queryNorm
              1.3268559 = fieldWeight in 4395, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4395)
        0.28 = coord(7/25)
    
  2. Airio, E.; Kettunen, K.: Does dictionary based bilingual retrieval work in a non-normalized index? (2009) 0.16
    0.16000608 = sum of:
      0.16000608 = product of:
        0.666692 = sum of:
          0.036553625 = weight(abstract_txt:stemming in 4224) [ClassicSimilarity], result of:
            0.036553625 = score(doc=4224,freq=1.0), product of:
              0.07852138 = queryWeight, product of:
                1.0604115 = boost
                7.448392 = idf(docFreq=69, maxDocs=44218)
                0.009941478 = queryNorm
              0.4655245 = fieldWeight in 4224, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.448392 = idf(docFreq=69, maxDocs=44218)
                0.0625 = fieldNorm(doc=4224)
          0.08283379 = weight(abstract_txt:finnish in 4224) [ClassicSimilarity], result of:
            0.08283379 = score(doc=4224,freq=4.0), product of:
              0.08534 = queryWeight, product of:
                1.105495 = boost
                7.7650614 = idf(docFreq=50, maxDocs=44218)
                0.009941478 = queryNorm
              0.9706327 = fieldWeight in 4224, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.7650614 = idf(docFreq=50, maxDocs=44218)
                0.0625 = fieldNorm(doc=4224)
          0.08478912 = weight(abstract_txt:swedish in 4224) [ClassicSimilarity], result of:
            0.08478912 = score(doc=4224,freq=4.0), product of:
              0.08667776 = queryWeight, product of:
                1.114126 = boost
                7.825686 = idf(docFreq=47, maxDocs=44218)
                0.009941478 = queryNorm
              0.97821075 = fieldWeight in 4224, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.825686 = idf(docFreq=47, maxDocs=44218)
                0.0625 = fieldNorm(doc=4224)
          0.01085131 = weight(abstract_txt:most in 4224) [ClassicSimilarity], result of:
            0.01085131 = score(doc=4224,freq=1.0), product of:
              0.044024967 = queryWeight, product of:
                1.12291 = boost
                3.943693 = idf(docFreq=2328, maxDocs=44218)
                0.009941478 = queryNorm
              0.24648081 = fieldWeight in 4224, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.943693 = idf(docFreq=2328, maxDocs=44218)
                0.0625 = fieldNorm(doc=4224)
          0.021837773 = weight(abstract_txt:indexing in 4224) [ClassicSimilarity], result of:
            0.021837773 = score(doc=4224,freq=1.0), product of:
              0.08033046 = queryWeight, product of:
                1.8577241 = boost
                4.3495874 = idf(docFreq=1551, maxDocs=44218)
                0.009941478 = queryNorm
              0.27184922 = fieldWeight in 4224, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.3495874 = idf(docFreq=1551, maxDocs=44218)
                0.0625 = fieldNorm(doc=4224)
          0.42982638 = weight(abstract_txt:lemmatization in 4224) [ClassicSimilarity], result of:
            0.42982638 = score(doc=4224,freq=1.0), product of:
              0.6943093 = queryWeight, product of:
                7.050858 = boost
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.009941478 = queryNorm
              0.6190705 = fieldWeight in 4224, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.0625 = fieldNorm(doc=4224)
        0.24 = coord(6/25)
    
  3. Hahn, J.: Semi-automated methods for BIBFRAME work entity description (2021) 0.14
    0.13622697 = sum of:
      0.13622697 = product of:
        0.6811348 = sum of:
          0.03357429 = weight(abstract_txt:subject in 725) [ClassicSimilarity], result of:
            0.03357429 = score(doc=725,freq=2.0), product of:
              0.06481493 = queryWeight, product of:
                1.6687014 = boost
                3.9070187 = idf(docFreq=2415, maxDocs=44218)
                0.009941478 = queryNorm
              0.5180024 = fieldWeight in 725, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.9070187 = idf(docFreq=2415, maxDocs=44218)
                0.09375 = fieldNorm(doc=725)
          0.03275666 = weight(abstract_txt:indexing in 725) [ClassicSimilarity], result of:
            0.03275666 = score(doc=725,freq=1.0), product of:
              0.08033046 = queryWeight, product of:
                1.8577241 = boost
                4.3495874 = idf(docFreq=1551, maxDocs=44218)
                0.009941478 = queryNorm
              0.40777382 = fieldWeight in 725, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.3495874 = idf(docFreq=1551, maxDocs=44218)
                0.09375 = fieldNorm(doc=725)
          0.16172992 = weight(abstract_txt:automated in 725) [ClassicSimilarity], result of:
            0.16172992 = score(doc=725,freq=3.0), product of:
              0.1777515 = queryWeight, product of:
                3.1909287 = boost
                5.6033173 = idf(docFreq=442, maxDocs=44218)
                0.009941478 = queryNorm
              0.90986526 = fieldWeight in 725, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.6033173 = idf(docFreq=442, maxDocs=44218)
                0.09375 = fieldNorm(doc=725)
          0.066230245 = weight(abstract_txt:methods in 725) [ClassicSimilarity], result of:
            0.066230245 = score(doc=725,freq=1.0), product of:
              0.17036368 = queryWeight, product of:
                4.132549 = boost
                4.146752 = idf(docFreq=1900, maxDocs=44218)
                0.009941478 = queryNorm
              0.388758 = fieldWeight in 725, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.146752 = idf(docFreq=1900, maxDocs=44218)
                0.09375 = fieldNorm(doc=725)
          0.3868437 = weight(abstract_txt:annif in 725) [ClassicSimilarity], result of:
            0.3868437 = score(doc=725,freq=1.0), product of:
              0.41658556 = queryWeight, product of:
                4.2305145 = boost
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.009941478 = queryNorm
              0.9286057 = fieldWeight in 725, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.09375 = fieldNorm(doc=725)
        0.2 = coord(5/25)
    
  4. Ahlgren, P.; Kekäläinen, J.: Indexing strategies for Swedish full text retrieval under different user scenarios (2007) 0.13
    0.12551855 = sum of:
      0.12551855 = product of:
        0.522994 = sum of:
          0.05995496 = weight(abstract_txt:swedish in 896) [ClassicSimilarity], result of:
            0.05995496 = score(doc=896,freq=2.0), product of:
              0.08667776 = queryWeight, product of:
                1.114126 = boost
                7.825686 = idf(docFreq=47, maxDocs=44218)
                0.009941478 = queryNorm
              0.69169945 = fieldWeight in 896, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.825686 = idf(docFreq=47, maxDocs=44218)
                0.0625 = fieldNorm(doc=896)
          0.06915156 = weight(abstract_txt:analyzer in 896) [ClassicSimilarity], result of:
            0.06915156 = score(doc=896,freq=1.0), product of:
              0.12010717 = queryWeight, product of:
                1.3114898 = boost
                9.211981 = idf(docFreq=11, maxDocs=44218)
                0.009941478 = queryNorm
              0.5757488 = fieldWeight in 896, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.211981 = idf(docFreq=11, maxDocs=44218)
                0.0625 = fieldNorm(doc=896)
          0.043675546 = weight(abstract_txt:indexing in 896) [ClassicSimilarity], result of:
            0.043675546 = score(doc=896,freq=4.0), product of:
              0.08033046 = queryWeight, product of:
                1.8577241 = boost
                4.3495874 = idf(docFreq=1551, maxDocs=44218)
                0.009941478 = queryNorm
              0.54369843 = fieldWeight in 896, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.3495874 = idf(docFreq=1551, maxDocs=44218)
                0.0625 = fieldNorm(doc=896)
          0.15446983 = weight(abstract_txt:normalization in 896) [ClassicSimilarity], result of:
            0.15446983 = score(doc=896,freq=4.0), product of:
              0.16289961 = queryWeight, product of:
                2.1600087 = boost
                7.5860133 = idf(docFreq=60, maxDocs=44218)
                0.009941478 = queryNorm
              0.94825166 = fieldWeight in 896, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.5860133 = idf(docFreq=60, maxDocs=44218)
                0.0625 = fieldNorm(doc=896)
          0.14894448 = weight(abstract_txt:baseline in 896) [ClassicSimilarity], result of:
            0.14894448 = score(doc=896,freq=3.0), product of:
              0.20031673 = queryWeight, product of:
                2.9335918 = boost
                6.8685737 = idf(docFreq=124, maxDocs=44218)
                0.009941478 = queryNorm
              0.7435449 = fieldWeight in 896, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.8685737 = idf(docFreq=124, maxDocs=44218)
                0.0625 = fieldNorm(doc=896)
          0.046797574 = weight(abstract_txt:text in 896) [ClassicSimilarity], result of:
            0.046797574 = score(doc=896,freq=1.0), product of:
              0.18515971 = queryWeight, product of:
                4.6057324 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.009941478 = queryNorm
              0.25274166 = fieldWeight in 896, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=896)
        0.24 = coord(6/25)
    
  5. Galvez, C.; Moya-Anegón, F. de: ¬An evaluation of conflation accuracy using finite-state transducers (2006) 0.12
    0.11958055 = sum of:
      0.11958055 = product of:
        0.74737847 = sum of:
          0.018370247 = weight(abstract_txt:using in 5599) [ClassicSimilarity], result of:
            0.018370247 = score(doc=5599,freq=1.0), product of:
              0.067898095 = queryWeight, product of:
                1.9721467 = boost
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.009941478 = queryNorm
              0.27055615 = fieldWeight in 5599, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.078125 = fieldNorm(doc=5599)
          0.13653332 = weight(abstract_txt:normalization in 5599) [ClassicSimilarity], result of:
            0.13653332 = score(doc=5599,freq=2.0), product of:
              0.16289961 = queryWeight, product of:
                2.1600087 = boost
                7.5860133 = idf(docFreq=60, maxDocs=44218)
                0.009941478 = queryNorm
              0.83814394 = fieldWeight in 5599, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.5860133 = idf(docFreq=60, maxDocs=44218)
                0.078125 = fieldNorm(doc=5599)
          0.055191867 = weight(abstract_txt:methods in 5599) [ClassicSimilarity], result of:
            0.055191867 = score(doc=5599,freq=1.0), product of:
              0.17036368 = queryWeight, product of:
                4.132549 = boost
                4.146752 = idf(docFreq=1900, maxDocs=44218)
                0.009941478 = queryNorm
              0.32396498 = fieldWeight in 5599, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.146752 = idf(docFreq=1900, maxDocs=44218)
                0.078125 = fieldNorm(doc=5599)
          0.537283 = weight(abstract_txt:lemmatization in 5599) [ClassicSimilarity], result of:
            0.537283 = score(doc=5599,freq=1.0), product of:
              0.6943093 = queryWeight, product of:
                7.050858 = boost
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.009941478 = queryNorm
              0.7738381 = fieldWeight in 5599, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.078125 = fieldNorm(doc=5599)
        0.16 = coord(4/25)