Document (#40955)

Author
Grün, S.
Title
Mehrwortbegriffe und Latent Semantic Analysis : Bewertung automatisch extrahierter Mehrwortgruppen mit LSA
Imprint
Düsseldorf : Heinrich-Heine-Universität / Philosophische Fakultät / Institut für Sprache und Information
Year
2017
Pages
67 S
Abstract
Die vorliegende Studie untersucht das Potenzial von Mehrwortbegriffen für das Information Retrieval. Zielsetzung der Arbeit ist es, intellektuell positiv bewertete Kandidaten mithilfe des Latent Semantic Analysis (LSA) Verfahren höher zu gewichten, als negativ bewertete Kandidaten. Die positiven Kandidaten sollen demnach bei einem Ranking im Information Retrieval bevorzugt werden. Als Kollektion wurde eine Version der sozialwissenschaftlichen GIRT-Datenbank (German Indexing and Retrieval Testdatabase) eingesetzt. Um Kandidaten für Mehrwortbegriffe zu identifizieren wurde die automatische Indexierung Lingo verwendet. Die notwendigen Kernfunktionalitäten waren Lemmatisierung, Identifizierung von Komposita, algorithmische Mehrworterkennung sowie Gewichtung von Indextermen durch das LSA-Modell. Die durch Lingo erkannten und LSAgewichteten Mehrwortkandidaten wurden evaluiert. Zuerst wurde dazu eine intellektuelle Auswahl von positiven und negativen Mehrwortkandidaten vorgenommen. Im zweiten Schritt der Evaluierung erfolgte die Berechnung der Ausbeute, um den Anteil der positiven Mehrwortkandidaten zu erhalten. Im letzten Schritt der Evaluierung wurde auf der Basis der R-Precision berechnet, wie viele positiv bewerteten Mehrwortkandidaten es an der Stelle k des Rankings geschafft haben. Die Ausbeute der positiven Mehrwortkandidaten lag bei durchschnittlich ca. 39%, während die R-Precision einen Durchschnittswert von 54% erzielte. Das LSA-Modell erzielt ein ambivalentes Ergebnis mit positiver Tendenz.
Footnote
Masterarbeit, Studiengang Informationswissenschaft und Sprachtechnologie, Institut für Sprache und Information, Philosophische Fakultät, Heinrich-Heine-Universität Düsseldorf
Theme
Automatisches Indexieren
Object
Lingo
Latent Semantic Indexing
GIRT

Similar documents (content)

  1. Grün, S.: Bildung von Komposita-Indextermen auf der Basis einer algorithmischen Mehrwortgruppenanalyse mit Lingo (2015) 0.39
    0.3892117 = sum of:
      0.3892117 = product of:
        1.3900418 = sum of:
          0.10033656 = weight(abstract_txt:bewerteten in 1335) [ClassicSimilarity], result of:
            0.10033656 = score(doc=1335,freq=1.0), product of:
              0.1351951 = queryWeight, product of:
                1.0568289 = boost
                9.499662 = idf(docFreq=8, maxDocs=44218)
                0.013466296 = queryNorm
              0.74216115 = fieldWeight in 1335, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.499662 = idf(docFreq=8, maxDocs=44218)
                0.078125 = fieldNorm(doc=1335)
          0.24264093 = weight(abstract_txt:komposita in 1335) [ClassicSimilarity], result of:
            0.24264093 = score(doc=1335,freq=5.0), product of:
              0.14244293 = queryWeight, product of:
                1.0847874 = boost
                9.7509775 = idf(docFreq=6, maxDocs=44218)
                0.013466296 = queryNorm
              1.7034256 = fieldWeight in 1335, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                9.7509775 = idf(docFreq=6, maxDocs=44218)
                0.078125 = fieldNorm(doc=1335)
          0.22748089 = weight(abstract_txt:mehrwortgruppen in 1335) [ClassicSimilarity], result of:
            0.22748089 = score(doc=1335,freq=4.0), product of:
              0.14698222 = queryWeight, product of:
                1.1019365 = boost
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.013466296 = queryNorm
              1.5476762 = fieldWeight in 1335, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.078125 = fieldNorm(doc=1335)
          0.13205723 = weight(abstract_txt:positiv in 1335) [ClassicSimilarity], result of:
            0.13205723 = score(doc=1335,freq=1.0), product of:
              0.20456892 = queryWeight, product of:
                1.8384804 = boost
                8.2629 = idf(docFreq=30, maxDocs=44218)
                0.013466296 = queryNorm
              0.6455391 = fieldWeight in 1335, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.2629 = idf(docFreq=30, maxDocs=44218)
                0.078125 = fieldNorm(doc=1335)
          0.18298846 = weight(abstract_txt:lingo in 1335) [ClassicSimilarity], result of:
            0.18298846 = score(doc=1335,freq=1.0), product of:
              0.25426152 = queryWeight, product of:
                2.049649 = boost
                9.211981 = idf(docFreq=11, maxDocs=44218)
                0.013466296 = queryNorm
              0.71968603 = fieldWeight in 1335, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.211981 = idf(docFreq=11, maxDocs=44218)
                0.078125 = fieldNorm(doc=1335)
          0.088077165 = weight(abstract_txt:wurde in 1335) [ClassicSimilarity], result of:
            0.088077165 = score(doc=1335,freq=3.0), product of:
              0.13641956 = queryWeight, product of:
                2.1232078 = boost
                4.771292 = idf(docFreq=1017, maxDocs=44218)
                0.013466296 = queryNorm
              0.6456344 = fieldWeight in 1335, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.771292 = idf(docFreq=1017, maxDocs=44218)
                0.078125 = fieldNorm(doc=1335)
          0.41646057 = weight(abstract_txt:kandidaten in 1335) [ClassicSimilarity], result of:
            0.41646057 = score(doc=1335,freq=1.0), product of:
              0.5542735 = queryWeight, product of:
                4.2797284 = boost
                9.617446 = idf(docFreq=7, maxDocs=44218)
                0.013466296 = queryNorm
              0.751363 = fieldWeight in 1335, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.617446 = idf(docFreq=7, maxDocs=44218)
                0.078125 = fieldNorm(doc=1335)
        0.28 = coord(7/25)
    
  2. Bredack, J.: Terminologieextraktion von Mehrwortgruppen in kunsthistorischen Fachtexten (2013) 0.16
    0.1567954 = sum of:
      0.1567954 = product of:
        0.6533142 = sum of:
          0.06469619 = weight(abstract_txt:algorithmische in 1054) [ClassicSimilarity], result of:
            0.06469619 = score(doc=1054,freq=2.0), product of:
              0.12713076 = queryWeight, product of:
                1.0248245 = boost
                9.211981 = idf(docFreq=11, maxDocs=44218)
                0.013466296 = queryNorm
              0.50889486 = fieldWeight in 1054, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.211981 = idf(docFreq=11, maxDocs=44218)
                0.0390625 = fieldNorm(doc=1054)
          0.20504849 = weight(abstract_txt:mehrwortgruppen in 1054) [ClassicSimilarity], result of:
            0.20504849 = score(doc=1054,freq=13.0), product of:
              0.14698222 = queryWeight, product of:
                1.1019365 = boost
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.013466296 = queryNorm
              1.3950564 = fieldWeight in 1054, product of:
                3.6055512 = tf(freq=13.0), with freq of:
                  13.0 = termFreq=13.0
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.0390625 = fieldNorm(doc=1054)
          0.010419757 = weight(abstract_txt:retrieval in 1054) [ClassicSimilarity], result of:
            0.010419757 = score(doc=1054,freq=2.0), product of:
              0.054276314 = queryWeight, product of:
                1.1598183 = boost
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.013466296 = queryNorm
              0.19197613 = fieldWeight in 1054, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.0390625 = fieldNorm(doc=1054)
          0.18298846 = weight(abstract_txt:lingo in 1054) [ClassicSimilarity], result of:
            0.18298846 = score(doc=1054,freq=4.0), product of:
              0.25426152 = queryWeight, product of:
                2.049649 = boost
                9.211981 = idf(docFreq=11, maxDocs=44218)
                0.013466296 = queryNorm
              0.71968603 = fieldWeight in 1054, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                9.211981 = idf(docFreq=11, maxDocs=44218)
                0.0390625 = fieldNorm(doc=1054)
          0.03595735 = weight(abstract_txt:wurde in 1054) [ClassicSimilarity], result of:
            0.03595735 = score(doc=1054,freq=2.0), product of:
              0.13641956 = queryWeight, product of:
                2.1232078 = boost
                4.771292 = idf(docFreq=1017, maxDocs=44218)
                0.013466296 = queryNorm
              0.26357913 = fieldWeight in 1054, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.771292 = idf(docFreq=1017, maxDocs=44218)
                0.0390625 = fieldNorm(doc=1054)
          0.15420389 = weight(abstract_txt:positiven in 1054) [ClassicSimilarity], result of:
            0.15420389 = score(doc=1054,freq=1.0), product of:
              0.45368916 = queryWeight, product of:
                3.8719823 = boost
                8.701155 = idf(docFreq=19, maxDocs=44218)
                0.013466296 = queryNorm
              0.33988887 = fieldWeight in 1054, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.701155 = idf(docFreq=19, maxDocs=44218)
                0.0390625 = fieldNorm(doc=1054)
        0.24 = coord(6/25)
    
  3. Lepsky, K.; Vorhauer, J.: Lingo - ein open source System für die Automatische Indexierung deutschsprachiger Dokumente (2006) 0.07
    0.0679978 = sum of:
      0.0679978 = product of:
        0.5666483 = sum of:
          0.109793074 = weight(abstract_txt:algorithmische in 3581) [ClassicSimilarity], result of:
            0.109793074 = score(doc=3581,freq=1.0), product of:
              0.12713076 = queryWeight, product of:
                1.0248245 = boost
                9.211981 = idf(docFreq=11, maxDocs=44218)
                0.013466296 = queryNorm
              0.8636232 = fieldWeight in 3581, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.211981 = idf(docFreq=11, maxDocs=44218)
                0.09375 = fieldNorm(doc=3581)
          0.017682914 = weight(abstract_txt:retrieval in 3581) [ClassicSimilarity], result of:
            0.017682914 = score(doc=3581,freq=1.0), product of:
              0.054276314 = queryWeight, product of:
                1.1598183 = boost
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.013466296 = queryNorm
              0.3257943 = fieldWeight in 3581, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.09375 = fieldNorm(doc=3581)
          0.4391723 = weight(abstract_txt:lingo in 3581) [ClassicSimilarity], result of:
            0.4391723 = score(doc=3581,freq=4.0), product of:
              0.25426152 = queryWeight, product of:
                2.049649 = boost
                9.211981 = idf(docFreq=11, maxDocs=44218)
                0.013466296 = queryNorm
              1.7272464 = fieldWeight in 3581, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                9.211981 = idf(docFreq=11, maxDocs=44218)
                0.09375 = fieldNorm(doc=3581)
        0.12 = coord(3/25)
    
  4. Bredack, J.; Lepsky, K.: Automatische Extraktion von Fachterminologie aus Volltexten (2014) 0.07
    0.06630844 = sum of:
      0.06630844 = product of:
        0.55257034 = sum of:
          0.22519457 = weight(abstract_txt:mehrwortgruppen in 4872) [ClassicSimilarity], result of:
            0.22519457 = score(doc=4872,freq=2.0), product of:
              0.14698222 = queryWeight, product of:
                1.1019365 = boost
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.013466296 = queryNorm
              1.5321212 = fieldWeight in 4872, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.109375 = fieldNorm(doc=4872)
          0.25618383 = weight(abstract_txt:lingo in 4872) [ClassicSimilarity], result of:
            0.25618383 = score(doc=4872,freq=1.0), product of:
              0.25426152 = queryWeight, product of:
                2.049649 = boost
                9.211981 = idf(docFreq=11, maxDocs=44218)
                0.013466296 = queryNorm
              1.0075604 = fieldWeight in 4872, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.211981 = idf(docFreq=11, maxDocs=44218)
                0.109375 = fieldNorm(doc=4872)
          0.07119192 = weight(abstract_txt:wurde in 4872) [ClassicSimilarity], result of:
            0.07119192 = score(doc=4872,freq=1.0), product of:
              0.13641956 = queryWeight, product of:
                2.1232078 = boost
                4.771292 = idf(docFreq=1017, maxDocs=44218)
                0.013466296 = queryNorm
              0.52186006 = fieldWeight in 4872, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.771292 = idf(docFreq=1017, maxDocs=44218)
                0.109375 = fieldNorm(doc=4872)
        0.12 = coord(3/25)
    
  5. Informations- und Kommunikationsutopien (2008) 0.05
    0.054090407 = sum of:
      0.054090407 = product of:
        0.4507534 = sum of:
          0.09149423 = weight(abstract_txt:negativen in 213) [ClassicSimilarity], result of:
            0.09149423 = score(doc=213,freq=1.0), product of:
              0.12713076 = queryWeight, product of:
                1.0248245 = boost
                9.211981 = idf(docFreq=11, maxDocs=44218)
                0.013466296 = queryNorm
              0.71968603 = fieldWeight in 213, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.211981 = idf(docFreq=11, maxDocs=44218)
                0.078125 = fieldNorm(doc=213)
          0.050851375 = weight(abstract_txt:wurde in 213) [ClassicSimilarity], result of:
            0.050851375 = score(doc=213,freq=1.0), product of:
              0.13641956 = queryWeight, product of:
                2.1232078 = boost
                4.771292 = idf(docFreq=1017, maxDocs=44218)
                0.013466296 = queryNorm
              0.3727572 = fieldWeight in 213, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.771292 = idf(docFreq=1017, maxDocs=44218)
                0.078125 = fieldNorm(doc=213)
          0.30840778 = weight(abstract_txt:positiven in 213) [ClassicSimilarity], result of:
            0.30840778 = score(doc=213,freq=1.0), product of:
              0.45368916 = queryWeight, product of:
                3.8719823 = boost
                8.701155 = idf(docFreq=19, maxDocs=44218)
                0.013466296 = queryNorm
              0.67977774 = fieldWeight in 213, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.701155 = idf(docFreq=19, maxDocs=44218)
                0.078125 = fieldNorm(doc=213)
        0.12 = coord(3/25)