Document (#40956)

Author
Grün, S.
Title
Mehrwortbegriffe und Latent Semantic Analysis : Bewertung automatisch extrahierter Mehrwortgruppen mit LSA
Imprint
Düsseldorf : Heinrich-Heine-Universität / Philosophische Fakultät / Institut für Sprache und Information
Year
2017
Pages
67 S
Abstract
Die vorliegende Studie untersucht das Potenzial von Mehrwortbegriffen für das Information Retrieval. Zielsetzung der Arbeit ist es, intellektuell positiv bewertete Kandidaten mithilfe des Latent Semantic Analysis (LSA) Verfahren höher zu gewichten, als negativ bewertete Kandidaten. Die positiven Kandidaten sollen demnach bei einem Ranking im Information Retrieval bevorzugt werden. Als Kollektion wurde eine Version der sozialwissenschaftlichen GIRT-Datenbank (German Indexing and Retrieval Testdatabase) eingesetzt. Um Kandidaten für Mehrwortbegriffe zu identifizieren wurde die automatische Indexierung Lingo verwendet. Die notwendigen Kernfunktionalitäten waren Lemmatisierung, Identifizierung von Komposita, algorithmische Mehrworterkennung sowie Gewichtung von Indextermen durch das LSA-Modell. Die durch Lingo erkannten und LSAgewichteten Mehrwortkandidaten wurden evaluiert. Zuerst wurde dazu eine intellektuelle Auswahl von positiven und negativen Mehrwortkandidaten vorgenommen. Im zweiten Schritt der Evaluierung erfolgte die Berechnung der Ausbeute, um den Anteil der positiven Mehrwortkandidaten zu erhalten. Im letzten Schritt der Evaluierung wurde auf der Basis der R-Precision berechnet, wie viele positiv bewerteten Mehrwortkandidaten es an der Stelle k des Rankings geschafft haben. Die Ausbeute der positiven Mehrwortkandidaten lag bei durchschnittlich ca. 39%, während die R-Precision einen Durchschnittswert von 54% erzielte. Das LSA-Modell erzielt ein ambivalentes Ergebnis mit positiver Tendenz.
Footnote
Masterarbeit, Studiengang Informationswissenschaft und Sprachtechnologie, Institut für Sprache und Information, Philosophische Fakultät, Heinrich-Heine-Universität Düsseldorf
Theme
Automatisches Indexieren
Object
Lingo
Latent Semantic Indexing
GIRT

Similar documents (content)

  1. Grün, S.: Bildung von Komposita-Indextermen auf der Basis einer algorithmischen Mehrwortgruppenanalyse mit Lingo (2015) 0.39
    0.38879168 = sum of:
      0.38879168 = product of:
        1.3885417 = sum of:
          0.099840246 = weight(abstract_txt:bewerteten in 2336) [ClassicSimilarity], result of:
            0.099840246 = score(doc=2336,freq=1.0), product of:
              0.13500953 = queryWeight, product of:
                1.0521685 = boost
                9.465666 = idf(docFreq=8, maxDocs=42740)
                0.013555887 = queryNorm
              0.7395052 = fieldWeight in 2336, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.465666 = idf(docFreq=8, maxDocs=42740)
                0.078125 = fieldNorm(doc=2336)
          0.24150772 = weight(abstract_txt:komposita in 2336) [ClassicSimilarity], result of:
            0.24150772 = score(doc=2336,freq=5.0), product of:
              0.14227371 = queryWeight, product of:
                1.0801036 = boost
                9.71698 = idf(docFreq=6, maxDocs=42740)
                0.013555887 = queryNorm
              1.6974866 = fieldWeight in 2336, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                9.71698 = idf(docFreq=6, maxDocs=42740)
                0.078125 = fieldNorm(doc=2336)
          0.22645546 = weight(abstract_txt:mehrwortgruppen in 2336) [ClassicSimilarity], result of:
            0.22645546 = score(doc=2336,freq=4.0), product of:
              0.1468236 = queryWeight, product of:
                1.0972384 = boost
                9.871131 = idf(docFreq=5, maxDocs=42740)
                0.013555887 = queryNorm
              1.5423642 = fieldWeight in 2336, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                9.871131 = idf(docFreq=5, maxDocs=42740)
                0.078125 = fieldNorm(doc=2336)
          0.13440788 = weight(abstract_txt:positiv in 2336) [ClassicSimilarity], result of:
            0.13440788 = score(doc=2336,freq=1.0), product of:
              0.20738968 = queryWeight, product of:
                1.8442154 = boost
                8.295595 = idf(docFreq=28, maxDocs=42740)
                0.013555887 = queryNorm
              0.64809334 = fieldWeight in 2336, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.295595 = idf(docFreq=28, maxDocs=42740)
                0.078125 = fieldNorm(doc=2336)
          0.18202206 = weight(abstract_txt:lingo in 2336) [ClassicSimilarity], result of:
            0.18202206 = score(doc=2336,freq=1.0), product of:
              0.25385556 = queryWeight, product of:
                2.0403817 = boost
                9.177984 = idf(docFreq=11, maxDocs=42740)
                0.013555887 = queryNorm
              0.71703005 = fieldWeight in 2336, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.177984 = idf(docFreq=11, maxDocs=42740)
                0.078125 = fieldNorm(doc=2336)
          0.089853115 = weight(abstract_txt:wurde in 2336) [ClassicSimilarity], result of:
            0.089853115 = score(doc=2336,freq=3.0), product of:
              0.13851464 = queryWeight, product of:
                2.1314783 = boost
                4.793876 = idf(docFreq=961, maxDocs=42740)
                0.013555887 = queryNorm
              0.6486904 = fieldWeight in 2336, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.793876 = idf(docFreq=961, maxDocs=42740)
                0.078125 = fieldNorm(doc=2336)
          0.41445524 = weight(abstract_txt:kandidaten in 2336) [ClassicSimilarity], result of:
            0.41445524 = score(doc=2336,freq=1.0), product of:
              0.55356133 = queryWeight, product of:
                4.261043 = boost
                9.583449 = idf(docFreq=7, maxDocs=42740)
                0.013555887 = queryNorm
              0.748707 = fieldWeight in 2336, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.583449 = idf(docFreq=7, maxDocs=42740)
                0.078125 = fieldNorm(doc=2336)
        0.28 = coord(7/25)
    
  2. Bredack, J.: Terminologieextraktion von Mehrwortgruppen in kunsthistorischen Fachtexten (2013) 0.16
    0.15780222 = sum of:
      0.15780222 = product of:
        0.65750927 = sum of:
          0.06826643 = weight(abstract_txt:algorithmische in 3055) [ClassicSimilarity], result of:
            0.06826643 = score(doc=3055,freq=2.0), product of:
              0.13202073 = queryWeight, product of:
                1.040457 = boost
                9.360306 = idf(docFreq=9, maxDocs=42740)
                0.013555887 = queryNorm
              0.5170887 = fieldWeight in 3055, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.360306 = idf(docFreq=9, maxDocs=42740)
                0.0390625 = fieldNorm(doc=3055)
          0.2041242 = weight(abstract_txt:mehrwortgruppen in 3055) [ClassicSimilarity], result of:
            0.2041242 = score(doc=3055,freq=13.0), product of:
              0.1468236 = queryWeight, product of:
                1.0972384 = boost
                9.871131 = idf(docFreq=5, maxDocs=42740)
                0.013555887 = queryNorm
              1.3902683 = fieldWeight in 3055, product of:
                3.6055512 = tf(freq=13.0), with freq of:
                  13.0 = termFreq=13.0
                9.871131 = idf(docFreq=5, maxDocs=42740)
                0.0390625 = fieldNorm(doc=3055)
          0.010387077 = weight(abstract_txt:retrieval in 3055) [ClassicSimilarity], result of:
            0.010387077 = score(doc=3055,freq=2.0), product of:
              0.05426752 = queryWeight, product of:
                1.1554035 = boost
                3.4648013 = idf(docFreq=3633, maxDocs=42740)
                0.013555887 = queryNorm
              0.19140504 = fieldWeight in 3055, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4648013 = idf(docFreq=3633, maxDocs=42740)
                0.0390625 = fieldNorm(doc=3055)
          0.18202206 = weight(abstract_txt:lingo in 3055) [ClassicSimilarity], result of:
            0.18202206 = score(doc=3055,freq=4.0), product of:
              0.25385556 = queryWeight, product of:
                2.0403817 = boost
                9.177984 = idf(docFreq=11, maxDocs=42740)
                0.013555887 = queryNorm
              0.71703005 = fieldWeight in 3055, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                9.177984 = idf(docFreq=11, maxDocs=42740)
                0.0390625 = fieldNorm(doc=3055)
          0.036682382 = weight(abstract_txt:wurde in 3055) [ClassicSimilarity], result of:
            0.036682382 = score(doc=3055,freq=2.0), product of:
              0.13851464 = queryWeight, product of:
                2.1314783 = boost
                4.793876 = idf(docFreq=961, maxDocs=42740)
                0.013555887 = queryNorm
              0.26482674 = fieldWeight in 3055, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.793876 = idf(docFreq=961, maxDocs=42740)
                0.0390625 = fieldNorm(doc=3055)
          0.15602715 = weight(abstract_txt:positiven in 3055) [ClassicSimilarity], result of:
            0.15602715 = score(doc=3055,freq=1.0), product of:
              0.4581427 = queryWeight, product of:
                3.8764434 = boost
                8.7184515 = idf(docFreq=18, maxDocs=42740)
                0.013555887 = queryNorm
              0.34056452 = fieldWeight in 3055, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.7184515 = idf(docFreq=18, maxDocs=42740)
                0.0390625 = fieldNorm(doc=3055)
        0.24 = coord(6/25)
    
  3. Engel, F.: Expertensuche in semantisch integrierten Datenbeständen (2015) 0.08
    0.07844104 = sum of:
      0.07844104 = product of:
        0.6536753 = sum of:
          0.018182602 = weight(abstract_txt:semantic in 4284) [ClassicSimilarity], result of:
            0.018182602 = score(doc=4284,freq=2.0), product of:
              0.06097669 = queryWeight, product of:
                4.4981704 = idf(docFreq=1292, maxDocs=42740)
                0.013555887 = queryNorm
              0.29818937 = fieldWeight in 4284, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.4981704 = idf(docFreq=1292, maxDocs=42740)
                0.046875 = fieldNorm(doc=4284)
          0.07944269 = weight(abstract_txt:berechnung in 4284) [ClassicSimilarity], result of:
            0.07944269 = score(doc=4284,freq=2.0), product of:
              0.12934583 = queryWeight, product of:
                1.0298626 = boost
                9.264996 = idf(docFreq=10, maxDocs=42740)
                0.013555887 = queryNorm
              0.6141882 = fieldWeight in 4284, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.264996 = idf(docFreq=10, maxDocs=42740)
                0.046875 = fieldNorm(doc=4284)
          0.55605006 = weight(abstract_txt:kandidaten in 4284) [ClassicSimilarity], result of:
            0.55605006 = score(doc=4284,freq=5.0), product of:
              0.55356133 = queryWeight, product of:
                4.261043 = boost
                9.583449 = idf(docFreq=7, maxDocs=42740)
                0.013555887 = queryNorm
              1.0044959 = fieldWeight in 4284, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                9.583449 = idf(docFreq=7, maxDocs=42740)
                0.046875 = fieldNorm(doc=4284)
        0.12 = coord(3/25)
    
  4. Lepsky, K.; Vorhauer, J.: Lingo - ein open source System für die Automatische Indexierung deutschsprachiger Dokumente (2006) 0.07
    0.06843988 = sum of:
      0.06843988 = product of:
        0.57033235 = sum of:
          0.115851976 = weight(abstract_txt:algorithmische in 4582) [ClassicSimilarity], result of:
            0.115851976 = score(doc=4582,freq=1.0), product of:
              0.13202073 = queryWeight, product of:
                1.040457 = boost
                9.360306 = idf(docFreq=9, maxDocs=42740)
                0.013555887 = queryNorm
              0.87752867 = fieldWeight in 4582, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.360306 = idf(docFreq=9, maxDocs=42740)
                0.09375 = fieldNorm(doc=4582)
          0.017627452 = weight(abstract_txt:retrieval in 4582) [ClassicSimilarity], result of:
            0.017627452 = score(doc=4582,freq=1.0), product of:
              0.05426752 = queryWeight, product of:
                1.1554035 = boost
                3.4648013 = idf(docFreq=3633, maxDocs=42740)
                0.013555887 = queryNorm
              0.3248251 = fieldWeight in 4582, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4648013 = idf(docFreq=3633, maxDocs=42740)
                0.09375 = fieldNorm(doc=4582)
          0.43685293 = weight(abstract_txt:lingo in 4582) [ClassicSimilarity], result of:
            0.43685293 = score(doc=4582,freq=4.0), product of:
              0.25385556 = queryWeight, product of:
                2.0403817 = boost
                9.177984 = idf(docFreq=11, maxDocs=42740)
                0.013555887 = queryNorm
              1.720872 = fieldWeight in 4582, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                9.177984 = idf(docFreq=11, maxDocs=42740)
                0.09375 = fieldNorm(doc=4582)
        0.12 = coord(3/25)
    
  5. Bredack, J.; Lepsky, K.: Automatische Extraktion von Fachterminologie aus Volltexten (2014) 0.07
    0.06619653 = sum of:
      0.06619653 = product of:
        0.55163777 = sum of:
          0.22417946 = weight(abstract_txt:mehrwortgruppen in 1873) [ClassicSimilarity], result of:
            0.22417946 = score(doc=1873,freq=2.0), product of:
              0.1468236 = queryWeight, product of:
                1.0972384 = boost
                9.871131 = idf(docFreq=5, maxDocs=42740)
                0.013555887 = queryNorm
              1.5268626 = fieldWeight in 1873, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.871131 = idf(docFreq=5, maxDocs=42740)
                0.109375 = fieldNorm(doc=1873)
          0.25483087 = weight(abstract_txt:lingo in 1873) [ClassicSimilarity], result of:
            0.25483087 = score(doc=1873,freq=1.0), product of:
              0.25385556 = queryWeight, product of:
                2.0403817 = boost
                9.177984 = idf(docFreq=11, maxDocs=42740)
                0.013555887 = queryNorm
              1.003842 = fieldWeight in 1873, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.177984 = idf(docFreq=11, maxDocs=42740)
                0.109375 = fieldNorm(doc=1873)
          0.07262741 = weight(abstract_txt:wurde in 1873) [ClassicSimilarity], result of:
            0.07262741 = score(doc=1873,freq=1.0), product of:
              0.13851464 = queryWeight, product of:
                2.1314783 = boost
                4.793876 = idf(docFreq=961, maxDocs=42740)
                0.013555887 = queryNorm
              0.5243302 = fieldWeight in 1873, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.793876 = idf(docFreq=961, maxDocs=42740)
                0.109375 = fieldNorm(doc=1873)
        0.12 = coord(3/25)