Document (#38538)

Author
Nagy T., I.
Title
Detecting multiword expressions and named entities in natural language texts
Imprint
Szeged : University of Szeged, Faculty of Science and Informatics, Doctoral School of Computer Science
Year
2014
Pages
XVIII,
Abstract
Multiword expressions (MWEs) are lexical items that can be decomposed into single words and display lexical, syntactic, semantic, pragmatic and/or statistical idiosyncrasy (Sag et al., 2002; Kim, 2008; Calzolari et al., 2002). The proper treatment of multiword expressions such as rock 'n' roll and make a decision is essential for many natural language processing (NLP) applications like information extraction and retrieval, terminology extraction and machine translation, and it is important to identify multiword expressions in context. For example, in machine translation we must know that MWEs form one semantic unit, hence their parts should not be translated separately. For this, multiword expressions should be identified first in the text to be translated. The chief aim of this thesis is to develop machine learning-based approaches for the automatic detection of different types of multiword expressions in English and Hungarian natural language texts. In our investigations, we pay attention to the characteristics of different types of multiword expressions such as nominal compounds, multiword named entities and light verb constructions, and we apply novel methods to identify MWEs in raw texts. In the thesis it will be demonstrated that nominal compounds and multiword amed entities may require a similar approach for their automatic detection as they behave in the same way from a linguistic point of view. Furthermore, it will be shown that the automatic detection of light verb constructions can be carried out using two effective machine learning-based approaches.
In this thesis, we focused on the automatic detection of multiword expressions in natural language texts. On the basis of the main contributions, we can argue that: - Supervised machine learning methods can be successfully applied for the automatic detection of different types of multiword expressions in natural language texts. - Machine learning-based multiword expression detection can be successfully carried out for English as well as for Hungarian. - Our supervised machine learning-based model was successfully applied to the automatic detection of nominal compounds from English raw texts. - We developed a Wikipedia-based dictionary labeling method to automatically detect English nominal compounds. - A prior knowledge of nominal compounds can enhance Named Entity Recognition, while previously identified named entities can assist the nominal compound identification process. - The machine learning-based method can also provide acceptable results when it was trained on an automatically generated silver standard corpus. - As named entities form one semantic unit and may consist of more than one word and function as a noun, we can treat them in a similar way to nominal compounds. - Our sequence labelling-based tool can be successfully applied for identifying verbal light verb constructions in two typologically different languages, namely English and Hungarian. - Domain adaptation techniques may help diminish the distance between domains in the automatic detection of light verb constructions. - Our syntax-based method can be successfully applied for the full-coverage identification of light verb constructions. As a first step, a data-driven candidate extraction method can be utilized. After, a machine learning approach that makes use of an extended and rich feature set selects LVCs among extracted candidates. - When a precise syntactic parser is available for the actual domain, the full-coverage identification can be performed better. In other cases, the usage of the sequence labeling method is recommended.
Content
Vgl.: http://doktori.bibl.u-szeged.hu/2434/1/main.pdf.
Theme
Computerlinguistik

Similar documents (content)

  1. Ramisch, C.: Multiword expressions acquisition : a generic and open framework (2015) 0.41
    0.40975374 = sum of:
      0.40975374 = product of:
        1.2804805 = sum of:
          0.029495623 = weight(abstract_txt:language in 2650) [ClassicSimilarity], result of:
            0.029495623 = score(doc=2650,freq=4.0), product of:
              0.056249022 = queryWeight, product of:
                1.1253704 = boost
                4.195006 = idf(docFreq=1744, maxDocs=42596)
                0.011914805 = queryNorm
              0.52437574 = fieldWeight in 2650, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.195006 = idf(docFreq=1744, maxDocs=42596)
                0.0625 = fieldNorm(doc=2650)
          0.026609056 = weight(abstract_txt:natural in 2650) [ClassicSimilarity], result of:
            0.026609056 = score(doc=2650,freq=1.0), product of:
              0.08336484 = queryWeight, product of:
                1.370028 = boost
                5.107008 = idf(docFreq=700, maxDocs=42596)
                0.011914805 = queryNorm
              0.319188 = fieldWeight in 2650, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.107008 = idf(docFreq=700, maxDocs=42596)
                0.0625 = fieldNorm(doc=2650)
          0.06189835 = weight(abstract_txt:texts in 2650) [ClassicSimilarity], result of:
            0.06189835 = score(doc=2650,freq=2.0), product of:
              0.123442985 = queryWeight, product of:
                1.8262568 = boost
                5.6730638 = idf(docFreq=397, maxDocs=42596)
                0.011914805 = queryNorm
              0.5014327 = fieldWeight in 2650, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.6730638 = idf(docFreq=397, maxDocs=42596)
                0.0625 = fieldNorm(doc=2650)
          0.039280646 = weight(abstract_txt:automatic in 2650) [ClassicSimilarity], result of:
            0.039280646 = score(doc=2650,freq=1.0), product of:
              0.12090892 = queryWeight, product of:
                1.9522309 = boost
                5.1980476 = idf(docFreq=639, maxDocs=42596)
                0.011914805 = queryNorm
              0.32487798 = fieldWeight in 2650, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1980476 = idf(docFreq=639, maxDocs=42596)
                0.0625 = fieldNorm(doc=2650)
          0.12567246 = weight(abstract_txt:constructions in 2650) [ClassicSimilarity], result of:
            0.12567246 = score(doc=2650,freq=1.0), product of:
              0.2346695 = queryWeight, product of:
                2.2986155 = boost
                8.568473 = idf(docFreq=21, maxDocs=42596)
                0.011914805 = queryNorm
              0.53552955 = fieldWeight in 2650, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.568473 = idf(docFreq=21, maxDocs=42596)
                0.0625 = fieldNorm(doc=2650)
          0.054827645 = weight(abstract_txt:machine in 2650) [ClassicSimilarity], result of:
            0.054827645 = score(doc=2650,freq=1.0), product of:
              0.1642053 = queryWeight, product of:
                2.579692 = boost
                5.342351 = idf(docFreq=553, maxDocs=42596)
                0.011914805 = queryNorm
              0.33389693 = fieldWeight in 2650, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.342351 = idf(docFreq=553, maxDocs=42596)
                0.0625 = fieldNorm(doc=2650)
          0.25722295 = weight(abstract_txt:expressions in 2650) [ClassicSimilarity], result of:
            0.25722295 = score(doc=2650,freq=5.0), product of:
              0.26911458 = queryWeight, product of:
                3.3025002 = boost
                6.839234 = idf(docFreq=123, maxDocs=42596)
                0.011914805 = queryNorm
              0.955812 = fieldWeight in 2650, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                6.839234 = idf(docFreq=123, maxDocs=42596)
                0.0625 = fieldNorm(doc=2650)
          0.6854738 = weight(abstract_txt:multiword in 2650) [ClassicSimilarity], result of:
            0.6854738 = score(doc=2650,freq=5.0), product of:
              0.569339 = queryWeight, product of:
                5.5466285 = boost
                8.614993 = idf(docFreq=20, maxDocs=42596)
                0.011914805 = queryNorm
              1.2039819 = fieldWeight in 2650, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                8.614993 = idf(docFreq=20, maxDocs=42596)
                0.0625 = fieldNorm(doc=2650)
        0.32 = coord(8/25)
    
  2. Gödert, W.: Detecting multiword phrases in mathematical text corpora (2012) 0.24
    0.24494024 = sum of:
      0.24494024 = product of:
        1.0205843 = sum of:
          0.03923327 = weight(abstract_txt:method in 1467) [ClassicSimilarity], result of:
            0.03923327 = score(doc=1467,freq=2.0), product of:
              0.06541271 = queryWeight, product of:
                1.2135818 = boost
                4.5238285 = idf(docFreq=1255, maxDocs=42596)
                0.011914805 = queryNorm
              0.59978056 = fieldWeight in 1467, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.5238285 = idf(docFreq=1255, maxDocs=42596)
                0.09375 = fieldNorm(doc=1467)
          0.022399258 = weight(abstract_txt:based in 1467) [ClassicSimilarity], result of:
            0.022399258 = score(doc=1467,freq=2.0), product of:
              0.052652806 = queryWeight, product of:
                1.3772372 = boost
                3.2086759 = idf(docFreq=4678, maxDocs=42596)
                0.011914805 = queryNorm
              0.42541432 = fieldWeight in 1467, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.2086759 = idf(docFreq=4678, maxDocs=42596)
                0.09375 = fieldNorm(doc=1467)
          0.09689487 = weight(abstract_txt:named in 1467) [ClassicSimilarity], result of:
            0.09689487 = score(doc=1467,freq=1.0), product of:
              0.15058081 = queryWeight, product of:
                1.8412926 = boost
                6.863725 = idf(docFreq=120, maxDocs=42596)
                0.011914805 = queryNorm
              0.6434742 = fieldWeight in 1467, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.863725 = idf(docFreq=120, maxDocs=42596)
                0.09375 = fieldNorm(doc=1467)
          0.05892097 = weight(abstract_txt:automatic in 1467) [ClassicSimilarity], result of:
            0.05892097 = score(doc=1467,freq=1.0), product of:
              0.12090892 = queryWeight, product of:
                1.9522309 = boost
                5.1980476 = idf(docFreq=639, maxDocs=42596)
                0.011914805 = queryNorm
              0.48731697 = fieldWeight in 1467, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1980476 = idf(docFreq=639, maxDocs=42596)
                0.09375 = fieldNorm(doc=1467)
          0.15283841 = weight(abstract_txt:detection in 1467) [ClassicSimilarity], result of:
            0.15283841 = score(doc=1467,freq=1.0), product of:
              0.23865145 = queryWeight, product of:
                2.9321084 = boost
                6.831202 = idf(docFreq=124, maxDocs=42596)
                0.011914805 = queryNorm
              0.6404252 = fieldWeight in 1467, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.831202 = idf(docFreq=124, maxDocs=42596)
                0.09375 = fieldNorm(doc=1467)
          0.6502975 = weight(abstract_txt:multiword in 1467) [ClassicSimilarity], result of:
            0.6502975 = score(doc=1467,freq=2.0), product of:
              0.569339 = queryWeight, product of:
                5.5466285 = boost
                8.614993 = idf(docFreq=20, maxDocs=42596)
                0.011914805 = queryNorm
              1.1421975 = fieldWeight in 1467, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.614993 = idf(docFreq=20, maxDocs=42596)
                0.09375 = fieldNorm(doc=1467)
        0.24 = coord(6/25)
    
  3. Snajder, J.; Almic, P.: Modeling semantic compositionality of Croatian multiword expressions (2015) 0.24
    0.23895665 = sum of:
      0.23895665 = product of:
        0.99565274 = sum of:
          0.022121718 = weight(abstract_txt:language in 3921) [ClassicSimilarity], result of:
            0.022121718 = score(doc=3921,freq=1.0), product of:
              0.056249022 = queryWeight, product of:
                1.1253704 = boost
                4.195006 = idf(docFreq=1744, maxDocs=42596)
                0.011914805 = queryNorm
              0.39328182 = fieldWeight in 3921, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.195006 = idf(docFreq=1744, maxDocs=42596)
                0.09375 = fieldNorm(doc=3921)
          0.039913584 = weight(abstract_txt:natural in 3921) [ClassicSimilarity], result of:
            0.039913584 = score(doc=3921,freq=1.0), product of:
              0.08336484 = queryWeight, product of:
                1.370028 = boost
                5.107008 = idf(docFreq=700, maxDocs=42596)
                0.011914805 = queryNorm
              0.478782 = fieldWeight in 3921, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.107008 = idf(docFreq=700, maxDocs=42596)
                0.09375 = fieldNorm(doc=3921)
          0.027433375 = weight(abstract_txt:based in 3921) [ClassicSimilarity], result of:
            0.027433375 = score(doc=3921,freq=3.0), product of:
              0.052652806 = queryWeight, product of:
                1.3772372 = boost
                3.2086759 = idf(docFreq=4678, maxDocs=42596)
                0.011914805 = queryNorm
              0.521024 = fieldWeight in 3921, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.2086759 = idf(docFreq=4678, maxDocs=42596)
                0.09375 = fieldNorm(doc=3921)
          0.2738039 = weight(abstract_txt:mwes in 3921) [ClassicSimilarity], result of:
            0.2738039 = score(doc=3921,freq=3.0), product of:
              0.17601061 = queryWeight, product of:
                1.5419953 = boost
                9.580074 = idf(docFreq=7, maxDocs=42596)
                0.011914805 = queryNorm
              1.5556102 = fieldWeight in 3921, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.580074 = idf(docFreq=7, maxDocs=42596)
                0.09375 = fieldNorm(doc=3921)
          0.1725504 = weight(abstract_txt:expressions in 3921) [ClassicSimilarity], result of:
            0.1725504 = score(doc=3921,freq=1.0), product of:
              0.26911458 = queryWeight, product of:
                3.3025002 = boost
                6.839234 = idf(docFreq=123, maxDocs=42596)
                0.011914805 = queryNorm
              0.6411782 = fieldWeight in 3921, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.839234 = idf(docFreq=123, maxDocs=42596)
                0.09375 = fieldNorm(doc=3921)
          0.4598298 = weight(abstract_txt:multiword in 3921) [ClassicSimilarity], result of:
            0.4598298 = score(doc=3921,freq=1.0), product of:
              0.569339 = queryWeight, product of:
                5.5466285 = boost
                8.614993 = idf(docFreq=20, maxDocs=42596)
                0.011914805 = queryNorm
              0.8076556 = fieldWeight in 3921, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.614993 = idf(docFreq=20, maxDocs=42596)
                0.09375 = fieldNorm(doc=3921)
        0.24 = coord(6/25)
    
  4. Cruys, T. van de; Moirón, B.V.: Semantics-based multiword expression extraction (2007) 0.23
    0.22899054 = sum of:
      0.22899054 = product of:
        0.95412725 = sum of:
          0.043115146 = weight(abstract_txt:extraction in 3920) [ClassicSimilarity], result of:
            0.043115146 = score(doc=3920,freq=1.0), product of:
              0.07402404 = queryWeight, product of:
                6.212778 = idf(docFreq=231, maxDocs=42596)
                0.011914805 = queryNorm
              0.58244795 = fieldWeight in 3920, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.212778 = idf(docFreq=231, maxDocs=42596)
                0.09375 = fieldNorm(doc=3920)
          0.03923327 = weight(abstract_txt:method in 3920) [ClassicSimilarity], result of:
            0.03923327 = score(doc=3920,freq=2.0), product of:
              0.06541271 = queryWeight, product of:
                1.2135818 = boost
                4.5238285 = idf(docFreq=1255, maxDocs=42596)
                0.011914805 = queryNorm
              0.59978056 = fieldWeight in 3920, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.5238285 = idf(docFreq=1255, maxDocs=42596)
                0.09375 = fieldNorm(doc=3920)
          0.015838668 = weight(abstract_txt:based in 3920) [ClassicSimilarity], result of:
            0.015838668 = score(doc=3920,freq=1.0), product of:
              0.052652806 = queryWeight, product of:
                1.3772372 = boost
                3.2086759 = idf(docFreq=4678, maxDocs=42596)
                0.011914805 = queryNorm
              0.30081338 = fieldWeight in 3920, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.2086759 = idf(docFreq=4678, maxDocs=42596)
                0.09375 = fieldNorm(doc=3920)
          0.22355995 = weight(abstract_txt:mwes in 3920) [ClassicSimilarity], result of:
            0.22355995 = score(doc=3920,freq=2.0), product of:
              0.17601061 = queryWeight, product of:
                1.5419953 = boost
                9.580074 = idf(docFreq=7, maxDocs=42596)
                0.011914805 = queryNorm
              1.2701504 = fieldWeight in 3920, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.580074 = idf(docFreq=7, maxDocs=42596)
                0.09375 = fieldNorm(doc=3920)
          0.1725504 = weight(abstract_txt:expressions in 3920) [ClassicSimilarity], result of:
            0.1725504 = score(doc=3920,freq=1.0), product of:
              0.26911458 = queryWeight, product of:
                3.3025002 = boost
                6.839234 = idf(docFreq=123, maxDocs=42596)
                0.011914805 = queryNorm
              0.6411782 = fieldWeight in 3920, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.839234 = idf(docFreq=123, maxDocs=42596)
                0.09375 = fieldNorm(doc=3920)
          0.4598298 = weight(abstract_txt:multiword in 3920) [ClassicSimilarity], result of:
            0.4598298 = score(doc=3920,freq=1.0), product of:
              0.569339 = queryWeight, product of:
                5.5466285 = boost
                8.614993 = idf(docFreq=20, maxDocs=42596)
                0.011914805 = queryNorm
              0.8076556 = fieldWeight in 3920, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.614993 = idf(docFreq=20, maxDocs=42596)
                0.09375 = fieldNorm(doc=3920)
        0.24 = coord(6/25)
    
  5. Nissim, M.; Zaninello, A,: Modeling the internal variability of multiword expressions through a pattern-based method (2013) 0.21
    0.21094625 = sum of:
      0.21094625 = product of:
        0.75337946 = sum of:
          0.04064935 = weight(abstract_txt:extraction in 1991) [ClassicSimilarity], result of:
            0.04064935 = score(doc=1991,freq=2.0), product of:
              0.07402404 = queryWeight, product of:
                6.212778 = idf(docFreq=231, maxDocs=42596)
                0.011914805 = queryNorm
              0.5491372 = fieldWeight in 1991, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.212778 = idf(docFreq=231, maxDocs=42596)
                0.0625 = fieldNorm(doc=1991)
          0.026155513 = weight(abstract_txt:method in 1991) [ClassicSimilarity], result of:
            0.026155513 = score(doc=1991,freq=2.0), product of:
              0.06541271 = queryWeight, product of:
                1.2135818 = boost
                4.5238285 = idf(docFreq=1255, maxDocs=42596)
                0.011914805 = queryNorm
              0.3998537 = fieldWeight in 1991, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.5238285 = idf(docFreq=1255, maxDocs=42596)
                0.0625 = fieldNorm(doc=1991)
          0.014932838 = weight(abstract_txt:based in 1991) [ClassicSimilarity], result of:
            0.014932838 = score(doc=1991,freq=2.0), product of:
              0.052652806 = queryWeight, product of:
                1.3772372 = boost
                3.2086759 = idf(docFreq=4678, maxDocs=42596)
                0.011914805 = queryNorm
              0.28360954 = fieldWeight in 1991, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.2086759 = idf(docFreq=4678, maxDocs=42596)
                0.0625 = fieldNorm(doc=1991)
          0.21077433 = weight(abstract_txt:mwes in 1991) [ClassicSimilarity], result of:
            0.21077433 = score(doc=1991,freq=4.0), product of:
              0.17601061 = queryWeight, product of:
                1.5419953 = boost
                9.580074 = idf(docFreq=7, maxDocs=42596)
                0.011914805 = queryNorm
              1.1975093 = fieldWeight in 1991, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                9.580074 = idf(docFreq=7, maxDocs=42596)
                0.0625 = fieldNorm(doc=1991)
          0.039280646 = weight(abstract_txt:automatic in 1991) [ClassicSimilarity], result of:
            0.039280646 = score(doc=1991,freq=1.0), product of:
              0.12090892 = queryWeight, product of:
                1.9522309 = boost
                5.1980476 = idf(docFreq=639, maxDocs=42596)
                0.011914805 = queryNorm
              0.32487798 = fieldWeight in 1991, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1980476 = idf(docFreq=639, maxDocs=42596)
                0.0625 = fieldNorm(doc=1991)
          0.1150336 = weight(abstract_txt:expressions in 1991) [ClassicSimilarity], result of:
            0.1150336 = score(doc=1991,freq=1.0), product of:
              0.26911458 = queryWeight, product of:
                3.3025002 = boost
                6.839234 = idf(docFreq=123, maxDocs=42596)
                0.011914805 = queryNorm
              0.42745212 = fieldWeight in 1991, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.839234 = idf(docFreq=123, maxDocs=42596)
                0.0625 = fieldNorm(doc=1991)
          0.3065532 = weight(abstract_txt:multiword in 1991) [ClassicSimilarity], result of:
            0.3065532 = score(doc=1991,freq=1.0), product of:
              0.569339 = queryWeight, product of:
                5.5466285 = boost
                8.614993 = idf(docFreq=20, maxDocs=42596)
                0.011914805 = queryNorm
              0.53843707 = fieldWeight in 1991, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.614993 = idf(docFreq=20, maxDocs=42596)
                0.0625 = fieldNorm(doc=1991)
        0.28 = coord(7/25)