Document (#39911)

Author
Snajder, J.
Dalbelo Basic, B.D.
Tadic, M.
Title
Automatic acquisition of inflectional lexica for morphological normalisation
Source
Information processing and management. 44(2008) no.5, S.1720-1731
Year
2008
Abstract
Due to natural language morphology, words can take on various morphological forms. Morphological normalisation - often used in information retrieval and text mining systems - conflates morphological variants of a word to a single representative form. In this paper, we describe an approach to lexicon-based inflectional normalisation. This approach is in between stemming and lemmatisation, and is suitable for morphological normalisation of inflectionally complex languages. To eliminate the immense effort required to compile the lexicon by hand, we focus on the problem of acquiring automatically an inflectional morphological lexicon from raw corpora. We propose a convenient and highly expressive morphology representation formalism on which the acquisition procedure is based. Our approach is applied to the morphologically complex Croatian language, but it should be equally applicable to other languages of similar morphological complexity. Experimental results show that our approach can be used to acquire a lexicon whose linguistic quality allows for rather good normalisation performance.
Theme
Computerlinguistik
Automatisches Indexieren

Similar documents (content)

  1. Malenica, M.; Smuc, T.; Snajder, J.; Basic, B.D.: Language morphology offset : text classification on a Croatian-English parallel corpus (2008) 0.36
    0.36059126 = sum of:
      0.36059126 = product of:
        1.8029563 = sum of:
          0.0151439365 = weight(abstract_txt:language in 2035) [ClassicSimilarity], result of:
            0.0151439365 = score(doc=2035,freq=1.0), product of:
              0.046350632 = queryWeight, product of:
                1.2449285 = boost
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.008902626 = queryNorm
              0.32672557 = fieldWeight in 2035, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.078125 = fieldNorm(doc=2035)
          0.121377476 = weight(abstract_txt:croatian in 2035) [ClassicSimilarity], result of:
            0.121377476 = score(doc=2035,freq=2.0), product of:
              0.11694147 = queryWeight, product of:
                1.3982533 = boost
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.008902626 = queryNorm
              1.0379336 = fieldWeight in 2035, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.078125 = fieldNorm(doc=2035)
          0.057825454 = weight(abstract_txt:languages in 2035) [ClassicSimilarity], result of:
            0.057825454 = score(doc=2035,freq=4.0), product of:
              0.07133278 = queryWeight, product of:
                1.5444049 = boost
                5.188118 = idf(docFreq=670, maxDocs=44218)
                0.008902626 = queryNorm
              0.81064343 = fieldWeight in 2035, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.188118 = idf(docFreq=670, maxDocs=44218)
                0.078125 = fieldNorm(doc=2035)
          0.76857215 = weight(abstract_txt:normalisation in 2035) [ClassicSimilarity], result of:
            0.76857215 = score(doc=2035,freq=3.0), product of:
              0.5978963 = queryWeight, product of:
                7.069676 = boost
                9.499662 = idf(docFreq=8, maxDocs=44218)
                0.008902626 = queryNorm
              1.2854607 = fieldWeight in 2035, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.499662 = idf(docFreq=8, maxDocs=44218)
                0.078125 = fieldNorm(doc=2035)
          0.8400373 = weight(abstract_txt:morphological in 2035) [ClassicSimilarity], result of:
            0.8400373 = score(doc=2035,freq=5.0), product of:
              0.5985882 = queryWeight, product of:
                8.369793 = boost
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.008902626 = queryNorm
              1.4033642 = fieldWeight in 2035, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.078125 = fieldNorm(doc=2035)
        0.2 = coord(5/25)
    
  2. Kettunen, K.; Kunttu, T.; Järvelin, K.: To stem or lemmatize a highly inflectional language in a probabilistic IR environment? (2005) 0.27
    0.2650821 = sum of:
      0.2650821 = product of:
        0.82838154 = sum of:
          0.007769956 = weight(abstract_txt:used in 4395) [ClassicSimilarity], result of:
            0.007769956 = score(doc=4395,freq=2.0), product of:
              0.029906584 = queryWeight, product of:
                3.3592992 = idf(docFreq=4177, maxDocs=44218)
                0.008902626 = queryNorm
              0.25980753 = fieldWeight in 4395, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.3592992 = idf(docFreq=4177, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4395)
          0.04234773 = weight(abstract_txt:stemming in 4395) [ClassicSimilarity], result of:
            0.04234773 = score(doc=4395,freq=2.0), product of:
              0.07351306 = queryWeight, product of:
                1.1086229 = boost
                7.448392 = idf(docFreq=69, maxDocs=44218)
                0.008902626 = queryNorm
              0.5760572 = fieldWeight in 4395, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.448392 = idf(docFreq=69, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4395)
          0.021201512 = weight(abstract_txt:language in 4395) [ClassicSimilarity], result of:
            0.021201512 = score(doc=4395,freq=4.0), product of:
              0.046350632 = queryWeight, product of:
                1.2449285 = boost
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.008902626 = queryNorm
              0.45741582 = fieldWeight in 4395, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4395)
          0.058268674 = weight(abstract_txt:morphologically in 4395) [ClassicSimilarity], result of:
            0.058268674 = score(doc=4395,freq=1.0), product of:
              0.11458063 = queryWeight, product of:
                1.3840673 = boost
                9.298992 = idf(docFreq=10, maxDocs=44218)
                0.008902626 = queryNorm
              0.5085386 = fieldWeight in 4395, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.298992 = idf(docFreq=10, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4395)
          0.019176 = weight(abstract_txt:complex in 4395) [ClassicSimilarity], result of:
            0.019176 = score(doc=4395,freq=1.0), product of:
              0.06881289 = queryWeight, product of:
                1.516881 = boost
                5.095657 = idf(docFreq=735, maxDocs=44218)
                0.008902626 = queryNorm
              0.27866873 = fieldWeight in 4395, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.095657 = idf(docFreq=735, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4395)
          0.020238908 = weight(abstract_txt:languages in 4395) [ClassicSimilarity], result of:
            0.020238908 = score(doc=4395,freq=1.0), product of:
              0.07133278 = queryWeight, product of:
                1.5444049 = boost
                5.188118 = idf(docFreq=670, maxDocs=44218)
                0.008902626 = queryNorm
              0.2837252 = fieldWeight in 4395, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.188118 = idf(docFreq=670, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4395)
          0.015228499 = weight(abstract_txt:approach in 4395) [ClassicSimilarity], result of:
            0.015228499 = score(doc=4395,freq=1.0), product of:
              0.07434969 = queryWeight, product of:
                2.229827 = boost
                3.745328 = idf(docFreq=2839, maxDocs=44218)
                0.008902626 = queryNorm
              0.20482263 = fieldWeight in 4395, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.745328 = idf(docFreq=2839, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4395)
          0.64415026 = weight(abstract_txt:morphological in 4395) [ClassicSimilarity], result of:
            0.64415026 = score(doc=4395,freq=6.0), product of:
              0.5985882 = queryWeight, product of:
                8.369793 = boost
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.008902626 = queryNorm
              1.0761158 = fieldWeight in 4395, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4395)
        0.32 = coord(8/25)
    
  3. Pirkola, A.: Morphological typology of languages for IR (2001) 0.25
    0.24656063 = sum of:
      0.24656063 = product of:
        1.027336 = sum of:
          0.00784884 = weight(abstract_txt:used in 4476) [ClassicSimilarity], result of:
            0.00784884 = score(doc=4476,freq=1.0), product of:
              0.029906584 = queryWeight, product of:
                3.3592992 = idf(docFreq=4177, maxDocs=44218)
                0.008902626 = queryNorm
              0.26244524 = fieldWeight in 4476, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.3592992 = idf(docFreq=4177, maxDocs=44218)
                0.078125 = fieldNorm(doc=4476)
          0.04277766 = weight(abstract_txt:stemming in 4476) [ClassicSimilarity], result of:
            0.04277766 = score(doc=4476,freq=1.0), product of:
              0.07351306 = queryWeight, product of:
                1.1086229 = boost
                7.448392 = idf(docFreq=69, maxDocs=44218)
                0.008902626 = queryNorm
              0.5819056 = fieldWeight in 4476, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.448392 = idf(docFreq=69, maxDocs=44218)
                0.078125 = fieldNorm(doc=4476)
          0.021416761 = weight(abstract_txt:language in 4476) [ClassicSimilarity], result of:
            0.021416761 = score(doc=4476,freq=2.0), product of:
              0.046350632 = queryWeight, product of:
                1.2449285 = boost
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.008902626 = queryNorm
              0.46205974 = fieldWeight in 4476, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.078125 = fieldNorm(doc=4476)
          0.05007831 = weight(abstract_txt:languages in 4476) [ClassicSimilarity], result of:
            0.05007831 = score(doc=4476,freq=3.0), product of:
              0.07133278 = queryWeight, product of:
                1.5444049 = boost
                5.188118 = idf(docFreq=670, maxDocs=44218)
                0.008902626 = queryNorm
              0.7020378 = fieldWeight in 4476, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.188118 = idf(docFreq=670, maxDocs=44218)
                0.078125 = fieldNorm(doc=4476)
          0.15386221 = weight(abstract_txt:morphology in 4476) [ClassicSimilarity], result of:
            0.15386221 = score(doc=4476,freq=1.0), product of:
              0.21742915 = queryWeight, product of:
                2.696345 = boost
                9.05783 = idf(docFreq=13, maxDocs=44218)
                0.008902626 = queryNorm
              0.707643 = fieldWeight in 4476, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.05783 = idf(docFreq=13, maxDocs=44218)
                0.078125 = fieldNorm(doc=4476)
          0.7513522 = weight(abstract_txt:morphological in 4476) [ClassicSimilarity], result of:
            0.7513522 = score(doc=4476,freq=4.0), product of:
              0.5985882 = queryWeight, product of:
                8.369793 = boost
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.008902626 = queryNorm
              1.2552071 = fieldWeight in 4476, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.078125 = fieldNorm(doc=4476)
        0.24 = coord(6/25)
    
  4. Ekmekcioglu, F.C.; Lynch, M.F.; Willet, P.: Development and evaluation of conflation techniques for the implementation of a document retrieval system for Turkish text databases (1995) 0.19
    0.1927049 = sum of:
      0.1927049 = product of:
        0.96352446 = sum of:
          0.043050185 = weight(abstract_txt:corpora in 5797) [ClassicSimilarity], result of:
            0.043050185 = score(doc=5797,freq=1.0), product of:
              0.0653756 = queryWeight, product of:
                1.0454649 = boost
                7.0240583 = idf(docFreq=106, maxDocs=44218)
                0.008902626 = queryNorm
              0.65850544 = fieldWeight in 5797, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.0240583 = idf(docFreq=106, maxDocs=44218)
                0.09375 = fieldNorm(doc=5797)
          0.0725961 = weight(abstract_txt:stemming in 5797) [ClassicSimilarity], result of:
            0.0725961 = score(doc=5797,freq=2.0), product of:
              0.07351306 = queryWeight, product of:
                1.1086229 = boost
                7.448392 = idf(docFreq=69, maxDocs=44218)
                0.008902626 = queryNorm
              0.9875266 = fieldWeight in 5797, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.448392 = idf(docFreq=69, maxDocs=44218)
                0.09375 = fieldNorm(doc=5797)
          0.025700115 = weight(abstract_txt:language in 5797) [ClassicSimilarity], result of:
            0.025700115 = score(doc=5797,freq=2.0), product of:
              0.046350632 = queryWeight, product of:
                1.2449285 = boost
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.008902626 = queryNorm
              0.55447173 = fieldWeight in 5797, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.09375 = fieldNorm(doc=5797)
          0.18463464 = weight(abstract_txt:morphology in 5797) [ClassicSimilarity], result of:
            0.18463464 = score(doc=5797,freq=1.0), product of:
              0.21742915 = queryWeight, product of:
                2.696345 = boost
                9.05783 = idf(docFreq=13, maxDocs=44218)
                0.008902626 = queryNorm
              0.8491715 = fieldWeight in 5797, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.05783 = idf(docFreq=13, maxDocs=44218)
                0.09375 = fieldNorm(doc=5797)
          0.63754344 = weight(abstract_txt:morphological in 5797) [ClassicSimilarity], result of:
            0.63754344 = score(doc=5797,freq=2.0), product of:
              0.5985882 = queryWeight, product of:
                8.369793 = boost
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.008902626 = queryNorm
              1.0650785 = fieldWeight in 5797, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.09375 = fieldNorm(doc=5797)
        0.2 = coord(5/25)
    
  5. Kettunen, K.: Reductive and generative approaches to management of morphological variation of keywords in monolingual information retrieval : an overview (2009) 0.16
    0.15857333 = sum of:
      0.15857333 = product of:
        0.6607222 = sum of:
          0.00784884 = weight(abstract_txt:used in 2835) [ClassicSimilarity], result of:
            0.00784884 = score(doc=2835,freq=1.0), product of:
              0.029906584 = queryWeight, product of:
                3.3592992 = idf(docFreq=4177, maxDocs=44218)
                0.008902626 = queryNorm
              0.26244524 = fieldWeight in 2835, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.3592992 = idf(docFreq=4177, maxDocs=44218)
                0.078125 = fieldNorm(doc=2835)
          0.043799493 = weight(abstract_txt:variants in 2835) [ClassicSimilarity], result of:
            0.043799493 = score(doc=2835,freq=1.0), product of:
              0.07467912 = queryWeight, product of:
                1.1173807 = boost
                7.5072327 = idf(docFreq=65, maxDocs=44218)
                0.008902626 = queryNorm
              0.58650255 = fieldWeight in 2835, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5072327 = idf(docFreq=65, maxDocs=44218)
                0.078125 = fieldNorm(doc=2835)
          0.0151439365 = weight(abstract_txt:language in 2835) [ClassicSimilarity], result of:
            0.0151439365 = score(doc=2835,freq=1.0), product of:
              0.046350632 = queryWeight, product of:
                1.2449285 = boost
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.008902626 = queryNorm
              0.32672557 = fieldWeight in 2835, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.078125 = fieldNorm(doc=2835)
          0.040888768 = weight(abstract_txt:languages in 2835) [ClassicSimilarity], result of:
            0.040888768 = score(doc=2835,freq=2.0), product of:
              0.07133278 = queryWeight, product of:
                1.5444049 = boost
                5.188118 = idf(docFreq=670, maxDocs=44218)
                0.008902626 = queryNorm
              0.57321143 = fieldWeight in 2835, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.188118 = idf(docFreq=670, maxDocs=44218)
                0.078125 = fieldNorm(doc=2835)
          0.021754995 = weight(abstract_txt:approach in 2835) [ClassicSimilarity], result of:
            0.021754995 = score(doc=2835,freq=1.0), product of:
              0.07434969 = queryWeight, product of:
                2.229827 = boost
                3.745328 = idf(docFreq=2839, maxDocs=44218)
                0.008902626 = queryNorm
              0.29260373 = fieldWeight in 2835, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.745328 = idf(docFreq=2839, maxDocs=44218)
                0.078125 = fieldNorm(doc=2835)
          0.5312862 = weight(abstract_txt:morphological in 2835) [ClassicSimilarity], result of:
            0.5312862 = score(doc=2835,freq=2.0), product of:
              0.5985882 = queryWeight, product of:
                8.369793 = boost
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.008902626 = queryNorm
              0.8875654 = fieldWeight in 2835, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.078125 = fieldNorm(doc=2835)
        0.24 = coord(6/25)