Document (#39912)

Author
Snajder, J.
Dalbelo Basic, B.D.
Tadic, M.
Title
Automatic acquisition of inflectional lexica for morphological normalisation
Source
Information processing and management. 44(2008) no.5, S.1720-1731
Year
2008
Abstract
Due to natural language morphology, words can take on various morphological forms. Morphological normalisation - often used in information retrieval and text mining systems - conflates morphological variants of a word to a single representative form. In this paper, we describe an approach to lexicon-based inflectional normalisation. This approach is in between stemming and lemmatisation, and is suitable for morphological normalisation of inflectionally complex languages. To eliminate the immense effort required to compile the lexicon by hand, we focus on the problem of acquiring automatically an inflectional morphological lexicon from raw corpora. We propose a convenient and highly expressive morphology representation formalism on which the acquisition procedure is based. Our approach is applied to the morphologically complex Croatian language, but it should be equally applicable to other languages of similar morphological complexity. Experimental results show that our approach can be used to acquire a lexicon whose linguistic quality allows for rather good normalisation performance.
Theme
Computerlinguistik
Automatisches Indexieren

Similar documents (content)

  1. Malenica, M.; Smuc, T.; Snajder, J.; Basic, B.D.: Language morphology offset : text classification on a Croatian-English parallel corpus (2008) 0.36
    0.35909975 = sum of:
      0.35909975 = product of:
        1.7954987 = sum of:
          0.01529965 = weight(abstract_txt:language in 4036) [ClassicSimilarity], result of:
            0.01529965 = score(doc=4036,freq=1.0), product of:
              0.04671467 = queryWeight, product of:
                1.2447486 = boost
                4.192163 = idf(docFreq=1776, maxDocs=43254)
                0.008952277 = queryNorm
              0.32751274 = fieldWeight in 4036, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.192163 = idf(docFreq=1776, maxDocs=43254)
                0.078125 = fieldNorm(doc=4036)
          0.12088852 = weight(abstract_txt:croatian in 4036) [ClassicSimilarity], result of:
            0.12088852 = score(doc=4036,freq=2.0), product of:
              0.116744295 = queryWeight, product of:
                1.3914187 = boost
                9.37226 = idf(docFreq=9, maxDocs=43254)
                0.008952277 = queryNorm
              1.0354983 = fieldWeight in 4036, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.37226 = idf(docFreq=9, maxDocs=43254)
                0.078125 = fieldNorm(doc=4036)
          0.058121286 = weight(abstract_txt:languages in 4036) [ClassicSimilarity], result of:
            0.058121286 = score(doc=4036,freq=4.0), product of:
              0.07164773 = queryWeight, product of:
                1.5415452 = boost
                5.1917377 = idf(docFreq=653, maxDocs=43254)
                0.008952277 = queryNorm
              0.811209 = fieldWeight in 4036, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.1917377 = idf(docFreq=653, maxDocs=43254)
                0.078125 = fieldNorm(doc=4036)
          0.76553583 = weight(abstract_txt:normalisation in 4036) [ClassicSimilarity], result of:
            0.76553583 = score(doc=4036,freq=3.0), product of:
              0.59691924 = queryWeight, product of:
                7.0353026 = boost
                9.47762 = idf(docFreq=8, maxDocs=43254)
                0.008952277 = queryNorm
              1.2824781 = fieldWeight in 4036, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.47762 = idf(docFreq=8, maxDocs=43254)
                0.078125 = fieldNorm(doc=4036)
          0.8356534 = weight(abstract_txt:morphological in 4036) [ClassicSimilarity], result of:
            0.8356534 = score(doc=4036,freq=5.0), product of:
              0.5971028 = queryWeight, product of:
                8.3255625 = boost
                8.011283 = idf(docFreq=38, maxDocs=43254)
                0.008952277 = queryNorm
              1.3995135 = fieldWeight in 4036, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                8.011283 = idf(docFreq=38, maxDocs=43254)
                0.078125 = fieldNorm(doc=4036)
        0.2 = coord(5/25)
    
  2. Kettunen, K.; Kunttu, T.; Järvelin, K.: To stem or lemmatize a highly inflectional language in a probabilistic IR environment? (2005) 0.26
    0.26427898 = sum of:
      0.26427898 = product of:
        0.8258718 = sum of:
          0.007853253 = weight(abstract_txt:used in 396) [ClassicSimilarity], result of:
            0.007853253 = score(doc=396,freq=2.0), product of:
              0.030150186 = queryWeight, product of:
                3.3678792 = idf(docFreq=4051, maxDocs=43254)
                0.008952277 = queryNorm
              0.2604711 = fieldWeight in 396, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.3678792 = idf(docFreq=4051, maxDocs=43254)
                0.0546875 = fieldNorm(doc=396)
          0.042594347 = weight(abstract_txt:stemming in 396) [ClassicSimilarity], result of:
            0.042594347 = score(doc=396,freq=2.0), product of:
              0.0738723 = queryWeight, product of:
                1.1068298 = boost
                7.4553375 = idf(docFreq=67, maxDocs=43254)
                0.008952277 = queryNorm
              0.5765943 = fieldWeight in 396, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.4553375 = idf(docFreq=67, maxDocs=43254)
                0.0546875 = fieldNorm(doc=396)
          0.021419508 = weight(abstract_txt:language in 396) [ClassicSimilarity], result of:
            0.021419508 = score(doc=396,freq=4.0), product of:
              0.04671467 = queryWeight, product of:
                1.2447486 = boost
                4.192163 = idf(docFreq=1776, maxDocs=43254)
                0.008952277 = queryNorm
              0.45851782 = fieldWeight in 396, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.192163 = idf(docFreq=1776, maxDocs=43254)
                0.0546875 = fieldNorm(doc=396)
          0.058029752 = weight(abstract_txt:morphologically in 396) [ClassicSimilarity], result of:
            0.058029752 = score(doc=396,freq=1.0), product of:
              0.11438193 = queryWeight, product of:
                1.3772688 = boost
                9.27695 = idf(docFreq=10, maxDocs=43254)
                0.008952277 = queryNorm
              0.5073332 = fieldWeight in 396, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.27695 = idf(docFreq=10, maxDocs=43254)
                0.0546875 = fieldNorm(doc=396)
          0.019440165 = weight(abstract_txt:complex in 396) [ClassicSimilarity], result of:
            0.019440165 = score(doc=396,freq=1.0), product of:
              0.06951314 = queryWeight, product of:
                1.518408 = boost
                5.1138144 = idf(docFreq=706, maxDocs=43254)
                0.008952277 = queryNorm
              0.27966172 = fieldWeight in 396, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1138144 = idf(docFreq=706, maxDocs=43254)
                0.0546875 = fieldNorm(doc=396)
          0.02034245 = weight(abstract_txt:languages in 396) [ClassicSimilarity], result of:
            0.02034245 = score(doc=396,freq=1.0), product of:
              0.07164773 = queryWeight, product of:
                1.5415452 = boost
                5.1917377 = idf(docFreq=653, maxDocs=43254)
                0.008952277 = queryNorm
              0.28392315 = fieldWeight in 396, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1917377 = idf(docFreq=653, maxDocs=43254)
                0.0546875 = fieldNorm(doc=396)
          0.015403544 = weight(abstract_txt:approach in 396) [ClassicSimilarity], result of:
            0.015403544 = score(doc=396,freq=1.0), product of:
              0.07499357 = queryWeight, product of:
                2.2303963 = boost
                3.7558525 = idf(docFreq=2748, maxDocs=43254)
                0.008952277 = queryNorm
              0.20539819 = fieldWeight in 396, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.7558525 = idf(docFreq=2748, maxDocs=43254)
                0.0546875 = fieldNorm(doc=396)
          0.6407888 = weight(abstract_txt:morphological in 396) [ClassicSimilarity], result of:
            0.6407888 = score(doc=396,freq=6.0), product of:
              0.5971028 = queryWeight, product of:
                8.3255625 = boost
                8.011283 = idf(docFreq=38, maxDocs=43254)
                0.008952277 = queryNorm
              1.0731633 = fieldWeight in 396, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                8.011283 = idf(docFreq=38, maxDocs=43254)
                0.0546875 = fieldNorm(doc=396)
        0.32 = coord(8/25)
    
  3. Pirkola, A.: Morphological typology of languages for IR (2001) 0.25
    0.24565552 = sum of:
      0.24565552 = product of:
        1.0235647 = sum of:
          0.007932983 = weight(abstract_txt:used in 477) [ClassicSimilarity], result of:
            0.007932983 = score(doc=477,freq=1.0), product of:
              0.030150186 = queryWeight, product of:
                3.3678792 = idf(docFreq=4051, maxDocs=43254)
                0.008952277 = queryNorm
              0.26311556 = fieldWeight in 477, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.3678792 = idf(docFreq=4051, maxDocs=43254)
                0.078125 = fieldNorm(doc=477)
          0.04302679 = weight(abstract_txt:stemming in 477) [ClassicSimilarity], result of:
            0.04302679 = score(doc=477,freq=1.0), product of:
              0.0738723 = queryWeight, product of:
                1.1068298 = boost
                7.4553375 = idf(docFreq=67, maxDocs=43254)
                0.008952277 = queryNorm
              0.58244824 = fieldWeight in 477, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.4553375 = idf(docFreq=67, maxDocs=43254)
                0.078125 = fieldNorm(doc=477)
          0.021636972 = weight(abstract_txt:language in 477) [ClassicSimilarity], result of:
            0.021636972 = score(doc=477,freq=2.0), product of:
              0.04671467 = queryWeight, product of:
                1.2447486 = boost
                4.192163 = idf(docFreq=1776, maxDocs=43254)
                0.008952277 = queryNorm
              0.46317294 = fieldWeight in 477, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.192163 = idf(docFreq=1776, maxDocs=43254)
                0.078125 = fieldNorm(doc=477)
          0.050334513 = weight(abstract_txt:languages in 477) [ClassicSimilarity], result of:
            0.050334513 = score(doc=477,freq=3.0), product of:
              0.07164773 = queryWeight, product of:
                1.5415452 = boost
                5.1917377 = idf(docFreq=653, maxDocs=43254)
                0.008952277 = queryNorm
              0.70252764 = fieldWeight in 477, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.1917377 = idf(docFreq=653, maxDocs=43254)
                0.078125 = fieldNorm(doc=477)
          0.15320222 = weight(abstract_txt:morphology in 477) [ClassicSimilarity], result of:
            0.15320222 = score(doc=477,freq=1.0), product of:
              0.21702461 = queryWeight, product of:
                2.682931 = boost
                9.035788 = idf(docFreq=13, maxDocs=43254)
                0.008952277 = queryNorm
              0.70592093 = fieldWeight in 477, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.035788 = idf(docFreq=13, maxDocs=43254)
                0.078125 = fieldNorm(doc=477)
          0.7474312 = weight(abstract_txt:morphological in 477) [ClassicSimilarity], result of:
            0.7474312 = score(doc=477,freq=4.0), product of:
              0.5971028 = queryWeight, product of:
                8.3255625 = boost
                8.011283 = idf(docFreq=38, maxDocs=43254)
                0.008952277 = queryNorm
              1.251763 = fieldWeight in 477, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                8.011283 = idf(docFreq=38, maxDocs=43254)
                0.078125 = fieldNorm(doc=477)
        0.24 = coord(6/25)
    
  4. Ekmekcioglu, F.C.; Lynch, M.F.; Willet, P.: Development and evaluation of conflation techniques for the implementation of a document retrieval system for Turkish text databases (1995) 0.19
    0.19232811 = sum of:
      0.19232811 = product of:
        0.96164054 = sum of:
          0.044598263 = weight(abstract_txt:corpora in 6866) [ClassicSimilarity], result of:
            0.044598263 = score(doc=6866,freq=1.0), product of:
              0.06700082 = queryWeight, product of:
                1.0540957 = boost
                7.100134 = idf(docFreq=96, maxDocs=43254)
                0.008952277 = queryNorm
              0.66563755 = fieldWeight in 6866, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.100134 = idf(docFreq=96, maxDocs=43254)
                0.09375 = fieldNorm(doc=6866)
          0.073018886 = weight(abstract_txt:stemming in 6866) [ClassicSimilarity], result of:
            0.073018886 = score(doc=6866,freq=2.0), product of:
              0.0738723 = queryWeight, product of:
                1.1068298 = boost
                7.4553375 = idf(docFreq=67, maxDocs=43254)
                0.008952277 = queryNorm
              0.9884474 = fieldWeight in 6866, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.4553375 = idf(docFreq=67, maxDocs=43254)
                0.09375 = fieldNorm(doc=6866)
          0.025964366 = weight(abstract_txt:language in 6866) [ClassicSimilarity], result of:
            0.025964366 = score(doc=6866,freq=2.0), product of:
              0.04671467 = queryWeight, product of:
                1.2447486 = boost
                4.192163 = idf(docFreq=1776, maxDocs=43254)
                0.008952277 = queryNorm
              0.55580753 = fieldWeight in 6866, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.192163 = idf(docFreq=1776, maxDocs=43254)
                0.09375 = fieldNorm(doc=6866)
          0.18384264 = weight(abstract_txt:morphology in 6866) [ClassicSimilarity], result of:
            0.18384264 = score(doc=6866,freq=1.0), product of:
              0.21702461 = queryWeight, product of:
                2.682931 = boost
                9.035788 = idf(docFreq=13, maxDocs=43254)
                0.008952277 = queryNorm
              0.8471051 = fieldWeight in 6866, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.035788 = idf(docFreq=13, maxDocs=43254)
                0.09375 = fieldNorm(doc=6866)
          0.63421637 = weight(abstract_txt:morphological in 6866) [ClassicSimilarity], result of:
            0.63421637 = score(doc=6866,freq=2.0), product of:
              0.5971028 = queryWeight, product of:
                8.3255625 = boost
                8.011283 = idf(docFreq=38, maxDocs=43254)
                0.008952277 = queryNorm
              1.0621561 = fieldWeight in 6866, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.011283 = idf(docFreq=38, maxDocs=43254)
                0.09375 = fieldNorm(doc=6866)
        0.2 = coord(5/25)
    
  5. Kettunen, K.: Reductive and generative approaches to management of morphological variation of keywords in monolingual information retrieval : an overview (2009) 0.16
    0.1580789 = sum of:
      0.1580789 = product of:
        0.6586621 = sum of:
          0.007932983 = weight(abstract_txt:used in 4836) [ClassicSimilarity], result of:
            0.007932983 = score(doc=4836,freq=1.0), product of:
              0.030150186 = queryWeight, product of:
                3.3678792 = idf(docFreq=4051, maxDocs=43254)
                0.008952277 = queryNorm
              0.26311556 = fieldWeight in 4836, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.3678792 = idf(docFreq=4051, maxDocs=43254)
                0.078125 = fieldNorm(doc=4836)
          0.043812733 = weight(abstract_txt:variants in 4836) [ClassicSimilarity], result of:
            0.043812733 = score(doc=4836,freq=1.0), product of:
              0.07476917 = queryWeight, product of:
                1.1135284 = boost
                7.500458 = idf(docFreq=64, maxDocs=43254)
                0.008952277 = queryNorm
              0.58597326 = fieldWeight in 4836, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.500458 = idf(docFreq=64, maxDocs=43254)
                0.078125 = fieldNorm(doc=4836)
          0.01529965 = weight(abstract_txt:language in 4836) [ClassicSimilarity], result of:
            0.01529965 = score(doc=4836,freq=1.0), product of:
              0.04671467 = queryWeight, product of:
                1.2447486 = boost
                4.192163 = idf(docFreq=1776, maxDocs=43254)
                0.008952277 = queryNorm
              0.32751274 = fieldWeight in 4836, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.192163 = idf(docFreq=1776, maxDocs=43254)
                0.078125 = fieldNorm(doc=4836)
          0.041097954 = weight(abstract_txt:languages in 4836) [ClassicSimilarity], result of:
            0.041097954 = score(doc=4836,freq=2.0), product of:
              0.07164773 = queryWeight, product of:
                1.5415452 = boost
                5.1917377 = idf(docFreq=653, maxDocs=43254)
                0.008952277 = queryNorm
              0.5736114 = fieldWeight in 4836, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.1917377 = idf(docFreq=653, maxDocs=43254)
                0.078125 = fieldNorm(doc=4836)
          0.022005063 = weight(abstract_txt:approach in 4836) [ClassicSimilarity], result of:
            0.022005063 = score(doc=4836,freq=1.0), product of:
              0.07499357 = queryWeight, product of:
                2.2303963 = boost
                3.7558525 = idf(docFreq=2748, maxDocs=43254)
                0.008952277 = queryNorm
              0.29342598 = fieldWeight in 4836, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.7558525 = idf(docFreq=2748, maxDocs=43254)
                0.078125 = fieldNorm(doc=4836)
          0.52851367 = weight(abstract_txt:morphological in 4836) [ClassicSimilarity], result of:
            0.52851367 = score(doc=4836,freq=2.0), product of:
              0.5971028 = queryWeight, product of:
                8.3255625 = boost
                8.011283 = idf(docFreq=38, maxDocs=43254)
                0.008952277 = queryNorm
              0.8851301 = fieldWeight in 4836, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.011283 = idf(docFreq=38, maxDocs=43254)
                0.078125 = fieldNorm(doc=4836)
        0.24 = coord(6/25)