Document (#39688)

Author
Brychcín, T.
Konopík, M.
Title
HPS: High precision stemmer
Source
Information processing and management. 51(2015) no.1, S.68-91
Year
2015
Abstract
Research into unsupervised ways of stemming has resulted, in the past few years, in the development of methods that are reliable and perform well. Our approach further shifts the boundaries of the state of the art by providing more accurate stemming results. The idea of the approach consists in building a stemmer in two stages. In the first stage, a stemming algorithm based upon clustering, which exploits the lexical and semantic information of words, is used to prepare large-scale training data for the second-stage algorithm. The second-stage algorithm uses a maximum entropy classifier. The stemming-specific features help the classifier decide when and how to stem a particular word. In our research, we have pursued the goal of creating a multi-purpose stemming tool. Its design opens up possibilities of solving non-traditional tasks such as approximating lemmas or improving language modeling. However, we still aim at very good results in the traditional task of information retrieval. The conducted tests reveal exceptional performance in all the above mentioned tasks. Our stemming method is compared with three state-of-the-art statistical algorithms and one rule-based algorithm. We used corpora in the Czech, Slovak, Polish, Hungarian, Spanish and English languages. In the tests, our algorithm excels in stemming previously unseen words (the words that are not present in the training set). Moreover, it was discovered that our approach demands very little text data for training when compared with competing unsupervised algorithms.
Content
Vgl.: doi: 10.1016/j.ipm.2014.08.006.
Footnote
Vgl. auch: http://liks.fav.zcu.cz/HPS/.
Theme
Computerlinguistik

Similar documents (content)

  1. Savoy, J.: Searching strategies for the Hungarian language (2008) 0.33
    0.32654184 = sum of:
      0.32654184 = product of:
        1.1662209 = sum of:
          0.034161706 = weight(abstract_txt:when in 4038) [ClassicSimilarity], result of:
            0.034161706 = score(doc=4038,freq=3.0), product of:
              0.06053956 = queryWeight, product of:
                1.017458 = boost
                4.1701303 = idf(docFreq=1794, maxDocs=42740)
                0.01426833 = queryNorm
              0.5642873 = fieldWeight in 4038, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.1701303 = idf(docFreq=1794, maxDocs=42740)
                0.078125 = fieldNorm(doc=4038)
          0.111524455 = weight(abstract_txt:hungarian in 4038) [ClassicSimilarity], result of:
            0.111524455 = score(doc=4038,freq=1.0), product of:
              0.1525071 = queryWeight, product of:
                1.1418968 = boost
                9.360306 = idf(docFreq=9, maxDocs=42740)
                0.01426833 = queryNorm
              0.7312739 = fieldWeight in 4038, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.360306 = idf(docFreq=9, maxDocs=42740)
                0.078125 = fieldNorm(doc=4038)
          0.032555263 = weight(abstract_txt:very in 4038) [ClassicSimilarity], result of:
            0.032555263 = score(doc=4038,freq=1.0), product of:
              0.084553994 = queryWeight, product of:
                1.2024413 = boost
                4.928299 = idf(docFreq=840, maxDocs=42740)
                0.01426833 = queryNorm
              0.38502336 = fieldWeight in 4038, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.928299 = idf(docFreq=840, maxDocs=42740)
                0.078125 = fieldNorm(doc=4038)
          0.05984155 = weight(abstract_txt:compared in 4038) [ClassicSimilarity], result of:
            0.05984155 = score(doc=4038,freq=3.0), product of:
              0.08797274 = queryWeight, product of:
                1.2265095 = boost
                5.026944 = idf(docFreq=761, maxDocs=42740)
                0.01426833 = queryNorm
              0.68022835 = fieldWeight in 4038, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.026944 = idf(docFreq=761, maxDocs=42740)
                0.078125 = fieldNorm(doc=4038)
          0.037931796 = weight(abstract_txt:approach in 4038) [ClassicSimilarity], result of:
            0.037931796 = score(doc=4038,freq=3.0), product of:
              0.074309714 = queryWeight, product of:
                1.3805918 = boost
                3.772308 = idf(docFreq=2671, maxDocs=42740)
                0.01426833 = queryNorm
              0.5104554 = fieldWeight in 4038, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.772308 = idf(docFreq=2671, maxDocs=42740)
                0.078125 = fieldNorm(doc=4038)
          0.21026741 = weight(abstract_txt:stemmer in 4038) [ClassicSimilarity], result of:
            0.21026741 = score(doc=4038,freq=1.0), product of:
              0.2932477 = queryWeight, product of:
                2.2393098 = boost
                9.177984 = idf(docFreq=11, maxDocs=42740)
                0.01426833 = queryNorm
              0.71703005 = fieldWeight in 4038, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.177984 = idf(docFreq=11, maxDocs=42740)
                0.078125 = fieldNorm(doc=4038)
          0.67993873 = weight(abstract_txt:stemming in 4038) [ClassicSimilarity], result of:
            0.67993873 = score(doc=4038,freq=3.0), product of:
              0.6750699 = queryWeight, product of:
                6.3563128 = boost
                7.4433827 = idf(docFreq=67, maxDocs=42740)
                0.01426833 = queryNorm
              1.0072123 = fieldWeight in 4038, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.4433827 = idf(docFreq=67, maxDocs=42740)
                0.078125 = fieldNorm(doc=4038)
        0.28 = coord(7/25)
    
  2. Kraaij, W.; Pohlmann, R.: Evaluation of a Dutch stemming algorithm (1995) 0.26
    0.26418754 = sum of:
      0.26418754 = product of:
        1.3209376 = sum of:
          0.07490132 = weight(abstract_txt:stem in 5867) [ClassicSimilarity], result of:
            0.07490132 = score(doc=5867,freq=1.0), product of:
              0.11695971 = queryWeight, product of:
                8.197155 = idf(docFreq=31, maxDocs=42740)
                0.01426833 = queryNorm
              0.64040273 = fieldWeight in 5867, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.197155 = idf(docFreq=31, maxDocs=42740)
                0.078125 = fieldNorm(doc=5867)
          0.08876706 = weight(abstract_txt:words in 5867) [ClassicSimilarity], result of:
            0.08876706 = score(doc=5867,freq=2.0), product of:
              0.14993683 = queryWeight, product of:
                1.9610859 = boost
                5.358442 = idf(docFreq=546, maxDocs=42740)
                0.01426833 = queryNorm
              0.59202975 = fieldWeight in 5867, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.358442 = idf(docFreq=546, maxDocs=42740)
                0.078125 = fieldNorm(doc=5867)
          0.297363 = weight(abstract_txt:stemmer in 5867) [ClassicSimilarity], result of:
            0.297363 = score(doc=5867,freq=2.0), product of:
              0.2932477 = queryWeight, product of:
                2.2393098 = boost
                9.177984 = idf(docFreq=11, maxDocs=42740)
                0.01426833 = queryNorm
              1.0140336 = fieldWeight in 5867, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.177984 = idf(docFreq=11, maxDocs=42740)
                0.078125 = fieldNorm(doc=5867)
          0.17996745 = weight(abstract_txt:algorithm in 5867) [ClassicSimilarity], result of:
            0.17996745 = score(doc=5867,freq=2.0), product of:
              0.28476456 = queryWeight, product of:
                3.4890711 = boost
                5.7200913 = idf(docFreq=380, maxDocs=42740)
                0.01426833 = queryNorm
              0.6319868 = fieldWeight in 5867, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.7200913 = idf(docFreq=380, maxDocs=42740)
                0.078125 = fieldNorm(doc=5867)
          0.67993873 = weight(abstract_txt:stemming in 5867) [ClassicSimilarity], result of:
            0.67993873 = score(doc=5867,freq=3.0), product of:
              0.6750699 = queryWeight, product of:
                6.3563128 = boost
                7.4433827 = idf(docFreq=67, maxDocs=42740)
                0.01426833 = queryNorm
              1.0072123 = fieldWeight in 5867, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.4433827 = idf(docFreq=67, maxDocs=42740)
                0.078125 = fieldNorm(doc=5867)
        0.2 = coord(5/25)
    
  3. Fox, B.; Fox, C.J.: Efficient stemmer generation (2002) 0.23
    0.23338553 = sum of:
      0.23338553 = product of:
        1.4586596 = sum of:
          0.07267188 = weight(abstract_txt:algorithms in 3586) [ClassicSimilarity], result of:
            0.07267188 = score(doc=3586,freq=1.0), product of:
              0.115401715 = queryWeight, product of:
                1.4047627 = boost
                5.757529 = idf(docFreq=366, maxDocs=42740)
                0.01426833 = queryNorm
              0.6297297 = fieldWeight in 3586, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.757529 = idf(docFreq=366, maxDocs=42740)
                0.109375 = fieldNorm(doc=3586)
          0.6582411 = weight(abstract_txt:stemmer in 3586) [ClassicSimilarity], result of:
            0.6582411 = score(doc=3586,freq=5.0), product of:
              0.2932477 = queryWeight, product of:
                2.2393098 = boost
                9.177984 = idf(docFreq=11, maxDocs=42740)
                0.01426833 = queryNorm
              2.244659 = fieldWeight in 3586, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                9.177984 = idf(docFreq=11, maxDocs=42740)
                0.109375 = fieldNorm(doc=3586)
          0.17815867 = weight(abstract_txt:algorithm in 3586) [ClassicSimilarity], result of:
            0.17815867 = score(doc=3586,freq=1.0), product of:
              0.28476456 = queryWeight, product of:
                3.4890711 = boost
                5.7200913 = idf(docFreq=380, maxDocs=42740)
                0.01426833 = queryNorm
              0.62563497 = fieldWeight in 3586, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7200913 = idf(docFreq=380, maxDocs=42740)
                0.109375 = fieldNorm(doc=3586)
          0.5495879 = weight(abstract_txt:stemming in 3586) [ClassicSimilarity], result of:
            0.5495879 = score(doc=3586,freq=1.0), product of:
              0.6750699 = queryWeight, product of:
                6.3563128 = boost
                7.4433827 = idf(docFreq=67, maxDocs=42740)
                0.01426833 = queryNorm
              0.81412 = fieldWeight in 3586, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.4433827 = idf(docFreq=67, maxDocs=42740)
                0.109375 = fieldNorm(doc=3586)
        0.16 = coord(4/25)
    
  4. Kettunen, K.; Kunttu, T.; Järvelin, K.: To stem or lemmatize a highly inflectional language in a probabilistic IR environment? (2005) 0.23
    0.22923565 = sum of:
      0.22923565 = product of:
        0.81869876 = sum of:
          0.11723911 = weight(abstract_txt:stem in 396) [ClassicSimilarity], result of:
            0.11723911 = score(doc=396,freq=5.0), product of:
              0.11695971 = queryWeight, product of:
                8.197155 = idf(docFreq=31, maxDocs=42740)
                0.01426833 = queryNorm
              1.0023888 = fieldWeight in 396, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                8.197155 = idf(docFreq=31, maxDocs=42740)
                0.0546875 = fieldNorm(doc=396)
          0.022788683 = weight(abstract_txt:very in 396) [ClassicSimilarity], result of:
            0.022788683 = score(doc=396,freq=1.0), product of:
              0.084553994 = queryWeight, product of:
                1.2024413 = boost
                4.928299 = idf(docFreq=840, maxDocs=42740)
                0.01426833 = queryNorm
              0.26951635 = fieldWeight in 396, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.928299 = idf(docFreq=840, maxDocs=42740)
                0.0546875 = fieldNorm(doc=396)
          0.024184674 = weight(abstract_txt:compared in 396) [ClassicSimilarity], result of:
            0.024184674 = score(doc=396,freq=1.0), product of:
              0.08797274 = queryWeight, product of:
                1.2265095 = boost
                5.026944 = idf(docFreq=761, maxDocs=42740)
                0.01426833 = queryNorm
              0.27491102 = fieldWeight in 396, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.026944 = idf(docFreq=761, maxDocs=42740)
                0.0546875 = fieldNorm(doc=396)
          0.015329952 = weight(abstract_txt:approach in 396) [ClassicSimilarity], result of:
            0.015329952 = score(doc=396,freq=1.0), product of:
              0.074309714 = queryWeight, product of:
                1.3805918 = boost
                3.772308 = idf(docFreq=2671, maxDocs=42740)
                0.01426833 = queryNorm
              0.2062981 = fieldWeight in 396, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.772308 = idf(docFreq=2671, maxDocs=42740)
                0.0546875 = fieldNorm(doc=396)
          0.04238497 = weight(abstract_txt:tests in 396) [ClassicSimilarity], result of:
            0.04238497 = score(doc=396,freq=1.0), product of:
              0.12787801 = queryWeight, product of:
                1.4787501 = boost
                6.060772 = idf(docFreq=270, maxDocs=42740)
                0.01426833 = queryNorm
              0.33144847 = fieldWeight in 396, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.060772 = idf(docFreq=270, maxDocs=42740)
                0.0546875 = fieldNorm(doc=396)
          0.20815411 = weight(abstract_txt:stemmer in 396) [ClassicSimilarity], result of:
            0.20815411 = score(doc=396,freq=2.0), product of:
              0.2932477 = queryWeight, product of:
                2.2393098 = boost
                9.177984 = idf(docFreq=11, maxDocs=42740)
                0.01426833 = queryNorm
              0.7098235 = fieldWeight in 396, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.177984 = idf(docFreq=11, maxDocs=42740)
                0.0546875 = fieldNorm(doc=396)
          0.3886173 = weight(abstract_txt:stemming in 396) [ClassicSimilarity], result of:
            0.3886173 = score(doc=396,freq=2.0), product of:
              0.6750699 = queryWeight, product of:
                6.3563128 = boost
                7.4433827 = idf(docFreq=67, maxDocs=42740)
                0.01426833 = queryNorm
              0.5756697 = fieldWeight in 396, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.4433827 = idf(docFreq=67, maxDocs=42740)
                0.0546875 = fieldNorm(doc=396)
        0.28 = coord(7/25)
    
  5. Martins, A.L.; Souza, R.R.; Ribeiro de Mello, H.: ¬The use of noun phrases in information retrieval : proposing a mechanism for automatic classification (2014) 0.23
    0.22654316 = sum of:
      0.22654316 = product of:
        0.6292865 = sum of:
          0.024184674 = weight(abstract_txt:compared in 3442) [ClassicSimilarity], result of:
            0.024184674 = score(doc=3442,freq=1.0), product of:
              0.08797274 = queryWeight, product of:
                1.2265095 = boost
                5.026944 = idf(docFreq=761, maxDocs=42740)
                0.01426833 = queryNorm
              0.27491102 = fieldWeight in 3442, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.026944 = idf(docFreq=761, maxDocs=42740)
                0.0546875 = fieldNorm(doc=3442)
          0.024375835 = weight(abstract_txt:second in 3442) [ClassicSimilarity], result of:
            0.024375835 = score(doc=3442,freq=1.0), product of:
              0.0884357 = queryWeight, product of:
                1.2297325 = boost
                5.040154 = idf(docFreq=751, maxDocs=42740)
                0.01426833 = queryNorm
              0.27563342 = fieldWeight in 3442, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.040154 = idf(docFreq=751, maxDocs=42740)
                0.0546875 = fieldNorm(doc=3442)
          0.02691272 = weight(abstract_txt:tasks in 3442) [ClassicSimilarity], result of:
            0.02691272 = score(doc=3442,freq=1.0), product of:
              0.09446981 = queryWeight, product of:
                1.2709936 = boost
                5.2092657 = idf(docFreq=634, maxDocs=42740)
                0.01426833 = queryNorm
              0.2848817 = fieldWeight in 3442, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2092657 = idf(docFreq=634, maxDocs=42740)
                0.0546875 = fieldNorm(doc=3442)
          0.015329952 = weight(abstract_txt:approach in 3442) [ClassicSimilarity], result of:
            0.015329952 = score(doc=3442,freq=1.0), product of:
              0.074309714 = queryWeight, product of:
                1.3805918 = boost
                3.772308 = idf(docFreq=2671, maxDocs=42740)
                0.01426833 = queryNorm
              0.2062981 = fieldWeight in 3442, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.772308 = idf(docFreq=2671, maxDocs=42740)
                0.0546875 = fieldNorm(doc=3442)
          0.0599414 = weight(abstract_txt:tests in 3442) [ClassicSimilarity], result of:
            0.0599414 = score(doc=3442,freq=2.0), product of:
              0.12787801 = queryWeight, product of:
                1.4787501 = boost
                6.060772 = idf(docFreq=270, maxDocs=42740)
                0.01426833 = queryNorm
              0.4687389 = fieldWeight in 3442, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.060772 = idf(docFreq=270, maxDocs=42740)
                0.0546875 = fieldNorm(doc=3442)
          0.038566716 = weight(abstract_txt:training in 3442) [ClassicSimilarity], result of:
            0.038566716 = score(doc=3442,freq=1.0), product of:
              0.13745487 = queryWeight, product of:
                1.8776842 = boost
                5.130556 = idf(docFreq=686, maxDocs=42740)
                0.01426833 = queryNorm
              0.2805773 = fieldWeight in 3442, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.130556 = idf(docFreq=686, maxDocs=42740)
                0.0546875 = fieldNorm(doc=3442)
          0.0761019 = weight(abstract_txt:words in 3442) [ClassicSimilarity], result of:
            0.0761019 = score(doc=3442,freq=3.0), product of:
              0.14993683 = queryWeight, product of:
                1.9610859 = boost
                5.358442 = idf(docFreq=546, maxDocs=42740)
                0.01426833 = queryNorm
              0.5075598 = fieldWeight in 3442, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.358442 = idf(docFreq=546, maxDocs=42740)
                0.0546875 = fieldNorm(doc=3442)
          0.089079335 = weight(abstract_txt:algorithm in 3442) [ClassicSimilarity], result of:
            0.089079335 = score(doc=3442,freq=1.0), product of:
              0.28476456 = queryWeight, product of:
                3.4890711 = boost
                5.7200913 = idf(docFreq=380, maxDocs=42740)
                0.01426833 = queryNorm
              0.31281748 = fieldWeight in 3442, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7200913 = idf(docFreq=380, maxDocs=42740)
                0.0546875 = fieldNorm(doc=3442)
          0.27479395 = weight(abstract_txt:stemming in 3442) [ClassicSimilarity], result of:
            0.27479395 = score(doc=3442,freq=1.0), product of:
              0.6750699 = queryWeight, product of:
                6.3563128 = boost
                7.4433827 = idf(docFreq=67, maxDocs=42740)
                0.01426833 = queryNorm
              0.40706 = fieldWeight in 3442, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.4433827 = idf(docFreq=67, maxDocs=42740)
                0.0546875 = fieldNorm(doc=3442)
        0.36 = coord(9/25)