Document (#34036)

Author
Malenica, M.
Smuc, T.
Snajder, J.
Basic, B.D.
Title
Language morphology offset : text classification on a Croatian-English parallel corpus
Source
Information processing and management. 44(2008) no.1, S.325-339
Year
2008
Abstract
We investigate how, and to what extent, morphological complexity of the language influences text classification using support vector machines (SVM). The Croatian-English parallel corpus provides the basis for direct comparison of two languages of radically different morphological complexity. We quantified, compared, and statistically tested the effects of morphological normalisation on SVM classifier performance based on a series of parallel experiments on both languages, carried over a large scale of different feature subset sizes obtained by different feature selection methods, and applying different levels of morphological normalisation. We also quantified the trade-off between feature space size and performance for different levels of morphological normalisation, and compared the results for both languages. Our experiments have shown that the improvements in SVM classifier performance is statistically significant; they are greater for small and medium number of features, especially for Croatian, whereas for large number of features the improvements are rather small and may be negligible in practice for both languages.
Theme
Automatisches Klassifizieren

Similar documents (content)

  1. Snajder, J.; Dalbelo Basic, B.D.; Tadic, M.: Automatic acquisition of inflectional lexica for morphological normalisation (2008) 0.57
    0.5741457 = sum of:
      0.5741457 = product of:
        2.0505202 = sum of:
          0.13035008 = weight(abstract_txt:morphology in 2910) [ClassicSimilarity], result of:
            0.13035008 = score(doc=2910,freq=2.0), product of:
              0.13025132 = queryWeight, product of:
                1.0167993 = boost
                9.05783 = idf(docFreq=13, maxDocs=44218)
                0.014142387 = queryNorm
              1.0007583 = fieldWeight in 2910, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.05783 = idf(docFreq=13, maxDocs=44218)
                0.078125 = fieldNorm(doc=2910)
          0.05128509 = weight(abstract_txt:complexity in 2910) [ClassicSimilarity], result of:
            0.05128509 = score(doc=2910,freq=1.0), product of:
              0.11101679 = queryWeight, product of:
                1.327558 = boost
                5.913062 = idf(docFreq=324, maxDocs=44218)
                0.014142387 = queryNorm
              0.461958 = fieldWeight in 2910, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.913062 = idf(docFreq=324, maxDocs=44218)
                0.078125 = fieldNorm(doc=2910)
          0.036940802 = weight(abstract_txt:performance in 2910) [ClassicSimilarity], result of:
            0.036940802 = score(doc=2910,freq=1.0), product of:
              0.10211649 = queryWeight, product of:
                1.5593828 = boost
                4.63042 = idf(docFreq=1171, maxDocs=44218)
                0.014142387 = queryNorm
              0.3617516 = fieldWeight in 2910, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.63042 = idf(docFreq=1171, maxDocs=44218)
                0.078125 = fieldNorm(doc=2910)
          0.09797794 = weight(abstract_txt:languages in 2910) [ClassicSimilarity], result of:
            0.09797794 = score(doc=2910,freq=2.0), product of:
              0.1709281 = queryWeight, product of:
                2.3295977 = boost
                5.188118 = idf(docFreq=670, maxDocs=44218)
                0.014142387 = queryNorm
              0.57321143 = fieldWeight in 2910, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.188118 = idf(docFreq=670, maxDocs=44218)
                0.078125 = fieldNorm(doc=2910)
          0.30848828 = weight(abstract_txt:croatian in 2910) [ClassicSimilarity], result of:
            0.30848828 = score(doc=2910,freq=1.0), product of:
              0.42032394 = queryWeight, product of:
                3.1637115 = boost
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.014142387 = queryNorm
              0.7339299 = fieldWeight in 2910, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.078125 = fieldNorm(doc=2910)
          0.6379691 = weight(abstract_txt:normalisation in 2910) [ClassicSimilarity], result of:
            0.6379691 = score(doc=2910,freq=4.0), product of:
              0.42980495 = queryWeight, product of:
                3.1991937 = boost
                9.499662 = idf(docFreq=8, maxDocs=44218)
                0.014142387 = queryNorm
              1.4843223 = fieldWeight in 2910, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                9.499662 = idf(docFreq=8, maxDocs=44218)
                0.078125 = fieldNorm(doc=2910)
          0.78750896 = weight(abstract_txt:morphological in 2910) [ClassicSimilarity], result of:
            0.78750896 = score(doc=2910,freq=6.0), product of:
              0.5122648 = queryWeight, product of:
                4.5089607 = boost
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.014142387 = queryNorm
              1.5373085 = fieldWeight in 2910, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.078125 = fieldNorm(doc=2910)
        0.28 = coord(7/25)
    
  2. Kettunen, K.; Kunttu, T.; Järvelin, K.: To stem or lemmatize a highly inflectional language in a probabilistic IR environment? (2005) 0.21
    0.21258521 = sum of:
      0.21258521 = product of:
        0.7592329 = sum of:
          0.02191457 = weight(abstract_txt:compared in 4395) [ClassicSimilarity], result of:
            0.02191457 = score(doc=4395,freq=1.0), product of:
              0.079888545 = queryWeight, product of:
                1.1261635 = boost
                5.0160327 = idf(docFreq=796, maxDocs=44218)
                0.014142387 = queryNorm
              0.27431428 = fieldWeight in 4395, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.0160327 = idf(docFreq=796, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4395)
          0.026349746 = weight(abstract_txt:small in 4395) [ClassicSimilarity], result of:
            0.026349746 = score(doc=4395,freq=1.0), product of:
              0.09033308 = queryWeight, product of:
                1.1975195 = boost
                5.333859 = idf(docFreq=579, maxDocs=44218)
                0.014142387 = queryNorm
              0.29169542 = fieldWeight in 4395, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.333859 = idf(docFreq=579, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4395)
          0.055122364 = weight(abstract_txt:statistically in 4395) [ClassicSimilarity], result of:
            0.055122364 = score(doc=4395,freq=1.0), product of:
              0.1477569 = queryWeight, product of:
                1.5315567 = boost
                6.82169 = idf(docFreq=130, maxDocs=44218)
                0.014142387 = queryNorm
              0.37306118 = fieldWeight in 4395, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.82169 = idf(docFreq=130, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4395)
          0.02585856 = weight(abstract_txt:performance in 4395) [ClassicSimilarity], result of:
            0.02585856 = score(doc=4395,freq=1.0), product of:
              0.10211649 = queryWeight, product of:
                1.5593828 = boost
                4.63042 = idf(docFreq=1171, maxDocs=44218)
                0.014142387 = queryNorm
              0.2532261 = fieldWeight in 4395, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.63042 = idf(docFreq=1171, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4395)
          0.030234804 = weight(abstract_txt:different in 4395) [ClassicSimilarity], result of:
            0.030234804 = score(doc=4395,freq=2.0), product of:
              0.10665242 = queryWeight, product of:
                2.0573802 = boost
                3.6655018 = idf(docFreq=3075, maxDocs=44218)
                0.014142387 = queryNorm
              0.28348917 = fieldWeight in 4395, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.6655018 = idf(docFreq=3075, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4395)
          0.04849661 = weight(abstract_txt:languages in 4395) [ClassicSimilarity], result of:
            0.04849661 = score(doc=4395,freq=1.0), product of:
              0.1709281 = queryWeight, product of:
                2.3295977 = boost
                5.188118 = idf(docFreq=670, maxDocs=44218)
                0.014142387 = queryNorm
              0.2837252 = fieldWeight in 4395, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.188118 = idf(docFreq=670, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4395)
          0.55125624 = weight(abstract_txt:morphological in 4395) [ClassicSimilarity], result of:
            0.55125624 = score(doc=4395,freq=6.0), product of:
              0.5122648 = queryWeight, product of:
                4.5089607 = boost
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.014142387 = queryNorm
              1.0761158 = fieldWeight in 4395, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4395)
        0.28 = coord(7/25)
    
  3. Pirkola, A.: Morphological typology of languages for IR (2001) 0.19
    0.18992911 = sum of:
      0.18992911 = product of:
        0.9496455 = sum of:
          0.09217143 = weight(abstract_txt:morphology in 4476) [ClassicSimilarity], result of:
            0.09217143 = score(doc=4476,freq=1.0), product of:
              0.13025132 = queryWeight, product of:
                1.0167993 = boost
                9.05783 = idf(docFreq=13, maxDocs=44218)
                0.014142387 = queryNorm
              0.707643 = fieldWeight in 4476, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.05783 = idf(docFreq=13, maxDocs=44218)
                0.078125 = fieldNorm(doc=4476)
          0.05128509 = weight(abstract_txt:complexity in 4476) [ClassicSimilarity], result of:
            0.05128509 = score(doc=4476,freq=1.0), product of:
              0.11101679 = queryWeight, product of:
                1.327558 = boost
                5.913062 = idf(docFreq=324, maxDocs=44218)
                0.014142387 = queryNorm
              0.461958 = fieldWeight in 4476, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.913062 = idf(docFreq=324, maxDocs=44218)
                0.078125 = fieldNorm(doc=4476)
          0.04319258 = weight(abstract_txt:different in 4476) [ClassicSimilarity], result of:
            0.04319258 = score(doc=4476,freq=2.0), product of:
              0.10665242 = queryWeight, product of:
                2.0573802 = boost
                3.6655018 = idf(docFreq=3075, maxDocs=44218)
                0.014142387 = queryNorm
              0.40498453 = fieldWeight in 4476, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.6655018 = idf(docFreq=3075, maxDocs=44218)
                0.078125 = fieldNorm(doc=4476)
          0.11999799 = weight(abstract_txt:languages in 4476) [ClassicSimilarity], result of:
            0.11999799 = score(doc=4476,freq=3.0), product of:
              0.1709281 = queryWeight, product of:
                2.3295977 = boost
                5.188118 = idf(docFreq=670, maxDocs=44218)
                0.014142387 = queryNorm
              0.7020378 = fieldWeight in 4476, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.188118 = idf(docFreq=670, maxDocs=44218)
                0.078125 = fieldNorm(doc=4476)
          0.6429984 = weight(abstract_txt:morphological in 4476) [ClassicSimilarity], result of:
            0.6429984 = score(doc=4476,freq=4.0), product of:
              0.5122648 = queryWeight, product of:
                4.5089607 = boost
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.014142387 = queryNorm
              1.2552071 = fieldWeight in 4476, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.078125 = fieldNorm(doc=4476)
        0.2 = coord(5/25)
    
  4. Yang, C.C.; Li, K.W.: Automatic construction of English/Chinese parallel corpora (2003) 0.15
    0.14794475 = sum of:
      0.14794475 = product of:
        0.41095763 = sum of:
          0.015343646 = weight(abstract_txt:large in 1683) [ClassicSimilarity], result of:
            0.015343646 = score(doc=1683,freq=1.0), product of:
              0.06299145 = queryWeight, product of:
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.014142387 = queryNorm
              0.243583 = fieldWeight in 1683, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1683)
          0.026298763 = weight(abstract_txt:experiments in 1683) [ClassicSimilarity], result of:
            0.026298763 = score(doc=1683,freq=1.0), product of:
              0.090216525 = queryWeight, product of:
                1.1967467 = boost
                5.3304167 = idf(docFreq=581, maxDocs=44218)
                0.014142387 = queryNorm
              0.29150715 = fieldWeight in 1683, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.3304167 = idf(docFreq=581, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1683)
          0.073675066 = weight(abstract_txt:english in 1683) [ClassicSimilarity], result of:
            0.073675066 = score(doc=1683,freq=6.0), product of:
              0.0986641 = queryWeight, product of:
                1.2515228 = boost
                5.574394 = idf(docFreq=455, maxDocs=44218)
                0.014142387 = queryNorm
              0.7467262 = fieldWeight in 1683, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                5.574394 = idf(docFreq=455, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1683)
          0.014422839 = weight(abstract_txt:both in 1683) [ClassicSimilarity], result of:
            0.014422839 = score(doc=1683,freq=1.0), product of:
              0.06919268 = queryWeight, product of:
                1.2836154 = boost
                3.811558 = idf(docFreq=2657, maxDocs=44218)
                0.014142387 = queryNorm
              0.20844458 = fieldWeight in 1683, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.811558 = idf(docFreq=2657, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1683)
          0.06821411 = weight(abstract_txt:corpus in 1683) [ClassicSimilarity], result of:
            0.06821411 = score(doc=1683,freq=3.0), product of:
              0.118087776 = queryWeight, product of:
                1.3691835 = boost
                6.0984654 = idf(docFreq=269, maxDocs=44218)
                0.014142387 = queryNorm
              0.577656 = fieldWeight in 1683, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.0984654 = idf(docFreq=269, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1683)
          0.02585856 = weight(abstract_txt:performance in 1683) [ClassicSimilarity], result of:
            0.02585856 = score(doc=1683,freq=1.0), product of:
              0.10211649 = queryWeight, product of:
                1.5593828 = boost
                4.63042 = idf(docFreq=1171, maxDocs=44218)
                0.014142387 = queryNorm
              0.2532261 = fieldWeight in 1683, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.63042 = idf(docFreq=1171, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1683)
          0.021379236 = weight(abstract_txt:different in 1683) [ClassicSimilarity], result of:
            0.021379236 = score(doc=1683,freq=1.0), product of:
              0.10665242 = queryWeight, product of:
                2.0573802 = boost
                3.6655018 = idf(docFreq=3075, maxDocs=44218)
                0.014142387 = queryNorm
              0.20045713 = fieldWeight in 1683, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6655018 = idf(docFreq=3075, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1683)
          0.09718084 = weight(abstract_txt:parallel in 1683) [ClassicSimilarity], result of:
            0.09718084 = score(doc=1683,freq=2.0), product of:
              0.19591609 = queryWeight, product of:
                2.159931 = boost
                6.4136834 = idf(docFreq=196, maxDocs=44218)
                0.014142387 = queryNorm
              0.496033 = fieldWeight in 1683, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.4136834 = idf(docFreq=196, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1683)
          0.06858457 = weight(abstract_txt:languages in 1683) [ClassicSimilarity], result of:
            0.06858457 = score(doc=1683,freq=2.0), product of:
              0.1709281 = queryWeight, product of:
                2.3295977 = boost
                5.188118 = idf(docFreq=670, maxDocs=44218)
                0.014142387 = queryNorm
              0.40124804 = fieldWeight in 1683, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.188118 = idf(docFreq=670, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1683)
        0.36 = coord(9/25)
    
  5. Mao, J.; Cui, H.: Identifying bacterial biotope entities using sequence labeling : performance and feature analysis (2018) 0.14
    0.13582057 = sum of:
      0.13582057 = product of:
        0.37727934 = sum of:
          0.017535595 = weight(abstract_txt:large in 4462) [ClassicSimilarity], result of:
            0.017535595 = score(doc=4462,freq=1.0), product of:
              0.06299145 = queryWeight, product of:
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.014142387 = queryNorm
              0.27838057 = fieldWeight in 4462, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.0625 = fieldNorm(doc=4462)
          0.03711914 = weight(abstract_txt:features in 4462) [ClassicSimilarity], result of:
            0.03711914 = score(doc=4462,freq=4.0), product of:
              0.0654204 = queryWeight, product of:
                1.0190976 = boost
                4.5391517 = idf(docFreq=1283, maxDocs=44218)
                0.014142387 = queryNorm
              0.56739396 = fieldWeight in 4462, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.5391517 = idf(docFreq=1283, maxDocs=44218)
                0.0625 = fieldNorm(doc=4462)
          0.035419293 = weight(abstract_txt:compared in 4462) [ClassicSimilarity], result of:
            0.035419293 = score(doc=4462,freq=2.0), product of:
              0.079888545 = queryWeight, product of:
                1.1261635 = boost
                5.0160327 = idf(docFreq=796, maxDocs=44218)
                0.014142387 = queryNorm
              0.44335884 = fieldWeight in 4462, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.0160327 = idf(docFreq=796, maxDocs=44218)
                0.0625 = fieldNorm(doc=4462)
          0.03005573 = weight(abstract_txt:experiments in 4462) [ClassicSimilarity], result of:
            0.03005573 = score(doc=4462,freq=1.0), product of:
              0.090216525 = queryWeight, product of:
                1.1967467 = boost
                5.3304167 = idf(docFreq=581, maxDocs=44218)
                0.014142387 = queryNorm
              0.33315104 = fieldWeight in 4462, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.3304167 = idf(docFreq=581, maxDocs=44218)
                0.0625 = fieldNorm(doc=4462)
          0.016483244 = weight(abstract_txt:both in 4462) [ClassicSimilarity], result of:
            0.016483244 = score(doc=4462,freq=1.0), product of:
              0.06919268 = queryWeight, product of:
                1.2836154 = boost
                3.811558 = idf(docFreq=2657, maxDocs=44218)
                0.014142387 = queryNorm
              0.23822238 = fieldWeight in 4462, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.811558 = idf(docFreq=2657, maxDocs=44218)
                0.0625 = fieldNorm(doc=4462)
          0.04500964 = weight(abstract_txt:corpus in 4462) [ClassicSimilarity], result of:
            0.04500964 = score(doc=4462,freq=1.0), product of:
              0.118087776 = queryWeight, product of:
                1.3691835 = boost
                6.0984654 = idf(docFreq=269, maxDocs=44218)
                0.014142387 = queryNorm
              0.3811541 = fieldWeight in 4462, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0984654 = idf(docFreq=269, maxDocs=44218)
                0.0625 = fieldNorm(doc=4462)
          0.05910528 = weight(abstract_txt:performance in 4462) [ClassicSimilarity], result of:
            0.05910528 = score(doc=4462,freq=4.0), product of:
              0.10211649 = queryWeight, product of:
                1.5593828 = boost
                4.63042 = idf(docFreq=1171, maxDocs=44218)
                0.014142387 = queryNorm
              0.5788025 = fieldWeight in 4462, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.63042 = idf(docFreq=1171, maxDocs=44218)
                0.0625 = fieldNorm(doc=4462)
          0.07539048 = weight(abstract_txt:classifier in 4462) [ClassicSimilarity], result of:
            0.07539048 = score(doc=4462,freq=1.0), product of:
              0.16655037 = queryWeight, product of:
                1.6260428 = boost
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.014142387 = queryNorm
              0.45265874 = fieldWeight in 4462, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.0625 = fieldNorm(doc=4462)
          0.06116095 = weight(abstract_txt:feature in 4462) [ClassicSimilarity], result of:
            0.06116095 = score(doc=4462,freq=1.0), product of:
              0.1658369 = queryWeight, product of:
                1.9872175 = boost
                5.9008293 = idf(docFreq=328, maxDocs=44218)
                0.014142387 = queryNorm
              0.36880183 = fieldWeight in 4462, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9008293 = idf(docFreq=328, maxDocs=44218)
                0.0625 = fieldNorm(doc=4462)
        0.36 = coord(9/25)