Document (#34450)

Author
Sánchez-de-Madariaga, R.
Fernández-del-Castillo, J.R.
Title
¬The bootstrapping of the Yarowsky algorithm in real corpora
Source
Information processing and management. 45(2009) no.1, S.55-69
Year
2009
Abstract
The Yarowsky bootstrapping algorithm resolves the homograph-level word sense disambiguation (WSD) problem, which is the sense granularity level required for real natural language processing (NLP) applications. At the same time it resolves the knowledge acquisition bottleneck problem affecting most WSD algorithms and can be easily applied to foreign language corpora. However, this paper shows that the Yarowsky algorithm is significantly less accurate when applied to domain fluctuating, real corpora. This paper also introduces a new bootstrapping methodology that performs much better when applied to these corpora. The accuracy achieved in non-domain fluctuating corpora is not reached due to inherent domain fluctuation ambiguities.
Theme
Retrievalalgorithmen

Similar documents (author)

  1. Castillo, M. Davey => Davey Castillo, M.: 0.99
    0.9904157 = sum of:
      0.9904157 = product of:
        2.971247 = sum of:
          2.971247 = weight(author_txt:castillo in 2447) [ClassicSimilarity], result of:
            2.971247 = score(doc=2447,freq=2.0), product of:
              0.6331673 = queryWeight, product of:
                1.0926499 = boost
                8.848589 = idf(docFreq=16, maxDocs=43556)
                0.06548826 = queryNorm
              4.6926727 = fieldWeight in 2447, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.848589 = idf(docFreq=16, maxDocs=43556)
                0.375 = fieldNorm(doc=2447)
        0.33333334 = coord(1/3)
    
  2. Castillo, J. Ruiz -> Ruiz-Castillo, J.: 0.99
    0.9904157 = sum of:
      0.9904157 = product of:
        2.971247 = sum of:
          2.971247 = weight(author_txt:castillo in 227) [ClassicSimilarity], result of:
            2.971247 = score(doc=227,freq=2.0), product of:
              0.6331673 = queryWeight, product of:
                1.0926499 = boost
                8.848589 = idf(docFreq=16, maxDocs=43556)
                0.06548826 = queryNorm
              4.6926727 = fieldWeight in 227, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.848589 = idf(docFreq=16, maxDocs=43556)
                0.375 = fieldNorm(doc=227)
        0.33333334 = coord(1/3)
    
  3. Castillo, J. Ruiz- => Ruiz-Castillo, J.: 0.99
    0.9904157 = sum of:
      0.9904157 = product of:
        2.971247 = sum of:
          2.971247 = weight(author_txt:castillo in 2885) [ClassicSimilarity], result of:
            2.971247 = score(doc=2885,freq=2.0), product of:
              0.6331673 = queryWeight, product of:
                1.0926499 = boost
                8.848589 = idf(docFreq=16, maxDocs=43556)
                0.06548826 = queryNorm
              4.6926727 = fieldWeight in 2885, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.848589 = idf(docFreq=16, maxDocs=43556)
                0.375 = fieldNorm(doc=2885)
        0.33333334 = coord(1/3)
    
  4. Moreno Fernández, L.M. -> Fernández, L.M.M.: 0.97
    0.97082055 = sum of:
      0.97082055 = product of:
        2.9124615 = sum of:
          2.9124615 = weight(author_txt:fernández in 5948) [ClassicSimilarity], result of:
            2.9124615 = score(doc=5948,freq=2.0), product of:
              0.5637695 = queryWeight, product of:
                1.031033 = boost
                8.349598 = idf(docFreq=27, maxDocs=43556)
                0.06548826 = queryNorm
              5.16605 = fieldWeight in 5948, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.349598 = idf(docFreq=27, maxDocs=43556)
                0.4375 = fieldNorm(doc=5948)
        0.33333334 = coord(1/3)
    
  5. Sánchez, M.F.: Semantically enhanced Information Retrieval : an ontology-based approach (2006) 0.89
    0.89476323 = sum of:
      0.89476323 = product of:
        2.6842897 = sum of:
          2.6842897 = weight(author_txt:sánchez in 1325) [ClassicSimilarity], result of:
            2.6842897 = score(doc=1325,freq=1.0), product of:
              0.53034246 = queryWeight, product of:
                8.098284 = idf(docFreq=35, maxDocs=43556)
                0.06548826 = queryNorm
              5.061427 = fieldWeight in 1325, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.098284 = idf(docFreq=35, maxDocs=43556)
                0.625 = fieldNorm(doc=1325)
        0.33333334 = coord(1/3)
    

Similar documents (content)

  1. Tsujii, J.-I.: Automatic acquisition of semantic collocation from corpora (1995) 0.13
    0.13409403 = sum of:
      0.13409403 = product of:
        0.8380877 = sum of:
          0.05956202 = weight(abstract_txt:acquisition in 4775) [ClassicSimilarity], result of:
            0.05956202 = score(doc=4775,freq=1.0), product of:
              0.076209426 = queryWeight, product of:
                6.252457 = idf(docFreq=227, maxDocs=43556)
                0.012188716 = queryNorm
              0.78155714 = fieldWeight in 4775, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.252457 = idf(docFreq=227, maxDocs=43556)
                0.125 = fieldNorm(doc=4775)
          0.10999508 = weight(abstract_txt:real in 4775) [ClassicSimilarity], result of:
            0.10999508 = score(doc=4775,freq=1.0), product of:
              0.16544424 = queryWeight, product of:
                2.5520084 = boost
                5.3187747 = idf(docFreq=579, maxDocs=43556)
                0.012188716 = queryNorm
              0.66484684 = fieldWeight in 4775, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.3187747 = idf(docFreq=579, maxDocs=43556)
                0.125 = fieldNorm(doc=4775)
          0.23674594 = weight(abstract_txt:algorithm in 4775) [ClassicSimilarity], result of:
            0.23674594 = score(doc=4775,freq=3.0), product of:
              0.19122767 = queryWeight, product of:
                2.7436686 = boost
                5.7182236 = idf(docFreq=388, maxDocs=43556)
                0.012188716 = queryNorm
              1.2380317 = fieldWeight in 4775, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.7182236 = idf(docFreq=388, maxDocs=43556)
                0.125 = fieldNorm(doc=4775)
          0.43178466 = weight(abstract_txt:corpora in 4775) [ClassicSimilarity], result of:
            0.43178466 = score(doc=4775,freq=1.0), product of:
              0.48812443 = queryWeight, product of:
                5.6590815 = boost
                7.0766325 = idf(docFreq=99, maxDocs=43556)
                0.012188716 = queryNorm
              0.88457906 = fieldWeight in 4775, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.0766325 = idf(docFreq=99, maxDocs=43556)
                0.125 = fieldNorm(doc=4775)
        0.16 = coord(4/25)
    
  2. Yang, C.C.; Li, K.W.: Automatic construction of English/Chinese parallel corpora (2003) 0.12
    0.11659356 = sum of:
      0.11659356 = product of:
        0.48580652 = sum of:
          0.009038283 = weight(abstract_txt:paper in 2681) [ClassicSimilarity], result of:
            0.009038283 = score(doc=2681,freq=1.0), product of:
              0.047399923 = queryWeight, product of:
                1.11532 = boost
                3.486745 = idf(docFreq=3622, maxDocs=43556)
                0.012188716 = queryNorm
              0.19068137 = fieldWeight in 2681, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.486745 = idf(docFreq=3622, maxDocs=43556)
                0.0546875 = fieldNorm(doc=2681)
          0.031372298 = weight(abstract_txt:language in 2681) [ClassicSimilarity], result of:
            0.031372298 = score(doc=2681,freq=4.0), product of:
              0.068453856 = queryWeight, product of:
                1.3403234 = boost
                4.1901574 = idf(docFreq=1792, maxDocs=43556)
                0.012188716 = queryNorm
              0.45829847 = fieldWeight in 2681, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.1901574 = idf(docFreq=1792, maxDocs=43556)
                0.0546875 = fieldNorm(doc=2681)
          0.033967905 = weight(abstract_txt:level in 2681) [ClassicSimilarity], result of:
            0.033967905 = score(doc=2681,freq=3.0), product of:
              0.07944364 = queryWeight, product of:
                1.4439104 = boost
                4.5139937 = idf(docFreq=1296, maxDocs=43556)
                0.012188716 = queryNorm
              0.42757237 = fieldWeight in 2681, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.5139937 = idf(docFreq=1296, maxDocs=43556)
                0.0546875 = fieldNorm(doc=2681)
          0.048694853 = weight(abstract_txt:domain in 2681) [ClassicSimilarity], result of:
            0.048694853 = score(doc=2681,freq=2.0), product of:
              0.1323517 = queryWeight, product of:
                2.282554 = boost
                4.75719 = idf(docFreq=1016, maxDocs=43556)
                0.012188716 = queryNorm
              0.36792013 = fieldWeight in 2681, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.75719 = idf(docFreq=1016, maxDocs=43556)
                0.0546875 = fieldNorm(doc=2681)
          0.035538796 = weight(abstract_txt:applied in 2681) [ClassicSimilarity], result of:
            0.035538796 = score(doc=2681,freq=1.0), product of:
              0.13517176 = queryWeight, product of:
                2.3067434 = boost
                4.8076043 = idf(docFreq=966, maxDocs=43556)
                0.012188716 = queryNorm
              0.26291585 = fieldWeight in 2681, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.8076043 = idf(docFreq=966, maxDocs=43556)
                0.0546875 = fieldNorm(doc=2681)
          0.3271944 = weight(abstract_txt:corpora in 2681) [ClassicSimilarity], result of:
            0.3271944 = score(doc=2681,freq=3.0), product of:
              0.48812443 = queryWeight, product of:
                5.6590815 = boost
                7.0766325 = idf(docFreq=99, maxDocs=43556)
                0.012188716 = queryNorm
              0.6703094 = fieldWeight in 2681, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.0766325 = idf(docFreq=99, maxDocs=43556)
                0.0546875 = fieldNorm(doc=2681)
        0.24 = coord(6/25)
    
  3. Markó, K.G.: Foundation, implementation and evaluation of the MorphoSaurus system (2008) 0.10
    0.10357391 = sum of:
      0.10357391 = product of:
        0.2877053 = sum of:
          0.018613132 = weight(abstract_txt:acquisition in 1413) [ClassicSimilarity], result of:
            0.018613132 = score(doc=1413,freq=1.0), product of:
              0.076209426 = queryWeight, product of:
                6.252457 = idf(docFreq=227, maxDocs=43556)
                0.012188716 = queryNorm
              0.2442366 = fieldWeight in 1413, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.252457 = idf(docFreq=227, maxDocs=43556)
                0.0390625 = fieldNorm(doc=1413)
          0.022342041 = weight(abstract_txt:inherent in 1413) [ClassicSimilarity], result of:
            0.022342041 = score(doc=1413,freq=1.0), product of:
              0.08607512 = queryWeight, product of:
                1.0627582 = boost
                6.6448503 = idf(docFreq=153, maxDocs=43556)
                0.012188716 = queryNorm
              0.25956446 = fieldWeight in 1413, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.6448503 = idf(docFreq=153, maxDocs=43556)
                0.0390625 = fieldNorm(doc=1413)
          0.042551506 = weight(abstract_txt:disambiguation in 1413) [ClassicSimilarity], result of:
            0.042551506 = score(doc=1413,freq=2.0), product of:
              0.10496931 = queryWeight, product of:
                1.1736182 = boost
                7.3379974 = idf(docFreq=76, maxDocs=43556)
                0.012188716 = queryNorm
              0.40537092 = fieldWeight in 1413, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.3379974 = idf(docFreq=76, maxDocs=43556)
                0.0390625 = fieldNorm(doc=1413)
          0.022408783 = weight(abstract_txt:language in 1413) [ClassicSimilarity], result of:
            0.022408783 = score(doc=1413,freq=4.0), product of:
              0.068453856 = queryWeight, product of:
                1.3403234 = boost
                4.1901574 = idf(docFreq=1792, maxDocs=43556)
                0.012188716 = queryNorm
              0.32735604 = fieldWeight in 1413, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.1901574 = idf(docFreq=1792, maxDocs=43556)
                0.0390625 = fieldNorm(doc=1413)
          0.030128123 = weight(abstract_txt:sense in 1413) [ClassicSimilarity], result of:
            0.030128123 = score(doc=1413,freq=1.0), product of:
              0.13236925 = queryWeight, product of:
                1.863821 = boost
                5.8267307 = idf(docFreq=348, maxDocs=43556)
                0.012188716 = queryNorm
              0.22760667 = fieldWeight in 1413, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8267307 = idf(docFreq=348, maxDocs=43556)
                0.0390625 = fieldNorm(doc=1413)
          0.049189236 = weight(abstract_txt:domain in 1413) [ClassicSimilarity], result of:
            0.049189236 = score(doc=1413,freq=4.0), product of:
              0.1323517 = queryWeight, product of:
                2.282554 = boost
                4.75719 = idf(docFreq=1016, maxDocs=43556)
                0.012188716 = queryNorm
              0.3716555 = fieldWeight in 1413, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.75719 = idf(docFreq=1016, maxDocs=43556)
                0.0390625 = fieldNorm(doc=1413)
          0.025384856 = weight(abstract_txt:applied in 1413) [ClassicSimilarity], result of:
            0.025384856 = score(doc=1413,freq=1.0), product of:
              0.13517176 = queryWeight, product of:
                2.3067434 = boost
                4.8076043 = idf(docFreq=966, maxDocs=43556)
                0.012188716 = queryNorm
              0.18779704 = fieldWeight in 1413, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.8076043 = idf(docFreq=966, maxDocs=43556)
                0.0390625 = fieldNorm(doc=1413)
          0.034373462 = weight(abstract_txt:real in 1413) [ClassicSimilarity], result of:
            0.034373462 = score(doc=1413,freq=1.0), product of:
              0.16544424 = queryWeight, product of:
                2.5520084 = boost
                5.3187747 = idf(docFreq=579, maxDocs=43556)
                0.012188716 = queryNorm
              0.20776464 = fieldWeight in 1413, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.3187747 = idf(docFreq=579, maxDocs=43556)
                0.0390625 = fieldNorm(doc=1413)
          0.042714164 = weight(abstract_txt:algorithm in 1413) [ClassicSimilarity], result of:
            0.042714164 = score(doc=1413,freq=1.0), product of:
              0.19122767 = queryWeight, product of:
                2.7436686 = boost
                5.7182236 = idf(docFreq=388, maxDocs=43556)
                0.012188716 = queryNorm
              0.22336811 = fieldWeight in 1413, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7182236 = idf(docFreq=388, maxDocs=43556)
                0.0390625 = fieldNorm(doc=1413)
        0.36 = coord(9/25)
    
  4. Snajder, J.; Dalbelo Basic, B.D.; Tadic, M.: Automatic acquisition of inflectional lexica for morphological normalisation (2008) 0.10
    0.10305449 = sum of:
      0.10305449 = product of:
        0.42939374 = sum of:
          0.037226263 = weight(abstract_txt:acquisition in 4908) [ClassicSimilarity], result of:
            0.037226263 = score(doc=4908,freq=1.0), product of:
              0.076209426 = queryWeight, product of:
                6.252457 = idf(docFreq=227, maxDocs=43556)
                0.012188716 = queryNorm
              0.4884732 = fieldWeight in 4908, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.252457 = idf(docFreq=227, maxDocs=43556)
                0.078125 = fieldNorm(doc=4908)
          0.012911832 = weight(abstract_txt:paper in 4908) [ClassicSimilarity], result of:
            0.012911832 = score(doc=4908,freq=1.0), product of:
              0.047399923 = queryWeight, product of:
                1.11532 = boost
                3.486745 = idf(docFreq=3622, maxDocs=43556)
                0.012188716 = queryNorm
              0.27240196 = fieldWeight in 4908, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.486745 = idf(docFreq=3622, maxDocs=43556)
                0.078125 = fieldNorm(doc=4908)
          0.031690806 = weight(abstract_txt:language in 4908) [ClassicSimilarity], result of:
            0.031690806 = score(doc=4908,freq=2.0), product of:
              0.068453856 = queryWeight, product of:
                1.3403234 = boost
                4.1901574 = idf(docFreq=1792, maxDocs=43556)
                0.012188716 = queryNorm
              0.46295136 = fieldWeight in 4908, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1901574 = idf(docFreq=1792, maxDocs=43556)
                0.078125 = fieldNorm(doc=4908)
          0.026929695 = weight(abstract_txt:problem in 4908) [ClassicSimilarity], result of:
            0.026929695 = score(doc=4908,freq=1.0), product of:
              0.077376075 = queryWeight, product of:
                1.4249972 = boost
                4.454867 = idf(docFreq=1375, maxDocs=43556)
                0.012188716 = queryNorm
              0.34803647 = fieldWeight in 4908, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.454867 = idf(docFreq=1375, maxDocs=43556)
                0.078125 = fieldNorm(doc=4908)
          0.050769713 = weight(abstract_txt:applied in 4908) [ClassicSimilarity], result of:
            0.050769713 = score(doc=4908,freq=1.0), product of:
              0.13517176 = queryWeight, product of:
                2.3067434 = boost
                4.8076043 = idf(docFreq=966, maxDocs=43556)
                0.012188716 = queryNorm
              0.37559408 = fieldWeight in 4908, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.8076043 = idf(docFreq=966, maxDocs=43556)
                0.078125 = fieldNorm(doc=4908)
          0.26986542 = weight(abstract_txt:corpora in 4908) [ClassicSimilarity], result of:
            0.26986542 = score(doc=4908,freq=1.0), product of:
              0.48812443 = queryWeight, product of:
                5.6590815 = boost
                7.0766325 = idf(docFreq=99, maxDocs=43556)
                0.012188716 = queryNorm
              0.5528619 = fieldWeight in 4908, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.0766325 = idf(docFreq=99, maxDocs=43556)
                0.078125 = fieldNorm(doc=4908)
        0.24 = coord(6/25)
    
  5. Suakkaphong, N.; Zhang, Z.; Chen, H.: Disease named entity recognition using semisupervised learning and conditional random fields (2011) 0.10
    0.100055255 = sum of:
      0.100055255 = product of:
        0.62534535 = sum of:
          0.03935139 = weight(abstract_txt:domain in 1365) [ClassicSimilarity], result of:
            0.03935139 = score(doc=1365,freq=1.0), product of:
              0.1323517 = queryWeight, product of:
                2.282554 = boost
                4.75719 = idf(docFreq=1016, maxDocs=43556)
                0.012188716 = queryNorm
              0.2973244 = fieldWeight in 1365, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.75719 = idf(docFreq=1016, maxDocs=43556)
                0.0625 = fieldNorm(doc=1365)
          0.04061577 = weight(abstract_txt:applied in 1365) [ClassicSimilarity], result of:
            0.04061577 = score(doc=1365,freq=1.0), product of:
              0.13517176 = queryWeight, product of:
                2.3067434 = boost
                4.8076043 = idf(docFreq=966, maxDocs=43556)
                0.012188716 = queryNorm
              0.30047527 = fieldWeight in 1365, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.8076043 = idf(docFreq=966, maxDocs=43556)
                0.0625 = fieldNorm(doc=1365)
          0.06834266 = weight(abstract_txt:algorithm in 1365) [ClassicSimilarity], result of:
            0.06834266 = score(doc=1365,freq=1.0), product of:
              0.19122767 = queryWeight, product of:
                2.7436686 = boost
                5.7182236 = idf(docFreq=388, maxDocs=43556)
                0.012188716 = queryNorm
              0.35738897 = fieldWeight in 1365, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7182236 = idf(docFreq=388, maxDocs=43556)
                0.0625 = fieldNorm(doc=1365)
          0.47703555 = weight(abstract_txt:bootstrapping in 1365) [ClassicSimilarity], result of:
            0.47703555 = score(doc=1365,freq=2.0), product of:
              0.5543448 = queryWeight, product of:
                4.6713915 = boost
                9.735892 = idf(docFreq=6, maxDocs=43556)
                0.012188716 = queryNorm
              0.86053944 = fieldWeight in 1365, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.735892 = idf(docFreq=6, maxDocs=43556)
                0.0625 = fieldNorm(doc=1365)
        0.16 = coord(4/25)