Document (#34452)

Author
Sánchez-de-Madariaga, R.
Fernández-del-Castillo, J.R.
Title
¬The bootstrapping of the Yarowsky algorithm in real corpora
Source
Information processing and management. 45(2009) no.1, S.55-69
Year
2009
Abstract
The Yarowsky bootstrapping algorithm resolves the homograph-level word sense disambiguation (WSD) problem, which is the sense granularity level required for real natural language processing (NLP) applications. At the same time it resolves the knowledge acquisition bottleneck problem affecting most WSD algorithms and can be easily applied to foreign language corpora. However, this paper shows that the Yarowsky algorithm is significantly less accurate when applied to domain fluctuating, real corpora. This paper also introduces a new bootstrapping methodology that performs much better when applied to these corpora. The accuracy achieved in non-domain fluctuating corpora is not reached due to inherent domain fluctuation ambiguities.
Theme
Retrievalalgorithmen

Similar documents (author)

  1. Castillo, M. Davey => Davey Castillo, M.: 0.99
    0.9919521 = sum of:
      0.9919521 = product of:
        2.9758563 = sum of:
          2.9758563 = weight(author_txt:castillo in 2447) [ClassicSimilarity], result of:
            2.9758563 = score(doc=2447,freq=2.0), product of:
              0.63307023 = queryWeight, product of:
                1.0924778 = boost
                8.863674 = idf(docFreq=16, maxDocs=44218)
                0.06537708 = queryNorm
              4.700673 = fieldWeight in 2447, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.863674 = idf(docFreq=16, maxDocs=44218)
                0.375 = fieldNorm(doc=2447)
        0.33333334 = coord(1/3)
    
  2. Castillo, J. Ruiz -> Ruiz-Castillo, J.: 0.99
    0.9919521 = sum of:
      0.9919521 = product of:
        2.9758563 = sum of:
          2.9758563 = weight(author_txt:castillo in 8230) [ClassicSimilarity], result of:
            2.9758563 = score(doc=8230,freq=2.0), product of:
              0.63307023 = queryWeight, product of:
                1.0924778 = boost
                8.863674 = idf(docFreq=16, maxDocs=44218)
                0.06537708 = queryNorm
              4.700673 = fieldWeight in 8230, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.863674 = idf(docFreq=16, maxDocs=44218)
                0.375 = fieldNorm(doc=8230)
        0.33333334 = coord(1/3)
    
  3. Castillo, J. Ruiz- => Ruiz-Castillo, J.: 0.99
    0.9919521 = sum of:
      0.9919521 = product of:
        2.9758563 = sum of:
          2.9758563 = weight(author_txt:castillo in 2819) [ClassicSimilarity], result of:
            2.9758563 = score(doc=2819,freq=2.0), product of:
              0.63307023 = queryWeight, product of:
                1.0924778 = boost
                8.863674 = idf(docFreq=16, maxDocs=44218)
                0.06537708 = queryNorm
              4.700673 = fieldWeight in 2819, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.863674 = idf(docFreq=16, maxDocs=44218)
                0.375 = fieldNorm(doc=2819)
        0.33333334 = coord(1/3)
    
  4. Moreno Fernández, L.M. -> Fernández, L.M.M.: 0.97
    0.9726233 = sum of:
      0.9726233 = product of:
        2.9178698 = sum of:
          2.9178698 = weight(author_txt:fernández in 5951) [ClassicSimilarity], result of:
            2.9178698 = score(doc=5951,freq=2.0), product of:
              0.5637978 = queryWeight, product of:
                1.0309755 = boost
                8.364683 = idf(docFreq=27, maxDocs=44218)
                0.06537708 = queryNorm
              5.1753836 = fieldWeight in 5951, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.364683 = idf(docFreq=27, maxDocs=44218)
                0.4375 = fieldNorm(doc=5951)
        0.33333334 = coord(1/3)
    
  5. Sánchez, M.F.: Semantically enhanced Information Retrieval : an ontology-based approach (2006) 0.90
    0.896575 = sum of:
      0.896575 = product of:
        2.689725 = sum of:
          2.689725 = weight(author_txt:sánchez in 4327) [ClassicSimilarity], result of:
            2.689725 = score(doc=4327,freq=1.0), product of:
              0.5304283 = queryWeight, product of:
                8.113368 = idf(docFreq=35, maxDocs=44218)
                0.06537708 = queryNorm
              5.070855 = fieldWeight in 4327, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.113368 = idf(docFreq=35, maxDocs=44218)
                0.625 = fieldNorm(doc=4327)
        0.33333334 = coord(1/3)
    

Similar documents (content)

  1. Tsujii, J.-I.: Automatic acquisition of semantic collocation from corpora (1995) 0.13
    0.13249512 = sum of:
      0.13249512 = product of:
        0.8280945 = sum of:
          0.059289213 = weight(abstract_txt:acquisition in 4709) [ClassicSimilarity], result of:
            0.059289213 = score(doc=4709,freq=1.0), product of:
              0.07604469 = queryWeight, product of:
                6.237302 = idf(docFreq=234, maxDocs=44218)
                0.0121919215 = queryNorm
              0.7796627 = fieldWeight in 4709, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.237302 = idf(docFreq=234, maxDocs=44218)
                0.125 = fieldNorm(doc=4709)
          0.10964261 = weight(abstract_txt:real in 4709) [ClassicSimilarity], result of:
            0.10964261 = score(doc=4709,freq=1.0), product of:
              0.16523871 = queryWeight, product of:
                2.5531838 = boost
                5.308326 = idf(docFreq=594, maxDocs=44218)
                0.0121919215 = queryNorm
              0.6635407 = fieldWeight in 4709, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.308326 = idf(docFreq=594, maxDocs=44218)
                0.125 = fieldNorm(doc=4709)
          0.23579293 = weight(abstract_txt:algorithm in 4709) [ClassicSimilarity], result of:
            0.23579293 = score(doc=4709,freq=3.0), product of:
              0.19088523 = queryWeight, product of:
                2.7441783 = boost
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.0121919215 = queryNorm
              1.2352602 = fieldWeight in 4709, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.125 = fieldNorm(doc=4709)
          0.4233697 = weight(abstract_txt:corpora in 4709) [ClassicSimilarity], result of:
            0.4233697 = score(doc=4709,freq=1.0), product of:
              0.48219383 = queryWeight, product of:
                5.6306868 = boost
                7.0240583 = idf(docFreq=106, maxDocs=44218)
                0.0121919215 = queryNorm
              0.8780073 = fieldWeight in 4709, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.0240583 = idf(docFreq=106, maxDocs=44218)
                0.125 = fieldNorm(doc=4709)
        0.16 = coord(4/25)
    
  2. Yang, C.C.; Li, K.W.: Automatic construction of English/Chinese parallel corpora (2003) 0.11
    0.11477414 = sum of:
      0.11477414 = product of:
        0.4782256 = sum of:
          0.008912433 = weight(abstract_txt:paper in 1683) [ClassicSimilarity], result of:
            0.008912433 = score(doc=1683,freq=1.0), product of:
              0.04700102 = queryWeight, product of:
                1.1118193 = boost
                3.467376 = idf(docFreq=3749, maxDocs=44218)
                0.0121919215 = queryNorm
              0.18962212 = fieldWeight in 1683, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.467376 = idf(docFreq=3749, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1683)
          0.031275395 = weight(abstract_txt:language in 1683) [ClassicSimilarity], result of:
            0.031275395 = score(doc=1683,freq=4.0), product of:
              0.0683741 = queryWeight, product of:
                1.3409925 = boost
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.0121919215 = queryNorm
              0.45741582 = fieldWeight in 1683, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1683)
          0.033697654 = weight(abstract_txt:level in 1683) [ClassicSimilarity], result of:
            0.033697654 = score(doc=1683,freq=3.0), product of:
              0.079092585 = queryWeight, product of:
                1.4422761 = boost
                4.497956 = idf(docFreq=1337, maxDocs=44218)
                0.0121919215 = queryNorm
              0.42605326 = fieldWeight in 1683, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.497956 = idf(docFreq=1337, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1683)
          0.048163977 = weight(abstract_txt:domain in 1683) [ClassicSimilarity], result of:
            0.048163977 = score(doc=1683,freq=2.0), product of:
              0.13150585 = queryWeight, product of:
                2.2777114 = boost
                4.7355914 = idf(docFreq=1054, maxDocs=44218)
                0.0121919215 = queryNorm
              0.3662497 = fieldWeight in 1683, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.7355914 = idf(docFreq=1054, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1683)
          0.035358295 = weight(abstract_txt:applied in 1683) [ClassicSimilarity], result of:
            0.035358295 = score(doc=1683,freq=1.0), product of:
              0.13483451 = queryWeight, product of:
                2.3063579 = boost
                4.79515 = idf(docFreq=993, maxDocs=44218)
                0.0121919215 = queryNorm
              0.26223475 = fieldWeight in 1683, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.79515 = idf(docFreq=993, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1683)
          0.32081783 = weight(abstract_txt:corpora in 1683) [ClassicSimilarity], result of:
            0.32081783 = score(doc=1683,freq=3.0), product of:
              0.48219383 = queryWeight, product of:
                5.6306868 = boost
                7.0240583 = idf(docFreq=106, maxDocs=44218)
                0.0121919215 = queryNorm
              0.6653296 = fieldWeight in 1683, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.0240583 = idf(docFreq=106, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1683)
        0.24 = coord(6/25)
    
  3. Markó, K.G.: Foundation, implementation and evaluation of the MorphoSaurus system (2008) 0.10
    0.103159145 = sum of:
      0.103159145 = product of:
        0.28655317 = sum of:
          0.018527878 = weight(abstract_txt:acquisition in 4415) [ClassicSimilarity], result of:
            0.018527878 = score(doc=4415,freq=1.0), product of:
              0.07604469 = queryWeight, product of:
                6.237302 = idf(docFreq=234, maxDocs=44218)
                0.0121919215 = queryNorm
              0.2436446 = fieldWeight in 4415, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.237302 = idf(docFreq=234, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4415)
          0.022044491 = weight(abstract_txt:inherent in 4415) [ClassicSimilarity], result of:
            0.022044491 = score(doc=4415,freq=1.0), product of:
              0.0853857 = queryWeight, product of:
                1.0596395 = boost
                6.609291 = idf(docFreq=161, maxDocs=44218)
                0.0121919215 = queryNorm
              0.25817543 = fieldWeight in 4415, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.609291 = idf(docFreq=161, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4415)
          0.042704172 = weight(abstract_txt:disambiguation in 4415) [ClassicSimilarity], result of:
            0.042704172 = score(doc=4415,freq=2.0), product of:
              0.10531461 = queryWeight, product of:
                1.1768196 = boost
                7.3401785 = idf(docFreq=77, maxDocs=44218)
                0.0121919215 = queryNorm
              0.4054914 = fieldWeight in 4415, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.3401785 = idf(docFreq=77, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4415)
          0.022339566 = weight(abstract_txt:language in 4415) [ClassicSimilarity], result of:
            0.022339566 = score(doc=4415,freq=4.0), product of:
              0.0683741 = queryWeight, product of:
                1.3409925 = boost
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.0121919215 = queryNorm
              0.32672557 = fieldWeight in 4415, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4415)
          0.030222647 = weight(abstract_txt:sense in 4415) [ClassicSimilarity], result of:
            0.030222647 = score(doc=4415,freq=1.0), product of:
              0.13276495 = queryWeight, product of:
                1.8686254 = boost
                5.8275905 = idf(docFreq=353, maxDocs=44218)
                0.0121919215 = queryNorm
              0.22764026 = fieldWeight in 4415, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8275905 = idf(docFreq=353, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4415)
          0.048652966 = weight(abstract_txt:domain in 4415) [ClassicSimilarity], result of:
            0.048652966 = score(doc=4415,freq=4.0), product of:
              0.13150585 = queryWeight, product of:
                2.2777114 = boost
                4.7355914 = idf(docFreq=1054, maxDocs=44218)
                0.0121919215 = queryNorm
              0.3699681 = fieldWeight in 4415, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.7355914 = idf(docFreq=1054, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4415)
          0.025255926 = weight(abstract_txt:applied in 4415) [ClassicSimilarity], result of:
            0.025255926 = score(doc=4415,freq=1.0), product of:
              0.13483451 = queryWeight, product of:
                2.3063579 = boost
                4.79515 = idf(docFreq=993, maxDocs=44218)
                0.0121919215 = queryNorm
              0.18731055 = fieldWeight in 4415, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.79515 = idf(docFreq=993, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4415)
          0.034263317 = weight(abstract_txt:real in 4415) [ClassicSimilarity], result of:
            0.034263317 = score(doc=4415,freq=1.0), product of:
              0.16523871 = queryWeight, product of:
                2.5531838 = boost
                5.308326 = idf(docFreq=594, maxDocs=44218)
                0.0121919215 = queryNorm
              0.20735648 = fieldWeight in 4415, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.308326 = idf(docFreq=594, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4415)
          0.042542227 = weight(abstract_txt:algorithm in 4415) [ClassicSimilarity], result of:
            0.042542227 = score(doc=4415,freq=1.0), product of:
              0.19088523 = queryWeight, product of:
                2.7441783 = boost
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.0121919215 = queryNorm
              0.22286808 = fieldWeight in 4415, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4415)
        0.36 = coord(9/25)
    
  4. Snajder, J.; Dalbelo Basic, B.D.; Tadic, M.: Automatic acquisition of inflectional lexica for morphological normalisation (2008) 0.10
    0.101665035 = sum of:
      0.101665035 = product of:
        0.4236043 = sum of:
          0.037055757 = weight(abstract_txt:acquisition in 2910) [ClassicSimilarity], result of:
            0.037055757 = score(doc=2910,freq=1.0), product of:
              0.07604469 = queryWeight, product of:
                6.237302 = idf(docFreq=234, maxDocs=44218)
                0.0121919215 = queryNorm
              0.4872892 = fieldWeight in 2910, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.237302 = idf(docFreq=234, maxDocs=44218)
                0.078125 = fieldNorm(doc=2910)
          0.012732047 = weight(abstract_txt:paper in 2910) [ClassicSimilarity], result of:
            0.012732047 = score(doc=2910,freq=1.0), product of:
              0.04700102 = queryWeight, product of:
                1.1118193 = boost
                3.467376 = idf(docFreq=3749, maxDocs=44218)
                0.0121919215 = queryNorm
              0.27088875 = fieldWeight in 2910, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.467376 = idf(docFreq=3749, maxDocs=44218)
                0.078125 = fieldNorm(doc=2910)
          0.031592917 = weight(abstract_txt:language in 2910) [ClassicSimilarity], result of:
            0.031592917 = score(doc=2910,freq=2.0), product of:
              0.0683741 = queryWeight, product of:
                1.3409925 = boost
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.0121919215 = queryNorm
              0.46205974 = fieldWeight in 2910, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.078125 = fieldNorm(doc=2910)
          0.027105663 = weight(abstract_txt:problem in 2910) [ClassicSimilarity], result of:
            0.027105663 = score(doc=2910,freq=1.0), product of:
              0.07778248 = queryWeight, product of:
                1.4302813 = boost
                4.460548 = idf(docFreq=1388, maxDocs=44218)
                0.0121919215 = queryNorm
              0.3484803 = fieldWeight in 2910, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.460548 = idf(docFreq=1388, maxDocs=44218)
                0.078125 = fieldNorm(doc=2910)
          0.050511852 = weight(abstract_txt:applied in 2910) [ClassicSimilarity], result of:
            0.050511852 = score(doc=2910,freq=1.0), product of:
              0.13483451 = queryWeight, product of:
                2.3063579 = boost
                4.79515 = idf(docFreq=993, maxDocs=44218)
                0.0121919215 = queryNorm
              0.3746211 = fieldWeight in 2910, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.79515 = idf(docFreq=993, maxDocs=44218)
                0.078125 = fieldNorm(doc=2910)
          0.26460606 = weight(abstract_txt:corpora in 2910) [ClassicSimilarity], result of:
            0.26460606 = score(doc=2910,freq=1.0), product of:
              0.48219383 = queryWeight, product of:
                5.6306868 = boost
                7.0240583 = idf(docFreq=106, maxDocs=44218)
                0.0121919215 = queryNorm
              0.5487546 = fieldWeight in 2910, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.0240583 = idf(docFreq=106, maxDocs=44218)
                0.078125 = fieldNorm(doc=2910)
        0.24 = coord(6/25)
    
  5. Suakkaphong, N.; Zhang, Z.; Chen, H.: Disease named entity recognition using semisupervised learning and conditional random fields (2011) 0.10
    0.100471474 = sum of:
      0.100471474 = product of:
        0.62794673 = sum of:
          0.038922373 = weight(abstract_txt:domain in 4367) [ClassicSimilarity], result of:
            0.038922373 = score(doc=4367,freq=1.0), product of:
              0.13150585 = queryWeight, product of:
                2.2777114 = boost
                4.7355914 = idf(docFreq=1054, maxDocs=44218)
                0.0121919215 = queryNorm
              0.29597446 = fieldWeight in 4367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7355914 = idf(docFreq=1054, maxDocs=44218)
                0.0625 = fieldNorm(doc=4367)
          0.04040948 = weight(abstract_txt:applied in 4367) [ClassicSimilarity], result of:
            0.04040948 = score(doc=4367,freq=1.0), product of:
              0.13483451 = queryWeight, product of:
                2.3063579 = boost
                4.79515 = idf(docFreq=993, maxDocs=44218)
                0.0121919215 = queryNorm
              0.29969686 = fieldWeight in 4367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.79515 = idf(docFreq=993, maxDocs=44218)
                0.0625 = fieldNorm(doc=4367)
          0.06806756 = weight(abstract_txt:algorithm in 4367) [ClassicSimilarity], result of:
            0.06806756 = score(doc=4367,freq=1.0), product of:
              0.19088523 = queryWeight, product of:
                2.7441783 = boost
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.0121919215 = queryNorm
              0.35658893 = fieldWeight in 4367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.0625 = fieldNorm(doc=4367)
          0.4805473 = weight(abstract_txt:bootstrapping in 4367) [ClassicSimilarity], result of:
            0.4805473 = score(doc=4367,freq=2.0), product of:
              0.55756176 = queryWeight, product of:
                4.689998 = boost
                9.7509775 = idf(docFreq=6, maxDocs=44218)
                0.0121919215 = queryNorm
              0.8618728 = fieldWeight in 4367, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.7509775 = idf(docFreq=6, maxDocs=44218)
                0.0625 = fieldNorm(doc=4367)
        0.16 = coord(4/25)