Search (2 results, page 1 of 1)

  • × author_ss:"Jurafsky, D."
  1. Levin, M.; Krawczyk, S.; Bethard, S.; Jurafsky, D.: Citation-based bootstrapping for large-scale author disambiguation (2012) 0.00
    0.0032601836 = product of:
      0.022821285 = sum of:
        0.017699862 = weight(_text_:web in 246) [ClassicSimilarity], result of:
          0.017699862 = score(doc=246,freq=2.0), product of:
            0.098177016 = queryWeight, product of:
              3.2635105 = idf(docFreq=4597, maxDocs=44218)
              0.030083254 = queryNorm
            0.18028519 = fieldWeight in 246, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.2635105 = idf(docFreq=4597, maxDocs=44218)
              0.0390625 = fieldNorm(doc=246)
        0.005121422 = weight(_text_:information in 246) [ClassicSimilarity], result of:
          0.005121422 = score(doc=246,freq=2.0), product of:
            0.052810486 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.030083254 = queryNorm
            0.09697737 = fieldWeight in 246, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0390625 = fieldNorm(doc=246)
      0.14285715 = coord(2/14)
    We present a new, two-stage, self-supervised algorithm for author disambiguation in large bibliographic databases. In the first "bootstrap" stage, a collection of high-precision features is used to bootstrap a training set with positive and negative examples of coreferring authors. A supervised feature-based classifier is then trained on the bootstrap clusters and used to cluster the authors in a larger unlabeled dataset. Our self-supervised approach shares the advantages of unsupervised approaches (no need for expensive hand labels) as well as supervised approaches (a rich set of features that can be discriminatively trained). The algorithm disambiguates 54,000,000 author instances in Thomson Reuters' Web of Knowledge with B3 F1 of.807. We analyze parameters and features, particularly those from citation networks, which have not been deeply investigated in author disambiguation. The most important citation feature is self-citation, which can be approximated without expensive extraction of the full network. For the supervised stage, the minor improvement due to other citation features (increasing F1 from.748 to.767) suggests they may not be worth the trouble of extracting from databases that don't already have them. A lean feature set without expensive abstract and title features performs 130 times faster with about equal F1.
    Journal of the American Society for Information Science and Technology. 63(2012) no.5, S.1030-1047
  2. Jurafsky, D.; Martin, J.H.: Speech and language processing : ani ntroduction to natural language processing, computational linguistics and speech recognition (2009) 0.00
    0.0017879561 = product of:
      0.025031384 = sum of:
        0.025031384 = weight(_text_:web in 1081) [ClassicSimilarity], result of:
          0.025031384 = score(doc=1081,freq=4.0), product of:
            0.098177016 = queryWeight, product of:
              3.2635105 = idf(docFreq=4597, maxDocs=44218)
              0.030083254 = queryNorm
            0.25496176 = fieldWeight in 1081, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.2635105 = idf(docFreq=4597, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1081)
      0.071428575 = coord(1/14)
    For undergraduate or advanced undergraduate courses in Classical Natural Language Processing, Statistical Natural Language Processing, Speech Recognition, Computational Linguistics, and Human Language Processing. An explosion of Web-based language techniques, merging of distinct fields, availability of phone-based dialogue systems, and much more make this an exciting time in speech and language processing. The first of its kind to thoroughly cover language technology at all levels and with all modern technologies this text takes an empirical approach to the subject, based on applying statistical and other machine-learning algorithms to large corporations. The authors cover areas that traditionally are taught in different courses, to describe a unified vision of speech and language processing. Emphasis is on practical applications and scientific evaluation. An accompanying Website contains teaching materials for instructors, with pointers to language processing resources on the Web. The Second Edition offers a significant amount of new and extended material.