Search (94 results, page 2 of 5)

  • × language_ss:"e"
  • × theme_ss:"Computerlinguistik"
  1. Baayen, R.H.; Lieber, H.: Word frequency distributions and lexical semantics (1997) 0.01
    0.009182452 = product of:
      0.027547356 = sum of:
        0.027547356 = product of:
          0.08264206 = sum of:
            0.08264206 = weight(_text_:22 in 3117) [ClassicSimilarity], result of:
              0.08264206 = score(doc=3117,freq=2.0), product of:
                0.15257138 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.043569047 = queryNorm
                0.5416616 = fieldWeight in 3117, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.109375 = fieldNorm(doc=3117)
          0.33333334 = coord(1/3)
      0.33333334 = coord(1/3)
    
    Date
    28. 2.1999 10:48:22
  2. Martínez, F.; Martín, M.T.; Rivas, V.M.; Díaz, M.C.; Ureña, L.A.: Using neural networks for multiword recognition in IR (2003) 0.01
    0.009000885 = product of:
      0.027002655 = sum of:
        0.027002655 = product of:
          0.081007965 = sum of:
            0.081007965 = weight(_text_:network in 2777) [ClassicSimilarity], result of:
              0.081007965 = score(doc=2777,freq=4.0), product of:
                0.19402927 = queryWeight, product of:
                  4.4533744 = idf(docFreq=1398, maxDocs=44218)
                  0.043569047 = queryNorm
                0.41750383 = fieldWeight in 2777, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  4.4533744 = idf(docFreq=1398, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2777)
          0.33333334 = coord(1/3)
      0.33333334 = coord(1/3)
    
    Abstract
    In this paper, a supervised neural network has been used to classify pairs of terms as being multiwords or non-multiwords. Classification is based an the values yielded by different estimators, currently available in literature, used as inputs for the neural network. Lists of multiwords and non-multiwords have been built to train the net. Afterward, many other pairs of terms have been classified using the trained net. Results obtained in this classification have been used to perform information retrieval tasks. Experiments show that detecting multiwords results in better performance of the IR methods.
  3. Meng, K.; Ba, Z.; Ma, Y.; Li, G.: ¬A network coupling approach to detecting hierarchical linkages between science and technology (2024) 0.01
    0.009000885 = product of:
      0.027002655 = sum of:
        0.027002655 = product of:
          0.081007965 = sum of:
            0.081007965 = weight(_text_:network in 1205) [ClassicSimilarity], result of:
              0.081007965 = score(doc=1205,freq=4.0), product of:
                0.19402927 = queryWeight, product of:
                  4.4533744 = idf(docFreq=1398, maxDocs=44218)
                  0.043569047 = queryNorm
                0.41750383 = fieldWeight in 1205, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  4.4533744 = idf(docFreq=1398, maxDocs=44218)
                  0.046875 = fieldNorm(doc=1205)
          0.33333334 = coord(1/3)
      0.33333334 = coord(1/3)
    
    Abstract
    Detecting science-technology hierarchical linkages is beneficial for understanding deep interactions between science and technology (S&T). Previous studies have mainly focused on linear linkages between S&T but ignored their structural linkages. In this paper, we propose a network coupling approach to inspect hierarchical interactions of S&T by integrating their knowledge linkages and structural linkages. S&T knowledge networks are first enhanced with bidirectional encoder representation from transformers (BERT) knowledge alignment, and then their hierarchical structures are identified based on K-core decomposition. Hierarchical coupling preferences and strengths of the S&T networks over time are further calculated based on similarities of coupling nodes' degree distribution and similarities of coupling edges' weight distribution. Extensive experimental results indicate that our approach is feasible and robust in identifying the coupling hierarchy with superior performance compared to other isomorphism and dissimilarity algorithms. Our research extends the mindset of S&T linkage measurement by identifying patterns and paths of the interaction of S&T hierarchical knowledge.
  4. Hsinchun, C.: Knowledge-based document retrieval framework and design (1992) 0.01
    0.008486116 = product of:
      0.025458349 = sum of:
        0.025458349 = product of:
          0.076375045 = sum of:
            0.076375045 = weight(_text_:network in 6686) [ClassicSimilarity], result of:
              0.076375045 = score(doc=6686,freq=2.0), product of:
                0.19402927 = queryWeight, product of:
                  4.4533744 = idf(docFreq=1398, maxDocs=44218)
                  0.043569047 = queryNorm
                0.3936264 = fieldWeight in 6686, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.4533744 = idf(docFreq=1398, maxDocs=44218)
                  0.0625 = fieldNorm(doc=6686)
          0.33333334 = coord(1/3)
      0.33333334 = coord(1/3)
    
    Abstract
    Presents research on the design of knowledge-based document retrieval systems in which a semantic network was adopted to represent subject knowledge and classification scheme knowledge and experts' search strategies and user modelling capability were modelled as procedural knowledge. These functionalities were incorporated into a prototype knowledge-based retrieval system, Metacat. Describes a system, the design of which was based on the blackboard architecture, which was able to create a user profile, identify task requirements, suggest heuristics-based search strategies, perform semantic-based search assistance, and assist online query refinement
  5. Mustafa El Hadi, W.: Evaluating human language technology : general applications to information access and management (2002) 0.01
    0.007942118 = product of:
      0.023826351 = sum of:
        0.023826351 = product of:
          0.07147905 = sum of:
            0.07147905 = weight(_text_:29 in 1840) [ClassicSimilarity], result of:
              0.07147905 = score(doc=1840,freq=2.0), product of:
                0.15326229 = queryWeight, product of:
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.043569047 = queryNorm
                0.46638384 = fieldWeight in 1840, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.09375 = fieldNorm(doc=1840)
          0.33333334 = coord(1/3)
      0.33333334 = coord(1/3)
    
    Source
    Knowledge organization. 29(2002) nos.3/4, S.124-134
  6. Byrne, C.C.; McCracken, S.A.: ¬An adaptive thesaurus employing semantic distance, relational inheritance and nominal compound interpretation for linguistic support of information retrieval (1999) 0.01
    0.007870673 = product of:
      0.023612019 = sum of:
        0.023612019 = product of:
          0.07083605 = sum of:
            0.07083605 = weight(_text_:22 in 4483) [ClassicSimilarity], result of:
              0.07083605 = score(doc=4483,freq=2.0), product of:
                0.15257138 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.043569047 = queryNorm
                0.46428138 = fieldWeight in 4483, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.09375 = fieldNorm(doc=4483)
          0.33333334 = coord(1/3)
      0.33333334 = coord(1/3)
    
    Date
    15. 3.2000 10:22:37
  7. Boleda, G.; Evert, S.: Multiword expressions : a pain in the neck of lexical semantics (2009) 0.01
    0.007870673 = product of:
      0.023612019 = sum of:
        0.023612019 = product of:
          0.07083605 = sum of:
            0.07083605 = weight(_text_:22 in 4888) [ClassicSimilarity], result of:
              0.07083605 = score(doc=4888,freq=2.0), product of:
                0.15257138 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.043569047 = queryNorm
                0.46428138 = fieldWeight in 4888, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.09375 = fieldNorm(doc=4888)
          0.33333334 = coord(1/3)
      0.33333334 = coord(1/3)
    
    Date
    1. 3.2013 14:56:22
  8. Lian, T.; Yu, C.; Wang, W.; Yuan, Q.; Hou, Z.: Doctoral dissertations on tourism in China : a co-word analysis (2016) 0.01
    0.007500738 = product of:
      0.022502214 = sum of:
        0.022502214 = product of:
          0.06750664 = sum of:
            0.06750664 = weight(_text_:network in 3178) [ClassicSimilarity], result of:
              0.06750664 = score(doc=3178,freq=4.0), product of:
                0.19402927 = queryWeight, product of:
                  4.4533744 = idf(docFreq=1398, maxDocs=44218)
                  0.043569047 = queryNorm
                0.34791988 = fieldWeight in 3178, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  4.4533744 = idf(docFreq=1398, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=3178)
          0.33333334 = coord(1/3)
      0.33333334 = coord(1/3)
    
    Abstract
    The aim of this paper is to map the foci of research in doctoral dissertations on tourism in China. In the paper, coword analysis is applied, with keywords coming from six public dissertation databases, i.e. CDFD, Wanfang Data, NLC, CALIS, ISTIC, and NSTL, as well as some university libraries providing doctoral dissertations on tourism. Altogether we have examined 928 doctoral dissertations on tourism written between 1989 and 2013. Doctoral dissertations on tourism in China involve 36 first level disciplines and 102 secondary level disciplines. We collect the top 68 keywords of practical significance in tourism which are mentioned at least four times or more. These keywords are classified into 12 categories based on co-word analysis, including cluster analysis, strategic diagrams analysis, and social network analysis. According to the strategic diagram of the 12 categories, we find the mature and immature areas in tourism study. From social networks, we can see the social network maps of original co-occurrence matrix and k-cores analysis of binary matrix. The paper provides valuable insight into the study of tourism by analyzing doctoral dissertations on tourism in China.
  9. Griffiths, T.L.; Steyvers, M.: ¬A probabilistic approach to semantic representation (2002) 0.01
    0.0074879 = product of:
      0.0224637 = sum of:
        0.0224637 = product of:
          0.0673911 = sum of:
            0.0673911 = weight(_text_:29 in 3671) [ClassicSimilarity], result of:
              0.0673911 = score(doc=3671,freq=4.0), product of:
                0.15326229 = queryWeight, product of:
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.043569047 = queryNorm
                0.43971092 = fieldWeight in 3671, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.0625 = fieldNorm(doc=3671)
          0.33333334 = coord(1/3)
      0.33333334 = coord(1/3)
    
    Date
    29. 6.2015 14:55:01
    29. 6.2015 16:09:05
  10. Czejdo. B.D.; Tucci, R.P.: ¬A dataflow graphical language for database applications (1994) 0.01
    0.006618432 = product of:
      0.019855294 = sum of:
        0.019855294 = product of:
          0.059565883 = sum of:
            0.059565883 = weight(_text_:29 in 559) [ClassicSimilarity], result of:
              0.059565883 = score(doc=559,freq=2.0), product of:
                0.15326229 = queryWeight, product of:
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.043569047 = queryNorm
                0.38865322 = fieldWeight in 559, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.078125 = fieldNorm(doc=559)
          0.33333334 = coord(1/3)
      0.33333334 = coord(1/3)
    
    Date
    20.10.2000 13:29:46
  11. Whitelock, P.; Kilby, K.: Linguistic and computational techniques in machine translation system design : 2nd ed (1995) 0.01
    0.006618432 = product of:
      0.019855294 = sum of:
        0.019855294 = product of:
          0.059565883 = sum of:
            0.059565883 = weight(_text_:29 in 1750) [ClassicSimilarity], result of:
              0.059565883 = score(doc=1750,freq=2.0), product of:
                0.15326229 = queryWeight, product of:
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.043569047 = queryNorm
                0.38865322 = fieldWeight in 1750, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.078125 = fieldNorm(doc=1750)
          0.33333334 = coord(1/3)
      0.33333334 = coord(1/3)
    
    Date
    29. 3.1996 18:28:09
  12. Rau, L.F.: Conceptual information extraction and retrieval from natural language input (198) 0.01
    0.006618432 = product of:
      0.019855294 = sum of:
        0.019855294 = product of:
          0.059565883 = sum of:
            0.059565883 = weight(_text_:29 in 1955) [ClassicSimilarity], result of:
              0.059565883 = score(doc=1955,freq=2.0), product of:
                0.15326229 = queryWeight, product of:
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.043569047 = queryNorm
                0.38865322 = fieldWeight in 1955, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.078125 = fieldNorm(doc=1955)
          0.33333334 = coord(1/3)
      0.33333334 = coord(1/3)
    
    Date
    16. 8.1998 13:29:20
  13. Liu, S.; Liu, F.; Yu, C.; Meng, W.: ¬An effective approach to document retrieval via utilizing WordNet and recognizing phrases (2004) 0.01
    0.006618432 = product of:
      0.019855294 = sum of:
        0.019855294 = product of:
          0.059565883 = sum of:
            0.059565883 = weight(_text_:29 in 4078) [ClassicSimilarity], result of:
              0.059565883 = score(doc=4078,freq=2.0), product of:
                0.15326229 = queryWeight, product of:
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.043569047 = queryNorm
                0.38865322 = fieldWeight in 4078, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.078125 = fieldNorm(doc=4078)
          0.33333334 = coord(1/3)
      0.33333334 = coord(1/3)
    
    Date
    10.10.2005 10:29:08
  14. Snajder, J.: Distributional semantics of multi-word expressions (2013) 0.01
    0.006618432 = product of:
      0.019855294 = sum of:
        0.019855294 = product of:
          0.059565883 = sum of:
            0.059565883 = weight(_text_:29 in 2868) [ClassicSimilarity], result of:
              0.059565883 = score(doc=2868,freq=2.0), product of:
                0.15326229 = queryWeight, product of:
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.043569047 = queryNorm
                0.38865322 = fieldWeight in 2868, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.078125 = fieldNorm(doc=2868)
          0.33333334 = coord(1/3)
      0.33333334 = coord(1/3)
    
    Date
    29. 4.2016 12:04:50
  15. Hutchins, J.: From first conception to first demonstration : the nascent years of machine translation, 1947-1954. A chronology (1997) 0.01
    0.006558894 = product of:
      0.019676682 = sum of:
        0.019676682 = product of:
          0.059030045 = sum of:
            0.059030045 = weight(_text_:22 in 1463) [ClassicSimilarity], result of:
              0.059030045 = score(doc=1463,freq=2.0), product of:
                0.15257138 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.043569047 = queryNorm
                0.38690117 = fieldWeight in 1463, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.078125 = fieldNorm(doc=1463)
          0.33333334 = coord(1/3)
      0.33333334 = coord(1/3)
    
    Date
    31. 7.1996 9:22:19
  16. Yang, Y.; Wilbur, J.: Using corpus statistics to remove redundant words in text categorization (1996) 0.01
    0.0063645868 = product of:
      0.01909376 = sum of:
        0.01909376 = product of:
          0.057281278 = sum of:
            0.057281278 = weight(_text_:network in 4199) [ClassicSimilarity], result of:
              0.057281278 = score(doc=4199,freq=2.0), product of:
                0.19402927 = queryWeight, product of:
                  4.4533744 = idf(docFreq=1398, maxDocs=44218)
                  0.043569047 = queryNorm
                0.29521978 = fieldWeight in 4199, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.4533744 = idf(docFreq=1398, maxDocs=44218)
                  0.046875 = fieldNorm(doc=4199)
          0.33333334 = coord(1/3)
      0.33333334 = coord(1/3)
    
    Abstract
    This article studies aggressive word removal in text categorization to reduce the noice in free texts to enhance the computational efficiency of categorization. We use a novel stop word identification method to automatically generate domain specific stoplists which are much larger than a conventional domain-independent stoplist. In our tests with 3 categorization methods on text collections from different domains/applications, significant numbers of words were removed without sacrificing categorization effectiveness. In the test of the Expert Network method on CACM documents, for example, an 87% removal of unique qords reduced the vocabulary of documents from 8.002 distinct words to 1.045 words, which resulted in a 63% time savings and a 74% memory savings in the computation of category ranking, with a 10% precision improvement on average over not using word removal. It is evident in this study that automated word removal based on corpus statistics has a practical and significant impact on the computational tractability of categorization methods in large databases
  17. Goller, C.; Löning, J.; Will, T.; Wolff, W.: Automatic document classification : a thourough evaluation of various methods (2000) 0.01
    0.0063645868 = product of:
      0.01909376 = sum of:
        0.01909376 = product of:
          0.057281278 = sum of:
            0.057281278 = weight(_text_:network in 5480) [ClassicSimilarity], result of:
              0.057281278 = score(doc=5480,freq=2.0), product of:
                0.19402927 = queryWeight, product of:
                  4.4533744 = idf(docFreq=1398, maxDocs=44218)
                  0.043569047 = queryNorm
                0.29521978 = fieldWeight in 5480, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.4533744 = idf(docFreq=1398, maxDocs=44218)
                  0.046875 = fieldNorm(doc=5480)
          0.33333334 = coord(1/3)
      0.33333334 = coord(1/3)
    
    Abstract
    (Automatic) document classification is generally defined as content-based assignment of one or more predefined categories to documents. Usually, machine learning, statistical pattern recognition, or neural network approaches are used to construct classifiers automatically. In this paper we thoroughly evaluate a wide variety of these methods on a document classification task for German text. We evaluate different feature construction and selection methods and various classifiers. Our main results are: (1) feature selection is necessary not only to reduce learning and classification time, but also to avoid overfitting (even for Support Vector Machines); (2) surprisingly, our morphological analysis does not improve classification quality compared to a letter 5-gram approach; (3) Support Vector Machines are significantly better than all other classification methods
  18. Clark, M.; Kim, Y.; Kruschwitz, U.; Song, D.; Albakour, D.; Dignum, S.; Beresi, U.C.; Fasli, M.; Roeck, A De: Automatically structuring domain knowledge from text : an overview of current research (2012) 0.01
    0.0056159254 = product of:
      0.016847776 = sum of:
        0.016847776 = product of:
          0.050543327 = sum of:
            0.050543327 = weight(_text_:29 in 2738) [ClassicSimilarity], result of:
              0.050543327 = score(doc=2738,freq=4.0), product of:
                0.15326229 = queryWeight, product of:
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.043569047 = queryNorm
                0.3297832 = fieldWeight in 2738, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.5176873 = idf(docFreq=3565, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2738)
          0.33333334 = coord(1/3)
      0.33333334 = coord(1/3)
    
    Date
    29. 1.2016 18:29:51
  19. Warner, J.: Analogies between linguistics and information theory (2007) 0.01
    0.0053038225 = product of:
      0.015911467 = sum of:
        0.015911467 = product of:
          0.047734402 = sum of:
            0.047734402 = weight(_text_:network in 138) [ClassicSimilarity], result of:
              0.047734402 = score(doc=138,freq=2.0), product of:
                0.19402927 = queryWeight, product of:
                  4.4533744 = idf(docFreq=1398, maxDocs=44218)
                  0.043569047 = queryNorm
                0.2460165 = fieldWeight in 138, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.4533744 = idf(docFreq=1398, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=138)
          0.33333334 = coord(1/3)
      0.33333334 = coord(1/3)
    
    Abstract
    An analogy is established between the syntagm and paradigm from Saussurean linguistics and the message and messages for selection from the information theory initiated by Claude Shannon. The analogy is pursued both as an end in itself and for its analytic value in understanding patterns of retrieval from full-text systems. The multivalency of individual words when isolated from their syntagm is contrasted with the relative stability of meaning of multiword sequences, when searching ordinary written discourse. The syntagm is understood as the linear sequence of oral and written language. Saussure's understanding of the word, as a unit that compels recognition by the mind, is endorsed, although not regarded as final. The lesser multivalency of multiword sequences is understood as the greater determination of signification by the extended syntagm. The paradigm is primarily understood as the network of associations a word acquires when considered apart from the syntagm. The restriction of information theory to expression or signals, and its focus on the combinatorial aspects of the message, is sustained. The message in the model of communication in information theory can include sequences of written language. Shannon's understanding of the written word, as a cohesive group of letters, with strong internal statistical influences, is added to the Saussurean conception. Sequences of more than one word are regarded as weakly correlated concatenations of cohesive units.
  20. Levin, M.; Krawczyk, S.; Bethard, S.; Jurafsky, D.: Citation-based bootstrapping for large-scale author disambiguation (2012) 0.01
    0.0053038225 = product of:
      0.015911467 = sum of:
        0.015911467 = product of:
          0.047734402 = sum of:
            0.047734402 = weight(_text_:network in 246) [ClassicSimilarity], result of:
              0.047734402 = score(doc=246,freq=2.0), product of:
                0.19402927 = queryWeight, product of:
                  4.4533744 = idf(docFreq=1398, maxDocs=44218)
                  0.043569047 = queryNorm
                0.2460165 = fieldWeight in 246, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.4533744 = idf(docFreq=1398, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=246)
          0.33333334 = coord(1/3)
      0.33333334 = coord(1/3)
    
    Abstract
    We present a new, two-stage, self-supervised algorithm for author disambiguation in large bibliographic databases. In the first "bootstrap" stage, a collection of high-precision features is used to bootstrap a training set with positive and negative examples of coreferring authors. A supervised feature-based classifier is then trained on the bootstrap clusters and used to cluster the authors in a larger unlabeled dataset. Our self-supervised approach shares the advantages of unsupervised approaches (no need for expensive hand labels) as well as supervised approaches (a rich set of features that can be discriminatively trained). The algorithm disambiguates 54,000,000 author instances in Thomson Reuters' Web of Knowledge with B3 F1 of.807. We analyze parameters and features, particularly those from citation networks, which have not been deeply investigated in author disambiguation. The most important citation feature is self-citation, which can be approximated without expensive extraction of the full network. For the supervised stage, the minor improvement due to other citation features (increasing F1 from.748 to.767) suggests they may not be worth the trouble of extracting from databases that don't already have them. A lean feature set without expensive abstract and title features performs 130 times faster with about equal F1.

Years

Types

  • a 83
  • el 6
  • m 4
  • s 3
  • p 2
  • x 1
  • More… Less…