Search (3079 results, page 2 of 154)

  • × type_ss:"a"
  1. Fox, K.L.; Frieder, O.; Knepper, M.M.; Snowberg, E.J.: SENTINEL: a multiple engine information retrieval and visualization system (1999) 0.05
    0.048865046 = product of:
      0.24432522 = sum of:
        0.24432522 = weight(_text_:grams in 3547) [ClassicSimilarity], result of:
          0.24432522 = score(doc=3547,freq=2.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.6233048 = fieldWeight in 3547, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0546875 = fieldNorm(doc=3547)
      0.2 = coord(1/5)
    
    Abstract
    We describe a prototype Information Retrieval system; SENTINEL, under development at Harris Corporation's Information Systems Division. SENTINEL is a fusion of multiple information retrieval technologies, integrating n-grams, a vector space model, and a neural network training rule. One of the primary advantages of SENTINEL is its 3-dimensional visualization capability that is based fully upon the mathematical representation of information with SENTINEL. The 3-dimensional visualization capability provides users with an intuitive understanding, with relevance/query refinement techniques athat can be better utilized, resulting in higher retrieval precision
  2. Wordhoard (o.J.) 0.05
    0.048865046 = product of:
      0.24432522 = sum of:
        0.24432522 = weight(_text_:grams in 3922) [ClassicSimilarity], result of:
          0.24432522 = score(doc=3922,freq=2.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.6233048 = fieldWeight in 3922, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0546875 = fieldNorm(doc=3922)
      0.2 = coord(1/5)
    
    Abstract
    WordHoard defines a multiword unit as a special type of collocate in which the component words comprise a meaningful phrase. For example, "Knight of the Round Table" is a meaningful multiword unit or phrase. WordHoard uses the notion of a pseudo-bigram to generalize the computation of bigram (two word) statistical measures to phrases (n-grams) longer than two words, and to allow comparisons of these measures for phrases with different word counts. WordHoard applies the localmaxs algorithm of Silva et al. to the pseudo-bigrams to identify potential compositional phrases that "stand out" in a text. WordHoard can also filter two and three word phrases using the word class filters suggested by Justeson and Katz.
  3. WordHoard: finding multiword units (20??) 0.05
    0.048865046 = product of:
      0.24432522 = sum of:
        0.24432522 = weight(_text_:grams in 1123) [ClassicSimilarity], result of:
          0.24432522 = score(doc=1123,freq=2.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.6233048 = fieldWeight in 1123, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1123)
      0.2 = coord(1/5)
    
    Abstract
    WordHoard defines a multiword unit as a special type of collocate in which the component words comprise a meaningful phrase. For example, "Knight of the Round Table" is a meaningful multiword unit or phrase. WordHoard uses the notion of a pseudo-bigram to generalize the computation of bigram (two word) statistical measures to phrases (n-grams) longer than two words, and to allow comparisons of these measures for phrases with different word counts. WordHoard applies the localmaxs algorithm of Silva et al. to the pseudo-bigrams to identify potential compositional phrases that "stand out" in a text. WordHoard can also filter two and three word phrases using the word class filters suggested by Justeson and Katz.
  4. Mas, S.; Marleau, Y.: Proposition of a faceted classification model to support corporate information organization and digital records management (2009) 0.05
    0.04634946 = product of:
      0.23174728 = sum of:
        0.23174728 = weight(_text_:3a in 2918) [ClassicSimilarity], result of:
          0.23174728 = score(doc=2918,freq=2.0), product of:
            0.41234848 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.04863741 = queryNorm
            0.56201804 = fieldWeight in 2918, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=2918)
      0.2 = coord(1/5)
    
    Footnote
    Vgl.: http://ieeexplore.ieee.org/Xplore/login.jsp?reload=true&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F4755313%2F4755314%2F04755480.pdf%3Farnumber%3D4755480&authDecision=-203.
  5. Li, L.; Shang, Y.; Zhang, W.: Improvement of HITS-based algorithms on Web documents 0.05
    0.04634946 = product of:
      0.23174728 = sum of:
        0.23174728 = weight(_text_:3a in 2514) [ClassicSimilarity], result of:
          0.23174728 = score(doc=2514,freq=2.0), product of:
            0.41234848 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.04863741 = queryNorm
            0.56201804 = fieldWeight in 2514, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=2514)
      0.2 = coord(1/5)
    
    Content
    Vgl.: http%3A%2F%2Fdelab.csd.auth.gr%2F~dimitris%2Fcourses%2Fir_spring06%2Fpage_rank_computing%2Fp527-li.pdf. Vgl. auch: http://www2002.org/CDROM/refereed/643/.
  6. Zeng, Q.; Yu, M.; Yu, W.; Xiong, J.; Shi, Y.; Jiang, M.: Faceted hierarchy : a new graph type to organize scientific concepts and a construction method (2019) 0.05
    0.04634946 = product of:
      0.23174728 = sum of:
        0.23174728 = weight(_text_:3a in 400) [ClassicSimilarity], result of:
          0.23174728 = score(doc=400,freq=2.0), product of:
            0.41234848 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.04863741 = queryNorm
            0.56201804 = fieldWeight in 400, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=400)
      0.2 = coord(1/5)
    
    Content
    Vgl.: https%3A%2F%2Faclanthology.org%2FD19-5317.pdf&usg=AOvVaw0ZZFyq5wWTtNTvNkrvjlGA.
  7. Suchenwirth, L.: Sacherschliessung in Zeiten von Corona : neue Herausforderungen und Chancen (2019) 0.05
    0.04634946 = product of:
      0.23174728 = sum of:
        0.23174728 = weight(_text_:3a in 484) [ClassicSimilarity], result of:
          0.23174728 = score(doc=484,freq=2.0), product of:
            0.41234848 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.04863741 = queryNorm
            0.56201804 = fieldWeight in 484, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=484)
      0.2 = coord(1/5)
    
    Footnote
    https%3A%2F%2Fjournals.univie.ac.at%2Findex.php%2Fvoebm%2Farticle%2Fdownload%2F5332%2F5271%2F&usg=AOvVaw2yQdFGHlmOwVls7ANCpTii.
  8. Noever, D.; Ciolino, M.: ¬The Turing deception (2022) 0.05
    0.04634946 = product of:
      0.23174728 = sum of:
        0.23174728 = weight(_text_:3a in 862) [ClassicSimilarity], result of:
          0.23174728 = score(doc=862,freq=2.0), product of:
            0.41234848 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.04863741 = queryNorm
            0.56201804 = fieldWeight in 862, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=862)
      0.2 = coord(1/5)
    
    Source
    https%3A%2F%2Farxiv.org%2Fabs%2F2212.06721&usg=AOvVaw3i_9pZm9y_dQWoHi6uv0EN
  9. Pearce, C.; Nicholas, C.: TELLTALE: Experiments in a dynamic hypertext environment for degraded and multilingual data (1996) 0.04
    0.04188432 = product of:
      0.2094216 = sum of:
        0.2094216 = weight(_text_:grams in 4071) [ClassicSimilarity], result of:
          0.2094216 = score(doc=4071,freq=2.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.5342612 = fieldWeight in 4071, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.046875 = fieldNorm(doc=4071)
      0.2 = coord(1/5)
    
    Abstract
    Methods and tools for finding documents relevant to a user's needs in a document corpora can be found in the information retrieval, library science, and hypertext communities. Typically, these systems provide retrieval capabilities for fairly static copora, their algorithms are dependent on the language for which they are written, e.g. English, and they do not perform well when presented with misspelled words or text that has been degraded by OCR techniques. In this article, we present experimentation results for the TELLTALE system. TELLTALE is a dynamic hypertext environment that provides full-text search from a hypertext-style user interface for text corpora that may be garbled by OCR or transmission errors, and that may contain languages other than English. TELLTALE uses several techniques based on n-grams (n character sequences of text). With these results we show that the dynamic linkage mechanisms in TELLTALE are tolerant of garbles in up to 30% of the characters in the body of the texts
  10. Haas, S.W.; Grams, E.S.: Readers, authors, and page structure : a discussion of four questions arising from a content analysis of Web pages (2000) 0.04
    0.04188432 = product of:
      0.2094216 = sum of:
        0.2094216 = weight(_text_:grams in 4387) [ClassicSimilarity], result of:
          0.2094216 = score(doc=4387,freq=2.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.5342612 = fieldWeight in 4387, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.046875 = fieldNorm(doc=4387)
      0.2 = coord(1/5)
    
  11. Westerman, S.J.; Cribbin, T.; Collins, J.: Human assessments of document similarity (2010) 0.04
    0.04188432 = product of:
      0.2094216 = sum of:
        0.2094216 = weight(_text_:grams in 3915) [ClassicSimilarity], result of:
          0.2094216 = score(doc=3915,freq=2.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.5342612 = fieldWeight in 3915, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.046875 = fieldNorm(doc=3915)
      0.2 = coord(1/5)
    
    Object
    n-grams
  12. Teich, E.; Degaetano-Ortlieb, S.; Fankhauser, P.; Kermes, H.; Lapshinova-Koltunski, E.: ¬The linguistic construal of disciplinarity : a data-mining approach using register features (2016) 0.04
    0.04188432 = product of:
      0.2094216 = sum of:
        0.2094216 = weight(_text_:grams in 3015) [ClassicSimilarity], result of:
          0.2094216 = score(doc=3015,freq=2.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.5342612 = fieldWeight in 3015, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.046875 = fieldNorm(doc=3015)
      0.2 = coord(1/5)
    
    Abstract
    We analyze the linguistic evolution of selected scientific disciplines over a 30-year time span (1970s to 2000s). Our focus is on four highly specialized disciplines at the boundaries of computer science that emerged during that time: computational linguistics, bioinformatics, digital construction, and microelectronics. Our analysis is driven by the question whether these disciplines develop a distinctive language use-both individually and collectively-over the given time period. The data set is the English Scientific Text Corpus (scitex), which includes texts from the 1970s/1980s and early 2000s. Our theoretical basis is register theory. In terms of methods, we combine corpus-based methods of feature extraction (various aggregated features [part-of-speech based], n-grams, lexico-grammatical patterns) and automatic text classification. The results of our research are directly relevant to the study of linguistic variation and languages for specific purposes (LSP) and have implications for various natural language processing (NLP) tasks, for example, authorship attribution, text mining, or training NLP tools.
  13. Lin, Y.-R.; Margolin, D.; Lazer, D.: Uncovering social semantics from textual traces : a theory-driven approach and evidence from public statements of U.S. Members of Congress (2016) 0.04
    0.04188432 = product of:
      0.2094216 = sum of:
        0.2094216 = weight(_text_:grams in 3078) [ClassicSimilarity], result of:
          0.2094216 = score(doc=3078,freq=2.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.5342612 = fieldWeight in 3078, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.046875 = fieldNorm(doc=3078)
      0.2 = coord(1/5)
    
    Abstract
    The increasing abundance of digital textual archives provides an opportunity for understanding human social systems. Yet the literature has not adequately considered the disparate social processes by which texts are produced. Drawing on communication theory, we identify three common processes by which documents might be detectably similar in their textual features-authors sharing subject matter, sharing goals, and sharing sources. We hypothesize that these processes produce distinct, detectable relationships between authors in different kinds of textual overlap. We develop a novel n-gram extraction technique to capture such signatures based on n-grams of different lengths. We test the hypothesis on a corpus where the author attributes are observable: the public statements of the members of the U.S. Congress. This article presents the first empirical finding that shows different social relationships are detectable through the structure of overlapping textual features. Our study has important implications for designing text modeling techniques to make sense of social phenomena from aggregate digital traces.
  14. Donsbach, W.: Wahrheit in den Medien : über den Sinn eines methodischen Objektivitätsbegriffes (2001) 0.04
    0.03862455 = product of:
      0.19312274 = sum of:
        0.19312274 = weight(_text_:3a in 5895) [ClassicSimilarity], result of:
          0.19312274 = score(doc=5895,freq=2.0), product of:
            0.41234848 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.04863741 = queryNorm
            0.46834838 = fieldWeight in 5895, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5895)
      0.2 = coord(1/5)
    
    Source
    Politische Meinung. 381(2001) Nr.1, S.65-74 [https%3A%2F%2Fwww.dgfe.de%2Ffileadmin%2FOrdnerRedakteure%2FSektionen%2FSek02_AEW%2FKWF%2FPublikationen_Reihe_1989-2003%2FBand_17%2FBd_17_1994_355-406_A.pdf&usg=AOvVaw2KcbRsHy5UQ9QRIUyuOLNi]
  15. Malsburg, C. von der: ¬The correlation theory of brain function (1981) 0.04
    0.03862455 = product of:
      0.19312274 = sum of:
        0.19312274 = weight(_text_:3a in 76) [ClassicSimilarity], result of:
          0.19312274 = score(doc=76,freq=2.0), product of:
            0.41234848 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.04863741 = queryNorm
            0.46834838 = fieldWeight in 76, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.0390625 = fieldNorm(doc=76)
      0.2 = coord(1/5)
    
    Source
    http%3A%2F%2Fcogprints.org%2F1380%2F1%2FvdM_correlation.pdf&usg=AOvVaw0g7DvZbQPb2U7dYb49b9v_
  16. Ackermann, E.: Piaget's constructivism, Papert's constructionism : what's the difference? (2001) 0.04
    0.03862455 = product of:
      0.19312274 = sum of:
        0.19312274 = weight(_text_:3a in 692) [ClassicSimilarity], result of:
          0.19312274 = score(doc=692,freq=2.0), product of:
            0.41234848 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.04863741 = queryNorm
            0.46834838 = fieldWeight in 692, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.0390625 = fieldNorm(doc=692)
      0.2 = coord(1/5)
    
    Content
    Vgl.: https://www.semanticscholar.org/paper/Piaget-%E2%80%99-s-Constructivism-%2C-Papert-%E2%80%99-s-%3A-What-%E2%80%99-s-Ackermann/89cbcc1e740a4591443ff4765a6ae8df0fdf5554. Darunter weitere Hinweise auf verwandte Beiträge. Auch unter: Learning Group Publication 5(2001) no.3, S.438.
  17. Liu, X.; Croft, W.B.: Statistical language modeling for information retrieval (2004) 0.03
    0.034903605 = product of:
      0.17451802 = sum of:
        0.17451802 = weight(_text_:grams in 4277) [ClassicSimilarity], result of:
          0.17451802 = score(doc=4277,freq=2.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.44521773 = fieldWeight in 4277, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4277)
      0.2 = coord(1/5)
    
    Abstract
    This chapter reviews research and applications in statistical language modeling for information retrieval (IR), which has emerged within the past several years as a new probabilistic framework for describing information retrieval processes. Generally speaking, statistical language modeling, or more simply language modeling (LM), involves estimating a probability distribution that captures statistical regularities of natural language use. Applied to information retrieval, language modeling refers to the problem of estimating the likelihood that a query and a document could have been generated by the same language model, given the language model of the document either with or without a language model of the query. The roots of statistical language modeling date to the beginning of the twentieth century when Markov tried to model letter sequences in works of Russian literature (Manning & Schütze, 1999). Zipf (1929, 1932, 1949, 1965) studied the statistical properties of text and discovered that the frequency of works decays as a Power function of each works rank. However, it was Shannon's (1951) work that inspired later research in this area. In 1951, eager to explore the applications of his newly founded information theory to human language, Shannon used a prediction game involving n-grams to investigate the information content of English text. He evaluated n-gram models' performance by comparing their crossentropy an texts with the true entropy estimated using predictions made by human subjects. For many years, statistical language models have been used primarily for automatic speech recognition. Since 1980, when the first significant language model was proposed (Rosenfeld, 2000), statistical language modeling has become a fundamental component of speech recognition, machine translation, and spelling correction.
  18. Dannenberg, R.B.; Birmingham, W.P.; Pardo, B.; Hu, N.; Meek, C.; Tzanetakis, G.: ¬A comparative evaluation of search techniques for query-by-humming using the MUSART testbed (2007) 0.03
    0.034903605 = product of:
      0.17451802 = sum of:
        0.17451802 = weight(_text_:grams in 269) [ClassicSimilarity], result of:
          0.17451802 = score(doc=269,freq=2.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.44521773 = fieldWeight in 269, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0390625 = fieldNorm(doc=269)
      0.2 = coord(1/5)
    
    Abstract
    Query-by-humming systems offer content-based searching for melodies and require no special musical training or knowledge. Many such systems have been built, but there has not been much useful evaluation and comparison in the literature due to the lack of shared databases and queries. The MUSART project testbed allows various search algorithms to be compared using a shared framework that automatically runs experiments and summarizes results. Using this testbed, the authors compared algorithms based on string alignment, melodic contour matching, a hidden Markov model, n-grams, and CubyHum. Retrieval performance is very sensitive to distance functions and the representation of pitch and rhythm, which raises questions about some previously published conclusions. Some algorithms are particularly sensitive to the quality of queries. Our queries, which are taken from human subjects in a realistic setting, are quite difficult, especially for n-gram models. Finally, simulations on query-by-humming performance as a function of database size indicate that retrieval performance falls only slowly as the database size increases.
  19. Hmeidi, I.I.; Al-Shalabi, R.F.; Al-Taani, A.T.; Najadat, H.; Al-Hazaimeh, S.A.: ¬A novel approach to the extraction of roots from Arabic words using bigrams (2010) 0.03
    0.034903605 = product of:
      0.17451802 = sum of:
        0.17451802 = weight(_text_:grams in 3426) [ClassicSimilarity], result of:
          0.17451802 = score(doc=3426,freq=2.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.44521773 = fieldWeight in 3426, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3426)
      0.2 = coord(1/5)
    
    Abstract
    Root extraction is one of the most important topics in information retrieval (IR), natural language processing (NLP), text summarization, and many other important fields. In the last two decades, several algorithms have been proposed to extract Arabic roots. Most of these algorithms dealt with triliteral roots only, and some with fixed length words only. In this study, a novel approach to the extraction of roots from Arabic words using bigrams is proposed. Two similarity measures are used, the dissimilarity measure called the Manhattan distance, and Dice's measure of similarity. The proposed algorithm is tested on the Holy Qu'ran and on a corpus of 242 abstracts from the Proceedings of the Saudi Arabian National Computer Conferences. The two files used contain a wide range of data: the Holy Qu'ran contains most of the ancient Arabic words while the other file contains some modern Arabic words and some words borrowed from foreign languages in addition to the original Arabic words. The results of this study showed that combining N-grams with the Dice measure gives better results than using the Manhattan distance measure.
  20. Yu, L.-C.; Wu, C.-H.; Chang, R.-Y.; Liu, C.-H.; Hovy, E.H.: Annotation and verification of sense pools in OntoNotes (2010) 0.03
    0.034903605 = product of:
      0.17451802 = sum of:
        0.17451802 = weight(_text_:grams in 4236) [ClassicSimilarity], result of:
          0.17451802 = score(doc=4236,freq=2.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.44521773 = fieldWeight in 4236, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4236)
      0.2 = coord(1/5)
    
    Abstract
    The paper describes the OntoNotes, a multilingual (English, Chinese and Arabic) corpus with large-scale semantic annotations, including predicate-argument structure, word senses, ontology linking, and coreference. The underlying semantic model of OntoNotes involves word senses that are grouped into so-called sense pools, i.e., sets of near-synonymous senses of words. Such information is useful for many applications, including query expansion for information retrieval (IR) systems, (near-)duplicate detection for text summarization systems, and alternative word selection for writing support systems. Although a sense pool provides a set of near-synonymous senses of words, there is still no knowledge about whether two words in a pool are interchangeable in practical use. Therefore, this paper devises an unsupervised algorithm that incorporates Google n-grams and a statistical test to determine whether a word in a pool can be substituted by other words in the same pool. The n-gram features are used to measure the degree of context mismatch for a substitution. The statistical test is then applied to determine whether the substitution is adequate based on the degree of mismatch. The proposed method is compared with a supervised method, namely Linear Discriminant Analysis (LDA). Experimental results show that the proposed unsupervised method can achieve comparable performance with the supervised method.

Languages

Types

  • el 73
  • b 34
  • p 1
  • More… Less…

Themes