Yu, L.-C.; Wu, C.-H.; Chang, R.-Y.; Liu, C.-H.; Hovy, E.H.: Annotation and verification of sense pools in OntoNotes (2010)
0.01
0.0070422525 = product of:
0.01760563 = sum of:
0.010769378 = weight(_text_:a in 4236) [ClassicSimilarity], result of:
0.010769378 = score(doc=4236,freq=20.0), product of:
0.053464882 = queryWeight, product of:
1.153047 = idf(docFreq=37942, maxDocs=44218)
0.046368346 = queryNorm
0.20142901 = fieldWeight in 4236, product of:
4.472136 = tf(freq=20.0), with freq of:
20.0 = termFreq=20.0
1.153047 = idf(docFreq=37942, maxDocs=44218)
0.0390625 = fieldNorm(doc=4236)
0.006836252 = product of:
0.013672504 = sum of:
0.013672504 = weight(_text_:information in 4236) [ClassicSimilarity], result of:
0.013672504 = score(doc=4236,freq=6.0), product of:
0.08139861 = queryWeight, product of:
1.7554779 = idf(docFreq=20772, maxDocs=44218)
0.046368346 = queryNorm
0.16796975 = fieldWeight in 4236, product of:
2.4494898 = tf(freq=6.0), with freq of:
6.0 = termFreq=6.0
1.7554779 = idf(docFreq=20772, maxDocs=44218)
0.0390625 = fieldNorm(doc=4236)
0.5 = coord(1/2)
0.4 = coord(2/5)
- Abstract
- The paper describes the OntoNotes, a multilingual (English, Chinese and Arabic) corpus with large-scale semantic annotations, including predicate-argument structure, word senses, ontology linking, and coreference. The underlying semantic model of OntoNotes involves word senses that are grouped into so-called sense pools, i.e., sets of near-synonymous senses of words. Such information is useful for many applications, including query expansion for information retrieval (IR) systems, (near-)duplicate detection for text summarization systems, and alternative word selection for writing support systems. Although a sense pool provides a set of near-synonymous senses of words, there is still no knowledge about whether two words in a pool are interchangeable in practical use. Therefore, this paper devises an unsupervised algorithm that incorporates Google n-grams and a statistical test to determine whether a word in a pool can be substituted by other words in the same pool. The n-gram features are used to measure the degree of context mismatch for a substitution. The statistical test is then applied to determine whether the substitution is adequate based on the degree of mismatch. The proposed method is compared with a supervised method, namely Linear Discriminant Analysis (LDA). Experimental results show that the proposed unsupervised method can achieve comparable performance with the supervised method.
- Source
- Information processing and management. 46(2010) no.4, S.436-447
- Type
- a