Sánchez, D.; Batet, M.; Valls, A.; Gibert, K.: Ontology-driven web-based semantic similarity (2010)
0.00
3.2888478E-4 = product of:
0.0049332716 = sum of:
0.0049332716 = product of:
0.009866543 = sum of:
0.009866543 = weight(_text_:information in 335) [ClassicSimilarity], result of:
0.009866543 = score(doc=335,freq=8.0), product of:
0.050870337 = queryWeight, product of:
1.7554779 = idf(docFreq=20772, maxDocs=44218)
0.028978055 = queryNorm
0.19395474 = fieldWeight in 335, product of:
2.828427 = tf(freq=8.0), with freq of:
8.0 = termFreq=8.0
1.7554779 = idf(docFreq=20772, maxDocs=44218)
0.0390625 = fieldNorm(doc=335)
0.5 = coord(1/2)
0.06666667 = coord(1/15)
- Abstract
- Estimation of the degree of semantic similarity/distance between concepts is a very common problem in research areas such as natural language processing, knowledge acquisition, information retrieval or data mining. In the past, many similarity measures have been proposed, exploiting explicit knowledge-such as the structure of a taxonomy-or implicit knowledge-such as information distribution. In the former case, taxonomies and/or ontologies are used to introduce additional semantics; in the latter case, frequencies of term appearances in a corpus are considered. Classical measures based on those premises suffer from some problems: in the ?rst case, their excessive dependency of the taxonomical/ontological structure; in the second case, the lack of semantics of a pure statistical analysis of occurrences and/or the ambiguity of estimating concept statistical distribution from term appearances. Measures based on Information Content (IC) of taxonomical concepts combine both approaches. However, they heavily depend on a properly pre-tagged and disambiguated corpus according to the ontological entities in order to computer accurate concept appearance probabilities. This limits the applicability of those measures to other ontologies - like specific domain ontologies - and massive corpus - like the Web. In this paper, several of the presente issues are analyzed. Modifications of classical similarity measures are also proposed. They are based on a contextualized and scalable version of IC computation in the Web by exploiting taxonomical knowledge. The goal is to avoid the measures' dependency on the corpus pre-processing to achieve reliable results and minimize language ambiguity. Our proposals are able to outperform classical approaches when using the Web for estimating concept probabilities.
- Source
- Journal of intelligent information systems. 35(2010) no.x, S.383-413