Xu, J.; Weischedel, R.: Empirical studies on the impact of lexical resources on CLIR performance (2005)
0.00
0.0026849252 = product of:
0.0053698504 = sum of:
0.0053698504 = product of:
0.010739701 = sum of:
0.010739701 = weight(_text_:a in 1020) [ClassicSimilarity], result of:
0.010739701 = score(doc=1020,freq=14.0), product of:
0.053105544 = queryWeight, product of:
1.153047 = idf(docFreq=37942, maxDocs=44218)
0.046056706 = queryNorm
0.20223314 = fieldWeight in 1020, product of:
3.7416575 = tf(freq=14.0), with freq of:
14.0 = termFreq=14.0
1.153047 = idf(docFreq=37942, maxDocs=44218)
0.046875 = fieldNorm(doc=1020)
0.5 = coord(1/2)
0.5 = coord(1/2)
- Abstract
- In this paper, we compile and review several experiments measuring cross-lingual information retrieval (CLIR) performance as a function of the following resources: bilingual term lists, parallel corpora, machine translation (MT), and stemmers. Our CLIR system uses a simple probabilistic language model; the studies used TREC test corpora over Chinese, Spanish and Arabic. Our findings include: One can achieve an acceptable CLIR performance using only a bilingual term list (70-80% on Chinese and Arabic corpora). However, if a bilingual term list and parallel corpora are available, CLIR performance can rival monolingual performance. If no parallel corpus is available, pseudo-parallel texts produced by an MT system can partially overcome the lack of parallel text. While stemming is useful normally, with a very large parallel corpus for Arabic-English, stemming hurt performance in our empirical studies with Arabic, a highly inflected language.
- Type
- a