Muneer, I.; Sharjeel, M.; Iqbal, M.; Adeel Nawab, R.M.; Rayson, P.: CLEU - A Cross-language english-urdu corpus and benchmark for text reuse experiments (2019)
0.00
0.0020714647 = product of:
0.0041429293 = sum of:
0.0041429293 = product of:
0.008285859 = sum of:
0.008285859 = weight(_text_:a in 5299) [ClassicSimilarity], result of:
0.008285859 = score(doc=5299,freq=12.0), product of:
0.053105544 = queryWeight, product of:
1.153047 = idf(docFreq=37942, maxDocs=44218)
0.046056706 = queryNorm
0.15602624 = fieldWeight in 5299, product of:
3.4641016 = tf(freq=12.0), with freq of:
12.0 = termFreq=12.0
1.153047 = idf(docFreq=37942, maxDocs=44218)
0.0390625 = fieldNorm(doc=5299)
0.5 = coord(1/2)
0.5 = coord(1/2)
- Abstract
- Text reuse is becoming a serious issue in many fields and research shows that it is much harder to detect when it occurs across languages. The recent rise in multi-lingual content on the Web has increased cross-language text reuse to an unprecedented scale. Although researchers have proposed methods to detect it, one major drawback is the unavailability of large-scale gold standard evaluation resources built on real cases. To overcome this problem, we propose a cross-language sentence/passage level text reuse corpus for the English-Urdu language pair. The Cross-Language English-Urdu Corpus (CLEU) has source text in English whereas the derived text is in Urdu. It contains in total 3,235 sentence/passage pairs manually tagged into three categories that is near copy, paraphrased copy, and independently written. Further, as a second contribution, we evaluate the Translation plus Mono-lingual Analysis method using three sets of experiments on the proposed dataset to highlight its usefulness. Evaluation results (f1=0.732 binary, f1=0.552 ternary classification) indicate that it is harder to detect cross-language real cases of text reuse, especially when the language pairs have unrelated scripts. The corpus is a useful benchmark resource for the future development and assessment of cross-language text reuse detection systems for the English-Urdu language pair.
- Type
- a