Diese Datenbank enthält über 40.000 Dokumente zu Themen aus den Bereichen Formalerschließung – Inhaltserschließung – Information Retrieval.
© 2015 W. Gödert, TH Köln, Institut für Informationswissenschaft / Powered by litecat, BIS Oldenburg (Stand: 04. Juni 2021)
1Muneer, I. ; Sharjeel, M. ; Iqbal, M. ; Adeel Nawab, R.M. ; Rayson, P.: CLEU - A Cross-language english-urdu corpus and benchmark for text reuse experiments.
In: Journal of the Association for Information Science and Technology. 70(2019) no.7, S.729-741.
Abstract: Text reuse is becoming a serious issue in many fields and research shows that it is much harder to detect when it occurs across languages. The recent rise in multi-lingual content on the Web has increased cross-language text reuse to an unprecedented scale. Although researchers have proposed methods to detect it, one major drawback is the unavailability of large-scale gold standard evaluation resources built on real cases. To overcome this problem, we propose a cross-language sentence/passage level text reuse corpus for the English-Urdu language pair. The Cross-Language English-Urdu Corpus (CLEU) has source text in English whereas the derived text is in Urdu. It contains in total 3,235 sentence/passage pairs manually tagged into three categories that is near copy, paraphrased copy, and independently written. Further, as a second contribution, we evaluate the Translation plus Mono-lingual Analysis method using three sets of experiments on the proposed dataset to highlight its usefulness. Evaluation results (f1=0.732 binary, f1=0.552 ternary classification) indicate that it is harder to detect cross-language real cases of text reuse, especially when the language pairs have unrelated scripts. The corpus is a useful benchmark resource for the future development and assessment of cross-language text reuse detection systems for the English-Urdu language pair.
Inhalt: Vgl.: https://onlinelibrary.wiley.com/doi/10.1002/asi.24074.
2Rayson, P. ; Piao, S. ; Sharoff, S. ; Evert, S. ; Moiron, B.V.: Multiword expressions : hard going or plain sailing?.
In: Language resources and evaluation. 44(2010) no.1, S.1-5.
Abstract: Over the past two decades or so, Multi-Word Expressions (MWEs; also called Multi-word Units) have been an increasingly important concern for Computational Linguistics and Natural Language Processing (NLP). The term MWE has been used to refer to various types of linguistic units and expressions, including idioms, noun compounds, phrasal verbs, light verbs and other habitual collocations. However, while there is no universally agreed definition for MWE as yet, most researchers use the term to refer to those frequently occurring phrasal units which are subject to certain level of semantic opaqueness, or non-compositionality. Non-compositional MWEs pose tough challenges for automatic analysis because their interpretation cannot be achieved by directly combining the semantics of their constituents, thereby causing the "pain in the neck of NLP".