Document (#33022)

Author
Xu, J.
Weischedel, R.
Title
Empirical studies on the impact of lexical resources on CLIR performance
Source
Information processing and management. 41(2005) no.3, S.475-488
Year
2005
Abstract
In this paper, we compile and review several experiments measuring cross-lingual information retrieval (CLIR) performance as a function of the following resources: bilingual term lists, parallel corpora, machine translation (MT), and stemmers. Our CLIR system uses a simple probabilistic language model; the studies used TREC test corpora over Chinese, Spanish and Arabic. Our findings include: One can achieve an acceptable CLIR performance using only a bilingual term list (70-80% on Chinese and Arabic corpora). However, if a bilingual term list and parallel corpora are available, CLIR performance can rival monolingual performance. If no parallel corpus is available, pseudo-parallel texts produced by an MT system can partially overcome the lack of parallel text. While stemming is useful normally, with a very large parallel corpus for Arabic-English, stemming hurt performance in our empirical studies with Arabic, a highly inflected language.
Theme
Multilinguale Probleme

Similar documents (content)

  1. Larkey, L.S.; Connell, M.E.: Structured queries, language modelling, and relevance modelling in cross-language information retrieval (2005) 0.33
    0.32786155 = sum of:
      0.32786155 = product of:
        1.0245674 = sum of:
          0.07087058 = weight(abstract_txt:pseudo in 3023) [ClassicSimilarity], result of:
            0.07087058 = score(doc=3023,freq=2.0), product of:
              0.10071368 = queryWeight, product of:
                1.1043615 = boost
                7.9612727 = idf(docFreq=40, maxDocs=43254)
                0.011454988 = queryNorm
              0.70368373 = fieldWeight in 3023, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.9612727 = idf(docFreq=40, maxDocs=43254)
                0.0625 = fieldNorm(doc=3023)
          0.05260929 = weight(abstract_txt:lingual in 3023) [ClassicSimilarity], result of:
            0.05260929 = score(doc=3023,freq=1.0), product of:
              0.104031 = queryWeight, product of:
                1.1224021 = boost
                8.091326 = idf(docFreq=35, maxDocs=43254)
                0.011454988 = queryNorm
              0.50570786 = fieldWeight in 3023, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.091326 = idf(docFreq=35, maxDocs=43254)
                0.0625 = fieldNorm(doc=3023)
          0.035844587 = weight(abstract_txt:language in 3023) [ClassicSimilarity], result of:
            0.035844587 = score(doc=3023,freq=6.0), product of:
              0.055850845 = queryWeight, product of:
                1.1630461 = boost
                4.192163 = idf(docFreq=1776, maxDocs=43254)
                0.011454988 = queryNorm
              0.6417913 = fieldWeight in 3023, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.192163 = idf(docFreq=1776, maxDocs=43254)
                0.0625 = fieldNorm(doc=3023)
          0.064817294 = weight(abstract_txt:corpus in 3023) [ClassicSimilarity], result of:
            0.064817294 = score(doc=3023,freq=2.0), product of:
              0.11955885 = queryWeight, product of:
                1.7016605 = boost
                6.1335816 = idf(docFreq=254, maxDocs=43254)
                0.011454988 = queryNorm
              0.54213715 = fieldWeight in 3023, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.1335816 = idf(docFreq=254, maxDocs=43254)
                0.0625 = fieldNorm(doc=3023)
          0.13243553 = weight(abstract_txt:bilingual in 3023) [ClassicSimilarity], result of:
            0.13243553 = score(doc=3023,freq=1.0), product of:
              0.27765015 = queryWeight, product of:
                3.1759717 = boost
                7.6317935 = idf(docFreq=56, maxDocs=43254)
                0.011454988 = queryNorm
              0.4769871 = fieldWeight in 3023, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6317935 = idf(docFreq=56, maxDocs=43254)
                0.0625 = fieldNorm(doc=3023)
          0.17419776 = weight(abstract_txt:arabic in 3023) [ClassicSimilarity], result of:
            0.17419776 = score(doc=3023,freq=1.0), product of:
              0.3668621 = queryWeight, product of:
                4.215494 = boost
                7.5973077 = idf(docFreq=58, maxDocs=43254)
                0.011454988 = queryNorm
              0.47483173 = fieldWeight in 3023, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5973077 = idf(docFreq=58, maxDocs=43254)
                0.0625 = fieldNorm(doc=3023)
          0.22216809 = weight(abstract_txt:parallel in 3023) [ClassicSimilarity], result of:
            0.22216809 = score(doc=3023,freq=2.0), product of:
              0.39199692 = queryWeight, product of:
                5.336838 = boost
                6.4121547 = idf(docFreq=192, maxDocs=43254)
                0.011454988 = queryNorm
              0.56675977 = fieldWeight in 3023, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.4121547 = idf(docFreq=192, maxDocs=43254)
                0.0625 = fieldNorm(doc=3023)
          0.2716242 = weight(abstract_txt:clir in 3023) [ClassicSimilarity], result of:
            0.2716242 = score(doc=3023,freq=1.0), product of:
              0.53140235 = queryWeight, product of:
                5.6723604 = boost
                8.178337 = idf(docFreq=32, maxDocs=43254)
                0.011454988 = queryNorm
              0.51114607 = fieldWeight in 3023, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.178337 = idf(docFreq=32, maxDocs=43254)
                0.0625 = fieldNorm(doc=3023)
        0.32 = coord(8/25)
    
  2. Yang, C.C.; Li, K.W.: Automatic construction of English/Chinese parallel corpora (2003) 0.26
    0.2642633 = sum of:
      0.2642633 = product of:
        0.82582283 = sum of:
          0.06510068 = weight(abstract_txt:lingual in 3684) [ClassicSimilarity], result of:
            0.06510068 = score(doc=3684,freq=2.0), product of:
              0.104031 = queryWeight, product of:
                1.1224021 = boost
                8.091326 = idf(docFreq=35, maxDocs=43254)
                0.011454988 = queryNorm
              0.62578154 = fieldWeight in 3684, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.091326 = idf(docFreq=35, maxDocs=43254)
                0.0546875 = fieldNorm(doc=3684)
          0.025608608 = weight(abstract_txt:language in 3684) [ClassicSimilarity], result of:
            0.025608608 = score(doc=3684,freq=4.0), product of:
              0.055850845 = queryWeight, product of:
                1.1630461 = boost
                4.192163 = idf(docFreq=1776, maxDocs=43254)
                0.011454988 = queryNorm
              0.45851782 = fieldWeight in 3684, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.192163 = idf(docFreq=1776, maxDocs=43254)
                0.0546875 = fieldNorm(doc=3684)
          0.06946156 = weight(abstract_txt:corpus in 3684) [ClassicSimilarity], result of:
            0.06946156 = score(doc=3684,freq=3.0), product of:
              0.11955885 = queryWeight, product of:
                1.7016605 = boost
                6.1335816 = idf(docFreq=254, maxDocs=43254)
                0.011454988 = queryNorm
              0.5809822 = fieldWeight in 3684, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.1335816 = idf(docFreq=254, maxDocs=43254)
                0.0546875 = fieldNorm(doc=3684)
          0.08786954 = weight(abstract_txt:chinese in 3684) [ClassicSimilarity], result of:
            0.08786954 = score(doc=3684,freq=4.0), product of:
              0.12705682 = queryWeight, product of:
                1.754208 = boost
                6.322987 = idf(docFreq=210, maxDocs=43254)
                0.011454988 = queryNorm
              0.6915767 = fieldWeight in 3684, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.322987 = idf(docFreq=210, maxDocs=43254)
                0.0546875 = fieldNorm(doc=3684)
          0.11588109 = weight(abstract_txt:bilingual in 3684) [ClassicSimilarity], result of:
            0.11588109 = score(doc=3684,freq=1.0), product of:
              0.27765015 = queryWeight, product of:
                3.1759717 = boost
                7.6317935 = idf(docFreq=56, maxDocs=43254)
                0.011454988 = queryNorm
              0.4173637 = fieldWeight in 3684, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6317935 = idf(docFreq=56, maxDocs=43254)
                0.0546875 = fieldNorm(doc=3684)
          0.05201201 = weight(abstract_txt:performance in 3684) [ClassicSimilarity], result of:
            0.05201201 = score(doc=3684,freq=1.0), product of:
              0.20506991 = queryWeight, product of:
                3.8600566 = boost
                4.6378174 = idf(docFreq=1137, maxDocs=43254)
                0.011454988 = queryNorm
              0.25363064 = fieldWeight in 3684, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6378174 = idf(docFreq=1137, maxDocs=43254)
                0.0546875 = fieldNorm(doc=3684)
          0.2154923 = weight(abstract_txt:corpora in 3684) [ClassicSimilarity], result of:
            0.2154923 = score(doc=3684,freq=3.0), product of:
              0.3204177 = queryWeight, product of:
                3.9396288 = boost
                7.100134 = idf(docFreq=96, maxDocs=43254)
                0.011454988 = queryNorm
              0.67253554 = fieldWeight in 3684, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.100134 = idf(docFreq=96, maxDocs=43254)
                0.0546875 = fieldNorm(doc=3684)
          0.19439706 = weight(abstract_txt:parallel in 3684) [ClassicSimilarity], result of:
            0.19439706 = score(doc=3684,freq=2.0), product of:
              0.39199692 = queryWeight, product of:
                5.336838 = boost
                6.4121547 = idf(docFreq=192, maxDocs=43254)
                0.011454988 = queryNorm
              0.4959148 = fieldWeight in 3684, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.4121547 = idf(docFreq=192, maxDocs=43254)
                0.0546875 = fieldNorm(doc=3684)
        0.32 = coord(8/25)
    
  3. Chen, J.: ¬A lexical knowledge base approach for English-Chinese cross-language information retrieval (2006) 0.24
    0.24295893 = sum of:
      0.24295893 = product of:
        0.8677105 = sum of:
          0.02069488 = weight(abstract_txt:language in 924) [ClassicSimilarity], result of:
            0.02069488 = score(doc=924,freq=2.0), product of:
              0.055850845 = queryWeight, product of:
                1.1630461 = boost
                4.192163 = idf(docFreq=1776, maxDocs=43254)
                0.011454988 = queryNorm
              0.37053835 = fieldWeight in 924, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.192163 = idf(docFreq=1776, maxDocs=43254)
                0.0625 = fieldNorm(doc=924)
          0.021207742 = weight(abstract_txt:resources in 924) [ClassicSimilarity], result of:
            0.021207742 = score(doc=924,freq=2.0), product of:
              0.05676981 = queryWeight, product of:
                1.1725755 = boost
                4.226511 = idf(docFreq=1716, maxDocs=43254)
                0.011454988 = queryNorm
              0.37357432 = fieldWeight in 924, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.226511 = idf(docFreq=1716, maxDocs=43254)
                0.0625 = fieldNorm(doc=924)
          0.015836002 = weight(abstract_txt:available in 924) [ClassicSimilarity], result of:
            0.015836002 = score(doc=924,freq=1.0), product of:
              0.058870107 = queryWeight, product of:
                1.1940691 = boost
                4.3039846 = idf(docFreq=1588, maxDocs=43254)
                0.011454988 = queryNorm
              0.26899904 = fieldWeight in 924, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.3039846 = idf(docFreq=1588, maxDocs=43254)
                0.0625 = fieldNorm(doc=924)
          0.028635236 = weight(abstract_txt:empirical in 924) [ClassicSimilarity], result of:
            0.028635236 = score(doc=924,freq=1.0), product of:
              0.087377235 = queryWeight, product of:
                1.4547261 = boost
                5.243514 = idf(docFreq=620, maxDocs=43254)
                0.011454988 = queryNorm
              0.32771963 = fieldWeight in 924, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.243514 = idf(docFreq=620, maxDocs=43254)
                0.0625 = fieldNorm(doc=924)
          0.07100931 = weight(abstract_txt:chinese in 924) [ClassicSimilarity], result of:
            0.07100931 = score(doc=924,freq=2.0), product of:
              0.12705682 = queryWeight, product of:
                1.754208 = boost
                6.322987 = idf(docFreq=210, maxDocs=43254)
                0.011454988 = queryNorm
              0.55887836 = fieldWeight in 924, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.322987 = idf(docFreq=210, maxDocs=43254)
                0.0625 = fieldNorm(doc=924)
          0.102957085 = weight(abstract_txt:performance in 924) [ClassicSimilarity], result of:
            0.102957085 = score(doc=924,freq=3.0), product of:
              0.20506991 = queryWeight, product of:
                3.8600566 = boost
                4.6378174 = idf(docFreq=1137, maxDocs=43254)
                0.011454988 = queryNorm
              0.50205845 = fieldWeight in 924, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.6378174 = idf(docFreq=1137, maxDocs=43254)
                0.0625 = fieldNorm(doc=924)
          0.6073702 = weight(abstract_txt:clir in 924) [ClassicSimilarity], result of:
            0.6073702 = score(doc=924,freq=5.0), product of:
              0.53140235 = queryWeight, product of:
                5.6723604 = boost
                8.178337 = idf(docFreq=32, maxDocs=43254)
                0.011454988 = queryNorm
              1.1429573 = fieldWeight in 924, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                8.178337 = idf(docFreq=32, maxDocs=43254)
                0.0625 = fieldNorm(doc=924)
        0.28 = coord(7/25)
    
  4. Perea-Ortega, J.M.; Martín-Valdivia, M.T.; Ureña-López, L.A.; Martínez-Cámara, E.: Improving polarity classification of bilingual parallel corpora combining machine learning and semantic orientation approaches (2013) 0.24
    0.2380439 = sum of:
      0.2380439 = product of:
        0.8501568 = sum of:
          0.014996139 = weight(abstract_txt:resources in 2510) [ClassicSimilarity], result of:
            0.014996139 = score(doc=2510,freq=1.0), product of:
              0.05676981 = queryWeight, product of:
                1.1725755 = boost
                4.226511 = idf(docFreq=1716, maxDocs=43254)
                0.011454988 = queryNorm
              0.26415694 = fieldWeight in 2510, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.226511 = idf(docFreq=1716, maxDocs=43254)
                0.0625 = fieldNorm(doc=2510)
          0.045832746 = weight(abstract_txt:corpus in 2510) [ClassicSimilarity], result of:
            0.045832746 = score(doc=2510,freq=1.0), product of:
              0.11955885 = queryWeight, product of:
                1.7016605 = boost
                6.1335816 = idf(docFreq=254, maxDocs=43254)
                0.011454988 = queryNorm
              0.38334885 = fieldWeight in 2510, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.1335816 = idf(docFreq=254, maxDocs=43254)
                0.0625 = fieldNorm(doc=2510)
          0.13243553 = weight(abstract_txt:bilingual in 2510) [ClassicSimilarity], result of:
            0.13243553 = score(doc=2510,freq=1.0), product of:
              0.27765015 = queryWeight, product of:
                3.1759717 = boost
                7.6317935 = idf(docFreq=56, maxDocs=43254)
                0.011454988 = queryNorm
              0.4769871 = fieldWeight in 2510, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6317935 = idf(docFreq=56, maxDocs=43254)
                0.0625 = fieldNorm(doc=2510)
          0.0594423 = weight(abstract_txt:performance in 2510) [ClassicSimilarity], result of:
            0.0594423 = score(doc=2510,freq=1.0), product of:
              0.20506991 = queryWeight, product of:
                3.8600566 = boost
                4.6378174 = idf(docFreq=1137, maxDocs=43254)
                0.011454988 = queryNorm
              0.2898636 = fieldWeight in 2510, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6378174 = idf(docFreq=1137, maxDocs=43254)
                0.0625 = fieldNorm(doc=2510)
          0.20108424 = weight(abstract_txt:corpora in 2510) [ClassicSimilarity], result of:
            0.20108424 = score(doc=2510,freq=2.0), product of:
              0.3204177 = queryWeight, product of:
                3.9396288 = boost
                7.100134 = idf(docFreq=96, maxDocs=43254)
                0.011454988 = queryNorm
              0.6275691 = fieldWeight in 2510, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.100134 = idf(docFreq=96, maxDocs=43254)
                0.0625 = fieldNorm(doc=2510)
          0.17419776 = weight(abstract_txt:arabic in 2510) [ClassicSimilarity], result of:
            0.17419776 = score(doc=2510,freq=1.0), product of:
              0.3668621 = queryWeight, product of:
                4.215494 = boost
                7.5973077 = idf(docFreq=58, maxDocs=43254)
                0.011454988 = queryNorm
              0.47483173 = fieldWeight in 2510, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5973077 = idf(docFreq=58, maxDocs=43254)
                0.0625 = fieldNorm(doc=2510)
          0.22216809 = weight(abstract_txt:parallel in 2510) [ClassicSimilarity], result of:
            0.22216809 = score(doc=2510,freq=2.0), product of:
              0.39199692 = queryWeight, product of:
                5.336838 = boost
                6.4121547 = idf(docFreq=192, maxDocs=43254)
                0.011454988 = queryNorm
              0.56675977 = fieldWeight in 2510, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.4121547 = idf(docFreq=192, maxDocs=43254)
                0.0625 = fieldNorm(doc=2510)
        0.28 = coord(7/25)
    
  5. Dadashkarimia, J.; Shakery, A.; Failia, H.; Zamani, H.: ¬An expectation-maximization algorithm for query translation based on pseudo-relevant documents (2017) 0.24
    0.23516685 = sum of:
      0.23516685 = product of:
        0.8398816 = sum of:
          0.07594857 = weight(abstract_txt:pseudo in 4761) [ClassicSimilarity], result of:
            0.07594857 = score(doc=4761,freq=3.0), product of:
              0.10071368 = queryWeight, product of:
                1.1043615 = boost
                7.9612727 = idf(docFreq=40, maxDocs=43254)
                0.011454988 = queryNorm
              0.7541039 = fieldWeight in 4761, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.9612727 = idf(docFreq=40, maxDocs=43254)
                0.0546875 = fieldNorm(doc=4761)
          0.025608608 = weight(abstract_txt:language in 4761) [ClassicSimilarity], result of:
            0.025608608 = score(doc=4761,freq=4.0), product of:
              0.055850845 = queryWeight, product of:
                1.1630461 = boost
                4.192163 = idf(docFreq=1776, maxDocs=43254)
                0.011454988 = queryNorm
              0.45851782 = fieldWeight in 4761, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.192163 = idf(docFreq=1776, maxDocs=43254)
                0.0546875 = fieldNorm(doc=4761)
          0.05836486 = weight(abstract_txt:term in 4761) [ClassicSimilarity], result of:
            0.05836486 = score(doc=4761,freq=4.0), product of:
              0.110722825 = queryWeight, product of:
                2.005609 = boost
                4.819436 = idf(docFreq=948, maxDocs=43254)
                0.011454988 = queryNorm
              0.52712584 = fieldWeight in 4761, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.819436 = idf(docFreq=948, maxDocs=43254)
                0.0546875 = fieldNorm(doc=4761)
          0.11588109 = weight(abstract_txt:bilingual in 4761) [ClassicSimilarity], result of:
            0.11588109 = score(doc=4761,freq=1.0), product of:
              0.27765015 = queryWeight, product of:
                3.1759717 = boost
                7.6317935 = idf(docFreq=56, maxDocs=43254)
                0.011454988 = queryNorm
              0.4173637 = fieldWeight in 4761, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6317935 = idf(docFreq=56, maxDocs=43254)
                0.0546875 = fieldNorm(doc=4761)
          0.05201201 = weight(abstract_txt:performance in 4761) [ClassicSimilarity], result of:
            0.05201201 = score(doc=4761,freq=1.0), product of:
              0.20506991 = queryWeight, product of:
                3.8600566 = boost
                4.6378174 = idf(docFreq=1137, maxDocs=43254)
                0.011454988 = queryNorm
              0.25363064 = fieldWeight in 4761, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6378174 = idf(docFreq=1137, maxDocs=43254)
                0.0546875 = fieldNorm(doc=4761)
          0.17594871 = weight(abstract_txt:corpora in 4761) [ClassicSimilarity], result of:
            0.17594871 = score(doc=4761,freq=2.0), product of:
              0.3204177 = queryWeight, product of:
                3.9396288 = boost
                7.100134 = idf(docFreq=96, maxDocs=43254)
                0.011454988 = queryNorm
              0.5491229 = fieldWeight in 4761, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.100134 = idf(docFreq=96, maxDocs=43254)
                0.0546875 = fieldNorm(doc=4761)
          0.3361178 = weight(abstract_txt:clir in 4761) [ClassicSimilarity], result of:
            0.3361178 = score(doc=4761,freq=2.0), product of:
              0.53140235 = queryWeight, product of:
                5.6723604 = boost
                8.178337 = idf(docFreq=32, maxDocs=43254)
                0.011454988 = queryNorm
              0.63251096 = fieldWeight in 4761, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.178337 = idf(docFreq=32, maxDocs=43254)
                0.0546875 = fieldNorm(doc=4761)
        0.28 = coord(7/25)