Search (191 results, page 1 of 10)

  • × theme_ss:"Retrievalalgorithmen"
  1. Efron, M.: Linear time series models for term weighting in information retrieval (2010) 0.29
    0.29163787 = product of:
      0.38885048 = sum of:
        0.0070626684 = product of:
          0.028250674 = sum of:
            0.028250674 = weight(_text_:based in 3688) [ClassicSimilarity], result of:
              0.028250674 = score(doc=3688,freq=2.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.19973516 = fieldWeight in 3688, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.046875 = fieldNorm(doc=3688)
          0.25 = coord(1/4)
        0.1659598 = weight(_text_:term in 3688) [ClassicSimilarity], result of:
          0.1659598 = score(doc=3688,freq=12.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.7576688 = fieldWeight in 3688, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.046875 = fieldNorm(doc=3688)
        0.215828 = weight(_text_:frequency in 3688) [ClassicSimilarity], result of:
          0.215828 = score(doc=3688,freq=8.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.7807447 = fieldWeight in 3688, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.046875 = fieldNorm(doc=3688)
      0.75 = coord(3/4)
    
    Abstract
    Common measures of term importance in information retrieval (IR) rely on counts of term frequency; rare terms receive higher weight in document ranking than common terms receive. However, realistic scenarios yield additional information about terms in a collection. Of interest in this article is the temporal behavior of terms as a collection changes over time. We propose capturing each term's collection frequency at discrete time intervals over the lifespan of a corpus and analyzing the resulting time series. We hypothesize the collection frequency of a weakly discriminative term x at time t is predictable by a linear model of the term's prior observations. On the other hand, a linear time series model for a strong discriminators' collection frequency will yield a poor fit to the data. Operationalizing this hypothesis, we induce three time-based measures of term importance and test these against state-of-the-art term weighting models.
  2. Losada, D.E.; Barreiro, A.: Emebedding term similarity and inverse document frequency into a logical model of information retrieval (2003) 0.27
    0.26751098 = product of:
      0.35668132 = sum of:
        0.12775593 = weight(_text_:term in 1422) [ClassicSimilarity], result of:
          0.12775593 = score(doc=1422,freq=4.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.58325374 = fieldWeight in 1422, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0625 = fieldNorm(doc=1422)
        0.20348458 = weight(_text_:frequency in 1422) [ClassicSimilarity], result of:
          0.20348458 = score(doc=1422,freq=4.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.7360931 = fieldWeight in 1422, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0625 = fieldNorm(doc=1422)
        0.025440816 = product of:
          0.05088163 = sum of:
            0.05088163 = weight(_text_:22 in 1422) [ClassicSimilarity], result of:
              0.05088163 = score(doc=1422,freq=2.0), product of:
                0.16438834 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04694356 = queryNorm
                0.30952093 = fieldWeight in 1422, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0625 = fieldNorm(doc=1422)
          0.5 = coord(1/2)
      0.75 = coord(3/4)
    
    Abstract
    We propose a novel approach to incorporate term similarity and inverse document frequency into a logical model of information retrieval. The ability of the logic to handle expressive representations along with the use of such classical notions are promising characteristics for IR systems. The approach proposed here has been efficiently implemented and experiments against test collections are presented.
    Date
    22. 3.2003 19:27:23
  3. Witschel, H.F.: Global term weights in distributed environments (2008) 0.25
    0.2514086 = sum of:
      0.0070626684 = product of:
        0.028250674 = sum of:
          0.028250674 = weight(_text_:based in 2096) [ClassicSimilarity], result of:
            0.028250674 = score(doc=2096,freq=2.0), product of:
              0.14144066 = queryWeight, product of:
                3.0129938 = idf(docFreq=5906, maxDocs=44218)
                0.04694356 = queryNorm
              0.19973516 = fieldWeight in 2096, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.0129938 = idf(docFreq=5906, maxDocs=44218)
                0.046875 = fieldNorm(doc=2096)
        0.25 = coord(1/4)
      0.117351316 = weight(_text_:term in 2096) [ClassicSimilarity], result of:
        0.117351316 = score(doc=2096,freq=6.0), product of:
          0.21904005 = queryWeight, product of:
            4.66603 = idf(docFreq=1130, maxDocs=44218)
            0.04694356 = queryNorm
          0.5357528 = fieldWeight in 2096, product of:
            2.4494898 = tf(freq=6.0), with freq of:
              6.0 = termFreq=6.0
            4.66603 = idf(docFreq=1130, maxDocs=44218)
            0.046875 = fieldNorm(doc=2096)
      0.107914 = weight(_text_:frequency in 2096) [ClassicSimilarity], result of:
        0.107914 = score(doc=2096,freq=2.0), product of:
          0.27643865 = queryWeight, product of:
            5.888745 = idf(docFreq=332, maxDocs=44218)
            0.04694356 = queryNorm
          0.39037234 = fieldWeight in 2096, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            5.888745 = idf(docFreq=332, maxDocs=44218)
            0.046875 = fieldNorm(doc=2096)
      0.019080611 = product of:
        0.038161222 = sum of:
          0.038161222 = weight(_text_:22 in 2096) [ClassicSimilarity], result of:
            0.038161222 = score(doc=2096,freq=2.0), product of:
              0.16438834 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.04694356 = queryNorm
              0.23214069 = fieldWeight in 2096, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.046875 = fieldNorm(doc=2096)
        0.5 = coord(1/2)
    
    Abstract
    This paper examines the estimation of global term weights (such as IDF) in information retrieval scenarios where a global view on the collection is not available. In particular, the two options of either sampling documents or of using a reference corpus independent of the target retrieval collection are compared using standard IR test collections. In addition, the possibility of pruning term lists based on frequency is evaluated. The results show that very good retrieval performance can be reached when just the most frequent terms of a collection - an "extended stop word list" - are known and all terms which are not in that list are treated equally. However, the list cannot always be fully estimated from a general-purpose reference corpus, but some "domain-specific stop words" need to be added. A good solution for achieving this is to mix estimates from small samples of the target retrieval collection with ones derived from a reference corpus.
    Date
    1. 8.2008 9:44:22
  4. Nunes, S.; Ribeiro, C.; David, G.: Term weighting based on document revision history (2011) 0.24
    0.2428341 = product of:
      0.3237788 = sum of:
        0.008323434 = product of:
          0.033293735 = sum of:
            0.033293735 = weight(_text_:based in 4946) [ClassicSimilarity], result of:
              0.033293735 = score(doc=4946,freq=4.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.23539014 = fieldWeight in 4946, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=4946)
          0.25 = coord(1/4)
        0.15969492 = weight(_text_:term in 4946) [ClassicSimilarity], result of:
          0.15969492 = score(doc=4946,freq=16.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.7290672 = fieldWeight in 4946, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4946)
        0.15576044 = weight(_text_:frequency in 4946) [ClassicSimilarity], result of:
          0.15576044 = score(doc=4946,freq=6.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.5634539 = fieldWeight in 4946, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4946)
      0.75 = coord(3/4)
    
    Abstract
    In real-world information retrieval systems, the underlying document collection is rarely stable or definitive. This work is focused on the study of signals extracted from the content of documents at different points in time for the purpose of weighting individual terms in a document. The basic idea behind our proposals is that terms that have existed for a longer time in a document should have a greater weight. We propose 4 term weighting functions that use each document's history to estimate a current term score. To evaluate this thesis, we conduct 3 independent experiments using a collection of documents sampled from Wikipedia. In the first experiment, we use data from Wikipedia to judge each set of terms. In a second experiment, we use an external collection of tags from a popular social bookmarking service as a gold standard. In the third experiment, we crowdsource user judgments to collect feedback on term preference. Across all experiments results consistently support our thesis. We show that temporally aware measures, specifically the proposed revision term frequency and revision term frequency span, outperform a term-weighting measure based on raw term frequency alone.
  5. Dang, E.K.F.; Luk, R.W.P.; Allan, J.: Beyond bag-of-words : bigram-enhanced context-dependent term weights (2014) 0.23
    0.23006546 = product of:
      0.30675396 = sum of:
        0.010194084 = product of:
          0.040776335 = sum of:
            0.040776335 = weight(_text_:based in 1283) [ClassicSimilarity], result of:
              0.040776335 = score(doc=1283,freq=6.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.28829288 = fieldWeight in 1283, product of:
                  2.4494898 = tf(freq=6.0), with freq of:
                    6.0 = termFreq=6.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1283)
          0.25 = coord(1/4)
        0.16938202 = weight(_text_:term in 1283) [ClassicSimilarity], result of:
          0.16938202 = score(doc=1283,freq=18.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.7732925 = fieldWeight in 1283, product of:
              4.2426405 = tf(freq=18.0), with freq of:
                18.0 = termFreq=18.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1283)
        0.12717786 = weight(_text_:frequency in 1283) [ClassicSimilarity], result of:
          0.12717786 = score(doc=1283,freq=4.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.46005818 = fieldWeight in 1283, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1283)
      0.75 = coord(3/4)
    
    Abstract
    While term independence is a widely held assumption in most of the established information retrieval approaches, it is clearly not true and various works in the past have investigated a relaxation of the assumption. One approach is to use n-grams in document representation instead of unigrams. However, the majority of early works on n-grams obtained only modest performance improvement. On the other hand, the use of information based on supporting terms or "contexts" of queries has been found to be promising. In particular, recent studies showed that using new context-dependent term weights improved the performance of relevance feedback (RF) retrieval compared with using traditional bag-of-words BM25 term weights. Calculation of the new term weights requires an estimation of the local probability of relevance of each query term occurrence. In previous studies, the estimation of this probability was based on unigrams that occur in the neighborhood of a query term. We explore an integration of the n-gram and context approaches by computing context-dependent term weights based on a mixture of unigrams and bigrams. Extensive experiments are performed using the title queries of the Text Retrieval Conference (TREC)-6, TREC-7, TREC-8, and TREC-2005 collections, for RF with relevance judgment of either the top 10 or top 20 documents of an initial retrieval. We identify some crucial elements needed in the use of bigrams in our methods, such as proper inverse document frequency (IDF) weighting of the bigrams and noise reduction by pruning bigrams with large document frequency values. We show that enhancing context-dependent term weights with bigrams is effective in further improving retrieval performance.
  6. Aizawa, A.: ¬An information-theoretic perspective of tf-idf measures (2003) 0.23
    0.22742893 = product of:
      0.30323857 = sum of:
        0.009416891 = product of:
          0.037667565 = sum of:
            0.037667565 = weight(_text_:based in 4155) [ClassicSimilarity], result of:
              0.037667565 = score(doc=4155,freq=2.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.26631355 = fieldWeight in 4155, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.0625 = fieldNorm(doc=4155)
          0.25 = coord(1/4)
        0.09033708 = weight(_text_:term in 4155) [ClassicSimilarity], result of:
          0.09033708 = score(doc=4155,freq=2.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.41242266 = fieldWeight in 4155, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0625 = fieldNorm(doc=4155)
        0.20348458 = weight(_text_:frequency in 4155) [ClassicSimilarity], result of:
          0.20348458 = score(doc=4155,freq=4.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.7360931 = fieldWeight in 4155, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0625 = fieldNorm(doc=4155)
      0.75 = coord(3/4)
    
    Abstract
    This paper presents a mathematical definition of the "probability-weighted amount of information" (PWI), a measure of specificity of terms in documents that is based on an information-theoretic view of retrieval events. The proposed PWI is expressed as a product of the occurrence probabilities of terms and their amounts of information, and corresponds well with the conventional term frequency - inverse document frequency measures that are commonly used in today's information retrieval systems. The mathematical definition of the PWI is shown, together with some illustrative examples of the calculation.
  7. Lee, D.L.; Ren, L.: Document ranking on weight-partitioned signature files (1996) 0.19
    0.18807726 = product of:
      0.3761545 = sum of:
        0.15808989 = weight(_text_:term in 2417) [ClassicSimilarity], result of:
          0.15808989 = score(doc=2417,freq=8.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.72173965 = fieldWeight in 2417, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2417)
        0.21806462 = weight(_text_:frequency in 2417) [ClassicSimilarity], result of:
          0.21806462 = score(doc=2417,freq=6.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.78883547 = fieldWeight in 2417, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2417)
      0.5 = coord(2/4)
    
    Abstract
    Proposes the weight partitioned signature file, a signature file organization for supporting document ranking. It uses multiple signature files each corresponding to one term frequency to represent terms with different term frequencies. Words with the same term frequency in a document are grouped together and hased into the signature file corresponding to that term frequency. Investigates the effect of false drops on retrieval effectiveness. Analyses the performance of the weight partitioned signature file under different search strategies and configurations. Obtains an optimal formula for storage allocation to minimise the effect of false drops on document ranks. Analytical results are supported by experiments on document collections
  8. Pan, M.; Huang, J.X.; He, T.; Mao, Z.; Ying, Z.; Tu, X.: ¬A simple kernel co-occurrence-based enhancement for pseudo-relevance feedback (2020) 0.18
    0.17999947 = product of:
      0.2399993 = sum of:
        0.011771114 = product of:
          0.047084454 = sum of:
            0.047084454 = weight(_text_:based in 5678) [ClassicSimilarity], result of:
              0.047084454 = score(doc=5678,freq=8.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.33289194 = fieldWeight in 5678, product of:
                  2.828427 = tf(freq=8.0), with freq of:
                    8.0 = termFreq=8.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=5678)
          0.25 = coord(1/4)
        0.13829985 = weight(_text_:term in 5678) [ClassicSimilarity], result of:
          0.13829985 = score(doc=5678,freq=12.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.6313907 = fieldWeight in 5678, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5678)
        0.08992833 = weight(_text_:frequency in 5678) [ClassicSimilarity], result of:
          0.08992833 = score(doc=5678,freq=2.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.32531026 = fieldWeight in 5678, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5678)
      0.75 = coord(3/4)
    
    Abstract
    Pseudo-relevance feedback is a well-studied query expansion technique in which it is assumed that the top-ranked documents in an initial set of retrieval results are relevant and expansion terms are then extracted from those documents. When selecting expansion terms, most traditional models do not simultaneously consider term frequency and the co-occurrence relationships between candidate terms and query terms. Intuitively, however, a term that has a higher co-occurrence with a query term is more likely to be related to the query topic. In this article, we propose a kernel co-occurrence-based framework to enhance retrieval performance by integrating term co-occurrence information into the Rocchio model and a relevance language model (RM3). Specifically, a kernel co-occurrence-based Rocchio method (KRoc) and a kernel co-occurrence-based RM3 method (KRM3) are proposed. In our framework, co-occurrence information is incorporated into both the factor of the term discrimination power and the factor of the within-document term weight to boost retrieval performance. The results of a series of experiments show that our proposed methods significantly outperform the corresponding strong baselines over all data sets in terms of the mean average precision and over most data sets in terms of P@10. A direct comparison of standard Text Retrieval Conference data sets indicates that our proposed methods are at least comparable to state-of-the-art approaches.
  9. Yang, L.; Ji, D.; Leong, M.: Document reranking by term distribution and maximal marginal relevance for chinese information retrieval (2007) 0.17
    0.17424598 = product of:
      0.23232798 = sum of:
        0.0070626684 = product of:
          0.028250674 = sum of:
            0.028250674 = weight(_text_:based in 907) [ClassicSimilarity], result of:
              0.028250674 = score(doc=907,freq=2.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.19973516 = fieldWeight in 907, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.046875 = fieldNorm(doc=907)
          0.25 = coord(1/4)
        0.117351316 = weight(_text_:term in 907) [ClassicSimilarity], result of:
          0.117351316 = score(doc=907,freq=6.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.5357528 = fieldWeight in 907, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.046875 = fieldNorm(doc=907)
        0.107914 = weight(_text_:frequency in 907) [ClassicSimilarity], result of:
          0.107914 = score(doc=907,freq=2.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.39037234 = fieldWeight in 907, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.046875 = fieldNorm(doc=907)
      0.75 = coord(3/4)
    
    Abstract
    In this paper, we propose a document reranking method for Chinese information retrieval. The method is based on a term weighting scheme, which integrates local and global distribution of terms as well as document frequency, document positions and term length. The weight scheme allows randomly setting a larger portion of the retrieved documents as relevance feedback, and lifts off the worry that very fewer relevant documents appear in top retrieved documents. It also helps to improve the performance of maximal marginal relevance (MMR) in document reranking. The method was evaluated by MAP (mean average precision), a recall-oriented measure. Significance tests showed that our method can get significant improvement against standard baselines, and outperform relevant methods consistently.
  10. Liu, R.-L.; Huang, Y.-C.: Ranker enhancement for proximity-based ranking of biomedical texts (2011) 0.17
    0.16837625 = product of:
      0.22450167 = sum of:
        0.008323434 = product of:
          0.033293735 = sum of:
            0.033293735 = weight(_text_:based in 4947) [ClassicSimilarity], result of:
              0.033293735 = score(doc=4947,freq=4.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.23539014 = fieldWeight in 4947, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=4947)
          0.25 = coord(1/4)
        0.12624991 = weight(_text_:term in 4947) [ClassicSimilarity], result of:
          0.12624991 = score(doc=4947,freq=10.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.5763782 = fieldWeight in 4947, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4947)
        0.08992833 = weight(_text_:frequency in 4947) [ClassicSimilarity], result of:
          0.08992833 = score(doc=4947,freq=2.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.32531026 = fieldWeight in 4947, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4947)
      0.75 = coord(3/4)
    
    Abstract
    Biomedical decision making often requires relevant evidence from the biomedical literature. Retrieval of the evidence calls for a system that receives a natural language query for a biomedical information need and, among the huge amount of texts retrieved for the query, ranks relevant texts higher for further processing. However, state-of-the-art text rankers have weaknesses in dealing with biomedical queries, which often consist of several correlating concepts and prefer those texts that completely talk about the concepts. In this article, we present a technique, Proximity-Based Ranker Enhancer (PRE), to enhance text rankers by term-proximity information. PRE assesses the term frequency (TF) of each term in the text by integrating three types of term proximity to measure the contextual completeness of query terms appearing in nearby areas in the text being ranked. Therefore, PRE may serve as a preprocessor for (or supplement to) those rankers that consider TF in ranking, without the need to change the algorithms and development processes of the rankers. Empirical evaluation shows that PRE significantly improves various kinds of text rankers, and when compared with several state-of-the-art techniques that enhance rankers by term-proximity information, PRE may more stably and significantly enhance the rankers.
  11. Robertson, S.: Understanding inverse document frequency : on theoretical arguments for IDF (2004) 0.16
    0.1598883 = product of:
      0.2131844 = sum of:
        0.00823978 = product of:
          0.03295912 = sum of:
            0.03295912 = weight(_text_:based in 4421) [ClassicSimilarity], result of:
              0.03295912 = score(doc=4421,freq=2.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.23302436 = fieldWeight in 4421, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=4421)
          0.25 = coord(1/4)
        0.079044946 = weight(_text_:term in 4421) [ClassicSimilarity], result of:
          0.079044946 = score(doc=4421,freq=2.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.36086982 = fieldWeight in 4421, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0546875 = fieldNorm(doc=4421)
        0.12589967 = weight(_text_:frequency in 4421) [ClassicSimilarity], result of:
          0.12589967 = score(doc=4421,freq=2.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.45543438 = fieldWeight in 4421, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0546875 = fieldNorm(doc=4421)
      0.75 = coord(3/4)
    
    Abstract
    The term-weighting function known as IDF was proposed in 1972, and has since been extremely widely used, usually as part of a TF*IDF function. It is often described as a heuristic, and many papers have been written (some based on Shannon's Information Theory) seeking to establish some theoretical basis for it. Some of these attempts are reviewed, and it is shown that the Information Theory approaches are problematic, but that there are good theoretical justifications of both IDF and TF*IDF in the traditional probabilistic model of information retrieval.
  12. Khoo, C.S.G.; Wan, K.-W.: ¬A simple relevancy-ranking strategy for an interface to Boolean OPACs (2004) 0.15
    0.14550374 = sum of:
      0.0058264043 = product of:
        0.023305617 = sum of:
          0.023305617 = weight(_text_:based in 2509) [ClassicSimilarity], result of:
            0.023305617 = score(doc=2509,freq=4.0), product of:
              0.14144066 = queryWeight, product of:
                3.0129938 = idf(docFreq=5906, maxDocs=44218)
                0.04694356 = queryNorm
              0.1647731 = fieldWeight in 2509, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.0129938 = idf(docFreq=5906, maxDocs=44218)
                0.02734375 = fieldNorm(doc=2509)
        0.25 = coord(1/4)
      0.039522473 = weight(_text_:term in 2509) [ClassicSimilarity], result of:
        0.039522473 = score(doc=2509,freq=2.0), product of:
          0.21904005 = queryWeight, product of:
            4.66603 = idf(docFreq=1130, maxDocs=44218)
            0.04694356 = queryNorm
          0.18043491 = fieldWeight in 2509, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            4.66603 = idf(docFreq=1130, maxDocs=44218)
            0.02734375 = fieldNorm(doc=2509)
      0.08902451 = weight(_text_:frequency in 2509) [ClassicSimilarity], result of:
        0.08902451 = score(doc=2509,freq=4.0), product of:
          0.27643865 = queryWeight, product of:
            5.888745 = idf(docFreq=332, maxDocs=44218)
            0.04694356 = queryNorm
          0.32204074 = fieldWeight in 2509, product of:
            2.0 = tf(freq=4.0), with freq of:
              4.0 = termFreq=4.0
            5.888745 = idf(docFreq=332, maxDocs=44218)
            0.02734375 = fieldNorm(doc=2509)
      0.011130357 = product of:
        0.022260714 = sum of:
          0.022260714 = weight(_text_:22 in 2509) [ClassicSimilarity], result of:
            0.022260714 = score(doc=2509,freq=2.0), product of:
              0.16438834 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.04694356 = queryNorm
              0.1354154 = fieldWeight in 2509, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.02734375 = fieldNorm(doc=2509)
        0.5 = coord(1/2)
    
    Abstract
    A relevancy-ranking algorithm for a natural language interface to Boolean online public access catalogs (OPACs) was formulated and compared with that currently used in a knowledge-based search interface called the E-Referencer, being developed by the authors. The algorithm makes use of seven weIl-known ranking criteria: breadth of match, section weighting, proximity of query words, variant word forms (stemming), document frequency, term frequency and document length. The algorithm converts a natural language query into a series of increasingly broader Boolean search statements. In a small experiment with ten subjects in which the algorithm was simulated by hand, the algorithm obtained good results with a mean overall precision of 0.42 and mean average precision of 0.62, representing a 27 percent improvement in precision and 41 percent improvement in average precision compared to the E-Referencer. The usefulness of each step in the algorithm was analyzed and suggestions are made for improving the algorithm.
    Content
    "Most Web search engines accept natural language queries, perform some kind of fuzzy matching and produce ranked output, displaying first the documents that are most likely to be relevant. On the other hand, most library online public access catalogs (OPACs) an the Web are still Boolean retrieval systems that perform exact matching, and require users to express their search requests precisely in a Boolean search language and to refine their search statements to improve the search results. It is well-documented that users have difficulty searching Boolean OPACs effectively (e.g. Borgman, 1996; Ensor, 1992; Wallace, 1993). One approach to making OPACs easier to use is to develop a natural language search interface that acts as a middleware between the user's Web browser and the OPAC system. The search interface can accept a natural language query from the user and reformulate it as a series of Boolean search statements that are then submitted to the OPAC. The records retrieved by the OPAC are ranked by the search interface before forwarding them to the user's Web browser. The user, then, does not need to interact directly with the Boolean OPAC but with the natural language search interface or search intermediary. The search interface interacts with the OPAC system an the user's behalf. The advantage of this approach is that no modification to the OPAC or library system is required. Furthermore, the search interface can access multiple OPACs, acting as a meta search engine, and integrate search results from various OPACs before sending them to the user. The search interface needs to incorporate a method for converting the user's natural language query into a series of Boolean search statements, and for ranking the OPAC records retrieved. The purpose of this study was to develop a relevancyranking algorithm for a search interface to Boolean OPAC systems. This is part of an on-going effort to develop a knowledge-based search interface to OPACs called the E-Referencer (Khoo et al., 1998, 1999; Poo et al., 2000). E-Referencer v. 2 that has been implemented applies a repertoire of initial search strategies and reformulation strategies to retrieve records from OPACs using the Z39.50 protocol, and also assists users in mapping query keywords to the Library of Congress subject headings."
    Source
    Electronic library. 22(2004) no.2, S.112-120
  13. Dang, E.K.F.; Luk, R.W.P.; Allan, J.; Ho, K.S.; Chung, K.F.L.; Lee, D.L.: ¬A new context-dependent term weight computed by boost and discount using relevance information (2010) 0.14
    0.1434364 = product of:
      0.2868728 = sum of:
        0.15969492 = weight(_text_:term in 4120) [ClassicSimilarity], result of:
          0.15969492 = score(doc=4120,freq=16.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.7290672 = fieldWeight in 4120, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4120)
        0.12717786 = weight(_text_:frequency in 4120) [ClassicSimilarity], result of:
          0.12717786 = score(doc=4120,freq=4.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.46005818 = fieldWeight in 4120, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4120)
      0.5 = coord(2/4)
    
    Abstract
    We studied the effectiveness of a new class of context-dependent term weights for information retrieval. Unlike the traditional term frequency-inverse document frequency (TF-IDF), the new weighting of a term t in a document d depends not only on the occurrence statistics of t alone but also on the terms found within a text window (or "document-context") centered on t. We introduce a Boost and Discount (B&D) procedure which utilizes partial relevance information to compute the context-dependent term weights of query terms according to a logistic regression model. We investigate the effectiveness of the new term weights compared with the context-independent BM25 weights in the setting of relevance feedback. We performed experiments with title queries of the TREC-6, -7, -8, and 2005 collections, comparing the residual Mean Average Precision (MAP) measures obtained using B&D term weights and those obtained by a baseline using BM25 weights. Given either 10 or 20 relevance judgments of the top retrieved documents, using the new term weights yields improvement over the baseline for all collections tested. The MAP obtained with the new weights has relative improvement over the baseline by 3.3 to 15.2%, with statistical significance at the 95% confidence level across all four collections.
  14. Sparck Jones, K.: ¬A statistical interpretation of term specificity and its application in retrieval (2004) 0.14
    0.14199477 = product of:
      0.28398955 = sum of:
        0.15808989 = weight(_text_:term in 4420) [ClassicSimilarity], result of:
          0.15808989 = score(doc=4420,freq=8.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.72173965 = fieldWeight in 4420, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0546875 = fieldNorm(doc=4420)
        0.12589967 = weight(_text_:frequency in 4420) [ClassicSimilarity], result of:
          0.12589967 = score(doc=4420,freq=2.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.45543438 = fieldWeight in 4420, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0546875 = fieldNorm(doc=4420)
      0.5 = coord(2/4)
    
    Abstract
    The exhaustivity of document descriptions and the specificity of index terms are usually regarded as independent. It is suggested that specificity should be interpreted statistically, as a function of term use rather than of term meaning. The effects on retrieval of variations in term specificity are examined, experiments with three test collections showing, in particular, that frequently-occurring terms are required for good overall performance. It is argued that terms should be weighted according to collection frequency, so that matches on less frequent, more specific, terms are of greater value than matches on frequent terms. Results for the test collections show that considerable improvements in performance are obtained with this very simple procedure.
  15. Smith, M.P.; Pollitt, S.A.: ¬A comparison of ranking formulae and their ranks (1995) 0.13
    0.13140477 = product of:
      0.26280954 = sum of:
        0.13690987 = weight(_text_:term in 5802) [ClassicSimilarity], result of:
          0.13690987 = score(doc=5802,freq=6.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.62504494 = fieldWeight in 5802, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0546875 = fieldNorm(doc=5802)
        0.12589967 = weight(_text_:frequency in 5802) [ClassicSimilarity], result of:
          0.12589967 = score(doc=5802,freq=2.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.45543438 = fieldWeight in 5802, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0546875 = fieldNorm(doc=5802)
      0.5 = coord(2/4)
    
    Abstract
    Reports a study to compare the ranking produced by several well known probabilistic formulae. Values for the variables used in these formulae (collection frequency for a query term, number of relevant documents retrieved, and number of relevant documents retrieved, and number of relevant documents indexed by the query term) were derived using a random number generator, the number of documents in the collection was fixed at 500.000. This produced ranked bands for each formula using document term characteristics rather than actual documents. These rankings were compared with one another using the Spearman Rho ranked correlation co-efficient to determine how closely the algorithms rank documents. There is little difference in the rankings produced by the Expected Mutual Information measure EMIM and the simpler F4.5 weighting scheme
  16. Moura, E.S. de; Fernandes, D.; Ribeiro-Neto, B.; Silva, A.S. da; Gonçalves, M.A.: Using structural information to improve search in Web collections (2010) 0.12
    0.1242152 = product of:
      0.2484304 = sum of:
        0.09581695 = weight(_text_:term in 4119) [ClassicSimilarity], result of:
          0.09581695 = score(doc=4119,freq=4.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.4374403 = fieldWeight in 4119, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.046875 = fieldNorm(doc=4119)
        0.15261345 = weight(_text_:frequency in 4119) [ClassicSimilarity], result of:
          0.15261345 = score(doc=4119,freq=4.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.55206984 = fieldWeight in 4119, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.046875 = fieldNorm(doc=4119)
      0.5 = coord(2/4)
    
    Abstract
    In this work, we investigate the problem of using the block structure of Web pages to improve ranking results. Starting with basic intuitions provided by the concepts of term frequency (TF) and inverse document frequency (IDF), we propose nine block-weight functions to distinguish the impact of term occurrences inside page blocks, instead of inside whole pages. These are then used to compute a modified BM25 ranking function. Using four distinct Web collections, we ran extensive experiments to compare our block-weight ranking formulas with two other baselines: (a) a BM25 ranking applied to full pages, and (b) a BM25 ranking that takes into account best blocks. Our methods suggest that our block-weighting ranking method is superior to all baselines across all collections we used and that average gain in precision figures from 5 to 20% are generated.
  17. Keen, E.M.: Designing and testing an interactive ranked retrieval system for professional searchers (1994) 0.11
    0.11248532 = product of:
      0.22497064 = sum of:
        0.09779277 = weight(_text_:term in 1066) [ClassicSimilarity], result of:
          0.09779277 = score(doc=1066,freq=6.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.44646066 = fieldWeight in 1066, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1066)
        0.12717786 = weight(_text_:frequency in 1066) [ClassicSimilarity], result of:
          0.12717786 = score(doc=1066,freq=4.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.46005818 = fieldWeight in 1066, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1066)
      0.5 = coord(2/4)
    
    Abstract
    Reports 3 explorations of ranked system design. 2 tests used a 'cystic fibrosis' test collection with 100 queries. Experiment 1 compared a Boolean with a ranked interactive system using a subject qualified trained searcher, and reporting recall and precision results. Experiment 2 compared 15 different ranked match algorithms in a batch mode using 2 test collections, and included some new proximate pairs and term weighting approaches. Experiment 3 is a design plan for an interactive ranked prototype offering mid search algorithm choices plus other manual search devices (such as obligatory and unwanted terms), as influenced by thinking aloud comments from experiment 1. Concludes that, in Boolean versus ranked using inverse collection frequency, the searcher inspected more records on ranked than Boolean and so achieved a higher recall but lower precision; however, the presentation order of the relevant records, was, on average, very similar in both systems. Concludes also that: query reformulation was quite strongly practised in ranked searching but does not appear to have been effective; the term pairs proximate weithing methods in experiment 2 enhanced precision on both test collections when used with inverse collection frequency weighting (ICF); and the design plan for an interactive prototype adds to a selection of match algorithms other devices, such as obligatory and unwanted term marking, evidence for this being found from think aloud comments
  18. Drucker, H.; Shahrary, B.; Gibbon, D.C.: Support vector machines : relevance feedback and information retrieval (2002) 0.11
    0.11018313 = product of:
      0.22036625 = sum of:
        0.06775281 = weight(_text_:term in 2581) [ClassicSimilarity], result of:
          0.06775281 = score(doc=2581,freq=2.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.309317 = fieldWeight in 2581, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.046875 = fieldNorm(doc=2581)
        0.15261345 = weight(_text_:frequency in 2581) [ClassicSimilarity], result of:
          0.15261345 = score(doc=2581,freq=4.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.55206984 = fieldWeight in 2581, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.046875 = fieldNorm(doc=2581)
      0.5 = coord(2/4)
    
    Abstract
    We compare support vector machines (SVMs) to Rocchio, Ide regular and Ide dec-hi algorithms in information retrieval (IR) of text documents using relevancy feedback. It is assumed a preliminary search finds a set of documents that the user marks as relevant or not and then feedback iterations commence. Particular attention is paid to IR searches where the number of relevant documents in the database is low and the preliminary set of documents used to start the search has few relevant documents. Experiments show that if inverse document frequency (IDF) weighting is not used because one is unwilling to pay the time penalty needed to obtain these features, then SVMs are better whether using term-frequency (TF) or binary weighting. SVM performance is marginally better than Ide dec-hi if TF-IDF weighting is used and there is a reasonable number of relevant documents found in the preliminary search. If the preliminary search is so poor that one has to search through many documents to find at least one relevant document, then SVM is preferred.
  19. Abu-Salem, H.; Al-Omari, M.; Evens, M.W.: Stemming methodologies over individual query words for an Arabic information retrieval system (1999) 0.10
    0.10351266 = product of:
      0.20702532 = sum of:
        0.07984746 = weight(_text_:term in 3672) [ClassicSimilarity], result of:
          0.07984746 = score(doc=3672,freq=4.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.3645336 = fieldWeight in 3672, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3672)
        0.12717786 = weight(_text_:frequency in 3672) [ClassicSimilarity], result of:
          0.12717786 = score(doc=3672,freq=4.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.46005818 = fieldWeight in 3672, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3672)
      0.5 = coord(2/4)
    
    Abstract
    Stemming is one of the most important factors that affect the performance of information retrieval systems. This article investigates how to improve the performance of an Arabic information retrieval system by imposing the retrieval method over individual words of a query depending on the importance of the WORD, the STEM, or the ROOT of the query terms in the database. This method, called Mxed Stemming, computes term importance using a weighting scheme that use the Term Frequency (TF) and the Inverse Document Frequency (IDF), called TFxIDF. An extended version of the Arabic IRS system is designed, implemented, and evaluated to reduce the number of irrelevant documents retrieved. The results of the experiment suggest that the proposed method outperforms the Word index method using the TFxIDF weighting scheme. It also outperforms the Stem index method using the Binary weighting scheme but does not outperform the Stem index method using the TFxIDF weighting scheme, and again it outperforms the Root index method using the Binary weighting scheme but does not outperform the Root index method using the TFxIDF weighting scheme
  20. Hammache, A.; Boughanem, M.: Term position-based language model for information retrieval (2021) 0.09
    0.08573302 = product of:
      0.17146604 = sum of:
        0.011771114 = product of:
          0.047084454 = sum of:
            0.047084454 = weight(_text_:based in 216) [ClassicSimilarity], result of:
              0.047084454 = score(doc=216,freq=8.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.33289194 = fieldWeight in 216, product of:
                  2.828427 = tf(freq=8.0), with freq of:
                    8.0 = termFreq=8.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=216)
          0.25 = coord(1/4)
        0.15969492 = weight(_text_:term in 216) [ClassicSimilarity], result of:
          0.15969492 = score(doc=216,freq=16.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.7290672 = fieldWeight in 216, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0390625 = fieldNorm(doc=216)
      0.5 = coord(2/4)
    
    Abstract
    Term position feature is widely and successfully used in IR and Web search engines, to enhance the retrieval effectiveness. This feature is essentially used for two purposes: to capture query terms proximity or to boost the weight of terms appearing in some parts of a document. In this paper, we are interested in this second category. We propose two novel query-independent techniques based on absolute term positions in a document, whose goal is to boost the weight of terms appearing in the beginning of a document. The first one considers only the earliest occurrence of a term in a document. The second one takes into account all term positions in a document. We formalize each of these two techniques as a document model based on term position, and then we incorporate it into a basic language model (LM). Two smoothing techniques, Dirichlet and Jelinek-Mercer, are considered in the basic LM. Experiments conducted on three TREC test collections show that our model, especially the version based on all term positions, achieves significant improvements over the baseline LMs, and it also often performs better than two state-of-the-art baseline models, the chronological term rank model and the Markov random field model.

Languages

  • e 182
  • d 5
  • chi 2
  • m 1
  • More… Less…

Types

  • a 178
  • m 6
  • el 4
  • s 3
  • p 2
  • r 2
  • More… Less…