Search (10098 results, page 1 of 505)

  1. Liu, R.-L.: Context-based term frequency assessment for text classification (2010) 0.47
    0.4729282 = sum of:
      0.0122329 = product of:
        0.0489316 = sum of:
          0.0489316 = weight(_text_:based in 3331) [ClassicSimilarity], result of:
            0.0489316 = score(doc=3331,freq=6.0), product of:
              0.14144066 = queryWeight, product of:
                3.0129938 = idf(docFreq=5906, maxDocs=44218)
                0.04694356 = queryNorm
              0.34595144 = fieldWeight in 3331, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                3.0129938 = idf(docFreq=5906, maxDocs=44218)
                0.046875 = fieldNorm(doc=3331)
        0.25 = coord(1/4)
      0.1916339 = weight(_text_:term in 3331) [ClassicSimilarity], result of:
        0.1916339 = score(doc=3331,freq=16.0), product of:
          0.21904005 = queryWeight, product of:
            4.66603 = idf(docFreq=1130, maxDocs=44218)
            0.04694356 = queryNorm
          0.8748806 = fieldWeight in 3331, product of:
            4.0 = tf(freq=16.0), with freq of:
              16.0 = termFreq=16.0
            4.66603 = idf(docFreq=1130, maxDocs=44218)
            0.046875 = fieldNorm(doc=3331)
      0.18691254 = weight(_text_:frequency in 3331) [ClassicSimilarity], result of:
        0.18691254 = score(doc=3331,freq=6.0), product of:
          0.27643865 = queryWeight, product of:
            5.888745 = idf(docFreq=332, maxDocs=44218)
            0.04694356 = queryNorm
          0.6761447 = fieldWeight in 3331, product of:
            2.4494898 = tf(freq=6.0), with freq of:
              6.0 = termFreq=6.0
            5.888745 = idf(docFreq=332, maxDocs=44218)
            0.046875 = fieldNorm(doc=3331)
      0.08214887 = product of:
        0.16429774 = sum of:
          0.16429774 = weight(_text_:assessment in 3331) [ClassicSimilarity], result of:
            0.16429774 = score(doc=3331,freq=6.0), product of:
              0.25917634 = queryWeight, product of:
                5.52102 = idf(docFreq=480, maxDocs=44218)
                0.04694356 = queryNorm
              0.63392264 = fieldWeight in 3331, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                5.52102 = idf(docFreq=480, maxDocs=44218)
                0.046875 = fieldNorm(doc=3331)
        0.5 = coord(1/2)
    
    Abstract
    Automatic text classification (TC) is essential for the management of information. To properly classify a document d, it is essential to identify the semantics of each term t in d, while the semantics heavily depend on context (neighboring terms) of t in d. Therefore, we present a technique CTFA (Context-based Term Frequency Assessment) that improves text classifiers by considering term contexts in test documents. The results of the term context recognition are used to assess term frequencies of terms, and hence CTFA may easily work with various kinds of text classifiers that base their TC decisions on term frequencies, without needing to modify the classifiers. Moreover, CTFA is efficient, and neither huge memory nor domain-specific knowledge is required. Empirical results show that CTFA successfully enhances performance of several kinds of text classifiers on different experimental data.
    Object
    Context-based Term Frequency Assessment
  2. Losee, R.M.: Determining information retrieval and filtering performance without experimentation (1995) 0.31
    0.31449008 = sum of:
      0.00823978 = product of:
        0.03295912 = sum of:
          0.03295912 = weight(_text_:based in 3368) [ClassicSimilarity], result of:
            0.03295912 = score(doc=3368,freq=2.0), product of:
              0.14144066 = queryWeight, product of:
                3.0129938 = idf(docFreq=5906, maxDocs=44218)
                0.04694356 = queryNorm
              0.23302436 = fieldWeight in 3368, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.0129938 = idf(docFreq=5906, maxDocs=44218)
                0.0546875 = fieldNorm(doc=3368)
        0.25 = coord(1/4)
      0.15808989 = weight(_text_:term in 3368) [ClassicSimilarity], result of:
        0.15808989 = score(doc=3368,freq=8.0), product of:
          0.21904005 = queryWeight, product of:
            4.66603 = idf(docFreq=1130, maxDocs=44218)
            0.04694356 = queryNorm
          0.72173965 = fieldWeight in 3368, product of:
            2.828427 = tf(freq=8.0), with freq of:
              8.0 = termFreq=8.0
            4.66603 = idf(docFreq=1130, maxDocs=44218)
            0.0546875 = fieldNorm(doc=3368)
      0.12589967 = weight(_text_:frequency in 3368) [ClassicSimilarity], result of:
        0.12589967 = score(doc=3368,freq=2.0), product of:
          0.27643865 = queryWeight, product of:
            5.888745 = idf(docFreq=332, maxDocs=44218)
            0.04694356 = queryNorm
          0.45543438 = fieldWeight in 3368, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            5.888745 = idf(docFreq=332, maxDocs=44218)
            0.0546875 = fieldNorm(doc=3368)
      0.022260714 = product of:
        0.04452143 = sum of:
          0.04452143 = weight(_text_:22 in 3368) [ClassicSimilarity], result of:
            0.04452143 = score(doc=3368,freq=2.0), product of:
              0.16438834 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.04694356 = queryNorm
              0.2708308 = fieldWeight in 3368, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.0546875 = fieldNorm(doc=3368)
        0.5 = coord(1/2)
    
    Abstract
    The performance of an information retrieval or text and media filtering system may be determined through analytic methods as well as by traditional simulation or experimental methods. These analytic methods can provide precise statements about expected performance. They can thus determine which of 2 similarly performing systems is superior. For both a single query terms and for a multiple query term retrieval model, a model for comparing the performance of different probabilistic retrieval methods is developed. This method may be used in computing the average search length for a query, given only knowledge of database parameter values. Describes predictive models for inverse document frequency, binary independence, and relevance feedback based retrieval and filtering. Simulation illustrate how the single term model performs and sample performance predictions are given for single term and multiple term problems
    Date
    22. 2.1996 13:14:10
  3. Wong, S.K.M.; Yao, Y.Y.: ¬An information-theoretic measure of term specifics (1992) 0.30
    0.29912633 = product of:
      0.3988351 = sum of:
        0.011652809 = product of:
          0.046611235 = sum of:
            0.046611235 = weight(_text_:based in 4807) [ClassicSimilarity], result of:
              0.046611235 = score(doc=4807,freq=4.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.3295462 = fieldWeight in 4807, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=4807)
          0.25 = coord(1/4)
        0.20913327 = weight(_text_:term in 4807) [ClassicSimilarity], result of:
          0.20913327 = score(doc=4807,freq=14.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.9547718 = fieldWeight in 4807, product of:
              3.7416575 = tf(freq=14.0), with freq of:
                14.0 = termFreq=14.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0546875 = fieldNorm(doc=4807)
        0.17804901 = weight(_text_:frequency in 4807) [ClassicSimilarity], result of:
          0.17804901 = score(doc=4807,freq=4.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.6440815 = fieldWeight in 4807, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0546875 = fieldNorm(doc=4807)
      0.75 = coord(3/4)
    
    Abstract
    The inverse document frequency (IDF) and signal-noise ratio (S/N) approaches are term weighting schemes based on term specifics. However, the existing justifications for these methods are still some what inconclusive and sometimes even based on incompatible assumptions. Introduces an information-theoretic measure for term specifics. Shows that the IDF weighting scheme can be derived from the proposed approach by assuming that the frequency of occurrence of each index term is uniform within the set of documents containing the term. The information-theoretic interpretation of term specifics also establishes the relationship between the IDF and S/N methods
  4. Efron, M.: Linear time series models for term weighting in information retrieval (2010) 0.29
    0.29163787 = product of:
      0.38885048 = sum of:
        0.0070626684 = product of:
          0.028250674 = sum of:
            0.028250674 = weight(_text_:based in 3688) [ClassicSimilarity], result of:
              0.028250674 = score(doc=3688,freq=2.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.19973516 = fieldWeight in 3688, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.046875 = fieldNorm(doc=3688)
          0.25 = coord(1/4)
        0.1659598 = weight(_text_:term in 3688) [ClassicSimilarity], result of:
          0.1659598 = score(doc=3688,freq=12.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.7576688 = fieldWeight in 3688, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.046875 = fieldNorm(doc=3688)
        0.215828 = weight(_text_:frequency in 3688) [ClassicSimilarity], result of:
          0.215828 = score(doc=3688,freq=8.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.7807447 = fieldWeight in 3688, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.046875 = fieldNorm(doc=3688)
      0.75 = coord(3/4)
    
    Abstract
    Common measures of term importance in information retrieval (IR) rely on counts of term frequency; rare terms receive higher weight in document ranking than common terms receive. However, realistic scenarios yield additional information about terms in a collection. Of interest in this article is the temporal behavior of terms as a collection changes over time. We propose capturing each term's collection frequency at discrete time intervals over the lifespan of a corpus and analyzing the resulting time series. We hypothesize the collection frequency of a weakly discriminative term x at time t is predictable by a linear model of the term's prior observations. On the other hand, a linear time series model for a strong discriminators' collection frequency will yield a poor fit to the data. Operationalizing this hypothesis, we induce three time-based measures of term importance and test these against state-of-the-art term weighting models.
  5. Losada, D.E.; Barreiro, A.: Emebedding term similarity and inverse document frequency into a logical model of information retrieval (2003) 0.27
    0.26751098 = product of:
      0.35668132 = sum of:
        0.12775593 = weight(_text_:term in 1422) [ClassicSimilarity], result of:
          0.12775593 = score(doc=1422,freq=4.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.58325374 = fieldWeight in 1422, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0625 = fieldNorm(doc=1422)
        0.20348458 = weight(_text_:frequency in 1422) [ClassicSimilarity], result of:
          0.20348458 = score(doc=1422,freq=4.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.7360931 = fieldWeight in 1422, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0625 = fieldNorm(doc=1422)
        0.025440816 = product of:
          0.05088163 = sum of:
            0.05088163 = weight(_text_:22 in 1422) [ClassicSimilarity], result of:
              0.05088163 = score(doc=1422,freq=2.0), product of:
                0.16438834 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04694356 = queryNorm
                0.30952093 = fieldWeight in 1422, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0625 = fieldNorm(doc=1422)
          0.5 = coord(1/2)
      0.75 = coord(3/4)
    
    Abstract
    We propose a novel approach to incorporate term similarity and inverse document frequency into a logical model of information retrieval. The ability of the logic to handle expressive representations along with the use of such classical notions are promising characteristics for IR systems. The approach proposed here has been efficiently implemented and experiments against test collections are presented.
    Date
    22. 3.2003 19:27:23
  6. Witschel, H.F.: Global term weights in distributed environments (2008) 0.25
    0.2514086 = sum of:
      0.0070626684 = product of:
        0.028250674 = sum of:
          0.028250674 = weight(_text_:based in 2096) [ClassicSimilarity], result of:
            0.028250674 = score(doc=2096,freq=2.0), product of:
              0.14144066 = queryWeight, product of:
                3.0129938 = idf(docFreq=5906, maxDocs=44218)
                0.04694356 = queryNorm
              0.19973516 = fieldWeight in 2096, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.0129938 = idf(docFreq=5906, maxDocs=44218)
                0.046875 = fieldNorm(doc=2096)
        0.25 = coord(1/4)
      0.117351316 = weight(_text_:term in 2096) [ClassicSimilarity], result of:
        0.117351316 = score(doc=2096,freq=6.0), product of:
          0.21904005 = queryWeight, product of:
            4.66603 = idf(docFreq=1130, maxDocs=44218)
            0.04694356 = queryNorm
          0.5357528 = fieldWeight in 2096, product of:
            2.4494898 = tf(freq=6.0), with freq of:
              6.0 = termFreq=6.0
            4.66603 = idf(docFreq=1130, maxDocs=44218)
            0.046875 = fieldNorm(doc=2096)
      0.107914 = weight(_text_:frequency in 2096) [ClassicSimilarity], result of:
        0.107914 = score(doc=2096,freq=2.0), product of:
          0.27643865 = queryWeight, product of:
            5.888745 = idf(docFreq=332, maxDocs=44218)
            0.04694356 = queryNorm
          0.39037234 = fieldWeight in 2096, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            5.888745 = idf(docFreq=332, maxDocs=44218)
            0.046875 = fieldNorm(doc=2096)
      0.019080611 = product of:
        0.038161222 = sum of:
          0.038161222 = weight(_text_:22 in 2096) [ClassicSimilarity], result of:
            0.038161222 = score(doc=2096,freq=2.0), product of:
              0.16438834 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.04694356 = queryNorm
              0.23214069 = fieldWeight in 2096, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.046875 = fieldNorm(doc=2096)
        0.5 = coord(1/2)
    
    Abstract
    This paper examines the estimation of global term weights (such as IDF) in information retrieval scenarios where a global view on the collection is not available. In particular, the two options of either sampling documents or of using a reference corpus independent of the target retrieval collection are compared using standard IR test collections. In addition, the possibility of pruning term lists based on frequency is evaluated. The results show that very good retrieval performance can be reached when just the most frequent terms of a collection - an "extended stop word list" - are known and all terms which are not in that list are treated equally. However, the list cannot always be fully estimated from a general-purpose reference corpus, but some "domain-specific stop words" need to be added. A good solution for achieving this is to mix estimates from small samples of the target retrieval collection with ones derived from a reference corpus.
    Date
    1. 8.2008 9:44:22
  7. Zanibbi, R.; Yuan, B.: Keyword and image-based retrieval for mathematical expressions (2011) 0.25
    0.24943498 = sum of:
      0.009988121 = product of:
        0.039952483 = sum of:
          0.039952483 = weight(_text_:based in 3449) [ClassicSimilarity], result of:
            0.039952483 = score(doc=3449,freq=4.0), product of:
              0.14144066 = queryWeight, product of:
                3.0129938 = idf(docFreq=5906, maxDocs=44218)
                0.04694356 = queryNorm
              0.28246817 = fieldWeight in 3449, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.0129938 = idf(docFreq=5906, maxDocs=44218)
                0.046875 = fieldNorm(doc=3449)
        0.25 = coord(1/4)
      0.06775281 = weight(_text_:term in 3449) [ClassicSimilarity], result of:
        0.06775281 = score(doc=3449,freq=2.0), product of:
          0.21904005 = queryWeight, product of:
            4.66603 = idf(docFreq=1130, maxDocs=44218)
            0.04694356 = queryNorm
          0.309317 = fieldWeight in 3449, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            4.66603 = idf(docFreq=1130, maxDocs=44218)
            0.046875 = fieldNorm(doc=3449)
      0.15261345 = weight(_text_:frequency in 3449) [ClassicSimilarity], result of:
        0.15261345 = score(doc=3449,freq=4.0), product of:
          0.27643865 = queryWeight, product of:
            5.888745 = idf(docFreq=332, maxDocs=44218)
            0.04694356 = queryNorm
          0.55206984 = fieldWeight in 3449, product of:
            2.0 = tf(freq=4.0), with freq of:
              4.0 = termFreq=4.0
            5.888745 = idf(docFreq=332, maxDocs=44218)
            0.046875 = fieldNorm(doc=3449)
      0.019080611 = product of:
        0.038161222 = sum of:
          0.038161222 = weight(_text_:22 in 3449) [ClassicSimilarity], result of:
            0.038161222 = score(doc=3449,freq=2.0), product of:
              0.16438834 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.04694356 = queryNorm
              0.23214069 = fieldWeight in 3449, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.046875 = fieldNorm(doc=3449)
        0.5 = coord(1/2)
    
    Abstract
    Two new methods for retrieving mathematical expressions using conventional keyword search and expression images are presented. An expression-level TF-IDF (term frequency-inverse document frequency) approach is used for keyword search, where queries and indexed expressions are represented by keywords taken from LATEX strings. TF-IDF is computed at the level of individual expressions rather than documents to increase the precision of matching. The second retrieval technique is a form of Content-Base Image Retrieval (CBIR). Expressions are segmented into connected components, and then components in the query expression and each expression in the collection are matched using contour and density features, aspect ratios, and relative positions. In an experiment using ten randomly sampled queries from a corpus of over 22,000 expressions, precision-at-k (k= 20) for the keyword-based approach was higher (keyword: µ= 84.0,s= 19.0, image-based:µ= 32.0,s= 30.7), but for a few of the queries better results were obtained using a combination of the two techniques.
    Date
    22. 2.2017 12:53:49
  8. Nunes, S.; Ribeiro, C.; David, G.: Term weighting based on document revision history (2011) 0.24
    0.2428341 = product of:
      0.3237788 = sum of:
        0.008323434 = product of:
          0.033293735 = sum of:
            0.033293735 = weight(_text_:based in 4946) [ClassicSimilarity], result of:
              0.033293735 = score(doc=4946,freq=4.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.23539014 = fieldWeight in 4946, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=4946)
          0.25 = coord(1/4)
        0.15969492 = weight(_text_:term in 4946) [ClassicSimilarity], result of:
          0.15969492 = score(doc=4946,freq=16.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.7290672 = fieldWeight in 4946, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4946)
        0.15576044 = weight(_text_:frequency in 4946) [ClassicSimilarity], result of:
          0.15576044 = score(doc=4946,freq=6.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.5634539 = fieldWeight in 4946, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4946)
      0.75 = coord(3/4)
    
    Abstract
    In real-world information retrieval systems, the underlying document collection is rarely stable or definitive. This work is focused on the study of signals extracted from the content of documents at different points in time for the purpose of weighting individual terms in a document. The basic idea behind our proposals is that terms that have existed for a longer time in a document should have a greater weight. We propose 4 term weighting functions that use each document's history to estimate a current term score. To evaluate this thesis, we conduct 3 independent experiments using a collection of documents sampled from Wikipedia. In the first experiment, we use data from Wikipedia to judge each set of terms. In a second experiment, we use an external collection of tags from a popular social bookmarking service as a gold standard. In the third experiment, we crowdsource user judgments to collect feedback on term preference. Across all experiments results consistently support our thesis. We show that temporally aware measures, specifically the proposed revision term frequency and revision term frequency span, outperform a term-weighting measure based on raw term frequency alone.
  9. Weinberg, B.H.; Cunningham, J.A.: Online search strategy and term frequency statistics (1983) 0.23
    0.23422241 = product of:
      0.46844482 = sum of:
        0.18067417 = weight(_text_:term in 6892) [ClassicSimilarity], result of:
          0.18067417 = score(doc=6892,freq=2.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.8248453 = fieldWeight in 6892, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.125 = fieldNorm(doc=6892)
        0.28777066 = weight(_text_:frequency in 6892) [ClassicSimilarity], result of:
          0.28777066 = score(doc=6892,freq=2.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            1.0409929 = fieldWeight in 6892, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.125 = fieldNorm(doc=6892)
      0.5 = coord(2/4)
    
  10. Chung, Y.M.; Lee, J.Y.: ¬A corpus-based approach to comparative evaluation of statistical term association measures (2001) 0.23
    0.23399408 = product of:
      0.3119921 = sum of:
        0.005885557 = product of:
          0.023542227 = sum of:
            0.023542227 = weight(_text_:based in 5769) [ClassicSimilarity], result of:
              0.023542227 = score(doc=5769,freq=2.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.16644597 = fieldWeight in 5769, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=5769)
          0.25 = coord(1/4)
        0.12624991 = weight(_text_:term in 5769) [ClassicSimilarity], result of:
          0.12624991 = score(doc=5769,freq=10.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.5763782 = fieldWeight in 5769, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5769)
        0.17985666 = weight(_text_:frequency in 5769) [ClassicSimilarity], result of:
          0.17985666 = score(doc=5769,freq=8.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.6506205 = fieldWeight in 5769, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5769)
      0.75 = coord(3/4)
    
    Abstract
    Statistical association measures have been widely applied in information retrieval research, usually employing a clustering of documents or terms on the basis of their relationships. Applications of the association measures for term clustering include automatic thesaurus construction and query expansion. This research evaluates the similarity of six association measures by comparing the relationship and behavior they demonstrate in various analyses of a test corpus. Analysis techniques include comparisons of highly ranked term pairs and term clusters, analyses of the correlation among the association measures using Pearson's correlation coefficient and MDS mapping, and an analysis of the impact of a term frequency on the association values by means of z-score. The major findings of the study are as follows: First, the most similar association measures are mutual information and Yule's coefficient of colligation Y, whereas cosine and Jaccard coefficients, as well as X**2 statistic and likelihood ratio, demonstrate quite similar behavior for terms with high frequency. Second, among all the measures, the X**2 statistic is the least affected by the frequency of terms. Third, although cosine and Jaccard coefficients tend to emphasize high frequency terms, mutual information and Yule's Y seem to overestimate rare terms
  11. Wong, W.Y.P.; Lee, D.L.: Implementation of partial document ranking using inverted files (1993) 0.23
    0.23035437 = product of:
      0.30713916 = sum of:
        0.013317495 = product of:
          0.05326998 = sum of:
            0.05326998 = weight(_text_:based in 6539) [ClassicSimilarity], result of:
              0.05326998 = score(doc=6539,freq=4.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.37662423 = fieldWeight in 6539, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.0625 = fieldNorm(doc=6539)
          0.25 = coord(1/4)
        0.09033708 = weight(_text_:term in 6539) [ClassicSimilarity], result of:
          0.09033708 = score(doc=6539,freq=2.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.41242266 = fieldWeight in 6539, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0625 = fieldNorm(doc=6539)
        0.20348458 = weight(_text_:frequency in 6539) [ClassicSimilarity], result of:
          0.20348458 = score(doc=6539,freq=4.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.7360931 = fieldWeight in 6539, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0625 = fieldNorm(doc=6539)
      0.75 = coord(3/4)
    
    Abstract
    Examines the implementations of document ranking based on inverted files. Studies three heuristic methods for implementing the term frequency X inverse document frequency weighting strategy. The basic idea of the heuristic methods is to process the query terms in an order so that as many top documents as possible can be identified without processing all of the query terms. The heuristics were evaluated and compared. The results show improved performance. Two methods for estimating the retrieval accuracy were studied. All experiments were based on four test collection made available with the SMART system
  12. Dang, E.K.F.; Luk, R.W.P.; Allan, J.: Beyond bag-of-words : bigram-enhanced context-dependent term weights (2014) 0.23
    0.23006546 = product of:
      0.30675396 = sum of:
        0.010194084 = product of:
          0.040776335 = sum of:
            0.040776335 = weight(_text_:based in 1283) [ClassicSimilarity], result of:
              0.040776335 = score(doc=1283,freq=6.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.28829288 = fieldWeight in 1283, product of:
                  2.4494898 = tf(freq=6.0), with freq of:
                    6.0 = termFreq=6.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1283)
          0.25 = coord(1/4)
        0.16938202 = weight(_text_:term in 1283) [ClassicSimilarity], result of:
          0.16938202 = score(doc=1283,freq=18.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.7732925 = fieldWeight in 1283, product of:
              4.2426405 = tf(freq=18.0), with freq of:
                18.0 = termFreq=18.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1283)
        0.12717786 = weight(_text_:frequency in 1283) [ClassicSimilarity], result of:
          0.12717786 = score(doc=1283,freq=4.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.46005818 = fieldWeight in 1283, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1283)
      0.75 = coord(3/4)
    
    Abstract
    While term independence is a widely held assumption in most of the established information retrieval approaches, it is clearly not true and various works in the past have investigated a relaxation of the assumption. One approach is to use n-grams in document representation instead of unigrams. However, the majority of early works on n-grams obtained only modest performance improvement. On the other hand, the use of information based on supporting terms or "contexts" of queries has been found to be promising. In particular, recent studies showed that using new context-dependent term weights improved the performance of relevance feedback (RF) retrieval compared with using traditional bag-of-words BM25 term weights. Calculation of the new term weights requires an estimation of the local probability of relevance of each query term occurrence. In previous studies, the estimation of this probability was based on unigrams that occur in the neighborhood of a query term. We explore an integration of the n-gram and context approaches by computing context-dependent term weights based on a mixture of unigrams and bigrams. Extensive experiments are performed using the title queries of the Text Retrieval Conference (TREC)-6, TREC-7, TREC-8, and TREC-2005 collections, for RF with relevance judgment of either the top 10 or top 20 documents of an initial retrieval. We identify some crucial elements needed in the use of bigrams in our methods, such as proper inverse document frequency (IDF) weighting of the bigrams and noise reduction by pruning bigrams with large document frequency values. We show that enhancing context-dependent term weights with bigrams is effective in further improving retrieval performance.
  13. Paijmans, H.: Gravity wells of meaning : detecting information rich passages in scientific texts (1997) 0.23
    0.22742893 = product of:
      0.30323857 = sum of:
        0.009416891 = product of:
          0.037667565 = sum of:
            0.037667565 = weight(_text_:based in 7444) [ClassicSimilarity], result of:
              0.037667565 = score(doc=7444,freq=2.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.26631355 = fieldWeight in 7444, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.0625 = fieldNorm(doc=7444)
          0.25 = coord(1/4)
        0.09033708 = weight(_text_:term in 7444) [ClassicSimilarity], result of:
          0.09033708 = score(doc=7444,freq=2.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.41242266 = fieldWeight in 7444, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0625 = fieldNorm(doc=7444)
        0.20348458 = weight(_text_:frequency in 7444) [ClassicSimilarity], result of:
          0.20348458 = score(doc=7444,freq=4.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.7360931 = fieldWeight in 7444, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0625 = fieldNorm(doc=7444)
      0.75 = coord(3/4)
    
    Abstract
    Presents research in which 4 term weigthing schemes were used to detect information rich passages in texts and the results compared. Demonstrates that word categories and frequency derived weights have a close correlation but that weighting according to the first mention theory or the cue method shows no correlation with frequency based weights
  14. Aizawa, A.: ¬An information-theoretic perspective of tf-idf measures (2003) 0.23
    0.22742893 = product of:
      0.30323857 = sum of:
        0.009416891 = product of:
          0.037667565 = sum of:
            0.037667565 = weight(_text_:based in 4155) [ClassicSimilarity], result of:
              0.037667565 = score(doc=4155,freq=2.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.26631355 = fieldWeight in 4155, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.0625 = fieldNorm(doc=4155)
          0.25 = coord(1/4)
        0.09033708 = weight(_text_:term in 4155) [ClassicSimilarity], result of:
          0.09033708 = score(doc=4155,freq=2.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.41242266 = fieldWeight in 4155, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0625 = fieldNorm(doc=4155)
        0.20348458 = weight(_text_:frequency in 4155) [ClassicSimilarity], result of:
          0.20348458 = score(doc=4155,freq=4.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.7360931 = fieldWeight in 4155, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0625 = fieldNorm(doc=4155)
      0.75 = coord(3/4)
    
    Abstract
    This paper presents a mathematical definition of the "probability-weighted amount of information" (PWI), a measure of specificity of terms in documents that is based on an information-theoretic view of retrieval events. The proposed PWI is expressed as a product of the occurrence probabilities of terms and their amounts of information, and corresponds well with the conventional term frequency - inverse document frequency measures that are commonly used in today's information retrieval systems. The mathematical definition of the PWI is shown, together with some illustrative examples of the calculation.
  15. Kang, B.-Y.; Lee, S.-J.: Document indexing : a concept-based approach to term weight estimation (2005) 0.23
    0.22668329 = product of:
      0.3022444 = sum of:
        0.014125337 = product of:
          0.056501348 = sum of:
            0.056501348 = weight(_text_:based in 1038) [ClassicSimilarity], result of:
              0.056501348 = score(doc=1038,freq=8.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.39947033 = fieldWeight in 1038, product of:
                  2.828427 = tf(freq=8.0), with freq of:
                    8.0 = termFreq=8.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.046875 = fieldNorm(doc=1038)
          0.25 = coord(1/4)
        0.13550562 = weight(_text_:term in 1038) [ClassicSimilarity], result of:
          0.13550562 = score(doc=1038,freq=8.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.618634 = fieldWeight in 1038, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.046875 = fieldNorm(doc=1038)
        0.15261345 = weight(_text_:frequency in 1038) [ClassicSimilarity], result of:
          0.15261345 = score(doc=1038,freq=4.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.55206984 = fieldWeight in 1038, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.046875 = fieldNorm(doc=1038)
      0.75 = coord(3/4)
    
    Abstract
    Traditional index weighting approaches for information retrieval from texts depend on the term frequency based analysis of the text contents. A shortcoming of these indexing schemes, which consider only the occurrences of the terms in a document, is that they have some limitations in extracting semantically exact indexes that represent the semantic content of a document. To address this issue, we developed a new indexing formalism that considers not only the terms in a document, but also the concepts. In this approach, concept clusters are defined and a concept vector space model is proposed to represent the semantic importance degrees of lexical items and concepts within a document. Through an experiment on the TREC collection of Wall Street Journal documents, we show that the proposed method outperforms an indexing method based on term frequency (TF), especially in regard to the few highest-ranked documents. Moreover, the index term dimension was 80% lower for the proposed method than for the TF-based method, which is expected to significantly reduce the document search time in a real environment.
  16. Schlieder, T.; Meuss, H.: Querying and ranking XML documents (2002) 0.22
    0.22358039 = product of:
      0.29810718 = sum of:
        0.009988121 = product of:
          0.039952483 = sum of:
            0.039952483 = weight(_text_:based in 459) [ClassicSimilarity], result of:
              0.039952483 = score(doc=459,freq=4.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.28246817 = fieldWeight in 459, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.046875 = fieldNorm(doc=459)
          0.25 = coord(1/4)
        0.13550562 = weight(_text_:term in 459) [ClassicSimilarity], result of:
          0.13550562 = score(doc=459,freq=8.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.618634 = fieldWeight in 459, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.046875 = fieldNorm(doc=459)
        0.15261345 = weight(_text_:frequency in 459) [ClassicSimilarity], result of:
          0.15261345 = score(doc=459,freq=4.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.55206984 = fieldWeight in 459, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.046875 = fieldNorm(doc=459)
      0.75 = coord(3/4)
    
    Abstract
    XML represents both content and structure of documents. Taking advantage of the document structure promises to greatly improve the retrieval precision. In this article, we present a retrieval technique that adopts the similarity measure of the vector space model, incorporates the document structure, and supports structured queries. Our query model is based on tree matching as a simple and elegant means to formulate queries without knowing the exact structure of the data. Using this query model we propose a logical document concept by deciding on the document boundaries at query time. We combine structured queries and term-based ranking by extending the term concept to structural terms that include substructures of queries and documents. The notions of term frequency and inverse document frequency are adapted to logical documents and structural terms. We introduce an efficient technique to calculate all necessary term frequencies and inverse document frequencies at query time. By adjusting parameters of the retrieval process we are able to model two contrary approaches: the classical vector space model, and the original tree matching approach.
  17. Wolfram, D.; Zhang, J.: ¬An investigation of the influence of indexing exhaustivity and term distributions on a document space (2002) 0.22
    0.2203361 = product of:
      0.29378146 = sum of:
        0.011771114 = product of:
          0.047084454 = sum of:
            0.047084454 = weight(_text_:based in 5238) [ClassicSimilarity], result of:
              0.047084454 = score(doc=5238,freq=8.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.33289194 = fieldWeight in 5238, product of:
                  2.828427 = tf(freq=8.0), with freq of:
                    8.0 = termFreq=8.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=5238)
          0.25 = coord(1/4)
        0.12624991 = weight(_text_:term in 5238) [ClassicSimilarity], result of:
          0.12624991 = score(doc=5238,freq=10.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.5763782 = fieldWeight in 5238, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5238)
        0.15576044 = weight(_text_:frequency in 5238) [ClassicSimilarity], result of:
          0.15576044 = score(doc=5238,freq=6.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.5634539 = fieldWeight in 5238, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5238)
      0.75 = coord(3/4)
    
    Abstract
    Wolfram and Zhang are interested in the effect of different indexing exhaustivity, by which they mean the number of terms chosen, and of different index term distributions and different term weighting methods on the resulting document cluster organization. The Distance Angle Retrieval Environment, DARE, which provides a two dimensional display of retrieved documents was used to represent the document clusters based upon a document's distance from the searcher's main interest, and on the angle formed by the document, a point representing a minor interest, and the point representing the main interest. If the centroid and the origin of the document space are assigned as major and minor points the average distance between documents and the centroid can be measured providing an indication of cluster organization. in the form of a size normalized similarity measure. Using 500 records from NTIS and nine models created by intersecting low, observed, and high exhaustivity levels (based upon a negative binomial distribution) with shallow, observed, and steep term distributions (based upon a Zipf distribution) simulation runs were preformed using inverse document frequency, inter-document term frequency, and inverse document frequency based upon both inter and intra-document frequencies. Low exhaustivity and shallow distributions result in a more dense document space and less effective retrieval. High exhaustivity and steeper distributions result in a more diffuse space.
  18. Tsuji, K.; Kageura, K.: Automatic generation of Japanese-English bilingual thesauri based on bilingual corpora (2006) 0.20
    0.20397238 = product of:
      0.27196318 = sum of:
        0.014416611 = product of:
          0.057666443 = sum of:
            0.057666443 = weight(_text_:based in 5061) [ClassicSimilarity], result of:
              0.057666443 = score(doc=5061,freq=12.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.4077077 = fieldWeight in 5061, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=5061)
          0.25 = coord(1/4)
        0.056460675 = weight(_text_:term in 5061) [ClassicSimilarity], result of:
          0.056460675 = score(doc=5061,freq=2.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.25776416 = fieldWeight in 5061, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5061)
        0.20108588 = weight(_text_:frequency in 5061) [ClassicSimilarity], result of:
          0.20108588 = score(doc=5061,freq=10.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.7274159 = fieldWeight in 5061, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5061)
      0.75 = coord(3/4)
    
    Abstract
    The authors propose a method for automatically generating Japanese-English bilingual thesauri based on bilingual corpora. The term bilingual thesaurus refers to a set of bilingual equivalent words and their synonyms. Most of the methods proposed so far for extracting bilingual equivalent word clusters from bilingual corpora depend heavily on word frequency and are not effective for dealing with low-frequency clusters. These low-frequency bilingual clusters are worth extracting because they contain many newly coined terms that are in demand but are not listed in existing bilingual thesauri. Assuming that single language-pair-independent methods such as frequency-based ones have reached their limitations and that a language-pair-dependent method used in combination with other methods shows promise, the authors propose the following approach: (a) Extract translation pairs based on transliteration patterns; (b) remove the pairs from among the candidate words; (c) extract translation pairs based on word frequency from the remaining candidate words; and (d) generate bilingual clusters based on the extracted pairs using a graph-theoretic method. The proposed method has been found to be significantly more effective than other methods.
  19. Baayen, R.H.; Lieber, H.: Word frequency distributions and lexical semantics (1997) 0.20
    0.20030972 = product of:
      0.40061945 = sum of:
        0.35609803 = weight(_text_:frequency in 3117) [ClassicSimilarity], result of:
          0.35609803 = score(doc=3117,freq=4.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            1.288163 = fieldWeight in 3117, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.109375 = fieldNorm(doc=3117)
        0.04452143 = product of:
          0.08904286 = sum of:
            0.08904286 = weight(_text_:22 in 3117) [ClassicSimilarity], result of:
              0.08904286 = score(doc=3117,freq=2.0), product of:
                0.16438834 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04694356 = queryNorm
                0.5416616 = fieldWeight in 3117, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.109375 = fieldNorm(doc=3117)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Abstract
    Relation between meaning, lexical productivity and frequency of use
    Date
    28. 2.1999 10:48:22
  20. Sun, Q.; Shaw, D.; Davis, C.H.: ¬A model for estimating the occurence of same-frequency words and the boundary between high- and low-frequency words in texts (1999) 0.19
    0.19296938 = product of:
      0.38593876 = sum of:
        0.00823978 = product of:
          0.03295912 = sum of:
            0.03295912 = weight(_text_:based in 3063) [ClassicSimilarity], result of:
              0.03295912 = score(doc=3063,freq=2.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.23302436 = fieldWeight in 3063, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=3063)
          0.25 = coord(1/4)
        0.377699 = weight(_text_:frequency in 3063) [ClassicSimilarity], result of:
          0.377699 = score(doc=3063,freq=18.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            1.3663031 = fieldWeight in 3063, product of:
              4.2426405 = tf(freq=18.0), with freq of:
                18.0 = termFreq=18.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0546875 = fieldNorm(doc=3063)
      0.5 = coord(2/4)
    
    Abstract
    A simpler model is proposed for estimating the frequency of any same-frequency words and identifying the boundary point between high-frequency words and low-frequency words in a text. The model, based on a 'maximum-ranking method', assigns ranks to the words and estimates word frequency by a formula. The boundary value between high-frequency and low-frequency words is obtained by taking the square root of the number of different words in the text. This straightforward model was used successfully with both English and Chinese texts

Authors

Languages

Types

Themes

Subjects

Classifications