Search (3364 results, page 1 of 169)

Losada, D.E.; Barreiro, A.: Emebedding term similarity and inverse document frequency into a logical model of information retrieval (2003) 0.27

0.26751098 = product of:
  0.35668132 = sum of:
    0.12775593 = weight(_text_:term in 1422) [ClassicSimilarity], result of:
      0.12775593 = score(doc=1422,freq=4.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.58325374 = fieldWeight in 1422, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0625 = fieldNorm(doc=1422)
    0.20348458 = weight(_text_:frequency in 1422) [ClassicSimilarity], result of:
      0.20348458 = score(doc=1422,freq=4.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.7360931 = fieldWeight in 1422, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.0625 = fieldNorm(doc=1422)
    0.025440816 = product of:
      0.05088163 = sum of:
        0.05088163 = weight(_text_:22 in 1422) [ClassicSimilarity], result of:
          0.05088163 = score(doc=1422,freq=2.0), product of:
            0.16438834 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.04694356 = queryNorm
            0.30952093 = fieldWeight in 1422, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0625 = fieldNorm(doc=1422)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: We propose a novel approach to incorporate term similarity and inverse document frequency into a logical model of information retrieval. The ability of the logic to handle expressive representations along with the use of such classical notions are promising characteristics for IR systems. The approach proposed here has been efficiently implemented and experiments against test collections are presented.
Date: 22. 3.2003 19:27:23

Witschel, H.F.: Global term weights in distributed environments (2008) 0.25

0.2514086 = sum of:
  0.0070626684 = product of:
    0.028250674 = sum of:
      0.028250674 = weight(_text_:based in 2096) [ClassicSimilarity], result of:
        0.028250674 = score(doc=2096,freq=2.0), product of:
          0.14144066 = queryWeight, product of:
            3.0129938 = idf(docFreq=5906, maxDocs=44218)
            0.04694356 = queryNorm
          0.19973516 = fieldWeight in 2096, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.0129938 = idf(docFreq=5906, maxDocs=44218)
            0.046875 = fieldNorm(doc=2096)
    0.25 = coord(1/4)
  0.117351316 = weight(_text_:term in 2096) [ClassicSimilarity], result of:
    0.117351316 = score(doc=2096,freq=6.0), product of:
      0.21904005 = queryWeight, product of:
        4.66603 = idf(docFreq=1130, maxDocs=44218)
        0.04694356 = queryNorm
      0.5357528 = fieldWeight in 2096, product of:
        2.4494898 = tf(freq=6.0), with freq of:
          6.0 = termFreq=6.0
        4.66603 = idf(docFreq=1130, maxDocs=44218)
        0.046875 = fieldNorm(doc=2096)
  0.107914 = weight(_text_:frequency in 2096) [ClassicSimilarity], result of:
    0.107914 = score(doc=2096,freq=2.0), product of:
      0.27643865 = queryWeight, product of:
        5.888745 = idf(docFreq=332, maxDocs=44218)
        0.04694356 = queryNorm
      0.39037234 = fieldWeight in 2096, product of:
        1.4142135 = tf(freq=2.0), with freq of:
          2.0 = termFreq=2.0
        5.888745 = idf(docFreq=332, maxDocs=44218)
        0.046875 = fieldNorm(doc=2096)
  0.019080611 = product of:
    0.038161222 = sum of:
      0.038161222 = weight(_text_:22 in 2096) [ClassicSimilarity], result of:
        0.038161222 = score(doc=2096,freq=2.0), product of:
          0.16438834 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.04694356 = queryNorm
          0.23214069 = fieldWeight in 2096, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.046875 = fieldNorm(doc=2096)
    0.5 = coord(1/2)

Abstract: This paper examines the estimation of global term weights (such as IDF) in information retrieval scenarios where a global view on the collection is not available. In particular, the two options of either sampling documents or of using a reference corpus independent of the target retrieval collection are compared using standard IR test collections. In addition, the possibility of pruning term lists based on frequency is evaluated. The results show that very good retrieval performance can be reached when just the most frequent terms of a collection - an "extended stop word list" - are known and all terms which are not in that list are treated equally. However, the list cannot always be fully estimated from a general-purpose reference corpus, but some "domain-specific stop words" need to be added. A good solution for achieving this is to mix estimates from small samples of the target retrieval collection with ones derived from a reference corpus.
Date: 1. 8.2008 9:44:22

Chung, Y.M.; Lee, J.Y.: ¬A corpus-based approach to comparative evaluation of statistical term association measures (2001) 0.23

0.23399408 = product of:
  0.3119921 = sum of:
    0.005885557 = product of:
      0.023542227 = sum of:
        0.023542227 = weight(_text_:based in 5769) [ClassicSimilarity], result of:
          0.023542227 = score(doc=5769,freq=2.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.16644597 = fieldWeight in 5769, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5769)
      0.25 = coord(1/4)
    0.12624991 = weight(_text_:term in 5769) [ClassicSimilarity], result of:
      0.12624991 = score(doc=5769,freq=10.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.5763782 = fieldWeight in 5769, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5769)
    0.17985666 = weight(_text_:frequency in 5769) [ClassicSimilarity], result of:
      0.17985666 = score(doc=5769,freq=8.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.6506205 = fieldWeight in 5769, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5769)
  0.75 = coord(3/4)

Abstract: Statistical association measures have been widely applied in information retrieval research, usually employing a clustering of documents or terms on the basis of their relationships. Applications of the association measures for term clustering include automatic thesaurus construction and query expansion. This research evaluates the similarity of six association measures by comparing the relationship and behavior they demonstrate in various analyses of a test corpus. Analysis techniques include comparisons of highly ranked term pairs and term clusters, analyses of the correlation among the association measures using Pearson's correlation coefficient and MDS mapping, and an analysis of the impact of a term frequency on the association values by means of z-score. The major findings of the study are as follows: First, the most similar association measures are mutual information and Yule's coefficient of colligation Y, whereas cosine and Jaccard coefficients, as well as X**2 statistic and likelihood ratio, demonstrate quite similar behavior for terms with high frequency. Second, among all the measures, the X**2 statistic is the least affected by the frequency of terms. Third, although cosine and Jaccard coefficients tend to emphasize high frequency terms, mutual information and Yule's Y seem to overestimate rare terms

Aizawa, A.: ¬An information-theoretic perspective of tf-idf measures (2003) 0.23

0.22742893 = product of:
  0.30323857 = sum of:
    0.009416891 = product of:
      0.037667565 = sum of:
        0.037667565 = weight(_text_:based in 4155) [ClassicSimilarity], result of:
          0.037667565 = score(doc=4155,freq=2.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.26631355 = fieldWeight in 4155, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0625 = fieldNorm(doc=4155)
      0.25 = coord(1/4)
    0.09033708 = weight(_text_:term in 4155) [ClassicSimilarity], result of:
      0.09033708 = score(doc=4155,freq=2.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.41242266 = fieldWeight in 4155, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0625 = fieldNorm(doc=4155)
    0.20348458 = weight(_text_:frequency in 4155) [ClassicSimilarity], result of:
      0.20348458 = score(doc=4155,freq=4.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.7360931 = fieldWeight in 4155, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.0625 = fieldNorm(doc=4155)
  0.75 = coord(3/4)

Abstract: This paper presents a mathematical definition of the "probability-weighted amount of information" (PWI), a measure of specificity of terms in documents that is based on an information-theoretic view of retrieval events. The proposed PWI is expressed as a product of the occurrence probabilities of terms and their amounts of information, and corresponds well with the conventional term frequency - inverse document frequency measures that are commonly used in today's information retrieval systems. The mathematical definition of the PWI is shown, together with some illustrative examples of the calculation.

Kang, B.-Y.; Lee, S.-J.: Document indexing : a concept-based approach to term weight estimation (2005) 0.23

0.22668329 = product of:
  0.3022444 = sum of:
    0.014125337 = product of:
      0.056501348 = sum of:
        0.056501348 = weight(_text_:based in 1038) [ClassicSimilarity], result of:
          0.056501348 = score(doc=1038,freq=8.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.39947033 = fieldWeight in 1038, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.046875 = fieldNorm(doc=1038)
      0.25 = coord(1/4)
    0.13550562 = weight(_text_:term in 1038) [ClassicSimilarity], result of:
      0.13550562 = score(doc=1038,freq=8.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.618634 = fieldWeight in 1038, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.046875 = fieldNorm(doc=1038)
    0.15261345 = weight(_text_:frequency in 1038) [ClassicSimilarity], result of:
      0.15261345 = score(doc=1038,freq=4.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.55206984 = fieldWeight in 1038, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.046875 = fieldNorm(doc=1038)
  0.75 = coord(3/4)

Abstract: Traditional index weighting approaches for information retrieval from texts depend on the term frequency based analysis of the text contents. A shortcoming of these indexing schemes, which consider only the occurrences of the terms in a document, is that they have some limitations in extracting semantically exact indexes that represent the semantic content of a document. To address this issue, we developed a new indexing formalism that considers not only the terms in a document, but also the concepts. In this approach, concept clusters are defined and a concept vector space model is proposed to represent the semantic importance degrees of lexical items and concepts within a document. Through an experiment on the TREC collection of Wall Street Journal documents, we show that the proposed method outperforms an indexing method based on term frequency (TF), especially in regard to the few highest-ranked documents. Moreover, the index term dimension was 80% lower for the proposed method than for the TF-based method, which is expected to significantly reduce the document search time in a real environment.

Schlieder, T.; Meuss, H.: Querying and ranking XML documents (2002) 0.22

0.22358039 = product of:
  0.29810718 = sum of:
    0.009988121 = product of:
      0.039952483 = sum of:
        0.039952483 = weight(_text_:based in 459) [ClassicSimilarity], result of:
          0.039952483 = score(doc=459,freq=4.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.28246817 = fieldWeight in 459, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.046875 = fieldNorm(doc=459)
      0.25 = coord(1/4)
    0.13550562 = weight(_text_:term in 459) [ClassicSimilarity], result of:
      0.13550562 = score(doc=459,freq=8.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.618634 = fieldWeight in 459, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.046875 = fieldNorm(doc=459)
    0.15261345 = weight(_text_:frequency in 459) [ClassicSimilarity], result of:
      0.15261345 = score(doc=459,freq=4.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.55206984 = fieldWeight in 459, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.046875 = fieldNorm(doc=459)
  0.75 = coord(3/4)

Abstract: XML represents both content and structure of documents. Taking advantage of the document structure promises to greatly improve the retrieval precision. In this article, we present a retrieval technique that adopts the similarity measure of the vector space model, incorporates the document structure, and supports structured queries. Our query model is based on tree matching as a simple and elegant means to formulate queries without knowing the exact structure of the data. Using this query model we propose a logical document concept by deciding on the document boundaries at query time. We combine structured queries and term-based ranking by extending the term concept to structural terms that include substructures of queries and documents. The notions of term frequency and inverse document frequency are adapted to logical documents and structural terms. We introduce an efficient technique to calculate all necessary term frequencies and inverse document frequencies at query time. By adjusting parameters of the retrieval process we are able to model two contrary approaches: the classical vector space model, and the original tree matching approach.

Wolfram, D.; Zhang, J.: ¬An investigation of the influence of indexing exhaustivity and term distributions on a document space (2002) 0.22

0.2203361 = product of:
  0.29378146 = sum of:
    0.011771114 = product of:
      0.047084454 = sum of:
        0.047084454 = weight(_text_:based in 5238) [ClassicSimilarity], result of:
          0.047084454 = score(doc=5238,freq=8.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.33289194 = fieldWeight in 5238, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5238)
      0.25 = coord(1/4)
    0.12624991 = weight(_text_:term in 5238) [ClassicSimilarity], result of:
      0.12624991 = score(doc=5238,freq=10.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.5763782 = fieldWeight in 5238, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5238)
    0.15576044 = weight(_text_:frequency in 5238) [ClassicSimilarity], result of:
      0.15576044 = score(doc=5238,freq=6.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.5634539 = fieldWeight in 5238, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5238)
  0.75 = coord(3/4)

Abstract: Wolfram and Zhang are interested in the effect of different indexing exhaustivity, by which they mean the number of terms chosen, and of different index term distributions and different term weighting methods on the resulting document cluster organization. The Distance Angle Retrieval Environment, DARE, which provides a two dimensional display of retrieved documents was used to represent the document clusters based upon a document's distance from the searcher's main interest, and on the angle formed by the document, a point representing a minor interest, and the point representing the main interest. If the centroid and the origin of the document space are assigned as major and minor points the average distance between documents and the centroid can be measured providing an indication of cluster organization. in the form of a size normalized similarity measure. Using 500 records from NTIS and nine models created by intersecting low, observed, and high exhaustivity levels (based upon a negative binomial distribution) with shallow, observed, and steep term distributions (based upon a Zipf distribution) simulation runs were preformed using inverse document frequency, inter-document term frequency, and inverse document frequency based upon both inter and intra-document frequencies. Low exhaustivity and shallow distributions result in a more dense document space and less effective retrieval. High exhaustivity and steeper distributions result in a more diffuse space.

Tsuji, K.; Kageura, K.: Automatic generation of Japanese-English bilingual thesauri based on bilingual corpora (2006) 0.20

0.20397238 = product of:
  0.27196318 = sum of:
    0.014416611 = product of:
      0.057666443 = sum of:
        0.057666443 = weight(_text_:based in 5061) [ClassicSimilarity], result of:
          0.057666443 = score(doc=5061,freq=12.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.4077077 = fieldWeight in 5061, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5061)
      0.25 = coord(1/4)
    0.056460675 = weight(_text_:term in 5061) [ClassicSimilarity], result of:
      0.056460675 = score(doc=5061,freq=2.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.25776416 = fieldWeight in 5061, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5061)
    0.20108588 = weight(_text_:frequency in 5061) [ClassicSimilarity], result of:
      0.20108588 = score(doc=5061,freq=10.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.7274159 = fieldWeight in 5061, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5061)
  0.75 = coord(3/4)

Abstract: The authors propose a method for automatically generating Japanese-English bilingual thesauri based on bilingual corpora. The term bilingual thesaurus refers to a set of bilingual equivalent words and their synonyms. Most of the methods proposed so far for extracting bilingual equivalent word clusters from bilingual corpora depend heavily on word frequency and are not effective for dealing with low-frequency clusters. These low-frequency bilingual clusters are worth extracting because they contain many newly coined terms that are in demand but are not listed in existing bilingual thesauri. Assuming that single language-pair-independent methods such as frequency-based ones have reached their limitations and that a language-pair-dependent method used in combination with other methods shows promise, the authors propose the following approach: (a) Extract translation pairs based on transliteration patterns; (b) remove the pairs from among the candidate words; (c) extract translation pairs based on word frequency from the remaining candidate words; and (d) generate bilingual clusters based on the extracted pairs using a graph-theoretic method. The proposed method has been found to be significantly more effective than other methods.

Ruthven, I.; Lalmas, M.; Rijsbergen, K. van: Combining and selecting characteristics of information use (2002) 0.19
```
0.18807887 = product of:
  0.25077182 = sum of:
    0.0066587473 = product of:
      0.02663499 = sum of:
        0.02663499 = weight(_text_:based in 5208) [ClassicSimilarity], result of:
          0.02663499 = score(doc=5208,freq=4.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.18831211 = fieldWeight in 5208, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.03125 = fieldNorm(doc=5208)
      0.25 = coord(1/4)
    0.11950473 = weight(_text_:term in 5208) [ClassicSimilarity], result of:
      0.11950473 = score(doc=5208,freq=14.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.5455839 = fieldWeight in 5208, product of:
          3.7416575 = tf(freq=14.0), with freq of:
            14.0 = termFreq=14.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.03125 = fieldNorm(doc=5208)
    0.12460835 = weight(_text_:frequency in 5208) [ClassicSimilarity], result of:
      0.12460835 = score(doc=5208,freq=6.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.45076314 = fieldWeight in 5208, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.03125 = fieldNorm(doc=5208)
  0.75 = coord(3/4)
```
Abstract

Ruthven, Lalmas, and van Rijsbergen use traditional term importance measures like inverse document frequency, noise, based upon in-document frequency, and term frequency supplemented by theme value which is calculated from differences of expected positions of words in a text from their actual positions, on the assumption that even distribution indicates term association with a main topic, and context, which is based on a query term's distance from the nearest other query term relative to the average expected distribution of all query terms in the document. They then define document characteristics like specificity, the sum of all idf values in a document over the total terms in the document, or document complexity, measured by the documents average idf value; and information to noise ratio, info-noise, tokens after stopping and stemming over tokens before these processes, measuring the ratio of useful and non-useful information in a document. Retrieval tests are then carried out using each characteristic, combinations of the characteristics, and relevance feedback to determine the correct combination of characteristics. A file ranks independently of query terms by both specificity and info-noise, but if presence of a query term is required unique rankings are generated. Tested on five standard collections the traditional characteristics out preformed the new characteristics, which did, however, out preform random retrieval. All possible combinations of characteristics were also tested both with and without a set of scaling weights applied. All characteristics can benefit by combination with another characteristic or set of characteristics and performance as a single characteristic is a good indicator of performance in combination. Larger combinations tended to be more effective than smaller ones and weighting increased precision measures of middle ranking combinations but decreased the ranking of poorer combinations. The best combinations vary for each collection, and in some collections with the addition of weighting. Finally, with all documents ranked by the all characteristics combination, they take the top 30 documents and calculate the characteristic scores for each term in both the relevant and the non-relevant sets. Then taking for each query term the characteristics whose average was higher for relevant than non-relevant documents the documents are re-ranked. The relevance feedback method of selecting characteristics can select a good set of characteristics for query terms.

Rölleke, T.; Tsikrika, T.; Kazai, G.: ¬A general matrix framework for modelling Information Retrieval (2006) 0.18

0.18435149 = product of:
  0.24580199 = sum of:
    0.010194084 = product of:
      0.040776335 = sum of:
        0.040776335 = weight(_text_:based in 957) [ClassicSimilarity], result of:
          0.040776335 = score(doc=957,freq=6.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.28829288 = fieldWeight in 957, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=957)
      0.25 = coord(1/4)
    0.07984746 = weight(_text_:term in 957) [ClassicSimilarity], result of:
      0.07984746 = score(doc=957,freq=4.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.3645336 = fieldWeight in 957, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0390625 = fieldNorm(doc=957)
    0.15576044 = weight(_text_:frequency in 957) [ClassicSimilarity], result of:
      0.15576044 = score(doc=957,freq=6.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.5634539 = fieldWeight in 957, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.0390625 = fieldNorm(doc=957)
  0.75 = coord(3/4)

Abstract: In this paper, we present a well-defined general matrix framework for modelling Information Retrieval (IR). In this framework, collections, documents and queries correspond to matrix spaces. Retrieval aspects, such as content, structure and semantics, are expressed by matrices defined in these spaces and by matrix operations applied on them. The dualities of these spaces are identified through the application of frequency-based operations on the proposed matrices and through the investigation of the meaning of their eigenvectors. This allows term weighting concepts used for content-based retrieval, such as term frequency and inverse document frequency, to translate directly to concepts for structure-based retrieval. In addition, concepts such as pagerank, authorities and hubs, determined by exploiting the structural relationships between linked documents, can be defined with respect to the semantic relationships between terms. Moreover, this mathematical framework can be used to express classical and alternative evaluation measures, involving, for instance, the structure of documents, and to further explain and relate IR models and theory. The high level of reusability and abstraction of the framework leads to a logical layer for IR that makes system design and construction significantly more efficient, and thus, better and increasingly personalised systems can be built at lower costs.

Yang, L.; Ji, D.; Leong, M.: Document reranking by term distribution and maximal marginal relevance for chinese information retrieval (2007) 0.17

0.17424598 = product of:
  0.23232798 = sum of:
    0.0070626684 = product of:
      0.028250674 = sum of:
        0.028250674 = weight(_text_:based in 907) [ClassicSimilarity], result of:
          0.028250674 = score(doc=907,freq=2.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.19973516 = fieldWeight in 907, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.046875 = fieldNorm(doc=907)
      0.25 = coord(1/4)
    0.117351316 = weight(_text_:term in 907) [ClassicSimilarity], result of:
      0.117351316 = score(doc=907,freq=6.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.5357528 = fieldWeight in 907, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.046875 = fieldNorm(doc=907)
    0.107914 = weight(_text_:frequency in 907) [ClassicSimilarity], result of:
      0.107914 = score(doc=907,freq=2.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.39037234 = fieldWeight in 907, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.046875 = fieldNorm(doc=907)
  0.75 = coord(3/4)

Abstract: In this paper, we propose a document reranking method for Chinese information retrieval. The method is based on a term weighting scheme, which integrates local and global distribution of terms as well as document frequency, document positions and term length. The weight scheme allows randomly setting a larger portion of the retrieved documents as relevance feedback, and lifts off the worry that very fewer relevant documents appear in top retrieved documents. It also helps to improve the performance of maximal marginal relevance (MMR) in document reranking. The method was evaluated by MAP (mean average precision), a recall-oriented measure. Significance tests showed that our method can get significant improvement against standard baselines, and outperform relevant methods consistently.

Wolfram, D.: Search characteristics in different types of Web-based IR environments : are they the same? (2008) 0.17

0.16606814 = product of:
  0.22142418 = sum of:
    0.01647956 = product of:
      0.06591824 = sum of:
        0.06591824 = weight(_text_:based in 2093) [ClassicSimilarity], result of:
          0.06591824 = score(doc=2093,freq=8.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.46604872 = fieldWeight in 2093, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2093)
      0.25 = coord(1/4)
    0.079044946 = weight(_text_:term in 2093) [ClassicSimilarity], result of:
      0.079044946 = score(doc=2093,freq=2.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.36086982 = fieldWeight in 2093, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0546875 = fieldNorm(doc=2093)
    0.12589967 = weight(_text_:frequency in 2093) [ClassicSimilarity], result of:
      0.12589967 = score(doc=2093,freq=2.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.45543438 = fieldWeight in 2093, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.0546875 = fieldNorm(doc=2093)
  0.75 = coord(3/4)

Abstract: Transaction logs from four different Web-based information retrieval environments (bibliographic databank, OPAC, search engine, specialized search system) were analyzed for empirical regularities in search characteristics to determine whether users engage in different behaviors in different Web-based search environments. Descriptive statistics and relative frequency distributions related to term usage, query formulation, and session duration were tabulated. The analysis revealed that there are differences in these characteristics. Users were more likely to engage in extensive searching using the OPAC and specialized search system. Surprisingly, the bibliographic databank search environment resulted in the most parsimonious searching, more similar to a general search engine. Although on the surface Web-based search facilities may appear similar, users do engage in different search behaviors.

Robertson, S.: Understanding inverse document frequency : on theoretical arguments for IDF (2004) 0.16

0.1598883 = product of:
  0.2131844 = sum of:
    0.00823978 = product of:
      0.03295912 = sum of:
        0.03295912 = weight(_text_:based in 4421) [ClassicSimilarity], result of:
          0.03295912 = score(doc=4421,freq=2.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.23302436 = fieldWeight in 4421, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0546875 = fieldNorm(doc=4421)
      0.25 = coord(1/4)
    0.079044946 = weight(_text_:term in 4421) [ClassicSimilarity], result of:
      0.079044946 = score(doc=4421,freq=2.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.36086982 = fieldWeight in 4421, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0546875 = fieldNorm(doc=4421)
    0.12589967 = weight(_text_:frequency in 4421) [ClassicSimilarity], result of:
      0.12589967 = score(doc=4421,freq=2.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.45543438 = fieldWeight in 4421, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.0546875 = fieldNorm(doc=4421)
  0.75 = coord(3/4)

Abstract: The term-weighting function known as IDF was proposed in 1972, and has since been extremely widely used, usually as part of a TF*IDF function. It is often described as a heuristic, and many papers have been written (some based on Shannon's Information Theory) seeking to establish some theoretical basis for it. Some of these attempts are reviewed, and it is shown that the Information Theory approaches are problematic, but that there are good theoretical justifications of both IDF and TF*IDF in the traditional probabilistic model of information retrieval.

Khoo, C.S.G.; Wan, K.-W.: ¬A simple relevancy-ranking strategy for an interface to Boolean OPACs (2004) 0.15
```
0.14550374 = sum of:
  0.0058264043 = product of:
    0.023305617 = sum of:
      0.023305617 = weight(_text_:based in 2509) [ClassicSimilarity], result of:
        0.023305617 = score(doc=2509,freq=4.0), product of:
          0.14144066 = queryWeight, product of:
            3.0129938 = idf(docFreq=5906, maxDocs=44218)
            0.04694356 = queryNorm
          0.1647731 = fieldWeight in 2509, product of:
            2.0 = tf(freq=4.0), with freq of:
              4.0 = termFreq=4.0
            3.0129938 = idf(docFreq=5906, maxDocs=44218)
            0.02734375 = fieldNorm(doc=2509)
    0.25 = coord(1/4)
  0.039522473 = weight(_text_:term in 2509) [ClassicSimilarity], result of:
    0.039522473 = score(doc=2509,freq=2.0), product of:
      0.21904005 = queryWeight, product of:
        4.66603 = idf(docFreq=1130, maxDocs=44218)
        0.04694356 = queryNorm
      0.18043491 = fieldWeight in 2509, product of:
        1.4142135 = tf(freq=2.0), with freq of:
          2.0 = termFreq=2.0
        4.66603 = idf(docFreq=1130, maxDocs=44218)
        0.02734375 = fieldNorm(doc=2509)
  0.08902451 = weight(_text_:frequency in 2509) [ClassicSimilarity], result of:
    0.08902451 = score(doc=2509,freq=4.0), product of:
      0.27643865 = queryWeight, product of:
        5.888745 = idf(docFreq=332, maxDocs=44218)
        0.04694356 = queryNorm
      0.32204074 = fieldWeight in 2509, product of:
        2.0 = tf(freq=4.0), with freq of:
          4.0 = termFreq=4.0
        5.888745 = idf(docFreq=332, maxDocs=44218)
        0.02734375 = fieldNorm(doc=2509)
  0.011130357 = product of:
    0.022260714 = sum of:
      0.022260714 = weight(_text_:22 in 2509) [ClassicSimilarity], result of:
        0.022260714 = score(doc=2509,freq=2.0), product of:
          0.16438834 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.04694356 = queryNorm
          0.1354154 = fieldWeight in 2509, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.02734375 = fieldNorm(doc=2509)
    0.5 = coord(1/2)
```
Abstract

A relevancy-ranking algorithm for a natural language interface to Boolean online public access catalogs (OPACs) was formulated and compared with that currently used in a knowledge-based search interface called the E-Referencer, being developed by the authors. The algorithm makes use of seven weIl-known ranking criteria: breadth of match, section weighting, proximity of query words, variant word forms (stemming), document frequency, term frequency and document length. The algorithm converts a natural language query into a series of increasingly broader Boolean search statements. In a small experiment with ten subjects in which the algorithm was simulated by hand, the algorithm obtained good results with a mean overall precision of 0.42 and mean average precision of 0.62, representing a 27 percent improvement in precision and 41 percent improvement in average precision compared to the E-Referencer. The usefulness of each step in the algorithm was analyzed and suggestions are made for improving the algorithm.

Content

"Most Web search engines accept natural language queries, perform some kind of fuzzy matching and produce ranked output, displaying first the documents that are most likely to be relevant. On the other hand, most library online public access catalogs (OPACs) an the Web are still Boolean retrieval systems that perform exact matching, and require users to express their search requests precisely in a Boolean search language and to refine their search statements to improve the search results. It is well-documented that users have difficulty searching Boolean OPACs effectively (e.g. Borgman, 1996; Ensor, 1992; Wallace, 1993). One approach to making OPACs easier to use is to develop a natural language search interface that acts as a middleware between the user's Web browser and the OPAC system. The search interface can accept a natural language query from the user and reformulate it as a series of Boolean search statements that are then submitted to the OPAC. The records retrieved by the OPAC are ranked by the search interface before forwarding them to the user's Web browser. The user, then, does not need to interact directly with the Boolean OPAC but with the natural language search interface or search intermediary. The search interface interacts with the OPAC system an the user's behalf. The advantage of this approach is that no modification to the OPAC or library system is required. Furthermore, the search interface can access multiple OPACs, acting as a meta search engine, and integrate search results from various OPACs before sending them to the user. The search interface needs to incorporate a method for converting the user's natural language query into a series of Boolean search statements, and for ranking the OPAC records retrieved. The purpose of this study was to develop a relevancyranking algorithm for a search interface to Boolean OPAC systems. This is part of an on-going effort to develop a knowledge-based search interface to OPACs called the E-Referencer (Khoo et al., 1998, 1999; Poo et al., 2000). E-Referencer v. 2 that has been implemented applies a repertoire of initial search strategies and reformulation strategies to retrieve records from OPACs using the Z39.50 protocol, and also assists users in mapping query keywords to the Library of Congress subject headings."

Source

Electronic library. 22(2004) no.2, S.112-120

Price, L.; Thelwall, M.: ¬The clustering power of low frequency words in academic webs (2005) 0.14

0.14488001 = product of:
  0.28976002 = sum of:
    0.00823978 = product of:
      0.03295912 = sum of:
        0.03295912 = weight(_text_:based in 3561) [ClassicSimilarity], result of:
          0.03295912 = score(doc=3561,freq=2.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.23302436 = fieldWeight in 3561, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0546875 = fieldNorm(doc=3561)
      0.25 = coord(1/4)
    0.28152025 = weight(_text_:frequency in 3561) [ClassicSimilarity], result of:
      0.28152025 = score(doc=3561,freq=10.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        1.0183823 = fieldWeight in 3561, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.0546875 = fieldNorm(doc=3561)
  0.5 = coord(2/4)

Abstract: The value of low frequency words for subject-based academic Web site clustering is assessed. A new technique is introduced to compare the relative clustering power of different vocabularies. The technique is designed for word frequency tests in large document clustering exercises. Results for the Australian and New Zealand academic Web spaces indicate that low frequency words are useful for clustering academic Web sites along subject lines; removing low frequency words results in sites becoming, an average, less dissimilar to sites from other subjects.

Pu, H.-T.; Chuang, S.-L.; Yang, C.: Subject categorization of query terms for exploring Web users' search interests (2002) 0.14
```
0.14214307 = product of:
  0.1895241 = sum of:
    0.005885557 = product of:
      0.023542227 = sum of:
        0.023542227 = weight(_text_:based in 587) [ClassicSimilarity], result of:
          0.023542227 = score(doc=587,freq=2.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.16644597 = fieldWeight in 587, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=587)
      0.25 = coord(1/4)
    0.056460675 = weight(_text_:term in 587) [ClassicSimilarity], result of:
      0.056460675 = score(doc=587,freq=2.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.25776416 = fieldWeight in 587, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0390625 = fieldNorm(doc=587)
    0.12717786 = weight(_text_:frequency in 587) [ClassicSimilarity], result of:
      0.12717786 = score(doc=587,freq=4.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.46005818 = fieldWeight in 587, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.0390625 = fieldNorm(doc=587)
  0.75 = coord(3/4)
```
Abstract

Subject content analysis of Web query terms is essential to understand Web searching interests. Such analysis includes exploring search topics and observing changes in their frequency distributions with time. To provide a basis for in-depth analysis of users' search interests on a larger scale, this article presents a query categorization approach to automatically classifying Web query terms into broad subject categories. Because a query is short in length and simple in structure, its intended subject(s) of search is difficult to judge. Our approach, therefore, combines the search processes of real-world search engines to obtain highly ranked Web documents based on each unknown query term. These documents are used to extract cooccurring terms and to create a feature set. An effective ranking function has also been developed to find the most appropriate categories. Three search engine logs in Taiwan were collected and tested. They contained over 5 million queries from different periods of time. The achieved performance is quite encouraging compared with that of human categorization. The experimental results demonstrate that the approach is efficient in dealing with large numbers of queries and adaptable to the dynamic Web environment. Through good integration of human and machine efforts, the frequency distributions of subject categories in response to changes in users' search interests can be systematically observed in real time. The approach has also shown potential for use in various information retrieval applications, and provides a basis for further Web searching studies.
White, H.D.: Combining bibliometrics, information retrieval, and relevance theory : part 2: some implications for information science (2007) 0.14
```
0.14214307 = product of:
  0.1895241 = sum of:
    0.005885557 = product of:
      0.023542227 = sum of:
        0.023542227 = weight(_text_:based in 437) [ClassicSimilarity], result of:
          0.023542227 = score(doc=437,freq=2.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.16644597 = fieldWeight in 437, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=437)
      0.25 = coord(1/4)
    0.056460675 = weight(_text_:term in 437) [ClassicSimilarity], result of:
      0.056460675 = score(doc=437,freq=2.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.25776416 = fieldWeight in 437, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0390625 = fieldNorm(doc=437)
    0.12717786 = weight(_text_:frequency in 437) [ClassicSimilarity], result of:
      0.12717786 = score(doc=437,freq=4.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.46005818 = fieldWeight in 437, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.0390625 = fieldNorm(doc=437)
  0.75 = coord(3/4)
```
Abstract

When bibliometric data are converted to term frequency (tf) and inverse document frequency (idf) values, plotted as pennant diagrams, and interpreted according to Sperber and Wilson's relevance theory (RT), the results evoke major variables of information science (IS). These include topicality, in the sense of intercohesion and intercoherence among texts; cognitive effects of texts in response to people's questions; people's levels of expertise as a precondition for cognitive effects; processing effort as textual or other messages are received; specificity of terms as it affects processing effort; relevance, defined in RT as the effects/effort ratio; and authority of texts and their authors. While such concerns figure automatically in dialogues between people, they become problematic when people create or use or judge literature-based information systems. The difficulty of achieving worthwhile cognitive effects and acceptable processing effort in human-system dialogues explains why relevance is the central concern of IS. Moreover, since relevant communication with both systems and unfamiliar people is uncertain, speakers tend to seek cognitive effects that cost them the least effort. Yet hearers need greater effort, often greater specificity, from speakers if their responses are to be highly relevant in their turn. This theme of mismatch manifests itself in vague reference questions, underdeveloped online searches, uncreative judging in retrieval evaluation trials, and perfunctory indexing. Another effect of least effort is a bias toward topical relevance over other kinds. RT can explain these outcomes as well as more adaptive ones. Pennant diagrams, applied here to a literature search and a Bradford-style journal analysis, can model them. Given RT and the right context, bibliometrics may predict psychometrics.

Sparck Jones, K.: ¬A statistical interpretation of term specificity and its application in retrieval (2004) 0.14

0.14199477 = product of:
  0.28398955 = sum of:
    0.15808989 = weight(_text_:term in 4420) [ClassicSimilarity], result of:
      0.15808989 = score(doc=4420,freq=8.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.72173965 = fieldWeight in 4420, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0546875 = fieldNorm(doc=4420)
    0.12589967 = weight(_text_:frequency in 4420) [ClassicSimilarity], result of:
      0.12589967 = score(doc=4420,freq=2.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.45543438 = fieldWeight in 4420, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.0546875 = fieldNorm(doc=4420)
  0.5 = coord(2/4)

Abstract: The exhaustivity of document descriptions and the specificity of index terms are usually regarded as independent. It is suggested that specificity should be interpreted statistically, as a function of term use rather than of term meaning. The effects on retrieval of variations in term specificity are examined, experiments with three test collections showing, in particular, that frequently-occurring terms are required for good overall performance. It is argued that terms should be weighted according to collection frequency, so that matches on less frequent, more specific, terms are of greater value than matches on frequent terms. Results for the test collections show that considerable improvements in performance are obtained with this very simple procedure.

Arsenault, C.: Aggregation consistency and frequency of Chinese words and characters (2006) 0.14
```
0.14046668 = product of:
  0.28093335 = sum of:
    0.07984746 = weight(_text_:term in 609) [ClassicSimilarity], result of:
      0.07984746 = score(doc=609,freq=4.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.3645336 = fieldWeight in 609, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0390625 = fieldNorm(doc=609)
    0.20108588 = weight(_text_:frequency in 609) [ClassicSimilarity], result of:
      0.20108588 = score(doc=609,freq=10.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.7274159 = fieldWeight in 609, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.0390625 = fieldNorm(doc=609)
  0.5 = coord(2/4)
```
Abstract

Purpose - Aims to measure syllable aggregation consistency of Romanized Chinese data in the title fields of bibliographic records. Also aims to verify if the term frequency distributions satisfy conventional bibliometric laws. Design/methodology/approach - Uses Cooper's interindexer formula to evaluate aggregation consistency within and between two sets of Chinese bibliographic data. Compares the term frequency distributions of polysyllabic words and monosyllabic characters (for vernacular and Romanized data) with the Lotka and the generalised Zipf theoretical distributions. The fits are tested with the Kolmogorov-Smirnov test. Findings - Finds high internal aggregation consistency within each data set but some aggregation discrepancy between sets. Shows that word (polysyllabic) distributions satisfy Lotka's law but that character (monosyllabic) distributions do not abide by the law. Research limitations/implications - The findings are limited to only two sets of bibliographic data (for aggregation consistency analysis) and to one set of data for the frequency distribution analysis. Only two bibliometric distributions are tested. Internal consistency within each database remains fairly high. Therefore the main argument against syllable aggregation does not appear to hold true. The analysis revealed that Chinese words and characters behave differently in terms of frequency distribution but that there is no noticeable difference between vernacular and Romanized data. The distribution of Romanized characters exhibits the worst case in terms of fit to either Lotka's or Zipf's laws, which indicates that Romanized data in aggregated form appear to be a preferable option. Originality/value - Provides empirical data on consistency and distribution of Romanized Chinese titles in bibliographic records.
Kim, W.; Wilbur, W.J.: Corpus-based statistical screening for content-bearing terms (2001) 0.14
```
0.13997644 = product of:
  0.18663526 = sum of:
    0.0066587473 = product of:
      0.02663499 = sum of:
        0.02663499 = weight(_text_:based in 5188) [ClassicSimilarity], result of:
          0.02663499 = score(doc=5188,freq=4.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.18831211 = fieldWeight in 5188, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.03125 = fieldNorm(doc=5188)
      0.25 = coord(1/4)
    0.07823421 = weight(_text_:term in 5188) [ClassicSimilarity], result of:
      0.07823421 = score(doc=5188,freq=6.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.35716853 = fieldWeight in 5188, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.03125 = fieldNorm(doc=5188)
    0.10174229 = weight(_text_:frequency in 5188) [ClassicSimilarity], result of:
      0.10174229 = score(doc=5188,freq=4.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.36804655 = fieldWeight in 5188, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.03125 = fieldNorm(doc=5188)
  0.75 = coord(3/4)
```
Abstract

Kim and Wilber present three techniques for the algorithmic identification in text of content bearing terms and phrases intended for human use as entry points or hyperlinks. Using a set of 1,075 terms from MEDLINE evaluated on a zero to four, stop word to definite content word scale, they evaluate the ranked lists of their three methods based on their placement of content words in the top ranks. Data consist of the natural language elements of 304,057 MEDLINE records from 1996, and 173,252 Wall Street Journal records from the TIPSTER collection. Phrases are extracted by breaking at punctuation marks and stop words, normalized by lower casing, replacement of nonalphanumerics with spaces, and the reduction of multiple spaces. In the ``strength of context'' approach each document is a vector of binary values for each word or word pair. The words or word pairs are removed from all documents, and the Robertson, Spark Jones relevance weight for each term computed, negative weights replaced with zero, those below a randomness threshold ignored, and the remainder summed for each document, to yield a score for the document and finally to assign to the term the average document score for documents in which it occurred. The average of these word scores is assigned to the original phrase. The ``frequency clumping'' approach defines a random phrase as one whose distribution among documents is Poisson in character. A pvalue, the probability that a phrase frequency of occurrence would be equal to, or less than, Poisson expectations is computed, and a score assigned which is the negative log of that value. In the ``database comparison'' approach if a phrase occurring in a document allows prediction that the document is in MEDLINE rather that in the Wall Street Journal, it is considered to be content bearing for MEDLINE. The score is computed by dividing the number of occurrences of the term in MEDLINE by occurrences in the Journal, and taking the product of all these values. The one hundred top and bottom ranked phrases that occurred in at least 500 documents were collected for each method. The union set had 476 phrases. A second selection was made of two word phrases occurring each in only three documents with a union of 599 phrases. A judge then ranked the two sets of terms as to subject specificity on a 0 to 4 scale. Precision was the average subject specificity of the first r ranks and recall the fraction of the subject specific phrases in the first r ranks and eleven point average precision was used as a summary measure. The three methods all move content bearing terms forward in the lists as does the use of the sum of the logs of the three methods.

Search (3364 results, page 1 of 169)

Authors

Languages

Types

Themes

Subjects

Classifications