Search (2220 results, page 1 of 111)

Li, D.; Kwong, C.-P.; Lee, D.L.: Unified linear subspace approach to semantic analysis (2009) 0.35

0.3499609 = product of:
  0.4666145 = sum of:
    0.21809667 = weight(_text_:vector in 3321) [ClassicSimilarity], result of:
      0.21809667 = score(doc=3321,freq=8.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.711459 = fieldWeight in 3321, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3321)
    0.21484317 = weight(_text_:space in 3321) [ClassicSimilarity], result of:
      0.21484317 = score(doc=3321,freq=18.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.86483204 = fieldWeight in 3321, product of:
          4.2426405 = tf(freq=18.0), with freq of:
            18.0 = termFreq=18.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3321)
    0.0336747 = product of:
      0.0673494 = sum of:
        0.0673494 = weight(_text_:model in 3321) [ClassicSimilarity], result of:
          0.0673494 = score(doc=3321,freq=6.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.36792353 = fieldWeight in 3321, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3321)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: The Basic Vector Space Model (BVSM) is well known in information retrieval. Unfortunately, its retrieval effectiveness is limited because it is based on literal term matching. The Generalized Vector Space Model (GVSM) and Latent Semantic Indexing (LSI) are two prominent semantic retrieval methods, both of which assume there is some underlying latent semantic structure in a dataset that can be used to improve retrieval performance. However, while this structure may be derived from both the term space and the document space, GVSM exploits only the former and LSI the latter. In this article, the latent semantic structure of a dataset is examined from a dual perspective; namely, we consider the term space and the document space simultaneously. This new viewpoint has a natural connection to the notion of kernels. Specifically, a unified kernel function can be derived for a class of vector space models. The dual perspective provides a deeper understanding of the semantic space and makes transparent the geometrical meaning of the unified kernel function. New semantic analysis methods based on the unified kernel function are developed, which combine the advantages of LSI and GVSM. We also prove that the new methods are stable because although the selected rank of the truncated Singular Value Decomposition (SVD) is far from the optimum, the retrieval performance will not be degraded significantly. Experiments performed on standard test collections show that our methods are promising.
Object: Generalized Vector Space Model

Xie, Y.; Raghavan, V.V.: Language-modeling kernel based approach for information retrieval (2007) 0.33

0.32542068 = product of:
  0.43389422 = sum of:
    0.261716 = weight(_text_:vector in 1326) [ClassicSimilarity], result of:
      0.261716 = score(doc=1326,freq=8.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.8537508 = fieldWeight in 1326, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.046875 = fieldNorm(doc=1326)
    0.14884771 = weight(_text_:space in 1326) [ClassicSimilarity], result of:
      0.14884771 = score(doc=1326,freq=6.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.59917325 = fieldWeight in 1326, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.046875 = fieldNorm(doc=1326)
    0.023330513 = product of:
      0.046661027 = sum of:
        0.046661027 = weight(_text_:model in 1326) [ClassicSimilarity], result of:
          0.046661027 = score(doc=1326,freq=2.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.25490487 = fieldWeight in 1326, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.046875 = fieldNorm(doc=1326)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: In this presentation, we propose a novel integrated information retrieval approach that provides a unified solution for two challenging problems in the field of information retrieval. The first problem is how to build an optimal vector space corresponding to users' different information needs when applying the vector space model. The second one is how to smoothly incorporate the advantages of machine learning techniques into the language modeling approach. To solve these problems, we designed the language-modeling kernel function, which has all the modeling powers provided by language modeling techniques. In addition, for each information need, this kernel function automatically determines an optimal vector space, for which a discriminative learning machine, such as the support vector machine, can be applied to find an optimal decision boundary between relevant and nonrelevant documents. Large-scale experiments on standard test-beds show that our approach makes significant improvements over other state-of-the-art information retrieval methods.

Dominich, S.; Kiezer, T.: ¬A measure theoretic approach to information retrieval (2007) 0.32
```
0.32321647 = product of:
  0.4309553 = sum of:
    0.23081185 = weight(_text_:vector in 445) [ClassicSimilarity], result of:
      0.23081185 = score(doc=445,freq=14.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.7529375 = fieldWeight in 445, product of:
          3.7416575 = tf(freq=14.0), with freq of:
            14.0 = termFreq=14.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.03125 = fieldNorm(doc=445)
    0.16204487 = weight(_text_:space in 445) [ClassicSimilarity], result of:
      0.16204487 = score(doc=445,freq=16.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.6522972 = fieldWeight in 445, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.03125 = fieldNorm(doc=445)
    0.03809857 = product of:
      0.07619714 = sum of:
        0.07619714 = weight(_text_:model in 445) [ClassicSimilarity], result of:
          0.07619714 = score(doc=445,freq=12.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.41625792 = fieldWeight in 445, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.03125 = fieldNorm(doc=445)
      0.5 = coord(1/2)
  0.75 = coord(3/4)
```
Abstract

The vector space model of information retrieval is one of the classical and widely applied retrieval models. Paradoxically, it has been characterized by a discrepancy between its formal framework and implementable form. The underlying concepts of the vector space model are mathematical terms: linear space, vector, and inner product. However, in the vector space model, the mathematical meaning of these concepts is not preserved. They are used as mere computational constructs or metaphors. Thus, the vector space model actually does not follow formally from the mathematical concepts on which it has been claimed to rest. This problem has been recognized for more than two decades, but no proper solution has emerged so far. The present article proposes a solution to this problem. First, the concept of retrieval is defined based on the mathematical measure theory. Then, retrieval is particularized using fuzzy set theory. As a result, the retrieval function is conceived as the cardinality of the intersection of two fuzzy sets. This view makes it possible to build a connection to linear spaces. It is shown that the classical and the generalized vector space models, as well as the latent semantic indexing model, gain a correct formal background with which they are consistent. At the same time it becomes clear that the inner product is not a necessary ingredient of the vector space model, and hence of Information Retrieval (IR). The Principle of Object Invariance is introduced to handle this situation. Moreover, this view makes it possible to consistently formulate new retrieval methods: in linear space with general basis, entropy-based, and probability-based. It is also shown that Information Retrieval may be viewed as integral calculus, and thus it gains a very compact and elegant mathematical way of writing. Also, Information Retrieval may thus be conceived as an application of mathematical measure theory.

Billhardt, H.; Borrajo, D.; Maojo, V.: ¬A context vector model for information retrieval (2002) 0.31

0.31448868 = product of:
  0.41931823 = sum of:
    0.2438395 = weight(_text_:vector in 251) [ClassicSimilarity], result of:
      0.2438395 = score(doc=251,freq=10.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.79543537 = fieldWeight in 251, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.0390625 = fieldNorm(doc=251)
    0.12403977 = weight(_text_:space in 251) [ClassicSimilarity], result of:
      0.12403977 = score(doc=251,freq=6.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.49931106 = fieldWeight in 251, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.0390625 = fieldNorm(doc=251)
    0.051438946 = product of:
      0.10287789 = sum of:
        0.10287789 = weight(_text_:model in 251) [ClassicSimilarity], result of:
          0.10287789 = score(doc=251,freq=14.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.56201243 = fieldWeight in 251, product of:
              3.7416575 = tf(freq=14.0), with freq of:
                14.0 = termFreq=14.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.0390625 = fieldNorm(doc=251)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: In the vector space model for information retrieval, term vectors are pair-wise orthogonal, that is, terms are assumed to be independent. It is well known that this assumption is too restrictive. In this article, we present our work on an indexing and retrieval method that, based on the vector space model, incorporates term dependencies and thus obtains semantically richer representations of documents. First, we generate term context vectors based on the co-occurrence of terms in the same documents. These vectors are used to calculate context vectors for documents. We present different techniques for estimating the dependencies among terms. We also define term weights that can be employed in the model. Experimental results on four text collections (MED, CRANFIELD, CISI, and CACM) show that the incorporation of term dependencies in the retrieval process performs statistically significantly better than the classical vector space model with OF weights. We also show that the degree of semantic matching versus direct word matching that performs best varies on the four collections. We conclude that the model performs well for certain types of queries and, generally, for information tasks with high recall requirements. Therefore, we propose the use of the context vector model in combination with other, direct word-matching methods

Dominich, S.; Góth, J.; Kiezer, T.; Szlávik, Z.: ¬An entropy-based interpretation of retrieval status value-based retrieval, and its application to the computation of term and query discrimination value (2004) 0.29
```
0.28576547 = product of:
  0.38102064 = sum of:
    0.21809667 = weight(_text_:vector in 2237) [ClassicSimilarity], result of:
      0.21809667 = score(doc=2237,freq=8.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.711459 = fieldWeight in 2237, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2237)
    0.12403977 = weight(_text_:space in 2237) [ClassicSimilarity], result of:
      0.12403977 = score(doc=2237,freq=6.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.49931106 = fieldWeight in 2237, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2237)
    0.03888419 = product of:
      0.07776838 = sum of:
        0.07776838 = weight(_text_:model in 2237) [ClassicSimilarity], result of:
          0.07776838 = score(doc=2237,freq=8.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.42484146 = fieldWeight in 2237, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2237)
      0.5 = coord(1/2)
  0.75 = coord(3/4)
```
Abstract

The concepts of Shannon information and entropy have been applied to a number of information retrieval tasks such as to formalize the probabilistic model, to design practical retrieval systems, to cluster documents, and to model texture in image retrieval. In this report, the concept of entropy is used for a different purpose. It is shown that any positive Retrieval Status Value (RSV)based retrieval system may be conceived as a special probability space in which the amount of the associated Shannon information is being reduced; in this view, the retrieval system is referred to as Uncertainty Decreasing Operation (UDO). The concept of UDO is then proposed as a theoretical background for term and query discrimination Power, and it is applied to the computation of term and query discrimination values in the vector space retrieval model. Experimental evidence is given as regards such computation; the results obtained compare weIl to those obtained using vector-based calculation of term discrimination values. The UDO-based computation, however, presents advantages over the vectorbased calculation: It is faster, easier to assess and handle in practice, and its application is not restricted to the vector space model. Based an the ADI test collection, it is shown that the UDO-based Term Discrimination Value (TDV) weighting scheme yields better retrieval effectiveness than using the vector-based TDV weighting scheme. Also, experimental evidence is given to the intuition that the choice of an appropriate weighting scheure and similarity measure depends an collection properties, and thus the UDO approach may be used as a theoretical basis for this intuition.

Song, D.; Bruza, P.D.: Towards context sensitive information inference (2003) 0.28

0.2764349 = product of:
  0.36857986 = sum of:
    0.15421765 = weight(_text_:vector in 1428) [ClassicSimilarity], result of:
      0.15421765 = score(doc=1428,freq=4.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.5030775 = fieldWeight in 1428, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1428)
    0.14322878 = weight(_text_:space in 1428) [ClassicSimilarity], result of:
      0.14322878 = score(doc=1428,freq=8.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.5765547 = fieldWeight in 1428, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1428)
    0.07113342 = sum of:
      0.03888419 = weight(_text_:model in 1428) [ClassicSimilarity], result of:
        0.03888419 = score(doc=1428,freq=2.0), product of:
          0.1830527 = queryWeight, product of:
            3.845226 = idf(docFreq=2569, maxDocs=44218)
            0.047605187 = queryNorm
          0.21242073 = fieldWeight in 1428, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.845226 = idf(docFreq=2569, maxDocs=44218)
            0.0390625 = fieldNorm(doc=1428)
      0.032249227 = weight(_text_:22 in 1428) [ClassicSimilarity], result of:
        0.032249227 = score(doc=1428,freq=2.0), product of:
          0.16670525 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.047605187 = queryNorm
          0.19345059 = fieldWeight in 1428, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0390625 = fieldNorm(doc=1428)
  0.75 = coord(3/4)

Abstract: Humans can make hasty, but generally robust judgements about what a text fragment is, or is not, about. Such judgements are termed information inference. This article furnishes an account of information inference from a psychologistic stance. By drawing an theories from nonclassical logic and applied cognition, an information inference mechanism is proposed that makes inferences via computations of information flow through an approximation of a conceptual space. Within a conceptual space information is represented geometrically. In this article, geometric representations of words are realized as vectors in a high dimensional semantic space, which is automatically constructed from a text corpus. Two approaches were presented for priming vector representations according to context. The first approach uses a concept combination heuristic to adjust the vector representation of a concept in the light of the representation of another concept. The second approach computes a prototypical concept an the basis of exemplar trace texts and moves it in the dimensional space according to the context. Information inference is evaluated by measuring the effectiveness of query models derived by information flow computations. Results show that information flow contributes significantly to query model effectiveness, particularly with respect to precision. Moreover, retrieval effectiveness compares favorably with two probabilistic query models, and another based an semantic association. More generally, this article can be seen as a contribution towards realizing operational systems that mimic text-based human reasoning.
Date: 22. 3.2003 19:35:46

Schlieder, T.; Meuss, H.: Querying and ranking XML documents (2002) 0.27

0.26907256 = product of:
  0.35876343 = sum of:
    0.18506117 = weight(_text_:vector in 459) [ClassicSimilarity], result of:
      0.18506117 = score(doc=459,freq=4.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.603693 = fieldWeight in 459, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.046875 = fieldNorm(doc=459)
    0.12153365 = weight(_text_:space in 459) [ClassicSimilarity], result of:
      0.12153365 = score(doc=459,freq=4.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.48922288 = fieldWeight in 459, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.046875 = fieldNorm(doc=459)
    0.05216862 = product of:
      0.10433724 = sum of:
        0.10433724 = weight(_text_:model in 459) [ClassicSimilarity], result of:
          0.10433724 = score(doc=459,freq=10.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.5699847 = fieldWeight in 459, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.046875 = fieldNorm(doc=459)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: XML represents both content and structure of documents. Taking advantage of the document structure promises to greatly improve the retrieval precision. In this article, we present a retrieval technique that adopts the similarity measure of the vector space model, incorporates the document structure, and supports structured queries. Our query model is based on tree matching as a simple and elegant means to formulate queries without knowing the exact structure of the data. Using this query model we propose a logical document concept by deciding on the document boundaries at query time. We combine structured queries and term-based ranking by extending the term concept to structural terms that include substructures of queries and documents. The notions of term frequency and inverse document frequency are adapted to logical documents and structural terms. We introduce an efficient technique to calculate all necessary term frequencies and inverse document frequencies at query time. By adjusting parameters of the retrieval process we are able to model two contrary approaches: the classical vector space model, and the original tree matching approach.

Efron, M.: Query expansion and dimensionality reduction : Notions of optimality in Rocchio relevance feedback and latent semantic indexing (2008) 0.26

0.26025337 = product of:
  0.34700447 = sum of:
    0.18506117 = weight(_text_:vector in 2020) [ClassicSimilarity], result of:
      0.18506117 = score(doc=2020,freq=4.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.603693 = fieldWeight in 2020, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.046875 = fieldNorm(doc=2020)
    0.12153365 = weight(_text_:space in 2020) [ClassicSimilarity], result of:
      0.12153365 = score(doc=2020,freq=4.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.48922288 = fieldWeight in 2020, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.046875 = fieldNorm(doc=2020)
    0.04040964 = product of:
      0.08081928 = sum of:
        0.08081928 = weight(_text_:model in 2020) [ClassicSimilarity], result of:
          0.08081928 = score(doc=2020,freq=6.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.44150823 = fieldWeight in 2020, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.046875 = fieldNorm(doc=2020)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: Rocchio relevance feedback and latent semantic indexing (LSI) are well-known extensions of the vector space model for information retrieval (IR). This paper analyzes the statistical relationship between these extensions. The analysis focuses on each method's basis in least-squares optimization. Noting that LSI and Rocchio relevance feedback both alter the vector space model in a way that is in some sense least-squares optimal, we ask: what is the relationship between LSI's and Rocchio's notions of optimality? What does this relationship imply for IR? Using an analytical approach, we argue that Rocchio relevance feedback is optimal if we understand retrieval as a simplified classification problem. On the other hand, LSI's motivation comes to the fore if we understand it as a biased regression technique, where projection onto a low-dimensional orthogonal subspace of the documents reduces model variance.

Dubin, D.: ¬The most influential paper Gerard Salton never wrote (2004) 0.25

0.2533339 = product of:
  0.33777854 = sum of:
    0.18887727 = weight(_text_:vector in 26) [ClassicSimilarity], result of:
      0.18887727 = score(doc=26,freq=6.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.6161416 = fieldWeight in 26, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.0390625 = fieldNorm(doc=26)
    0.101278044 = weight(_text_:space in 26) [ClassicSimilarity], result of:
      0.101278044 = score(doc=26,freq=4.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.40768576 = fieldWeight in 26, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.0390625 = fieldNorm(doc=26)
    0.04762321 = product of:
      0.09524642 = sum of:
        0.09524642 = weight(_text_:model in 26) [ClassicSimilarity], result of:
          0.09524642 = score(doc=26,freq=12.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.5203224 = fieldWeight in 26, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.0390625 = fieldNorm(doc=26)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: Gerard Salton is often credited with developing the vector space model (VSM) for information retrieval (IR). Citations to Salton give the impression that the VSM must have been articulated as an IR model sometime between 1970 and 1975. However, the VSM as it is understood today evolved over a longer time period than is usually acknowledged, and an articulation of the model and its assumptions did not appear in print until several years after those assumptions had been criticized and alternative models proposed. An often cited overview paper titled "A Vector Space Model for Information Retrieval" (alleged to have been published in 1975) does not exist, and citations to it represent a confusion of two 1975 articles, neither of which were overviews of the VSM as a model of information retrieval. Until the late 1970s, Salton did not present vector spaces as models of IR generally but rather as models of specific computations. Citations to the phantom paper reflect an apparently widely held misconception that the operational features and explanatory devices now associated with the VSM must have been introduced at the same time it was first proposed as an IR model.

Kalczynski, P.J.; Chou, A.: Temporal Document Retrieval Model for business news archives (2005) 0.23

0.2250543 = product of:
  0.3000724 = sum of:
    0.15266767 = weight(_text_:vector in 1030) [ClassicSimilarity], result of:
      0.15266767 = score(doc=1030,freq=2.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.4980213 = fieldWeight in 1030, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1030)
    0.100260146 = weight(_text_:space in 1030) [ClassicSimilarity], result of:
      0.100260146 = score(doc=1030,freq=2.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.4035883 = fieldWeight in 1030, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1030)
    0.04714458 = product of:
      0.09428916 = sum of:
        0.09428916 = weight(_text_:model in 1030) [ClassicSimilarity], result of:
          0.09428916 = score(doc=1030,freq=6.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.51509297 = fieldWeight in 1030, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1030)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: Temporal expressions occurring in business news, such as "last week" or "at the end of this month," carry important information about the time context of the news document and were proved to be useful for document retrieval. We found that about 10% of these expressions are difficult to project onto the calendar due to the uncertainty about their bounds. This paper introduces a novel approach to representing temporal expressions. A user study is conducted to measure the degree of uncertainty for selected temporal expressions and a method for representing uncertainty based on fuzzy numbers is proposed. The classical Vector Space Model is extended to the Temporal Document Retrieval Model (TDRM) that incorporates the proposed fuzzy representations of temporal expressions.

Terada, A.; Tokunaga, T.; Tanaka, H.: Automatic expansion of abbreviations by using context and character information (2004) 0.22

0.22074673 = product of:
  0.29432896 = sum of:
    0.18506117 = weight(_text_:vector in 2560) [ClassicSimilarity], result of:
      0.18506117 = score(doc=2560,freq=4.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.603693 = fieldWeight in 2560, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.046875 = fieldNorm(doc=2560)
    0.08593727 = weight(_text_:space in 2560) [ClassicSimilarity], result of:
      0.08593727 = score(doc=2560,freq=2.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.34593284 = fieldWeight in 2560, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.046875 = fieldNorm(doc=2560)
    0.023330513 = product of:
      0.046661027 = sum of:
        0.046661027 = weight(_text_:model in 2560) [ClassicSimilarity], result of:
          0.046661027 = score(doc=2560,freq=2.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.25490487 = fieldWeight in 2560, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.046875 = fieldNorm(doc=2560)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: Unknown words such as proper nouns, abbreviations, and acronyms are a major obstacle in text processing. Abbreviations, in particular, are difficult to read/process because they are often domain specific. In this paper, we propose a method for automatic expansion of abbreviations by using context and character information. In previous studies dictionaries were used to search for abbreviation expansion candidates (candidates words for original form of abbreviations) to expand abbreviations. We use a corpus with few abbreviations from the same field instead of a dictionary. We calculate the adequacy of abbreviation expansion candidates based on the similarity between the context of the target abbreviation and that of its expansion candidate. The similarity is calculated using a vector space model in which each vector element consists of words surrounding the target abbreviation and those of its expansion candidate. Experiments using approximately 10,000 documents in the field of aviation showed that the accuracy of the proposed method is 10% higher than that of previously developed methods.

López-Pujalte, C.; Guerrero-Bote, V.P.; Moya-Anegón, F. de: Genetic algorithms in relevance feedback : a second test and new contributions (2003) 0.22

0.21856591 = product of:
  0.2914212 = sum of:
    0.15266767 = weight(_text_:vector in 1076) [ClassicSimilarity], result of:
      0.15266767 = score(doc=1076,freq=2.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.4980213 = fieldWeight in 1076, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1076)
    0.100260146 = weight(_text_:space in 1076) [ClassicSimilarity], result of:
      0.100260146 = score(doc=1076,freq=2.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.4035883 = fieldWeight in 1076, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1076)
    0.038493384 = product of:
      0.07698677 = sum of:
        0.07698677 = weight(_text_:model in 1076) [ClassicSimilarity], result of:
          0.07698677 = score(doc=1076,freq=4.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.4205716 = fieldWeight in 1076, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1076)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: The present work is the continuation of an earlier study which reviewed the literature on relevance feedback genetic techniques that follow the vector space model (the model that is most commonly used in this type of application), and implemented them so that they could be compared with each other as well as with one of the best traditional methods of relevance feedback--the Ide dec-hi method. We here carry out the comparisons on more test collections (Cranfield, CISI, Medline, and NPL), using the residual collection method for their evaluation as is recommended in this type of technique. We also add some fitness functions of our own design.

Benoît, G.: Properties-based retrieval and user decision states : user control and behavior modeling (2004) 0.21

0.21403947 = product of:
  0.28538597 = sum of:
    0.130858 = weight(_text_:vector in 2262) [ClassicSimilarity], result of:
      0.130858 = score(doc=2262,freq=2.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.4268754 = fieldWeight in 2262, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.046875 = fieldNorm(doc=2262)
    0.12153365 = weight(_text_:space in 2262) [ClassicSimilarity], result of:
      0.12153365 = score(doc=2262,freq=4.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.48922288 = fieldWeight in 2262, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.046875 = fieldNorm(doc=2262)
    0.03299433 = product of:
      0.06598866 = sum of:
        0.06598866 = weight(_text_:model in 2262) [ClassicSimilarity], result of:
          0.06598866 = score(doc=2262,freq=4.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.36048993 = fieldWeight in 2262, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.046875 = fieldNorm(doc=2262)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: As retrieval set size in information retrieval (IR) becomes larger, users may need greater interactive opportunities to determine for themselves potential relevance of the resources offered by a given collection. A parts-of-document approach, coupled with an interactive graphic interface and control panel, permits end users to tailor the information seeking (IS) session. Applying the model described by the author in a previous paper in this journal, this paper explores two issues: whether a group of information seekers in the same research domain will want to use this type of IR interaction, and whether such interaction is more successful than relevancy ranked lists, based an the general vector model. In addition, the paper proposes the use of gradient space as a means of capturing end users' cognitive states- decision-making points-during a parts-of-document-based IR session. It concludes that, for a group of biomedical researchers, a parts-of-document approach is preferred for certain IR situations and that gradient space provides designers of systems with empirical evidence suited for systems analysis.

Dolamic, L.; Savoy, J.: Indexing and searching strategies for the Russian language (2009) 0.21

0.21224323 = product of:
  0.28299096 = sum of:
    0.15421765 = weight(_text_:vector in 3301) [ClassicSimilarity], result of:
      0.15421765 = score(doc=3301,freq=4.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.5030775 = fieldWeight in 3301, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3301)
    0.101278044 = weight(_text_:space in 3301) [ClassicSimilarity], result of:
      0.101278044 = score(doc=3301,freq=4.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.40768576 = fieldWeight in 3301, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3301)
    0.027495272 = product of:
      0.054990545 = sum of:
        0.054990545 = weight(_text_:model in 3301) [ClassicSimilarity], result of:
          0.054990545 = score(doc=3301,freq=4.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.30040827 = fieldWeight in 3301, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3301)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: This paper describes and evaluates various stemming and indexing strategies for the Russian language. We design and evaluate two stemming approaches, a light and a more aggressive one, and compare these stemmers to the Snowball stemmer, to no stemming, and also to a language-independent approach (n-gram). To evaluate the suggested stemming strategies we apply various probabilistic information retrieval (IR) models, including the Okapi, the Divergence from Randomness (DFR), a statistical language model (LM), as well as two vector-space approaches, namely, the classical tf idf scheme and the dtu-dtn model. We find that the vector-space dtu-dtn and the DFR models tend to result in better retrieval effectiveness than the Okapi, LM, or tf idf models, while only the latter two IR approaches result in statistically significant performance differences. Ignoring stemming generally reduces the MAP by more than 50%, and these differences are always significant. When applying an n-gram approach, performance differences are usually lower than an approach involving stemming. Finally, our light stemmer tends to perform best, although performance differences between the light, aggressive, and Snowball stemmers are not statistically significant.

Vallet, D.; Fernández, M.; Castells, P.: ¬An ontology-based information retrieval model (2005) 0.20

0.19759221 = product of:
  0.26345628 = sum of:
    0.130858 = weight(_text_:vector in 4708) [ClassicSimilarity], result of:
      0.130858 = score(doc=4708,freq=2.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.4268754 = fieldWeight in 4708, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.046875 = fieldNorm(doc=4708)
    0.08593727 = weight(_text_:space in 4708) [ClassicSimilarity], result of:
      0.08593727 = score(doc=4708,freq=2.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.34593284 = fieldWeight in 4708, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.046875 = fieldNorm(doc=4708)
    0.046661027 = product of:
      0.09332205 = sum of:
        0.09332205 = weight(_text_:model in 4708) [ClassicSimilarity], result of:
          0.09332205 = score(doc=4708,freq=8.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.50980973 = fieldWeight in 4708, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.046875 = fieldNorm(doc=4708)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: Semantic search has been one of the motivations of the Semantic Web since it was envisioned. We propose a model for the exploitation of ontologybased KBs to improve search over large document repositories. Our approach includes an ontology-based scheme for the semi-automatic annotation of documents, and a retrieval system. The retrieval model is based on an adaptation of the classic vector-space model, including an annotation weighting algorithm, and a ranking algorithm. Semantic search is combined with keyword-based search to achieve tolerance to KB incompleteness. Our proposal is illustrated with sample experiments showing improvements with respect to keyword-based search, and providing ground for further research and discussion.

Abdou, S.; Savoy, J.: Searching in Medline : query expansion and manual indexing evaluation (2008) 0.19

0.19290367 = product of:
  0.2572049 = sum of:
    0.130858 = weight(_text_:vector in 2062) [ClassicSimilarity], result of:
      0.130858 = score(doc=2062,freq=2.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.4268754 = fieldWeight in 2062, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.046875 = fieldNorm(doc=2062)
    0.08593727 = weight(_text_:space in 2062) [ClassicSimilarity], result of:
      0.08593727 = score(doc=2062,freq=2.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.34593284 = fieldWeight in 2062, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.046875 = fieldNorm(doc=2062)
    0.04040964 = product of:
      0.08081928 = sum of:
        0.08081928 = weight(_text_:model in 2062) [ClassicSimilarity], result of:
          0.08081928 = score(doc=2062,freq=6.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.44150823 = fieldWeight in 2062, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.046875 = fieldNorm(doc=2062)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: Based on a relatively large subset representing one third of the Medline collection, this paper evaluates ten different IR models, including recent developments in both probabilistic and language models. We show that the best performing IR models is a probabilistic model developed within the Divergence from Randomness framework [Amati, G., & van Rijsbergen, C.J. (2002) Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM-Transactions on Information Systems 20(4), 357-389], which result in 170% enhancements in mean average precision when compared to the classical tf idf vector-space model. This paper also reports on our impact evaluations on the retrieval effectiveness of manually assigned descriptors (MeSH or Medical Subject Headings), showing that by including these terms retrieval performance can improve from 2.4% to 13.5%, depending on the underling IR model. Finally, we design a new general blind-query expansion approach showing improved retrieval performances compared to those obtained using the Rocchio approach.

Freitas-Junior, H.R.; Ribeiro-Neto, B.A.; Freitas-Vale, R. de; Laender, A.H.F.; Lima, L.R.S. de: Categorization-driven cross-language retrieval of medical information (2006) 0.19

0.1888471 = product of:
  0.25179613 = sum of:
    0.10904834 = weight(_text_:vector in 5282) [ClassicSimilarity], result of:
      0.10904834 = score(doc=5282,freq=2.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.3557295 = fieldWeight in 5282, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5282)
    0.07161439 = weight(_text_:space in 5282) [ClassicSimilarity], result of:
      0.07161439 = score(doc=5282,freq=2.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.28827736 = fieldWeight in 5282, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5282)
    0.07113342 = sum of:
      0.03888419 = weight(_text_:model in 5282) [ClassicSimilarity], result of:
        0.03888419 = score(doc=5282,freq=2.0), product of:
          0.1830527 = queryWeight, product of:
            3.845226 = idf(docFreq=2569, maxDocs=44218)
            0.047605187 = queryNorm
          0.21242073 = fieldWeight in 5282, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.845226 = idf(docFreq=2569, maxDocs=44218)
            0.0390625 = fieldNorm(doc=5282)
      0.032249227 = weight(_text_:22 in 5282) [ClassicSimilarity], result of:
        0.032249227 = score(doc=5282,freq=2.0), product of:
          0.16670525 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.047605187 = queryNorm
          0.19345059 = fieldWeight in 5282, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0390625 = fieldNorm(doc=5282)
  0.75 = coord(3/4)

Abstract: The Web has become a large repository of documents (or pages) written in many different languages. In this context, traditional information retrieval (IR) techniques cannot be used whenever the user query and the documents being retrieved are in different languages. To address this problem, new cross-language information retrieval (CLIR) techniques have been proposed. In this work, we describe a method for cross-language retrieval of medical information. This method combines query terms and related medical concepts obtained automatically through a categorization procedure. The medical concepts are used to create a linguistic abstraction that allows retrieval of information in a language-independent way, minimizing linguistic problems such as polysemy. To evaluate our method, we carried out experiments using the OHSUMED test collection, whose documents are written in English, with queries expressed in Portuguese, Spanish, and French. The results indicate that our cross-language retrieval method is as effective as a standard vector space model algorithm operating on queries and documents in the same language. Further, our results are better than previous results in the literature.
Date: 22. 7.2006 16:46:36

Mather, L.A.: ¬A linear algebra measure of cluster quality (2000) 0.19

0.1873422 = product of:
  0.2497896 = sum of:
    0.130858 = weight(_text_:vector in 4767) [ClassicSimilarity], result of:
      0.130858 = score(doc=4767,freq=2.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.4268754 = fieldWeight in 4767, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.046875 = fieldNorm(doc=4767)
    0.08593727 = weight(_text_:space in 4767) [ClassicSimilarity], result of:
      0.08593727 = score(doc=4767,freq=2.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.34593284 = fieldWeight in 4767, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.046875 = fieldNorm(doc=4767)
    0.03299433 = product of:
      0.06598866 = sum of:
        0.06598866 = weight(_text_:model in 4767) [ClassicSimilarity], result of:
          0.06598866 = score(doc=4767,freq=4.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.36048993 = fieldWeight in 4767, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.046875 = fieldNorm(doc=4767)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: One of the most common models in information retrieval (IR), the vector space model, represents a document set as a term-document matrix where each row corresponds to a term and each column corresponds to a document. Because of the use of matrices in IR, it is possible to apply linear algebra to this IR model. This paper describes an application of linear algebra to text clustering, namely, a metric for measuring cluster quality. The metric is based on the theory that cluster quality is proportional to the number of terms that are disjoint across the clusters. The metric compares the singular values of the term-document matrix to the singular values of the matrices for each of the clusters to determine the amount of overlap of the terms across clusters. Because the metric can be difficult to interpret, a standardization of the metric is defined, which specifies the number of standard deviations a clustering of that document set. Empirical evidence shows that the standard cluster metric correlates with clustered retrieval performance when comparing clustering algorithms or multiple parameters for the same clustering algorithms

Manning, C.D.; Raghavan, P.; Schütze, H.: Introduction to information retrieval (2008) 0.19

0.18575847 = product of:
  0.24767795 = sum of:
    0.15110183 = weight(_text_:vector in 4041) [ClassicSimilarity], result of:
      0.15110183 = score(doc=4041,freq=6.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.4929133 = fieldWeight in 4041, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.03125 = fieldNorm(doc=4041)
    0.081022434 = weight(_text_:space in 4041) [ClassicSimilarity], result of:
      0.081022434 = score(doc=4041,freq=4.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.3261486 = fieldWeight in 4041, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.03125 = fieldNorm(doc=4041)
    0.015553676 = product of:
      0.031107351 = sum of:
        0.031107351 = weight(_text_:model in 4041) [ClassicSimilarity], result of:
          0.031107351 = score(doc=4041,freq=2.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.16993658 = fieldWeight in 4041, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.03125 = fieldNorm(doc=4041)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Content: Inhalt: Boolean retrieval - The term vocabulary & postings lists - Dictionaries and tolerant retrieval - Index construction - Index compression - Scoring, term weighting & the vector space model - Computing scores in a complete search system - Evaluation in information retrieval - Relevance feedback & query expansion - XML retrieval - Probabilistic information retrieval - Language models for information retrieval - Text classification & Naive Bayes - Vector space classification - Support vector machines & machine learning on documents - Flat clustering - Hierarchical clustering - Matrix decompositions & latent semantic indexing - Web search basics - Web crawling and indexes - Link analysis Vgl. die digitale Fassung unter: http://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf.

Tseng, Y.-H.: Automatic cataloguing and searching for retrospective data by use of OCR text (2001) 0.18

0.18009433 = product of:
  0.24012578 = sum of:
    0.130858 = weight(_text_:vector in 5421) [ClassicSimilarity], result of:
      0.130858 = score(doc=5421,freq=2.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.4268754 = fieldWeight in 5421, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.046875 = fieldNorm(doc=5421)
    0.08593727 = weight(_text_:space in 5421) [ClassicSimilarity], result of:
      0.08593727 = score(doc=5421,freq=2.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.34593284 = fieldWeight in 5421, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.046875 = fieldNorm(doc=5421)
    0.023330513 = product of:
      0.046661027 = sum of:
        0.046661027 = weight(_text_:model in 5421) [ClassicSimilarity], result of:
          0.046661027 = score(doc=5421,freq=2.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.25490487 = fieldWeight in 5421, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.046875 = fieldNorm(doc=5421)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: This article describes our efforts in supporting information retrieval from OCR degraded text. In particular, we report our approach to an automatic cataloging and searching contest for books in multiple languages. In this contest, 500 books in English, German, French, and Italian published during the 1770s to 1970s are scanned into images and OCRed to digital text. The goal is to use only automatic ways to extract information for sophisticated searching. We adopted the vector space retrieval model, an n-gram indexing method, and a special weighting scheme to tackle this problem. Although the performance by this approach is slightly inferior to the best approach, which is mainly based on regular expression match, one advantage of our approach is that it is less language dependent and less layout sensitive, thus is readily applicable to other languages and document collections. Problems of OCR text retrieval for some Asian languages are also discussed in this article, and solutions are suggested

Search (2220 results, page 1 of 111)

Authors

Languages

Types

Themes

Subjects

Classifications