Search (305 results, page 1 of 16)

  • × theme_ss:"Retrievalalgorithmen"
  1. Witschel, H.F.: Global term weights in distributed environments (2008) 0.11
    0.11432087 = product of:
      0.19053477 = sum of:
        0.14993818 = weight(_text_:list in 2096) [ClassicSimilarity], result of:
          0.14993818 = score(doc=2096,freq=6.0), product of:
            0.25191793 = queryWeight, product of:
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.04859849 = queryNorm
            0.59518665 = fieldWeight in 2096, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.046875 = fieldNorm(doc=2096)
        0.020843314 = weight(_text_:of in 2096) [ClassicSimilarity], result of:
          0.020843314 = score(doc=2096,freq=14.0), product of:
            0.07599624 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04859849 = queryNorm
            0.2742677 = fieldWeight in 2096, product of:
              3.7416575 = tf(freq=14.0), with freq of:
                14.0 = termFreq=14.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=2096)
        0.019753272 = product of:
          0.039506543 = sum of:
            0.039506543 = weight(_text_:22 in 2096) [ClassicSimilarity], result of:
              0.039506543 = score(doc=2096,freq=2.0), product of:
                0.17018363 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04859849 = queryNorm
                0.23214069 = fieldWeight in 2096, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2096)
          0.5 = coord(1/2)
      0.6 = coord(3/5)
    
    Abstract
    This paper examines the estimation of global term weights (such as IDF) in information retrieval scenarios where a global view on the collection is not available. In particular, the two options of either sampling documents or of using a reference corpus independent of the target retrieval collection are compared using standard IR test collections. In addition, the possibility of pruning term lists based on frequency is evaluated. The results show that very good retrieval performance can be reached when just the most frequent terms of a collection - an "extended stop word list" - are known and all terms which are not in that list are treated equally. However, the list cannot always be fully estimated from a general-purpose reference corpus, but some "domain-specific stop words" need to be added. A good solution for achieving this is to mix estimates from small samples of the target retrieval collection with ones derived from a reference corpus.
    Date
    1. 8.2008 9:44:22
  2. Faloutsos, C.: Signature files (1992) 0.10
    0.095972225 = product of:
      0.1599537 = sum of:
        0.115422465 = weight(_text_:list in 3499) [ClassicSimilarity], result of:
          0.115422465 = score(doc=3499,freq=2.0), product of:
            0.25191793 = queryWeight, product of:
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.04859849 = queryNorm
            0.45817488 = fieldWeight in 3499, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.0625 = fieldNorm(doc=3499)
        0.018193537 = weight(_text_:of in 3499) [ClassicSimilarity], result of:
          0.018193537 = score(doc=3499,freq=6.0), product of:
            0.07599624 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04859849 = queryNorm
            0.23940048 = fieldWeight in 3499, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0625 = fieldNorm(doc=3499)
        0.026337698 = product of:
          0.052675396 = sum of:
            0.052675396 = weight(_text_:22 in 3499) [ClassicSimilarity], result of:
              0.052675396 = score(doc=3499,freq=2.0), product of:
                0.17018363 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04859849 = queryNorm
                0.30952093 = fieldWeight in 3499, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0625 = fieldNorm(doc=3499)
          0.5 = coord(1/2)
      0.6 = coord(3/5)
    
    Abstract
    Presents a survey and discussion on signature-based text retrieval methods. It describes the main idea behind the signature approach and its advantages over other text retrieval methods, it provides a classification of the signature methods that have appeared in the literature, it describes the main representatives of each class, together with the relative advantages and drawbacks, and it gives a list of applications as well as commercial or university prototypes that use the signature approach
    Date
    7. 5.1999 15:22:48
  3. Green, R.: Topical relevance relationships : 2: an exploratory study and preliminary typology (1995) 0.07
    0.073380135 = product of:
      0.12230022 = sum of:
        0.026128478 = weight(_text_:of in 3724) [ClassicSimilarity], result of:
          0.026128478 = score(doc=3724,freq=22.0), product of:
            0.07599624 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04859849 = queryNorm
            0.34381276 = fieldWeight in 3724, product of:
              4.690416 = tf(freq=22.0), with freq of:
                22.0 = termFreq=22.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=3724)
        0.058281917 = weight(_text_:subject in 3724) [ClassicSimilarity], result of:
          0.058281917 = score(doc=3724,freq=4.0), product of:
            0.17381717 = queryWeight, product of:
              3.576596 = idf(docFreq=3361, maxDocs=44218)
              0.04859849 = queryNorm
            0.33530587 = fieldWeight in 3724, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.576596 = idf(docFreq=3361, maxDocs=44218)
              0.046875 = fieldNorm(doc=3724)
        0.03788982 = product of:
          0.07577964 = sum of:
            0.07577964 = weight(_text_:headings in 3724) [ClassicSimilarity], result of:
              0.07577964 = score(doc=3724,freq=2.0), product of:
                0.23569997 = queryWeight, product of:
                  4.849944 = idf(docFreq=940, maxDocs=44218)
                  0.04859849 = queryNorm
                0.3215089 = fieldWeight in 3724, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.849944 = idf(docFreq=940, maxDocs=44218)
                  0.046875 = fieldNorm(doc=3724)
          0.5 = coord(1/2)
      0.6 = coord(3/5)
    
    Abstract
    The assumption of topic matching between user needs and texts topically relevant to those needs is often erroneous. Reports an emprical investigantion of the question 'what relationship types actually account for topical relevance'? In order to avoid the bias to topic matching search strategies, user needs are back generated from a randomly selected subset of the subject headings employed in a user oriented topical concordance. The corresponding relevant texts are those indicated in the concordance under the subject heading. Compares the topics of the user needs with the topics of the relevant texts to determine the relationships between them. Topical relevance relationships include a large variety of relationships, only some of which are matching relationships. Others are examples of paradigmatic or syntagmatic relationships. There appear to be no constraints on the kinds of relationships that can function as topical relevance relationships. They are distinguishable from other types of relationships only on functional grounds
    Source
    Journal of the American Society for Information Science. 46(1995) no.9, S.654-662
  4. Efthimiadis, E.N.: User choices : a new yardstick for the evaluation of ranking algorithms for interactive query expansion (1995) 0.07
    0.06680522 = product of:
      0.11134203 = sum of:
        0.07213905 = weight(_text_:list in 5697) [ClassicSimilarity], result of:
          0.07213905 = score(doc=5697,freq=2.0), product of:
            0.25191793 = queryWeight, product of:
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.04859849 = queryNorm
            0.2863593 = fieldWeight in 5697, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5697)
        0.022741921 = weight(_text_:of in 5697) [ClassicSimilarity], result of:
          0.022741921 = score(doc=5697,freq=24.0), product of:
            0.07599624 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04859849 = queryNorm
            0.2992506 = fieldWeight in 5697, product of:
              4.8989797 = tf(freq=24.0), with freq of:
                24.0 = termFreq=24.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5697)
        0.016461061 = product of:
          0.032922123 = sum of:
            0.032922123 = weight(_text_:22 in 5697) [ClassicSimilarity], result of:
              0.032922123 = score(doc=5697,freq=2.0), product of:
                0.17018363 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04859849 = queryNorm
                0.19345059 = fieldWeight in 5697, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=5697)
          0.5 = coord(1/2)
      0.6 = coord(3/5)
    
    Abstract
    The performance of 8 ranking algorithms was evaluated with respect to their effectiveness in ranking terms for query expansion. The evaluation was conducted within an investigation of interactive query expansion and relevance feedback in a real operational environment. Focuses on the identification of algorithms that most effectively take cognizance of user preferences. user choices (i.e. the terms selected by the searchers for the query expansion search) provided the yardstick for the evaluation of the 8 ranking algorithms. This methodology introduces a user oriented approach in evaluating ranking algorithms for query expansion in contrast to the standard, system oriented approaches. Similarities in the performance of the 8 algorithms and the ways these algorithms rank terms were the main focus of this evaluation. The findings demonstrate that the r-lohi, wpq, enim, and porter algorithms have similar performance in bringing good terms to the top of a ranked list of terms for query expansion. However, further evaluation of the algorithms in different (e.g. full text) environments is needed before these results can be generalized beyond the context of the present study
    Date
    22. 2.1996 13:14:10
  5. Khoo, C.S.G.; Wan, K.-W.: ¬A simple relevancy-ranking strategy for an interface to Boolean OPACs (2004) 0.07
    0.065453224 = product of:
      0.109088704 = sum of:
        0.017798368 = weight(_text_:of in 2509) [ClassicSimilarity], result of:
          0.017798368 = score(doc=2509,freq=30.0), product of:
            0.07599624 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04859849 = queryNorm
            0.23420064 = fieldWeight in 2509, product of:
              5.477226 = tf(freq=30.0), with freq of:
                30.0 = termFreq=30.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.02734375 = fieldNorm(doc=2509)
        0.024040066 = weight(_text_:subject in 2509) [ClassicSimilarity], result of:
          0.024040066 = score(doc=2509,freq=2.0), product of:
            0.17381717 = queryWeight, product of:
              3.576596 = idf(docFreq=3361, maxDocs=44218)
              0.04859849 = queryNorm
            0.13830662 = fieldWeight in 2509, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.576596 = idf(docFreq=3361, maxDocs=44218)
              0.02734375 = fieldNorm(doc=2509)
        0.067250274 = sum of:
          0.04420479 = weight(_text_:headings in 2509) [ClassicSimilarity], result of:
            0.04420479 = score(doc=2509,freq=2.0), product of:
              0.23569997 = queryWeight, product of:
                4.849944 = idf(docFreq=940, maxDocs=44218)
                0.04859849 = queryNorm
              0.18754686 = fieldWeight in 2509, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.849944 = idf(docFreq=940, maxDocs=44218)
                0.02734375 = fieldNorm(doc=2509)
          0.023045486 = weight(_text_:22 in 2509) [ClassicSimilarity], result of:
            0.023045486 = score(doc=2509,freq=2.0), product of:
              0.17018363 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.04859849 = queryNorm
              0.1354154 = fieldWeight in 2509, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.02734375 = fieldNorm(doc=2509)
      0.6 = coord(3/5)
    
    Abstract
    A relevancy-ranking algorithm for a natural language interface to Boolean online public access catalogs (OPACs) was formulated and compared with that currently used in a knowledge-based search interface called the E-Referencer, being developed by the authors. The algorithm makes use of seven weIl-known ranking criteria: breadth of match, section weighting, proximity of query words, variant word forms (stemming), document frequency, term frequency and document length. The algorithm converts a natural language query into a series of increasingly broader Boolean search statements. In a small experiment with ten subjects in which the algorithm was simulated by hand, the algorithm obtained good results with a mean overall precision of 0.42 and mean average precision of 0.62, representing a 27 percent improvement in precision and 41 percent improvement in average precision compared to the E-Referencer. The usefulness of each step in the algorithm was analyzed and suggestions are made for improving the algorithm.
    Content
    "Most Web search engines accept natural language queries, perform some kind of fuzzy matching and produce ranked output, displaying first the documents that are most likely to be relevant. On the other hand, most library online public access catalogs (OPACs) an the Web are still Boolean retrieval systems that perform exact matching, and require users to express their search requests precisely in a Boolean search language and to refine their search statements to improve the search results. It is well-documented that users have difficulty searching Boolean OPACs effectively (e.g. Borgman, 1996; Ensor, 1992; Wallace, 1993). One approach to making OPACs easier to use is to develop a natural language search interface that acts as a middleware between the user's Web browser and the OPAC system. The search interface can accept a natural language query from the user and reformulate it as a series of Boolean search statements that are then submitted to the OPAC. The records retrieved by the OPAC are ranked by the search interface before forwarding them to the user's Web browser. The user, then, does not need to interact directly with the Boolean OPAC but with the natural language search interface or search intermediary. The search interface interacts with the OPAC system an the user's behalf. The advantage of this approach is that no modification to the OPAC or library system is required. Furthermore, the search interface can access multiple OPACs, acting as a meta search engine, and integrate search results from various OPACs before sending them to the user. The search interface needs to incorporate a method for converting the user's natural language query into a series of Boolean search statements, and for ranking the OPAC records retrieved. The purpose of this study was to develop a relevancyranking algorithm for a search interface to Boolean OPAC systems. This is part of an on-going effort to develop a knowledge-based search interface to OPACs called the E-Referencer (Khoo et al., 1998, 1999; Poo et al., 2000). E-Referencer v. 2 that has been implemented applies a repertoire of initial search strategies and reformulation strategies to retrieve records from OPACs using the Z39.50 protocol, and also assists users in mapping query keywords to the Library of Congress subject headings."
    Source
    Electronic library. 22(2004) no.2, S.112-120
  6. Xu, Y.; Wang, D.: Order effect in relevance judgment : mediation and causality (2008) 0.06
    0.058934618 = product of:
      0.14733654 = sum of:
        0.12242402 = weight(_text_:list in 1877) [ClassicSimilarity], result of:
          0.12242402 = score(doc=1877,freq=4.0), product of:
            0.25191793 = queryWeight, product of:
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.04859849 = queryNorm
            0.48596787 = fieldWeight in 1877, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.046875 = fieldNorm(doc=1877)
        0.024912525 = weight(_text_:of in 1877) [ClassicSimilarity], result of:
          0.024912525 = score(doc=1877,freq=20.0), product of:
            0.07599624 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04859849 = queryNorm
            0.32781258 = fieldWeight in 1877, product of:
              4.472136 = tf(freq=20.0), with freq of:
                20.0 = termFreq=20.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=1877)
      0.4 = coord(2/5)
    
    Abstract
    The order effect of relevance judgment refers to the different relevance perceptions of a document when it appears in different positions in a list. Although the order effect of relevance judgment has significant theoretical and practical implications, the extant literature is inconclusive regarding its existence and forming mechanisms. This study proposes a set of order effect forming mechanisms, including the learning effect, the subneed scheduling effect, and the cursoriness effect based on the conceptualization of dynamic relevance and the psychology of cognitive elaboration. Our empirical study indicates that in an interactive information retrieval setting, when a document list is reasonably long, order effects demonstrate a curvilinear pattern that conforms to the combined effect of the three mechanisms. Moreover, the curvilinear pattern of order effect could differ for documents of different relevance levels.
    Source
    Journal of the American Society for Information Science and Technology. 59(2008) no.8, S.1264-1275
  7. Harman, D.: Ranking algorithms (1992) 0.06
    0.056460805 = product of:
      0.14115201 = sum of:
        0.115422465 = weight(_text_:list in 3511) [ClassicSimilarity], result of:
          0.115422465 = score(doc=3511,freq=2.0), product of:
            0.25191793 = queryWeight, product of:
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.04859849 = queryNorm
            0.45817488 = fieldWeight in 3511, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.0625 = fieldNorm(doc=3511)
        0.025729544 = weight(_text_:of in 3511) [ClassicSimilarity], result of:
          0.025729544 = score(doc=3511,freq=12.0), product of:
            0.07599624 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04859849 = queryNorm
            0.33856338 = fieldWeight in 3511, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0625 = fieldNorm(doc=3511)
      0.4 = coord(2/5)
    
    Abstract
    Presents both a summary of past research done in the development of ranking algorithms and detailed instructions on implementing a ranking type of retrieval system. This type of retrieval system takes as input a natural language query without Boolean syntax and produces a list of records that 'answer' the query, with the records ranked in order of likely relevance. Ranking retrieval systems are particularly appropriate for end-users
  8. Vechtomova, O.; Karamuftuoglu, M.: Elicitation and use of relevance feedback information (2006) 0.05
    0.05202371 = product of:
      0.13005927 = sum of:
        0.100994654 = weight(_text_:list in 966) [ClassicSimilarity], result of:
          0.100994654 = score(doc=966,freq=2.0), product of:
            0.25191793 = queryWeight, product of:
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.04859849 = queryNorm
            0.40090302 = fieldWeight in 966, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.0546875 = fieldNorm(doc=966)
        0.029064612 = weight(_text_:of in 966) [ClassicSimilarity], result of:
          0.029064612 = score(doc=966,freq=20.0), product of:
            0.07599624 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04859849 = queryNorm
            0.38244802 = fieldWeight in 966, product of:
              4.472136 = tf(freq=20.0), with freq of:
                20.0 = termFreq=20.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0546875 = fieldNorm(doc=966)
      0.4 = coord(2/5)
    
    Abstract
    The paper presents two approaches to interactively refining user search formulations and their evaluation in the new High Accuracy Retrieval from Documents (HARD) track of TREC-12. The first method consists of asking the user to select a number of sentences that represent documents. The second method consists of showing to the user a list of noun phrases extracted from the initial document set. Both methods then expand the query based on the user feedback. The TREC results show that one of the methods is an effective means of interactive query expansion and yields significant performance improvements. The paper presents a comparison of the methods and detailed analysis of the evaluation results.
  9. Maron, M.E.; Kuhns, I.L.: On relevance, probabilistic indexing and information retrieval (1960) 0.05
    0.051312048 = product of:
      0.12828012 = sum of:
        0.10202001 = weight(_text_:list in 1928) [ClassicSimilarity], result of:
          0.10202001 = score(doc=1928,freq=4.0), product of:
            0.25191793 = queryWeight, product of:
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.04859849 = queryNorm
            0.4049732 = fieldWeight in 1928, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1928)
        0.026260108 = weight(_text_:of in 1928) [ClassicSimilarity], result of:
          0.026260108 = score(doc=1928,freq=32.0), product of:
            0.07599624 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04859849 = queryNorm
            0.34554482 = fieldWeight in 1928, product of:
              5.656854 = tf(freq=32.0), with freq of:
                32.0 = termFreq=32.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1928)
      0.4 = coord(2/5)
    
    Abstract
    Reports on a novel technique for literature indexing and searching in a mechanized library system. The notion of relevance is taken as the key concept in the theory of information retrieval and a comparative concept of relevance is explicated in terms of the theory of probability. The resulting technique called 'Probabilistic indexing' allows a computing machine, given a request for information, to make a statistical inference and derive a number (called the 'relevance number') for each document, which is a measure of the probability that the document will satisfy the given request. The result of a search is an ordered list of those documents which satisfy the request ranked according to their probable relevance. The paper goes on to show that whereas in a conventional library system the cross-referencing ('see' and 'see also') is based soley on the 'semantic closeness' between index terms, statistical measures of closeness between index terms can be defined and computed. Thus, given an arbitrary request consisting of one (or many) index term(s), a machine can eleborate on it to increase the probability of selecting relevant documents that would not otherwise have been selected. Finally, the paper suggest an interpretation of the whole library problem as one where the request is considered as a clue on the basis of which the library system makes a concatenated statistical inference in order to provide as an output an ordered list of those documents which most probably satisfy the information needs of the user
    Source
    Journal of the Association for Computing Machinery. 7(1960) no.3, S.216-244
  10. García Cumbreras, M.A.; Perea-Ortega, J.M.; García Vega, M.; Ureña López, L.A.: Information retrieval with geographical references : relevant documents filtering vs. query expansion (2009) 0.05
    0.050124742 = product of:
      0.12531185 = sum of:
        0.100994654 = weight(_text_:list in 4222) [ClassicSimilarity], result of:
          0.100994654 = score(doc=4222,freq=2.0), product of:
            0.25191793 = queryWeight, product of:
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.04859849 = queryNorm
            0.40090302 = fieldWeight in 4222, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.0546875 = fieldNorm(doc=4222)
        0.024317201 = weight(_text_:of in 4222) [ClassicSimilarity], result of:
          0.024317201 = score(doc=4222,freq=14.0), product of:
            0.07599624 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04859849 = queryNorm
            0.31997898 = fieldWeight in 4222, product of:
              3.7416575 = tf(freq=14.0), with freq of:
                14.0 = termFreq=14.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0546875 = fieldNorm(doc=4222)
      0.4 = coord(2/5)
    
    Abstract
    This is a thorough analysis of two techniques applied to Geographic Information Retrieval (GIR). Previous studies have researched the application of query expansion to improve the selection process of information retrieval systems. This paper emphasizes the effectiveness of the filtering of relevant documents applied to a GIR system, instead of query expansion. Based on the CLEF (Cross Language Evaluation Forum) framework available, several experiments have been run. Some based on query expansion, some on the filtering of relevant documents. The results show that filtering works better in a GIR environment, because relevant documents are not reordered in the final list.
  11. Crestani, F.; Dominich, S.; Lalmas, M.; Rijsbergen, C.J.K. van: Mathematical, logical, and formal methods in information retrieval : an introduction to the special issue (2003) 0.05
    0.04994835 = product of:
      0.083247244 = sum of:
        0.022282438 = weight(_text_:of in 1451) [ClassicSimilarity], result of:
          0.022282438 = score(doc=1451,freq=16.0), product of:
            0.07599624 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04859849 = queryNorm
            0.2932045 = fieldWeight in 1451, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=1451)
        0.041211538 = weight(_text_:subject in 1451) [ClassicSimilarity], result of:
          0.041211538 = score(doc=1451,freq=2.0), product of:
            0.17381717 = queryWeight, product of:
              3.576596 = idf(docFreq=3361, maxDocs=44218)
              0.04859849 = queryNorm
            0.23709705 = fieldWeight in 1451, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.576596 = idf(docFreq=3361, maxDocs=44218)
              0.046875 = fieldNorm(doc=1451)
        0.019753272 = product of:
          0.039506543 = sum of:
            0.039506543 = weight(_text_:22 in 1451) [ClassicSimilarity], result of:
              0.039506543 = score(doc=1451,freq=2.0), product of:
                0.17018363 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04859849 = queryNorm
                0.23214069 = fieldWeight in 1451, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.046875 = fieldNorm(doc=1451)
          0.5 = coord(1/2)
      0.6 = coord(3/5)
    
    Abstract
    Research an the use of mathematical, logical, and formal methods, has been central to Information Retrieval research for a long time. Research in this area is important not only because it helps enhancing retrieval effectiveness, but also because it helps clarifying the underlying concepts of Information Retrieval. In this article we outline some of the major aspects of the subject, and summarize the papers of this special issue with respect to how they relate to these aspects. We conclude by highlighting some directions of future research, which are needed to better understand the formal characteristics of Information Retrieval.
    Date
    22. 3.2003 19:27:36
    Source
    Journal of the American Society for Information Science and technology. 54(2003) no.4, S.281-284
  12. Robertson, S.E.; Sparck Jones, K.: Simple, proven approaches to text retrieval (1997) 0.05
    0.04724039 = product of:
      0.11810098 = sum of:
        0.10202001 = weight(_text_:list in 4532) [ClassicSimilarity], result of:
          0.10202001 = score(doc=4532,freq=4.0), product of:
            0.25191793 = queryWeight, product of:
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.04859849 = queryNorm
            0.4049732 = fieldWeight in 4532, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4532)
        0.016080966 = weight(_text_:of in 4532) [ClassicSimilarity], result of:
          0.016080966 = score(doc=4532,freq=12.0), product of:
            0.07599624 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04859849 = queryNorm
            0.21160212 = fieldWeight in 4532, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4532)
      0.4 = coord(2/5)
    
    Abstract
    This technical note describes straightforward techniques for document indexing and retrieval that have been solidly established through extensive testing and are easy to apply. They are useful for many different types of text material, are viable for very large files, and have the advantage that they do not require special skills or training for searching, but are easy for end users. The document and text retrieval methods described here have a sound theoretical basis, are well established by extensive testing, and the ideas involved are now implemented in some commercial retrieval systems. Testing in the last few years has, in particular, shown that the methods presented here work very well with full texts, not only title and abstracts, and with large files of texts containing three quarters of a million documents. These tests, the TREC Tests (see Harman 1993 - 1997; IP&M 1995), have been rigorous comparative evaluations involving many different approaches to information retrieval. These techniques depend an the use of simple terms for indexing both request and document texts; an term weighting exploiting statistical information about term occurrences; an scoring for request-document matching, using these weights, to obtain a ranked search output; and an relevance feedback to modify request weights or term sets in iterative searching. The normal implementation is via an inverted file organisation using a term list with linked document identifiers, plus counting data, and pointers to the actual texts. The user's request can be a word list, phrases, sentences or extended text.
    Issue
    May, 1997, Update of 1994 and 1996 versions.
    Series
    Technical Report TR356, University of Cambridge, Computer Laboratory
  13. Austin, D.: How Google finds your needle in the Web's haystack : as we'll see, the trick is to ask the web itself to rank the importance of pages... (2006) 0.04
    0.044358637 = product of:
      0.11089659 = sum of:
        0.087463945 = weight(_text_:list in 93) [ClassicSimilarity], result of:
          0.087463945 = score(doc=93,freq=6.0), product of:
            0.25191793 = queryWeight, product of:
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.04859849 = queryNorm
            0.34719223 = fieldWeight in 93, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.02734375 = fieldNorm(doc=93)
        0.023432638 = weight(_text_:of in 93) [ClassicSimilarity], result of:
          0.023432638 = score(doc=93,freq=52.0), product of:
            0.07599624 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04859849 = queryNorm
            0.30833945 = fieldWeight in 93, product of:
              7.2111025 = tf(freq=52.0), with freq of:
                52.0 = termFreq=52.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.02734375 = fieldNorm(doc=93)
      0.4 = coord(2/5)
    
    Abstract
    Imagine a library containing 25 billion documents but with no centralized organization and no librarians. In addition, anyone may add a document at any time without telling anyone. You may feel sure that one of the documents contained in the collection has a piece of information that is vitally important to you, and, being impatient like most of us, you'd like to find it in a matter of seconds. How would you go about doing it? Posed in this way, the problem seems impossible. Yet this description is not too different from the World Wide Web, a huge, highly-disorganized collection of documents in many different formats. Of course, we're all familiar with search engines (perhaps you found this article using one) so we know that there is a solution. This article will describe Google's PageRank algorithm and how it returns pages from the web's collection of 25 billion documents that match search criteria so well that "google" has become a widely used verb. Most search engines, including Google, continually run an army of computer programs that retrieve pages from the web, index the words in each document, and store this information in an efficient format. Each time a user asks for a web search using a search phrase, such as "search engine," the search engine determines all the pages on the web that contains the words in the search phrase. (Perhaps additional information such as the distance between the words "search" and "engine" will be noted as well.) Here is the problem: Google now claims to index 25 billion pages. Roughly 95% of the text in web pages is composed from a mere 10,000 words. This means that, for most searches, there will be a huge number of pages containing the words in the search phrase. What is needed is a means of ranking the importance of the pages that fit the search criteria so that the pages can be sorted with the most important pages at the top of the list. One way to determine the importance of pages is to use a human-generated ranking. For instance, you may have seen pages that consist mainly of a large number of links to other resources in a particular area of interest. Assuming the person maintaining this page is reliable, the pages referenced are likely to be useful. Of course, the list may quickly fall out of date, and the person maintaining the list may miss some important pages, either unintentionally or as a result of an unstated bias. Google's PageRank algorithm assesses the importance of web pages without human evaluation of the content. In fact, Google feels that the value of its service is largely in its ability to provide unbiased results to search queries; Google claims, "the heart of our software is PageRank." As we'll see, the trick is to ask the web itself to rank the importance of pages.
  14. Moffat, A.; Bell, T.A.H.: In situ generation of compressed inverted files (1995) 0.04
    0.044080377 = product of:
      0.11020094 = sum of:
        0.08656685 = weight(_text_:list in 2648) [ClassicSimilarity], result of:
          0.08656685 = score(doc=2648,freq=2.0), product of:
            0.25191793 = queryWeight, product of:
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.04859849 = queryNorm
            0.34363115 = fieldWeight in 2648, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.046875 = fieldNorm(doc=2648)
        0.023634095 = weight(_text_:of in 2648) [ClassicSimilarity], result of:
          0.023634095 = score(doc=2648,freq=18.0), product of:
            0.07599624 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04859849 = queryNorm
            0.3109903 = fieldWeight in 2648, product of:
              4.2426405 = tf(freq=18.0), with freq of:
                18.0 = termFreq=18.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=2648)
      0.4 = coord(2/5)
    
    Abstract
    An inverted index stores, for each term that appears in a collection of documents, a list of document numbers containing that term. Such an index is indispensible when Boolean or informal ranked queries are to be answered. Construction of the index ist, however, a non trivial task. Simple methods using in.memory data structures cannot be used for large collections because they require too much random access storage, and traditional disc based methods require large amounts of temporary file space. Describes a new indexing algorithm designed to create large compressed inverted indexes in situ. It makes use of simple compression codes for the positive integers and an in place external multi way merge sort. The new techniques has been used to invert a 2-gigabyte text collection in under 4 hours, using less than 40 megabytes of temporary disc space, and less than 20 megabytes of main memory
    Source
    Journal of the American Society for Information Science. 46(1995) no.7, S.537-550
  15. Käki, M.: fKWIC: frequency-based Keyword-in-Context Index for filtering Web search results (2006) 0.04
    0.039083228 = product of:
      0.09770807 = sum of:
        0.08656685 = weight(_text_:list in 6112) [ClassicSimilarity], result of:
          0.08656685 = score(doc=6112,freq=2.0), product of:
            0.25191793 = queryWeight, product of:
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.04859849 = queryNorm
            0.34363115 = fieldWeight in 6112, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.046875 = fieldNorm(doc=6112)
        0.011141219 = weight(_text_:of in 6112) [ClassicSimilarity], result of:
          0.011141219 = score(doc=6112,freq=4.0), product of:
            0.07599624 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04859849 = queryNorm
            0.14660224 = fieldWeight in 6112, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=6112)
      0.4 = coord(2/5)
    
    Abstract
    Enormous Web search engine databases combined with short search queries result in large result sets that are often difficult to access. Result ranking works fairly well, but users need help when it fails. For these situations, we propose a filtering interface that is inspired by keyword-in-context (KWIC) indices. The user interface lists the most frequent keyword contexts (fKWIC). When a context is selected, the corresponding results are displayed in the result list, allowing users to concentrate on the specific context. We compared the keyword context index user interface to the rank order result listing in an experiment with 36 participants. The results show that the proposed user interface was 29% faster in finding relevant results, and the precision of the selected results was 19% higher. In addition, participants showed positive attitudes toward the system.
    Source
    Journal of the American Society for Information Science and Technology. 57(2006) no.12, S.1606-1615
  16. Berry, M.W.; Browne, M.: Understanding search engines : mathematical modeling and text retrieval (2005) 0.04
    0.037343957 = product of:
      0.09335989 = sum of:
        0.081616014 = weight(_text_:list in 7) [ClassicSimilarity], result of:
          0.081616014 = score(doc=7,freq=4.0), product of:
            0.25191793 = queryWeight, product of:
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.04859849 = queryNorm
            0.32397857 = fieldWeight in 7, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.03125 = fieldNorm(doc=7)
        0.011743877 = weight(_text_:of in 7) [ClassicSimilarity], result of:
          0.011743877 = score(doc=7,freq=10.0), product of:
            0.07599624 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04859849 = queryNorm
            0.15453234 = fieldWeight in 7, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.03125 = fieldNorm(doc=7)
      0.4 = coord(2/5)
    
    Abstract
    The second edition of Understanding Search Engines: Mathematical Modeling and Text Retrieval follows the basic premise of the first edition by discussing many of the key design issues for building search engines and emphasizing the important role that applied mathematics can play in improving information retrieval. The authors discuss important data structures, algorithms, and software as well as user-centered issues such as interfaces, manual indexing, and document preparation. Significant changes bring the text up to date on current information retrieval methods: for example the addition of a new chapter on link-structure algorithms used in search engines such as Google. The chapter on user interface has been rewritten to specifically focus on search engine usability. In addition the authors have added new recommendations for further reading and expanded the bibliography, and have updated and streamlined the index to make it more reader friendly.
    Content
    Inhalt: Introduction Document File Preparation - Manual Indexing - Information Extraction - Vector Space Modeling - Matrix Decompositions - Query Representations - Ranking and Relevance Feedback - Searching by Link Structure - User Interface - Book Format Document File Preparation Document Purification and Analysis - Text Formatting - Validation - Manual Indexing - Automatic Indexing - Item Normalization - Inverted File Structures - Document File - Dictionary List - Inversion List - Other File Structures Vector Space Models Construction - Term-by-Document Matrices - Simple Query Matching - Design Issues - Term Weighting - Sparse Matrix Storage - Low-Rank Approximations Matrix Decompositions QR Factorization - Singular Value Decomposition - Low-Rank Approximations - Query Matching - Software - Semidiscrete Decomposition - Updating Techniques Query Management Query Binding - Types of Queries - Boolean Queries - Natural Language Queries - Thesaurus Queries - Fuzzy Queries - Term Searches - Probabilistic Queries Ranking and Relevance Feedback Performance Evaluation - Precision - Recall - Average Precision - Genetic Algorithms - Relevance Feedback Searching by Link Structure HITS Method - HITS Implementation - HITS Summary - PageRank Method - PageRank Adjustments - PageRank Implementation - PageRank Summary User Interface Considerations General Guidelines - Search Engine Interfaces - Form Fill-in - Display Considerations - Progress Indication - No Penalties for Error - Results - Test and Retest - Final Considerations Further Reading
  17. Lempel, R.; Moran, S.: SALSA: the stochastic approach for link-structure analysis (2001) 0.04
    0.037159797 = product of:
      0.09289949 = sum of:
        0.07213905 = weight(_text_:list in 10) [ClassicSimilarity], result of:
          0.07213905 = score(doc=10,freq=2.0), product of:
            0.25191793 = queryWeight, product of:
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.04859849 = queryNorm
            0.2863593 = fieldWeight in 10, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.0390625 = fieldNorm(doc=10)
        0.020760437 = weight(_text_:of in 10) [ClassicSimilarity], result of:
          0.020760437 = score(doc=10,freq=20.0), product of:
            0.07599624 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04859849 = queryNorm
            0.27317715 = fieldWeight in 10, product of:
              4.472136 = tf(freq=20.0), with freq of:
                20.0 = termFreq=20.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=10)
      0.4 = coord(2/5)
    
    Abstract
    Today, when searching for information on the WWW, one usually performs a query through a term-based search engine. These engines return, as the query's result, a list of Web pages whose contents matches the query. For broad-topic queries, such searches often result in a huge set of retrieved documents, many of which are irrelevant to the user. However, much information is contained in the link-structure of the WWW. Information such as which pages are linked to others can be used to augment search algorithms. In this context, Jon Kleinberg introduced the notion of two distinct types of Web pages: hubs and authorities. Kleinberg argued that hubs and authorities exhibit a mutually reinforcing relationship: a good hub will point to many authorities, and a good authority will be pointed at by many hubs. In light of this, he dervised an algoirthm aimed at finding authoritative pages. We present SALSA, a new stochastic approach for link-structure analysis, which examines random walks on graphs derived from the link-structure. We show that both SALSA and Kleinberg's Mutual Reinforcement approach employ the same metaalgorithm. We then prove that SALSA is quivalent to a weighted in degree analysis of the link-sturcutre of WWW subgraphs, making it computationally more efficient than the Mutual reinforcement approach. We compare that results of applying SALSA to the results derived through Kleinberg's approach. These comparisions reveal a topological Phenomenon called the TKC effectwhich, in certain cases, prevents the Mutual reinforcement approach from identifying meaningful authorities.
  18. He, J.; Meij, E.; Rijke, M. de: Result diversification based on query-specific cluster ranking (2011) 0.04
    0.037159797 = product of:
      0.09289949 = sum of:
        0.07213905 = weight(_text_:list in 4355) [ClassicSimilarity], result of:
          0.07213905 = score(doc=4355,freq=2.0), product of:
            0.25191793 = queryWeight, product of:
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.04859849 = queryNorm
            0.2863593 = fieldWeight in 4355, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4355)
        0.020760437 = weight(_text_:of in 4355) [ClassicSimilarity], result of:
          0.020760437 = score(doc=4355,freq=20.0), product of:
            0.07599624 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04859849 = queryNorm
            0.27317715 = fieldWeight in 4355, product of:
              4.472136 = tf(freq=20.0), with freq of:
                20.0 = termFreq=20.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4355)
      0.4 = coord(2/5)
    
    Abstract
    Result diversification is a retrieval strategy for dealing with ambiguous or multi-faceted queries by providing documents that cover as many facets of the query as possible. We propose a result diversification framework based on query-specific clustering and cluster ranking, in which diversification is restricted to documents belonging to clusters that potentially contain a high percentage of relevant documents. Empirical results show that the proposed framework improves the performance of several existing diversification methods. The framework also gives rise to a simple yet effective cluster-based approach to result diversification that selects documents from different clusters to be included in a ranked list in a round robin fashion. We describe a set of experiments aimed at thoroughly analyzing the behavior of the two main components of the proposed diversification framework, ranking and selecting clusters for diversification. Both components have a crucial impact on the overall performance of our framework, but ranking clusters plays a more important role than selecting clusters. We also examine properties that clusters should have in order for our diversification framework to be effective. Most relevant documents should be contained in a small number of high-quality clusters, while there should be no dominantly large clusters. Also, documents from these high-quality clusters should have a diverse content. These properties are strongly correlated with the overall performance of the proposed diversification framework.
    Source
    Journal of the American Society for Information Science and Technology. 62(2011) no.3, S.550-571
  19. White, R.W.; Jose, J.M.; Ruthven, I.: Using top-ranking sentences to facilitate effective information access (2005) 0.04
    0.036733653 = product of:
      0.09183413 = sum of:
        0.07213905 = weight(_text_:list in 3881) [ClassicSimilarity], result of:
          0.07213905 = score(doc=3881,freq=2.0), product of:
            0.25191793 = queryWeight, product of:
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.04859849 = queryNorm
            0.2863593 = fieldWeight in 3881, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3881)
        0.019695079 = weight(_text_:of in 3881) [ClassicSimilarity], result of:
          0.019695079 = score(doc=3881,freq=18.0), product of:
            0.07599624 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04859849 = queryNorm
            0.25915858 = fieldWeight in 3881, product of:
              4.2426405 = tf(freq=18.0), with freq of:
                18.0 = termFreq=18.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3881)
      0.4 = coord(2/5)
    
    Abstract
    Web searchers typically fall to view search results beyond the first page nor fully examine those results presented to them. In this article we describe an approach that encourages a deeper examination of the contents of the document set retrieved in response to a searcher's query. The approach shifts the focus of perusal and interaction away from potentially uninformative document surrogates (such as titles, sentence fragments, and URLs) to actual document content, and uses this content to drive the information seeking process. Current search interfaces assume searchers examine results document-by-document. In contrast our approach extracts, ranks, and presents the contents of the top-ranked document set. We use query-relevant topranking sentences extracted from the top documents at retrieval time as fine-grained representations of topranked document content and, when combined in a ranked list, an overview of these documents. The interaction of the searcher provides implicit evidence that is used to reorder the sentences where appropriate. We evaluate our approach in three separate user studies, each applying these sentences in a different way. The findings of these studies show that top-ranking sentences can facilitate effective information access.
    Source
    Journal of the American Society for Information Science and Technology. 56(2005) no.10, S.1113-1125
  20. Jiang, X.; Sun, X.; Yang, Z.; Zhuge, H.; Lapshinova-Koltunski, E.; Yao, J.: Exploiting heterogeneous scientific literature networks to combat ranking bias : evidence from the computational linguistics area (2016) 0.04
    0.036733653 = product of:
      0.09183413 = sum of:
        0.07213905 = weight(_text_:list in 3017) [ClassicSimilarity], result of:
          0.07213905 = score(doc=3017,freq=2.0), product of:
            0.25191793 = queryWeight, product of:
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.04859849 = queryNorm
            0.2863593 = fieldWeight in 3017, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.183657 = idf(docFreq=673, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3017)
        0.019695079 = weight(_text_:of in 3017) [ClassicSimilarity], result of:
          0.019695079 = score(doc=3017,freq=18.0), product of:
            0.07599624 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04859849 = queryNorm
            0.25915858 = fieldWeight in 3017, product of:
              4.2426405 = tf(freq=18.0), with freq of:
                18.0 = termFreq=18.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3017)
      0.4 = coord(2/5)
    
    Abstract
    It is important to help researchers find valuable papers from a large literature collection. To this end, many graph-based ranking algorithms have been proposed. However, most of these algorithms suffer from the problem of ranking bias. Ranking bias hurts the usefulness of a ranking algorithm because it returns a ranking list with an undesirable time distribution. This paper is a focused study on how to alleviate ranking bias by leveraging the heterogeneous network structure of the literature collection. We propose a new graph-based ranking algorithm, MutualRank, that integrates mutual reinforcement relationships among networks of papers, researchers, and venues to achieve a more synthetic, accurate, and less-biased ranking than previous methods. MutualRank provides a unified model that involves both intra- and inter-network information for ranking papers, researchers, and venues simultaneously. We use the ACL Anthology Network as the benchmark data set and construct the gold standard from computer linguistics course websites of well-known universities and two well-known textbooks. The experimental results show that MutualRank greatly outperforms the state-of-the-art competitors, including PageRank, HITS, CoRank, Future Rank, and P-Rank, in ranking papers in both improving ranking effectiveness and alleviating ranking bias. Rankings of researchers and venues by MutualRank are also quite reasonable.
    Source
    Journal of the Association for Information Science and Technology. 67(2016) no.7, S.1679-1702

Languages

  • e 293
  • d 9
  • chi 2
  • More… Less…

Types

  • a 283
  • m 10
  • el 8
  • s 4
  • r 3
  • p 2
  • x 1
  • More… Less…