Search (20 results, page 1 of 1)

  • × theme_ss:"Retrievalstudien"
  • × year_i:[2010 TO 2020}
  1. Ravana, S.D.; Taheri, M.S.; Rajagopal, P.: Document-based approach to improve the accuracy of pairwise comparison in evaluating information retrieval systems (2015) 0.05
    0.053692453 = product of:
      0.107384905 = sum of:
        0.107384905 = sum of:
          0.0720338 = weight(_text_:systems in 2587) [ClassicSimilarity], result of:
            0.0720338 = score(doc=2587,freq=14.0), product of:
              0.16037072 = queryWeight, product of:
                3.0731742 = idf(docFreq=5561, maxDocs=44218)
                0.052184064 = queryNorm
              0.4491705 = fieldWeight in 2587, product of:
                3.7416575 = tf(freq=14.0), with freq of:
                  14.0 = termFreq=14.0
                3.0731742 = idf(docFreq=5561, maxDocs=44218)
                0.0390625 = fieldNorm(doc=2587)
          0.0353511 = weight(_text_:22 in 2587) [ClassicSimilarity], result of:
            0.0353511 = score(doc=2587,freq=2.0), product of:
              0.1827397 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.052184064 = queryNorm
              0.19345059 = fieldWeight in 2587, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.0390625 = fieldNorm(doc=2587)
      0.5 = coord(1/2)
    
    Abstract
    Purpose The purpose of this paper is to propose a method to have more accurate results in comparing performance of the paired information retrieval (IR) systems with reference to the current method, which is based on the mean effectiveness scores of the systems across a set of identified topics/queries. Design/methodology/approach Based on the proposed approach, instead of the classic method of using a set of topic scores, the documents level scores are considered as the evaluation unit. These document scores are the defined document's weight, which play the role of the mean average precision (MAP) score of the systems as a significance test's statics. The experiments were conducted using the TREC 9 Web track collection. Findings The p-values generated through the two types of significance tests, namely the Student's t-test and Mann-Whitney show that by using the document level scores as an evaluation unit, the difference between IR systems is more significant compared with utilizing topic scores. Originality/value Utilizing a suitable test collection is a primary prerequisite for IR systems comparative evaluation. However, in addition to reusable test collections, having an accurate statistical testing is a necessity for these evaluations. The findings of this study will assist IR researchers to evaluate their retrieval systems and algorithms more accurately.
    Date
    20. 1.2015 18:30:22
  2. Pal, S.; Mitra, M.; Kamps, J.: Evaluation effort, reliability and reusability in XML retrieval (2011) 0.04
    0.044901766 = product of:
      0.08980353 = sum of:
        0.08980353 = sum of:
          0.054452434 = weight(_text_:systems in 4197) [ClassicSimilarity], result of:
            0.054452434 = score(doc=4197,freq=8.0), product of:
              0.16037072 = queryWeight, product of:
                3.0731742 = idf(docFreq=5561, maxDocs=44218)
                0.052184064 = queryNorm
              0.339541 = fieldWeight in 4197, product of:
                2.828427 = tf(freq=8.0), with freq of:
                  8.0 = termFreq=8.0
                3.0731742 = idf(docFreq=5561, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4197)
          0.0353511 = weight(_text_:22 in 4197) [ClassicSimilarity], result of:
            0.0353511 = score(doc=4197,freq=2.0), product of:
              0.1827397 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.052184064 = queryNorm
              0.19345059 = fieldWeight in 4197, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4197)
      0.5 = coord(1/2)
    
    Abstract
    The Initiative for the Evaluation of XML retrieval (INEX) provides a TREC-like platform for evaluating content-oriented XML retrieval systems. Since 2007, INEX has been using a set of precision-recall based metrics for its ad hoc tasks. The authors investigate the reliability and robustness of these focused retrieval measures, and of the INEX pooling method. They explore four specific questions: How reliable are the metrics when assessments are incomplete, or when query sets are small? What is the minimum pool/query-set size that can be used to reliably evaluate systems? Can the INEX collections be used to fairly evaluate "new" systems that did not participate in the pooling process? And, for a fixed amount of assessment effort, would this effort be better spent in thoroughly judging a few queries, or in judging many queries relatively superficially? The authors' findings validate properties of precision-recall-based metrics observed in document retrieval settings. Early precision measures are found to be more error-prone and less stable under incomplete judgments and small topic-set sizes. They also find that system rankings remain largely unaffected even when assessment effort is substantially (but systematically) reduced, and confirm that the INEX collections remain usable when evaluating nonparticipating systems. Finally, they observe that for a fixed amount of effort, judging shallow pools for many queries is better than judging deep pools for a smaller set of queries. However, when judging only a random sample of a pool, it is better to completely judge fewer topics than to partially judge many topics. This result confirms the effectiveness of pooling methods.
    Date
    22. 1.2011 14:20:56
  3. Rajagopal, P.; Ravana, S.D.; Koh, Y.S.; Balakrishnan, V.: Evaluating the effectiveness of information retrieval systems using effort-based relevance judgment (2019) 0.04
    0.044901766 = product of:
      0.08980353 = sum of:
        0.08980353 = sum of:
          0.054452434 = weight(_text_:systems in 5287) [ClassicSimilarity], result of:
            0.054452434 = score(doc=5287,freq=8.0), product of:
              0.16037072 = queryWeight, product of:
                3.0731742 = idf(docFreq=5561, maxDocs=44218)
                0.052184064 = queryNorm
              0.339541 = fieldWeight in 5287, product of:
                2.828427 = tf(freq=8.0), with freq of:
                  8.0 = termFreq=8.0
                3.0731742 = idf(docFreq=5561, maxDocs=44218)
                0.0390625 = fieldNorm(doc=5287)
          0.0353511 = weight(_text_:22 in 5287) [ClassicSimilarity], result of:
            0.0353511 = score(doc=5287,freq=2.0), product of:
              0.1827397 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.052184064 = queryNorm
              0.19345059 = fieldWeight in 5287, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.0390625 = fieldNorm(doc=5287)
      0.5 = coord(1/2)
    
    Abstract
    Purpose The effort in addition to relevance is a major factor for satisfaction and utility of the document to the actual user. The purpose of this paper is to propose a method in generating relevance judgments that incorporate effort without human judges' involvement. Then the study determines the variation in system rankings due to low effort relevance judgment in evaluating retrieval systems at different depth of evaluation. Design/methodology/approach Effort-based relevance judgments are generated using a proposed boxplot approach for simple document features, HTML features and readability features. The boxplot approach is a simple yet repeatable approach in classifying documents' effort while ensuring outlier scores do not skew the grading of the entire set of documents. Findings The retrieval systems evaluation using low effort relevance judgments has a stronger influence on shallow depth of evaluation compared to deeper depth. It is proved that difference in the system rankings is due to low effort documents and not the number of relevant documents. Originality/value Hence, it is crucial to evaluate retrieval systems at shallow depth using low effort relevance judgments.
    Date
    20. 1.2015 18:30:22
  4. Angelini, M.; Fazzini, V.; Ferro, N.; Santucci, G.; Silvello, G.: CLAIRE: A combinatorial visual analytics system for information retrieval evaluation (2018) 0.02
    0.01800845 = product of:
      0.0360169 = sum of:
        0.0360169 = product of:
          0.0720338 = sum of:
            0.0720338 = weight(_text_:systems in 5049) [ClassicSimilarity], result of:
              0.0720338 = score(doc=5049,freq=14.0), product of:
                0.16037072 = queryWeight, product of:
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.052184064 = queryNorm
                0.4491705 = fieldWeight in 5049, product of:
                  3.7416575 = tf(freq=14.0), with freq of:
                    14.0 = termFreq=14.0
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=5049)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Information Retrieval (IR) develops complex systems, constituted of several components, which aim at returning and optimally ranking the most relevant documents in response to user queries. In this context, experimental evaluation plays a central role, since it allows for measuring IR systems effectiveness, increasing the understanding of their functioning, and better directing the efforts for improving them. Current evaluation methodologies are limited by two major factors: (i) IR systems are evaluated as "black boxes", since it is not possible to decompose the contributions of the different components, e.g., stop lists, stemmers, and IR models; (ii) given that it is not possible to predict the effectiveness of an IR system, both academia and industry need to explore huge numbers of systems, originated by large combinatorial compositions of their components, to understand how they perform and how these components interact together. We propose a Combinatorial visuaL Analytics system for Information Retrieval Evaluation (CLAIRE) which allows for exploring and making sense of the performances of a large amount of IR systems, in order to quickly and intuitively grasp which system configurations are preferred, what are the contributions of the different components and how these components interact together. The CLAIRE system is then validated against use cases based on several test collections using a wide set of systems, generated by a combinatorial composition of several off-the-shelf components, representing the most common denominator almost always present in English IR systems. In particular, we validate the findings enabled by CLAIRE with respect to consolidated deep statistical analyses and we show that the CLAIRE system allows the generation of new insights, which were not detectable with traditional approaches.
  5. Behnert, C.; Lewandowski, D.: ¬A framework for designing retrieval effectiveness studies of library information systems using human relevance assessments (2017) 0.01
    0.013613109 = product of:
      0.027226217 = sum of:
        0.027226217 = product of:
          0.054452434 = sum of:
            0.054452434 = weight(_text_:systems in 3700) [ClassicSimilarity], result of:
              0.054452434 = score(doc=3700,freq=8.0), product of:
                0.16037072 = queryWeight, product of:
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.052184064 = queryNorm
                0.339541 = fieldWeight in 3700, product of:
                  2.828427 = tf(freq=8.0), with freq of:
                    8.0 = termFreq=8.0
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=3700)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Purpose This paper demonstrates how to apply traditional information retrieval evaluation methods based on standards from the Text REtrieval Conference (TREC) and web search evaluation to all types of modern library information systems including online public access catalogs, discovery systems, and digital libraries that provide web search features to gather information from heterogeneous sources. Design/methodology/approach We apply conventional procedures from information retrieval evaluation to the library information system context considering the specific characteristics of modern library materials. Findings We introduce a framework consisting of five parts: (1) search queries, (2) search results, (3) assessors, (4) testing, and (5) data analysis. We show how to deal with comparability problems resulting from diverse document types, e.g., electronic articles vs. printed monographs and what issues need to be considered for retrieval tests in the library context. Practical implications The framework can be used as a guideline for conducting retrieval effectiveness studies in the library context. Originality/value Although a considerable amount of research has been done on information retrieval evaluation, and standards for conducting retrieval effectiveness studies do exist, to our knowledge this is the first attempt to provide a systematic framework for evaluating the retrieval effectiveness of twenty-first-century library information systems. We demonstrate which issues must be considered and what decisions must be made by researchers prior to a retrieval test.
  6. Naderi, H.; Rumpler, B.: PERCIRS: a system to combine personalized and collaborative information retrieval (2010) 0.01
    0.012175934 = product of:
      0.024351869 = sum of:
        0.024351869 = product of:
          0.048703738 = sum of:
            0.048703738 = weight(_text_:systems in 3960) [ClassicSimilarity], result of:
              0.048703738 = score(doc=3960,freq=10.0), product of:
                0.16037072 = queryWeight, product of:
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.052184064 = queryNorm
                0.3036947 = fieldWeight in 3960, product of:
                  3.1622777 = tf(freq=10.0), with freq of:
                    10.0 = termFreq=10.0
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.03125 = fieldNorm(doc=3960)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Purpose - This paper aims to discuss and test the claim that utilization of the personalization techniques can be valuable to improve the efficiency of collaborative information retrieval (CIR) systems. Design/methodology/approach - A new personalized CIR system, called PERCIRS, is presented based on the user profile similarity calculation (UPSC) formulas. To this aim, the paper proposes several UPSC formulas as well as two techniques to evaluate them. As the proposed CIR system is personalized, it could not be evaluated by Cranfield, like evaluation techniques (e.g. TREC). Hence, this paper proposes a new user-centric mechanism, which enables PERCIRS to be evaluated. This mechanism is generic and can be used to evaluate any other personalized IR system. Findings - The results show that among the proposed UPSC formulas in this paper, the (query-document)-graph based formula is the most effective. After integrating this formula into PERCIRS and comparing it with nine other IR systems, it is concluded that the results of the system are better than the other IR systems. In addition, the paper shows that the complexity of the system is less that the complexity of the other CIR systems. Research limitations/implications - This system asks the users to explicitly rank the returned documents, while explicit ranking is still not widespread enough. However it believes that the users should actively participate in the IR process in order to aptly satisfy their needs to information. Originality/value - The value of this paper lies in combining collaborative and personalized IR, as well as introducing a mechanism which enables the personalized IR system to be evaluated. The proposed evaluation mechanism is very valuable for developers of personalized IR systems. The paper also introduces some significant user profile similarity calculation formulas, and two techniques to evaluate them. These formulas can also be used to find the user's community in the social networks.
  7. Reichert, S.; Mayr, P.: Untersuchung von Relevanzeigenschaften in einem kontrollierten Eyetracking-Experiment (2012) 0.01
    0.010605331 = product of:
      0.021210661 = sum of:
        0.021210661 = product of:
          0.042421322 = sum of:
            0.042421322 = weight(_text_:22 in 328) [ClassicSimilarity], result of:
              0.042421322 = score(doc=328,freq=2.0), product of:
                0.1827397 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.052184064 = queryNorm
                0.23214069 = fieldWeight in 328, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.046875 = fieldNorm(doc=328)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Date
    22. 7.2012 19:25:54
  8. Kelly, D.; Sugimoto, C.R.: ¬A systematic review of interactive information retrieval evaluation studies, 1967-2006 (2013) 0.01
    0.009625921 = product of:
      0.019251842 = sum of:
        0.019251842 = product of:
          0.038503684 = sum of:
            0.038503684 = weight(_text_:systems in 684) [ClassicSimilarity], result of:
              0.038503684 = score(doc=684,freq=4.0), product of:
                0.16037072 = queryWeight, product of:
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.052184064 = queryNorm
                0.24009174 = fieldWeight in 684, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=684)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    With the increasing number and diversity of search tools available, interest in the evaluation of search systems, particularly from a user perspective, has grown among researchers. More researchers are designing and evaluating interactive information retrieval (IIR) systems and beginning to innovate in evaluation methods. Maturation of a research specialty relies on the ability to replicate research, provide standards for measurement and analysis, and understand past endeavors. This article presents a historical overview of 40 years of IIR evaluation studies using the method of systematic review. A total of 2,791 journal and conference units were manually examined and 127 articles were selected for analysis in this study, based on predefined inclusion and exclusion criteria. These articles were systematically coded using features such as author, publication date, sources and references, and properties of the research method used in the articles, such as number of subjects, tasks, corpora, and measures. Results include data describing the growth of IIR studies over time, the most frequently occurring and cited authors and sources, and the most common types of corpora and measures used. An additional product of this research is a bibliography of IIR evaluation research that can be used by students, teachers, and those new to the area. To the authors' knowledge, this is the first historical, systematic characterization of the IIR evaluation literature, including the documentation of methods and measures used by researchers in this specialty.
  9. Losada, D.E.; Parapar, J.; Barreiro, A.: Multi-armed bandits for adjudicating documents in pooling-based evaluation of information retrieval systems (2017) 0.01
    0.009625921 = product of:
      0.019251842 = sum of:
        0.019251842 = product of:
          0.038503684 = sum of:
            0.038503684 = weight(_text_:systems in 5098) [ClassicSimilarity], result of:
              0.038503684 = score(doc=5098,freq=4.0), product of:
                0.16037072 = queryWeight, product of:
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.052184064 = queryNorm
                0.24009174 = fieldWeight in 5098, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=5098)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Evaluating Information Retrieval systems is crucial to making progress in search technologies. Evaluation is often based on assembling reference collections consisting of documents, queries and relevance judgments done by humans. In large-scale environments, exhaustively judging relevance becomes infeasible. Instead, only a pool of documents is judged for relevance. By selectively choosing documents from the pool we can optimize the number of judgments required to identify a given number of relevant documents. We argue that this iterative selection process can be naturally modeled as a reinforcement learning problem and propose innovative and formal adjudication methods based on multi-armed bandits. Casting document judging as a multi-armed bandit problem is not only theoretically appealing, but also leads to highly effective adjudication methods. Under this bandit allocation framework, we consider stationary and non-stationary models and propose seven new document adjudication methods (five stationary methods and two non-stationary variants). Our paper also reports a series of experiments performed to thoroughly compare our new methods against current adjudication methods. This comparative study includes existing methods designed for pooling-based evaluation and existing methods designed for metasearch. Our experiments show that our theoretically grounded adjudication methods can substantially minimize the assessment effort.
  10. Leiva-Mederos, A.; Senso, J.A.; Hidalgo-Delgado, Y.; Hipola, P.: Working framework of semantic interoperability for CRIS with heterogeneous data sources (2017) 0.01
    0.0094314385 = product of:
      0.018862877 = sum of:
        0.018862877 = product of:
          0.037725754 = sum of:
            0.037725754 = weight(_text_:systems in 3706) [ClassicSimilarity], result of:
              0.037725754 = score(doc=3706,freq=6.0), product of:
                0.16037072 = queryWeight, product of:
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.052184064 = queryNorm
                0.2352409 = fieldWeight in 3706, product of:
                  2.4494898 = tf(freq=6.0), with freq of:
                    6.0 = termFreq=6.0
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.03125 = fieldNorm(doc=3706)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Purpose Information from Current Research Information Systems (CRIS) is stored in different formats, in platforms that are not compatible, or even in independent networks. It would be helpful to have a well-defined methodology to allow for management data processing from a single site, so as to take advantage of the capacity to link disperse data found in different systems, platforms, sources and/or formats. Based on functionalities and materials of the VLIR project, the purpose of this paper is to present a model that provides for interoperability by means of semantic alignment techniques and metadata crosswalks, and facilitates the fusion of information stored in diverse sources. Design/methodology/approach After reviewing the state of the art regarding the diverse mechanisms for achieving semantic interoperability, the paper analyzes the following: the specific coverage of the data sets (type of data, thematic coverage and geographic coverage); the technical specifications needed to retrieve and analyze a distribution of the data set (format, protocol, etc.); the conditions of re-utilization (copyright and licenses); and the "dimensions" included in the data set as well as the semantics of these dimensions (the syntax and the taxonomies of reference). The semantic interoperability framework here presented implements semantic alignment and metadata crosswalk to convert information from three different systems (ABCD, Moodle and DSpace) to integrate all the databases in a single RDF file. Findings The paper also includes an evaluation based on the comparison - by means of calculations of recall and precision - of the proposed model and identical consultations made on Open Archives Initiative and SQL, in order to estimate its efficiency. The results have been satisfactory enough, due to the fact that the semantic interoperability facilitates the exact retrieval of information. Originality/value The proposed model enhances management of the syntactic and semantic interoperability of the CRIS system designed. In a real setting of use it achieves very positive results.
  11. Chu, H.: Factors affecting relevance judgment : a report from TREC Legal track (2011) 0.01
    0.008837775 = product of:
      0.01767555 = sum of:
        0.01767555 = product of:
          0.0353511 = sum of:
            0.0353511 = weight(_text_:22 in 4540) [ClassicSimilarity], result of:
              0.0353511 = score(doc=4540,freq=2.0), product of:
                0.1827397 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.052184064 = queryNorm
                0.19345059 = fieldWeight in 4540, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=4540)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Date
    12. 7.2011 18:29:22
  12. Wildemuth, B.; Freund, L.; Toms, E.G.: Untangling search task complexity and difficulty in the context of interactive information retrieval studies (2014) 0.01
    0.008837775 = product of:
      0.01767555 = sum of:
        0.01767555 = product of:
          0.0353511 = sum of:
            0.0353511 = weight(_text_:22 in 1786) [ClassicSimilarity], result of:
              0.0353511 = score(doc=1786,freq=2.0), product of:
                0.1827397 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.052184064 = queryNorm
                0.19345059 = fieldWeight in 1786, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1786)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Date
    6. 4.2015 19:31:22
  13. Colace, F.; Santo, M. de; Greco, L.; Napoletano, P.: Improving relevance feedback-based query expansion by the use of a weighted word pairs approach (2015) 0.01
    0.008167865 = product of:
      0.01633573 = sum of:
        0.01633573 = product of:
          0.03267146 = sum of:
            0.03267146 = weight(_text_:systems in 2263) [ClassicSimilarity], result of:
              0.03267146 = score(doc=2263,freq=2.0), product of:
                0.16037072 = queryWeight, product of:
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.052184064 = queryNorm
                0.2037246 = fieldWeight in 2263, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2263)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    In this article, the use of a new term extraction method for query expansion (QE) in text retrieval is investigated. The new method expands the initial query with a structured representation made of weighted word pairs (WWP) extracted from a set of training documents (relevance feedback). Standard text retrieval systems can handle a WWP structure through custom Boolean weighted models. We experimented with both the explicit and pseudorelevance feedback schemas and compared the proposed term extraction method with others in the literature, such as KLD and RM3. Evaluations have been conducted on a number of test collections (Text REtrivel Conference [TREC]-6, -7, -8, -9, and -10). Results demonstrated that the QE method based on this new structure outperforms the baseline.
  14. Li, J.; Zhang, P.; Song, D.; Wu, Y.: Understanding an enriched multidimensional user relevance model by analyzing query logs (2017) 0.01
    0.008167865 = product of:
      0.01633573 = sum of:
        0.01633573 = product of:
          0.03267146 = sum of:
            0.03267146 = weight(_text_:systems in 3961) [ClassicSimilarity], result of:
              0.03267146 = score(doc=3961,freq=2.0), product of:
                0.16037072 = queryWeight, product of:
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.052184064 = queryNorm
                0.2037246 = fieldWeight in 3961, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.046875 = fieldNorm(doc=3961)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Modeling multidimensional relevance in information retrieval (IR) has attracted much attention in recent years. However, most existing studies are conducted through relatively small-scale user studies, which may not reflect a real-world and natural search scenario. In this article, we propose to study the multidimensional user relevance model (MURM) on large scale query logs, which record users' various search behaviors (e.g., query reformulations, clicks and dwelling time, etc.) in natural search settings. We advance an existing MURM model (including five dimensions: topicality, novelty, reliability, understandability, and scope) by providing two additional dimensions, that is, interest and habit. The two new dimensions represent personalized relevance judgment on retrieved documents. Further, for each dimension in the enriched MURM model, a set of computable features are formulated. By conducting extensive document ranking experiments on Bing's query logs and TREC session Track data, we systematically investigated the impact of each dimension on retrieval performance and gained a series of insightful findings which may bring benefits for the design of future IR systems.
  15. Kutlu, M.; Elsayed, T.; Lease, M.: Intelligent topic selection for low-cost information retrieval evaluation : a new perspective on deep vs. shallow judging (2018) 0.01
    0.007700737 = product of:
      0.015401474 = sum of:
        0.015401474 = product of:
          0.030802948 = sum of:
            0.030802948 = weight(_text_:systems in 5092) [ClassicSimilarity], result of:
              0.030802948 = score(doc=5092,freq=4.0), product of:
                0.16037072 = queryWeight, product of:
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.052184064 = queryNorm
                0.19207339 = fieldWeight in 5092, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.03125 = fieldNorm(doc=5092)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    While test collections provide the cornerstone for Cranfield-based evaluation of information retrieval (IR) systems, it has become practically infeasible to rely on traditional pooling techniques to construct test collections at the scale of today's massive document collections (e.g., ClueWeb12's 700M+ Webpages). This has motivated a flurry of studies proposing more cost-effective yet reliable IR evaluation methods. In this paper, we propose a new intelligent topic selection method which reduces the number of search topics (and thereby costly human relevance judgments) needed for reliable IR evaluation. To rigorously assess our method, we integrate previously disparate lines of research on intelligent topic selection and deep vs. shallow judging (i.e., whether it is more cost-effective to collect many relevance judgments for a few topics or a few judgments for many topics). While prior work on intelligent topic selection has never been evaluated against shallow judging baselines, prior work on deep vs. shallow judging has largely argued for shallowed judging, but assuming random topic selection. We argue that for evaluating any topic selection method, ultimately one must ask whether it is actually useful to select topics, or should one simply perform shallow judging over many topics? In seeking a rigorous answer to this over-arching question, we conduct a comprehensive investigation over a set of relevant factors never previously studied together: 1) method of topic selection; 2) the effect of topic familiarity on human judging speed; and 3) how different topic generation processes (requiring varying human effort) impact (i) budget utilization and (ii) the resultant quality of judgments. Experiments on NIST TREC Robust 2003 and Robust 2004 test collections show that not only can we reliably evaluate IR systems with fewer topics, but also that: 1) when topics are intelligently selected, deep judging is often more cost-effective than shallow judging in evaluation reliability; and 2) topic familiarity and topic generation costs greatly impact the evaluation cost vs. reliability trade-off. Our findings challenge conventional wisdom in showing that deep judging is often preferable to shallow judging when topics are selected intelligently.
  16. Lu, K.; Kipp, M.E.I.: Understanding the retrieval effectiveness of collaborative tags and author keywords in different retrieval environments : an experimental study on medical collections (2014) 0.01
    0.0068065543 = product of:
      0.013613109 = sum of:
        0.013613109 = product of:
          0.027226217 = sum of:
            0.027226217 = weight(_text_:systems in 1215) [ClassicSimilarity], result of:
              0.027226217 = score(doc=1215,freq=2.0), product of:
                0.16037072 = queryWeight, product of:
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.052184064 = queryNorm
                0.1697705 = fieldWeight in 1215, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1215)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    This study investigates the retrieval effectiveness of collaborative tags and author keywords in different environments through controlled experiments. Three test collections were built. The first collection tests the impact of tags on retrieval performance when only the title and abstract are available (the abstract environment). The second tests the impact of tags when the full text is available (the full-text environment). The third compares the retrieval effectiveness of tags and author keywords in the abstract environment. In addition, both single-word queries and phrase queries are tested to understand the impact of different query types. Our findings suggest that including tags and author keywords in indexes can enhance recall but may improve or worsen average precision depending on retrieval environments and query types. Indexing tags and author keywords for searching using phrase queries in the abstract environment showed improved average precision, whereas indexing tags for searching using single-word queries in the full-text environment led to a significant drop in average precision. The comparison between tags and author keywords in the abstract environment indicates that they have comparable impact on average precision, but author keywords are more advantageous in enhancing recall. The findings from this study provide useful implications for designing retrieval systems that incorporate tags and author keywords.
  17. Ruthven, I.: Relevance behaviour in TREC (2014) 0.01
    0.0068065543 = product of:
      0.013613109 = sum of:
        0.013613109 = product of:
          0.027226217 = sum of:
            0.027226217 = weight(_text_:systems in 1785) [ClassicSimilarity], result of:
              0.027226217 = score(doc=1785,freq=2.0), product of:
                0.16037072 = queryWeight, product of:
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.052184064 = queryNorm
                0.1697705 = fieldWeight in 1785, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1785)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Purpose - The purpose of this paper is to examine how various types of TREC data can be used to better understand relevance and serve as test-bed for exploring relevance. The author proposes that there are many interesting studies that can be performed on the TREC data collections that are not directly related to evaluating systems but to learning more about human judgements of information and relevance and that these studies can provide useful research questions for other types of investigation. Design/methodology/approach - Through several case studies the author shows how existing data from TREC can be used to learn more about the factors that may affect relevance judgements and interactive search decisions and answer new research questions for exploring relevance. Findings - The paper uncovers factors, such as familiarity, interest and strictness of relevance criteria, that affect the nature of relevance assessments within TREC, contrasting these against findings from user studies of relevance. Research limitations/implications - The research only considers certain uses of TREC data and assessment given by professional relevance assessors but motivates further exploration of the TREC data so that the research community can further exploit the effort involved in the construction of TREC test collections. Originality/value - The paper presents an original viewpoint on relevance investigations and TREC itself by motivating TREC as a source of inspiration on understanding relevance rather than purely as a source of evaluation material.
  18. Tamine, L.; Chouquet, C.; Palmer, T.: Analysis of biomedical and health queries : lessons learned from TREC and CLEF evaluation benchmarks (2015) 0.01
    0.0068065543 = product of:
      0.013613109 = sum of:
        0.013613109 = product of:
          0.027226217 = sum of:
            0.027226217 = weight(_text_:systems in 2341) [ClassicSimilarity], result of:
              0.027226217 = score(doc=2341,freq=2.0), product of:
                0.16037072 = queryWeight, product of:
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.052184064 = queryNorm
                0.1697705 = fieldWeight in 2341, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2341)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    A large body of research work examined, from both the query side and the user behavior side, the characteristics of medical- and health-related searches. One of the core issues in medical information retrieval (IR) is diversity of tasks that lead to diversity of categories of information needs and queries. From the evaluation perspective, another related and challenging issue is the limited availability of appropriate test collections allowing the experimental validation of medically task oriented IR techniques and systems. In this paper, we explore the peculiarities of TREC and CLEF medically oriented tasks and queries through the analysis of the differences and the similarities between queries across tasks, with respect to length, specificity, and clarity features and then study their effect on retrieval performance. We show that, even for expert oriented queries, language specificity level varies significantly across tasks as well as search difficulty. Additional findings highlight that query clarity factors are task dependent and that query terms specificity based on domain-specific terminology resources is not significantly linked to term rareness in the document collection. The lessons learned from our study could serve as starting points for the design of future task-based medical information retrieval frameworks.
  19. Losada, D.E.; Parapar, J.; Barreiro, A.: When to stop making relevance judgments? : a study of stopping methods for building information retrieval test collections (2019) 0.01
    0.0068065543 = product of:
      0.013613109 = sum of:
        0.013613109 = product of:
          0.027226217 = sum of:
            0.027226217 = weight(_text_:systems in 4674) [ClassicSimilarity], result of:
              0.027226217 = score(doc=4674,freq=2.0), product of:
                0.16037072 = queryWeight, product of:
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.052184064 = queryNorm
                0.1697705 = fieldWeight in 4674, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=4674)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    In information retrieval evaluation, pooling is a well-known technique to extract a sample of documents to be assessed for relevance. Given the pooled documents, a number of studies have proposed different prioritization methods to adjudicate documents for judgment. These methods follow different strategies to reduce the assessment effort. However, there is no clear guidance on how many relevance judgments are required for creating a reliable test collection. In this article we investigate and further develop methods to determine when to stop making relevance judgments. We propose a highly diversified set of stopping methods and provide a comprehensive analysis of the usefulness of the resulting test collections. Some of the stopping methods introduced here combine innovative estimates of recall with time series models used in Financial Trading. Experimental results on several representative collections show that some stopping methods can reduce up to 95% of the assessment effort and still produce a robust test collection. We demonstrate that the reduced set of judgments can be reliably employed to compare search systems using disparate effectiveness metrics such as Average Precision, NDCG, P@100, and Rank Biased Precision. With all these measures, the correlations found between full pool rankings and reduced pool rankings is very high.
  20. Dzeyk, W.: Effektiv und nutzerfreundlich : Einsatz von semantischen Technologien und Usability-Methoden zur Verbesserung der medizinischen Literatursuche (2010) 0.01
    0.0067381454 = product of:
      0.013476291 = sum of:
        0.013476291 = product of:
          0.026952581 = sum of:
            0.026952581 = weight(_text_:systems in 4416) [ClassicSimilarity], result of:
              0.026952581 = score(doc=4416,freq=4.0), product of:
                0.16037072 = queryWeight, product of:
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.052184064 = queryNorm
                0.16806422 = fieldWeight in 4416, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.0731742 = idf(docFreq=5561, maxDocs=44218)
                  0.02734375 = fieldNorm(doc=4416)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    In der vorliegenden Arbeit werden die Ergebnisse des MorphoSaurus-Projekts der Deutschen Zentralbibliothek für Medizin (ZB MED) vorgestellt. Ziel des Forschungsprojekts war die substanzielle Verbesserung des Information-Retrievals der medizinischen Suchmaschine MEDPILOT mithilfe computerlinguistischer Ansätze sowie die Optimierung der Gebrauchstauglichkeit (Usability) der Suchmaschinenoberfläche. Das Projekt wurde in Kooperation mit der Averbis GmbH aus Freiburg im Zeitraum von Juni 2007 bis Dezember 2008 an der ZB MED in Köln durchgeführt. Ermöglicht wurde die Realisierung des Projekts durch eine Förderung des Paktes für Forschung und Innovation. Während Averbis die MorphoSaurus-Technologie zur Verarbeitung problematischer Sprachaspekte von Suchanfragen einbrachte und wesentliche Datenbanken der ZB MED in ein Testsystem mit moderner Suchmaschinentechnologie implementierte, evaluierte ein Team der ZB MED das Potenzial dieser Technologie. Neben einem Vergleich der Leistungsfähigkeit zwischen der bisherigen MEDPILOT-Suche und der neuen Sucharchitektur wurde ein Benchmarking mit konkurrierenden Suchmaschinen wie PubMed, Scirus, Google und Google Scholar sowie GoPubMed durchgeführt. Für die Evaluation wurden verschiedene Testkollektionen erstellt, deren Items bzw. Suchphrasen aus einer Inhaltsanalyse realer Suchanfragen des MEDPILOT-Systems gewonnen wurden. Eine Überprüfung der Relevanz der Treffer der Testsuchmaschine als wesentliches Kriterium für die Qualität der Suche zeigte folgendes Ergebnis: Durch die Anwendung der MorphoSaurus-Technologie ist eine im hohen Maße unabhängige Verarbeitung fremdsprachlicher medizinischer Inhalte möglich geworden. Darüber hinaus zeigt die neue Technik insbesondere dort ihre Stärken, wo es um die gleichwertige Verarbeitung von Laien- und Expertensprache, die Analyse von Komposita, Synonymen und grammatikalischen Varianten geht. Zudem sind Module zur Erkennung von Rechtschreibfehlern und zur Auflösung von Akronymen und medizinischen Abkürzungen implementiert worden, die eine weitere Leistungssteigerung des Systems versprechen. Ein Vergleich auf der Basis von MEDLINE-Daten zeigte: Den Suchmaschinen MED-PILOT, PubMed, GoPubMed und Scirus war die Averbis-Testsuchumgebung klar überlegen. Die Trefferrelevanz war größer, es wurden insgesamt mehr Treffer gefunden und die Anzahl der Null-Treffer-Meldungen war im Vergleich zu den anderen Suchmaschinen am geringsten.