Search (47 results, page 1 of 3)

Pal, S.; Mitra, M.; Kamps, J.: Evaluation effort, reliability and reusability in XML retrieval (2011) 0.08
```
0.07761824 = product of:
  0.15523648 = sum of:
    0.15523648 = sum of:
      0.12085646 = weight(_text_:assessment in 4197) [ClassicSimilarity], result of:
        0.12085646 = score(doc=4197,freq=4.0), product of:
          0.2801951 = queryWeight, product of:
            5.52102 = idf(docFreq=480, maxDocs=44218)
            0.050750602 = queryNorm
          0.43132967 = fieldWeight in 4197, product of:
            2.0 = tf(freq=4.0), with freq of:
              4.0 = termFreq=4.0
            5.52102 = idf(docFreq=480, maxDocs=44218)
            0.0390625 = fieldNorm(doc=4197)
      0.03438003 = weight(_text_:22 in 4197) [ClassicSimilarity], result of:
        0.03438003 = score(doc=4197,freq=2.0), product of:
          0.17771997 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.050750602 = queryNorm
          0.19345059 = fieldWeight in 4197, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0390625 = fieldNorm(doc=4197)
  0.5 = coord(1/2)
```
Abstract

The Initiative for the Evaluation of XML retrieval (INEX) provides a TREC-like platform for evaluating content-oriented XML retrieval systems. Since 2007, INEX has been using a set of precision-recall based metrics for its ad hoc tasks. The authors investigate the reliability and robustness of these focused retrieval measures, and of the INEX pooling method. They explore four specific questions: How reliable are the metrics when assessments are incomplete, or when query sets are small? What is the minimum pool/query-set size that can be used to reliably evaluate systems? Can the INEX collections be used to fairly evaluate "new" systems that did not participate in the pooling process? And, for a fixed amount of assessment effort, would this effort be better spent in thoroughly judging a few queries, or in judging many queries relatively superficially? The authors' findings validate properties of precision-recall-based metrics observed in document retrieval settings. Early precision measures are found to be more error-prone and less stable under incomplete judgments and small topic-set sizes. They also find that system rankings remain largely unaffected even when assessment effort is substantially (but systematically) reduced, and confirm that the INEX collections remain usable when evaluating nonparticipating systems. Finally, they observe that for a fixed amount of effort, judging shallow pools for many queries is better than judging deep pools for a smaller set of queries. However, when judging only a random sample of a pool, it is better to completely judge fewer topics than to partially judge many topics. This result confirms the effectiveness of pooling methods.

Date

22. 1.2011 14:20:56
Hansen, P.; Karlgren, J.: Effects of foreign language and task scenario on relevance assessment (2005) 0.05
```
0.05233238 = product of:
  0.10466476 = sum of:
    0.10466476 = product of:
      0.20932952 = sum of:
        0.20932952 = weight(_text_:assessment in 4393) [ClassicSimilarity], result of:
          0.20932952 = score(doc=4393,freq=12.0), product of:
            0.2801951 = queryWeight, product of:
              5.52102 = idf(docFreq=480, maxDocs=44218)
              0.050750602 = queryNorm
            0.7470849 = fieldWeight in 4393, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              5.52102 = idf(docFreq=480, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4393)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Purpose - This paper aims to investigate how readers assess relevance of retrieved documents in a foreign language they know well compared with their native language, and whether work-task scenario descriptions have effect on the assessment process. Design/methodology/approach - Queries, test collections, and relevance assessments were used from the 2002 Interactive CLEF. Swedish first-language speakers, fluent in English, were given simulated information-seeking scenarios and presented with retrieval results in both languages. Twenty-eight subjects in four groups were asked to rate the retrieved text documents by relevance. A two-level work-task scenario description framework was developed and applied to facilitate the study of context effects on the assessment process. Findings - Relevance assessment takes longer in a foreign language than in the user first language. The quality of assessments by comparison with pre-assessed results is inferior to those made in the users' first language. Work-task scenario descriptions had an effect on the assessment process, both by measured access time and by self-report by subjects. However, effects on results by traditional relevance ranking were detectable. This may be an argument for extending the traditional IR experimental topical relevance measures to cater for context effects. Originality/value - An extended two-level work-task scenario description framework was developed and applied. Contextual aspects had an effect on the relevance assessment process. English texts took longer to assess than Swedish and were assessed less well, especially for the most difficult queries. The IR research field needs to close this gap and to design information access systems with users' language competence in mind.
Huffman, G.D.; Vital, D.A.; Bivins, R.G.: Generating indices with lexical association methods : term uniqueness (1990) 0.03
```
0.030214114 = product of:
  0.06042823 = sum of:
    0.06042823 = product of:
      0.12085646 = sum of:
        0.12085646 = weight(_text_:assessment in 4152) [ClassicSimilarity], result of:
          0.12085646 = score(doc=4152,freq=4.0), product of:
            0.2801951 = queryWeight, product of:
              5.52102 = idf(docFreq=480, maxDocs=44218)
              0.050750602 = queryNorm
            0.43132967 = fieldWeight in 4152, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.52102 = idf(docFreq=480, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4152)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

A software system has been developed which orders citations retrieved from an online database in terms of relevancy. The system resulted from an effort generated by NASA's Technology Utilization Program to create new advanced software tools to largely automate the process of determining relevancy of database citations retrieved to support large technology transfer studies. The ranking is based on the generation of an enriched vocabulary using lexical association methods, a user assessment of the vocabulary and a combination of the user assessment and the lexical metric. One of the key elements in relevancy ranking is the enriched vocabulary -the terms mst be both unique and descriptive. This paper examines term uniqueness. Six lexical association methods were employed to generate characteristic word indices. A limited subset of the terms - the highest 20,40,60 and 7,5% of the uniquess words - we compared and uniquess factors developed. Computational times were also measured. It was found that methods based on occurrences and signal produced virtually the same terms. The limited subset of terms producedby the exact and centroid discrimination value were also nearly identical. Unique terms sets were produced by teh occurrence, variance and discrimination value (centroid), An end-user evaluation showed that the generated terms were largely distinct and had values of word precision which were consistent with values of the search precision.
Schaer, P.; Mayr, P.; Sünkler, S.; Lewandowski, D.: How relevant is the long tail? : a relevance assessment study on million short (2016) 0.03
```
0.030214114 = product of:
  0.06042823 = sum of:
    0.06042823 = product of:
      0.12085646 = sum of:
        0.12085646 = weight(_text_:assessment in 3144) [ClassicSimilarity], result of:
          0.12085646 = score(doc=3144,freq=4.0), product of:
            0.2801951 = queryWeight, product of:
              5.52102 = idf(docFreq=480, maxDocs=44218)
              0.050750602 = queryNorm
            0.43132967 = fieldWeight in 3144, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.52102 = idf(docFreq=480, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3144)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Users of web search engines are known to mostly focus on the top ranked results of the search engine result page. While many studies support this well known information seeking pattern only few studies concentrate on the question what users are missing by neglecting lower ranked results. To learn more about the relevance distributions in the so-called long tail we conducted a relevance assessment study with the Million Short long-tail web search engine. While we see a clear difference in the content between the head and the tail of the search engine result list we see no statistical significant differences in the binary relevance judgments and weak significant differences when using graded relevance. The tail contains different but still valuable results. We argue that the long tail can be a rich source for the diversification of web search engine result lists but it needs more evaluation to clearly describe the differences.
Losada, D.E.; Parapar, J.; Barreiro, A.: When to stop making relevance judgments? : a study of stopping methods for building information retrieval test collections (2019) 0.03
```
0.030214114 = product of:
  0.06042823 = sum of:
    0.06042823 = product of:
      0.12085646 = sum of:
        0.12085646 = weight(_text_:assessment in 4674) [ClassicSimilarity], result of:
          0.12085646 = score(doc=4674,freq=4.0), product of:
            0.2801951 = queryWeight, product of:
              5.52102 = idf(docFreq=480, maxDocs=44218)
              0.050750602 = queryNorm
            0.43132967 = fieldWeight in 4674, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.52102 = idf(docFreq=480, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4674)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

In information retrieval evaluation, pooling is a well-known technique to extract a sample of documents to be assessed for relevance. Given the pooled documents, a number of studies have proposed different prioritization methods to adjudicate documents for judgment. These methods follow different strategies to reduce the assessment effort. However, there is no clear guidance on how many relevance judgments are required for creating a reliable test collection. In this article we investigate and further develop methods to determine when to stop making relevance judgments. We propose a highly diversified set of stopping methods and provide a comprehensive analysis of the usefulness of the resulting test collections. Some of the stopping methods introduced here combine innovative estimates of recall with time series models used in Financial Trading. Experimental results on several representative collections show that some stopping methods can reduce up to 95% of the assessment effort and still produce a robust test collection. We demonstrate that the reduced set of judgments can be reliably employed to compare search systems using disparate effectiveness metrics such as Average Precision, NDCG, P@100, and Rank Biased Precision. With all these measures, the correlations found between full pool rankings and reduced pool rankings is very high.
Janes, J.W.; McKinney, R.: Relevance judgements of actual users and secondary judges : a comparative study (1992) 0.03
```
0.029910447 = product of:
  0.059820894 = sum of:
    0.059820894 = product of:
      0.11964179 = sum of:
        0.11964179 = weight(_text_:assessment in 4276) [ClassicSimilarity], result of:
          0.11964179 = score(doc=4276,freq=2.0), product of:
            0.2801951 = queryWeight, product of:
              5.52102 = idf(docFreq=480, maxDocs=44218)
              0.050750602 = queryNorm
            0.4269946 = fieldWeight in 4276, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.52102 = idf(docFreq=480, maxDocs=44218)
              0.0546875 = fieldNorm(doc=4276)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Examines judgements of relevance of document representations to query statements made by people other than the the originators of the queries. A small group of graduate students in the School of Information and Library Studies and undergraduates of Michigan Univ. judges sets of documents that had been retrieved for and judged by real users for a previous study. The assessment of relevance, by the secondary judges, were analysed by themselves and in comparison with the users' assessments. The judges performed reasonably well but some important differences were identified. Secondary judges use the various fields of document records in different ways than users and have a higher threshold of relevance
Hersh, W.; Pentecost, J.; Hickam, D.: ¬A task-oriented approach to information retrieval evaluation : overview and design for empirical testing (1996) 0.03
```
0.025637524 = product of:
  0.05127505 = sum of:
    0.05127505 = product of:
      0.1025501 = sum of:
        0.1025501 = weight(_text_:assessment in 3001) [ClassicSimilarity], result of:
          0.1025501 = score(doc=3001,freq=2.0), product of:
            0.2801951 = queryWeight, product of:
              5.52102 = idf(docFreq=480, maxDocs=44218)
              0.050750602 = queryNorm
            0.36599535 = fieldWeight in 3001, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.52102 = idf(docFreq=480, maxDocs=44218)
              0.046875 = fieldNorm(doc=3001)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

As retrieval system become more oriented towards end-users, there is an increasing need for improved methods to evaluate their effectiveness. We performed a task-oriented assessment of 2 MEDLINE searching systems, one which promotes traditional Boolean searching on human-indexed thesaurus terms and the other natural language searching on words in the title, abstracts and indexing terms. Medical students were randomized to one of the 2 systems and given clinical questions to answer. The students were able to use each system successfully, with no significant differences in questions correctly answered, time taken, relevant articles retrieved, or user satisfaction between the systems. This approach to evaluation was successful in measuring effectiveness of system use and demonstrates that both types of systems can be used equally well with minimal training
Hersh, W.R.; Pentecost, J.; Hickam, D.H.: ¬A task-oriented approach to retrieval system evaluation (1995) 0.03
```
0.025637524 = product of:
  0.05127505 = sum of:
    0.05127505 = product of:
      0.1025501 = sum of:
        0.1025501 = weight(_text_:assessment in 3867) [ClassicSimilarity], result of:
          0.1025501 = score(doc=3867,freq=2.0), product of:
            0.2801951 = queryWeight, product of:
              5.52102 = idf(docFreq=480, maxDocs=44218)
              0.050750602 = queryNorm
            0.36599535 = fieldWeight in 3867, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.52102 = idf(docFreq=480, maxDocs=44218)
              0.046875 = fieldNorm(doc=3867)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

There is a need for improved methods to evaluate the effectiveness of end user information retrieval systems. Performs a task oriented assessment of 2 MEDLINE searching systems, one which promotes Boolean searching on human indexed thesaurus terms and the other natural language searching on words in the title, abstract, and indexing terms. Each was used by medical students to answer clinical questions. Students were able to use each system successfully, with no significant differences in questions correctly answered, time taken, relevant articles retrieved, or user satisfaction between the systems. This approach to evaluation was successful in measuring effectiveness of system use and demonstrates that both types of systems can be used equally well with minimal training

Fuhr, N.; Niewelt, B.: ¬Ein Retrievaltest mit automatisch indexierten Dokumenten (1984) 0.02

0.024066022 = product of:
  0.048132043 = sum of:
    0.048132043 = product of:
      0.09626409 = sum of:
        0.09626409 = weight(_text_:22 in 262) [ClassicSimilarity], result of:
          0.09626409 = score(doc=262,freq=2.0), product of:
            0.17771997 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.050750602 = queryNorm
            0.5416616 = fieldWeight in 262, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.109375 = fieldNorm(doc=262)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Date: 20.10.2000 12:22:23

Tomaiuolo, N.G.; Parker, J.: Maximizing relevant retrieval : keyword and natural language searching (1998) 0.02

0.024066022 = product of:
  0.048132043 = sum of:
    0.048132043 = product of:
      0.09626409 = sum of:
        0.09626409 = weight(_text_:22 in 6418) [ClassicSimilarity], result of:
          0.09626409 = score(doc=6418,freq=2.0), product of:
            0.17771997 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.050750602 = queryNorm
            0.5416616 = fieldWeight in 6418, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.109375 = fieldNorm(doc=6418)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Source: Online. 22(1998) no.6, S.57-58

Voorhees, E.M.; Harman, D.: Overview of the Sixth Text REtrieval Conference (TREC-6) (2000) 0.02

0.024066022 = product of:
  0.048132043 = sum of:
    0.048132043 = product of:
      0.09626409 = sum of:
        0.09626409 = weight(_text_:22 in 6438) [ClassicSimilarity], result of:
          0.09626409 = score(doc=6438,freq=2.0), product of:
            0.17771997 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.050750602 = queryNorm
            0.5416616 = fieldWeight in 6438, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.109375 = fieldNorm(doc=6438)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Date: 11. 8.2001 16:22:19

Dalrymple, P.W.: Retrieval by reformulation in two library catalogs : toward a cognitive model of searching behavior (1990) 0.02

0.024066022 = product of:
  0.048132043 = sum of:
    0.048132043 = product of:
      0.09626409 = sum of:
        0.09626409 = weight(_text_:22 in 5089) [ClassicSimilarity], result of:
          0.09626409 = score(doc=5089,freq=2.0), product of:
            0.17771997 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.050750602 = queryNorm
            0.5416616 = fieldWeight in 5089, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.109375 = fieldNorm(doc=5089)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Date: 22. 7.2006 18:43:54

Vakkari, P.; Huuskonen, S.: Search effort degrades search output but improves task outcome (2012) 0.02
```
0.021364605 = product of:
  0.04272921 = sum of:
    0.04272921 = product of:
      0.08545842 = sum of:
        0.08545842 = weight(_text_:assessment in 46) [ClassicSimilarity], result of:
          0.08545842 = score(doc=46,freq=2.0), product of:
            0.2801951 = queryWeight, product of:
              5.52102 = idf(docFreq=480, maxDocs=44218)
              0.050750602 = queryNorm
            0.30499613 = fieldWeight in 46, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.52102 = idf(docFreq=480, maxDocs=44218)
              0.0390625 = fieldNorm(doc=46)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

We analyzed how effort in searching is associated with search output and task outcome. In a field study, we examined how students' search effort for an assigned learning task was associated with precision and relative recall, and how this was associated to the quality of learning outcome. The study subjects were 41 medical students writing essays for a class in medicine. Searching in Medline was part of their assignment. The data comprised students' search logs in Medline, their assessment of the usefulness of references retrieved, a questionnaire concerning the search process, and evaluation scores of the essays given by the teachers. Pearson correlation was calculated for answering the research questions. Finally, a path model for predicting task outcome was built. We found that effort in the search process degraded precision but improved task outcome. There were two major mechanisms reducing precision while enhancing task outcome. Effort in expanding Medical Subject Heading (MeSH) terms within search sessions and effort in assessing and exploring documents in the result list between the sessions degraded precision, but led to better task outcome. Thus, human effort compensated bad retrieval results on the way to good task outcome. Findings suggest that traditional effectiveness measures in information retrieval should be complemented with evaluation measures for search process and outcome.
Ruthven, I.: Relevance behaviour in TREC (2014) 0.02
```
0.021364605 = product of:
  0.04272921 = sum of:
    0.04272921 = product of:
      0.08545842 = sum of:
        0.08545842 = weight(_text_:assessment in 1785) [ClassicSimilarity], result of:
          0.08545842 = score(doc=1785,freq=2.0), product of:
            0.2801951 = queryWeight, product of:
              5.52102 = idf(docFreq=480, maxDocs=44218)
              0.050750602 = queryNorm
            0.30499613 = fieldWeight in 1785, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.52102 = idf(docFreq=480, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1785)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Purpose - The purpose of this paper is to examine how various types of TREC data can be used to better understand relevance and serve as test-bed for exploring relevance. The author proposes that there are many interesting studies that can be performed on the TREC data collections that are not directly related to evaluating systems but to learning more about human judgements of information and relevance and that these studies can provide useful research questions for other types of investigation. Design/methodology/approach - Through several case studies the author shows how existing data from TREC can be used to learn more about the factors that may affect relevance judgements and interactive search decisions and answer new research questions for exploring relevance. Findings - The paper uncovers factors, such as familiarity, interest and strictness of relevance criteria, that affect the nature of relevance assessments within TREC, contrasting these against findings from user studies of relevance. Research limitations/implications - The research only considers certain uses of TREC data and assessment given by professional relevance assessors but motivates further exploration of the TREC data so that the research community can further exploit the effort involved in the construction of TREC test collections. Originality/value - The paper presents an original viewpoint on relevance investigations and TREC itself by motivating TREC as a source of inspiration on understanding relevance rather than purely as a source of evaluation material.
Losada, D.E.; Parapar, J.; Barreiro, A.: Multi-armed bandits for adjudicating documents in pooling-based evaluation of information retrieval systems (2017) 0.02
```
0.021364605 = product of:
  0.04272921 = sum of:
    0.04272921 = product of:
      0.08545842 = sum of:
        0.08545842 = weight(_text_:assessment in 5098) [ClassicSimilarity], result of:
          0.08545842 = score(doc=5098,freq=2.0), product of:
            0.2801951 = queryWeight, product of:
              5.52102 = idf(docFreq=480, maxDocs=44218)
              0.050750602 = queryNorm
            0.30499613 = fieldWeight in 5098, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.52102 = idf(docFreq=480, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5098)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Evaluating Information Retrieval systems is crucial to making progress in search technologies. Evaluation is often based on assembling reference collections consisting of documents, queries and relevance judgments done by humans. In large-scale environments, exhaustively judging relevance becomes infeasible. Instead, only a pool of documents is judged for relevance. By selectively choosing documents from the pool we can optimize the number of judgments required to identify a given number of relevant documents. We argue that this iterative selection process can be naturally modeled as a reinforcement learning problem and propose innovative and formal adjudication methods based on multi-armed bandits. Casting document judging as a multi-armed bandit problem is not only theoretically appealing, but also leads to highly effective adjudication methods. Under this bandit allocation framework, we consider stationary and non-stationary models and propose seven new document adjudication methods (five stationary methods and two non-stationary variants). Our paper also reports a series of experiments performed to thoroughly compare our new methods against current adjudication methods. This comparative study includes existing methods designed for pooling-based evaluation and existing methods designed for metasearch. Our experiments show that our theoretically grounded adjudication methods can substantially minimize the assessment effort.
Saracevic, T.: Effects of inconsistent relevance judgments on information retrieval test results : a historical perspective (2008) 0.02
```
0.021364605 = product of:
  0.04272921 = sum of:
    0.04272921 = product of:
      0.08545842 = sum of:
        0.08545842 = weight(_text_:assessment in 5585) [ClassicSimilarity], result of:
          0.08545842 = score(doc=5585,freq=2.0), product of:
            0.2801951 = queryWeight, product of:
              5.52102 = idf(docFreq=480, maxDocs=44218)
              0.050750602 = queryNorm
            0.30499613 = fieldWeight in 5585, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.52102 = idf(docFreq=480, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5585)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

The main objective of information retrieval (IR) systems is to retrieve information or information objects relevant to user requests and possible needs. In IR tests, retrieval effectiveness is established by comparing IR systems retrievals (systems relevance) with users' or user surrogates' assessments (user relevance), where user relevance is treated as the gold standard for performance evaluation. Relevance is a human notion, and establishing relevance by humans is fraught with a number of problems-inconsistency in judgment being one of them. The aim of this critical review is to explore the relationship between relevance on the one hand and testing of IR systems and procedures on the other. Critics of IR tests raised the issue of validity of the IR tests because they were based on relevance judgments that are inconsistent. This review traces and synthesizes experimental studies dealing with (1) inconsistency of relevance judgments by people, (2) effects of such inconsistency on results of IR tests and (3) reasons for retrieval failures. A historical context for these studies and for IR testing is provided including an assessment of Lancaster's (1969) evaluation of MEDLARS and its unique place in the history of IR evaluation.

Allan, J.; Callan, J.P.; Croft, W.B.; Ballesteros, L.; Broglio, J.; Xu, J.; Shu, H.: INQUERY at TREC-5 (1997) 0.02

0.017190015 = product of:
  0.03438003 = sum of:
    0.03438003 = product of:
      0.06876006 = sum of:
        0.06876006 = weight(_text_:22 in 3103) [ClassicSimilarity], result of:
          0.06876006 = score(doc=3103,freq=2.0), product of:
            0.17771997 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.050750602 = queryNorm
            0.38690117 = fieldWeight in 3103, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.078125 = fieldNorm(doc=3103)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Date: 27. 2.1999 20:55:22

Ng, K.B.; Loewenstern, D.; Basu, C.; Hirsh, H.; Kantor, P.B.: Data fusion of machine-learning methods for the TREC5 routing tak (and other work) (1997) 0.02

0.017190015 = product of:
  0.03438003 = sum of:
    0.03438003 = product of:
      0.06876006 = sum of:
        0.06876006 = weight(_text_:22 in 3107) [ClassicSimilarity], result of:
          0.06876006 = score(doc=3107,freq=2.0), product of:
            0.17771997 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.050750602 = queryNorm
            0.38690117 = fieldWeight in 3107, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.078125 = fieldNorm(doc=3107)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Date: 27. 2.1999 20:59:22

Saracevic, T.: On a method for studying the structure and nature of requests in information retrieval (1983) 0.02

0.017190015 = product of:
  0.03438003 = sum of:
    0.03438003 = product of:
      0.06876006 = sum of:
        0.06876006 = weight(_text_:22 in 2417) [ClassicSimilarity], result of:
          0.06876006 = score(doc=2417,freq=2.0), product of:
            0.17771997 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.050750602 = queryNorm
            0.38690117 = fieldWeight in 2417, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.078125 = fieldNorm(doc=2417)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Pages: S.22-25

Rijsbergen, C.J. van: ¬A test for the separation of relevant and non-relevant documents in experimental retrieval collections (1973) 0.01

0.0137520125 = product of:
  0.027504025 = sum of:
    0.027504025 = product of:
      0.05500805 = sum of:
        0.05500805 = weight(_text_:22 in 5002) [ClassicSimilarity], result of:
          0.05500805 = score(doc=5002,freq=2.0), product of:
            0.17771997 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.050750602 = queryNorm
            0.30952093 = fieldWeight in 5002, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0625 = fieldNorm(doc=5002)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Date: 19. 3.1996 11:22:12

Search (47 results, page 1 of 3)

Authors

Years

Languages

Types

Themes