Search (100 results, page 1 of 5)

Ravana, S.D.; Taheri, M.S.; Rajagopal, P.: Document-based approach to improve the accuracy of pairwise comparison in evaluating information retrieval systems (2015) 0.02
```
0.021423629 = product of:
  0.064270884 = sum of:
    0.055643205 = weight(_text_:propose in 2587) [ClassicSimilarity], result of:
      0.055643205 = score(doc=2587,freq=2.0), product of:
        0.19617504 = queryWeight, product of:
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.038207654 = queryNorm
        0.2836406 = fieldWeight in 2587, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2587)
    0.008627683 = product of:
      0.025883049 = sum of:
        0.025883049 = weight(_text_:22 in 2587) [ClassicSimilarity], result of:
          0.025883049 = score(doc=2587,freq=2.0), product of:
            0.13379669 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.038207654 = queryNorm
            0.19345059 = fieldWeight in 2587, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2587)
      0.33333334 = coord(1/3)
  0.33333334 = coord(2/6)
```
Abstract

Purpose The purpose of this paper is to propose a method to have more accurate results in comparing performance of the paired information retrieval (IR) systems with reference to the current method, which is based on the mean effectiveness scores of the systems across a set of identified topics/queries. Design/methodology/approach Based on the proposed approach, instead of the classic method of using a set of topic scores, the documents level scores are considered as the evaluation unit. These document scores are the defined document's weight, which play the role of the mean average precision (MAP) score of the systems as a significance test's statics. The experiments were conducted using the TREC 9 Web track collection. Findings The p-values generated through the two types of significance tests, namely the Student's t-test and Mann-Whitney show that by using the document level scores as an evaluation unit, the difference between IR systems is more significant compared with utilizing topic scores. Originality/value Utilizing a suitable test collection is a primary prerequisite for IR systems comparative evaluation. However, in addition to reusable test collections, having an accurate statistical testing is a necessity for these evaluations. The findings of this study will assist IR researchers to evaluate their retrieval systems and algorithms more accurately.

Date

20. 1.2015 18:30:22
Rajagopal, P.; Ravana, S.D.; Koh, Y.S.; Balakrishnan, V.: Evaluating the effectiveness of information retrieval systems using effort-based relevance judgment (2019) 0.02
```
0.021423629 = product of:
  0.064270884 = sum of:
    0.055643205 = weight(_text_:propose in 5287) [ClassicSimilarity], result of:
      0.055643205 = score(doc=5287,freq=2.0), product of:
        0.19617504 = queryWeight, product of:
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.038207654 = queryNorm
        0.2836406 = fieldWeight in 5287, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5287)
    0.008627683 = product of:
      0.025883049 = sum of:
        0.025883049 = weight(_text_:22 in 5287) [ClassicSimilarity], result of:
          0.025883049 = score(doc=5287,freq=2.0), product of:
            0.13379669 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.038207654 = queryNorm
            0.19345059 = fieldWeight in 5287, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5287)
      0.33333334 = coord(1/3)
  0.33333334 = coord(2/6)
```
Abstract

Purpose The effort in addition to relevance is a major factor for satisfaction and utility of the document to the actual user. The purpose of this paper is to propose a method in generating relevance judgments that incorporate effort without human judges' involvement. Then the study determines the variation in system rankings due to low effort relevance judgment in evaluating retrieval systems at different depth of evaluation. Design/methodology/approach Effort-based relevance judgments are generated using a proposed boxplot approach for simple document features, HTML features and readability features. The boxplot approach is a simple yet repeatable approach in classifying documents' effort while ensuring outlier scores do not skew the grading of the entire set of documents. Findings The retrieval systems evaluation using low effort relevance judgments has a stronger influence on shallow depth of evaluation compared to deeper depth. It is proved that difference in the system rankings is due to low effort documents and not the number of relevant documents. Originality/value Hence, it is crucial to evaluate retrieval systems at shallow depth using low effort relevance judgments.

Date

20. 1.2015 18:30:22
Mandl, T.: Evaluierung im Information Retrieval : die Hildesheimer Antwort auf aktuelle Herausforderungen der globalisierten Informationsgesellschaft (2010) 0.02
```
0.018839221 = product of:
  0.11303533 = sum of:
    0.11303533 = weight(_text_:forschung in 4011) [ClassicSimilarity], result of:
      0.11303533 = score(doc=4011,freq=4.0), product of:
        0.1858777 = queryWeight, product of:
          4.8649335 = idf(docFreq=926, maxDocs=44218)
          0.038207654 = queryNorm
        0.6081167 = fieldWeight in 4011, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          4.8649335 = idf(docFreq=926, maxDocs=44218)
          0.0625 = fieldNorm(doc=4011)
  0.16666667 = coord(1/6)
```
Abstract

Die Forschung zur Evaluierung von Information Retrieval Systemen hat in den letzten Jahren neue Richtungen eingeschlagen und interessante Ergebnisse erzielt. Während früher primär die Überlegenheit einzelner Verfahren in heterogenen Anwendungsszenarien im Fokus stand, gerät zunehmend die Validität der Evaluierungsmethodik ins Zentrum der Aufmerksamkeit. Dieser Artikel fasst die aktuelle Forschung zu innovativen Evaluierungsmaßen und zur Zuverlässigkeit des so genannten Cranfield-Paradigmas zusammen.

Biebricher, P.; Fuhr, N.; Niewelt, B.: ¬Der AIR-Retrievaltest (1986) 0.02

0.016651679 = product of:
  0.099910066 = sum of:
    0.099910066 = weight(_text_:forschung in 4040) [ClassicSimilarity], result of:
      0.099910066 = score(doc=4040,freq=2.0), product of:
        0.1858777 = queryWeight, product of:
          4.8649335 = idf(docFreq=926, maxDocs=44218)
          0.038207654 = queryNorm
        0.5375043 = fieldWeight in 4040, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.8649335 = idf(docFreq=926, maxDocs=44218)
          0.078125 = fieldNorm(doc=4040)
  0.16666667 = coord(1/6)

Source: Automatische Indexierung zwischen Forschung und Anwendung, Hrsg.: G. Lustig

Grummann, M.: Sind Verfahren zur maschinellen Indexierung für Literaturbestände Öffentlicher Bibliotheken geeignet? : Retrievaltests von indexierten ekz-Daten mit der Software IDX (2000) 0.01

0.013321342 = product of:
  0.07992805 = sum of:
    0.07992805 = weight(_text_:forschung in 1879) [ClassicSimilarity], result of:
      0.07992805 = score(doc=1879,freq=2.0), product of:
        0.1858777 = queryWeight, product of:
          4.8649335 = idf(docFreq=926, maxDocs=44218)
          0.038207654 = queryNorm
        0.43000343 = fieldWeight in 1879, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.8649335 = idf(docFreq=926, maxDocs=44218)
          0.0625 = fieldNorm(doc=1879)
  0.16666667 = coord(1/6)

Source: Bibliothek: Forschung und Praxis. 24(2000) H.3, S.297-318

Mandl, T.: Neue Entwicklungen bei den Evaluierungsinitiativen im Information Retrieval (2006) 0.01
```
0.013321342 = product of:
  0.07992805 = sum of:
    0.07992805 = weight(_text_:forschung in 5975) [ClassicSimilarity], result of:
      0.07992805 = score(doc=5975,freq=2.0), product of:
        0.1858777 = queryWeight, product of:
          4.8649335 = idf(docFreq=926, maxDocs=44218)
          0.038207654 = queryNorm
        0.43000343 = fieldWeight in 5975, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.8649335 = idf(docFreq=926, maxDocs=44218)
          0.0625 = fieldNorm(doc=5975)
  0.16666667 = coord(1/6)
```
Abstract

Im Information Retrieval tragen Evaluierungsinitiativen erheblich zur empirisch fundierten Forschung bei. Mit umfangreichen Kollektionen und Aufgaben unterstützen sie die Standardisierung und damit die Systementwicklung. Die wachsenden Anforderungen hinsichtlich der Korpora und Anwendungsszenarien führten zu einer starken Diversifizierung innerhalb der Evaluierungsinitiativen. Dieser Artikel gibt einen Überblick über den aktuellen Stand der wichtigsten Evaluierungsinitiativen und neuen Trends.
Schirrmeister, N.-P.; Keil, S.: Aufbau einer Infrastruktur für Information Retrieval-Evaluationen (2012) 0.01
```
0.013321342 = product of:
  0.07992805 = sum of:
    0.07992805 = weight(_text_:forschung in 3097) [ClassicSimilarity], result of:
      0.07992805 = score(doc=3097,freq=2.0), product of:
        0.1858777 = queryWeight, product of:
          4.8649335 = idf(docFreq=926, maxDocs=44218)
          0.038207654 = queryNorm
        0.43000343 = fieldWeight in 3097, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.8649335 = idf(docFreq=926, maxDocs=44218)
          0.0625 = fieldNorm(doc=3097)
  0.16666667 = coord(1/6)
```
Abstract

Das Projekt "Aufbau einer Infrastruktur für Information Retrieval-Evaluationen" (AIIRE) bietet eine Softwareinfrastruktur zur Unterstützung von Information Retrieval-Evaluationen (IR-Evaluationen). Die Infrastruktur basiert auf einem Tool-Kit, das bei GESIS im Rahmen des DFG-Projekts IRM entwickelt wurde. Ziel ist es, ein System zu bieten, das zur Forschung und Lehre am Fachbereich Media für IR-Evaluationen genutzt werden kann. This paper describes some aspects of a project called "Aufbau einer Infrastruktur für Information Retrieval-Evaluationen" (AIIRE). Its goal is to build a software-infrastructure which supports the evaluation of information retrieval algorithms.
Losada, D.E.; Parapar, J.; Barreiro, A.: Multi-armed bandits for adjudicating documents in pooling-based evaluation of information retrieval systems (2017) 0.01
```
0.01311523 = product of:
  0.07869138 = sum of:
    0.07869138 = weight(_text_:propose in 5098) [ClassicSimilarity], result of:
      0.07869138 = score(doc=5098,freq=4.0), product of:
        0.19617504 = queryWeight, product of:
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.038207654 = queryNorm
        0.40112838 = fieldWeight in 5098, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5098)
  0.16666667 = coord(1/6)
```
Abstract

Evaluating Information Retrieval systems is crucial to making progress in search technologies. Evaluation is often based on assembling reference collections consisting of documents, queries and relevance judgments done by humans. In large-scale environments, exhaustively judging relevance becomes infeasible. Instead, only a pool of documents is judged for relevance. By selectively choosing documents from the pool we can optimize the number of judgments required to identify a given number of relevant documents. We argue that this iterative selection process can be naturally modeled as a reinforcement learning problem and propose innovative and formal adjudication methods based on multi-armed bandits. Casting document judging as a multi-armed bandit problem is not only theoretically appealing, but also leads to highly effective adjudication methods. Under this bandit allocation framework, we consider stationary and non-stationary models and propose seven new document adjudication methods (five stationary methods and two non-stationary variants). Our paper also reports a series of experiments performed to thoroughly compare our new methods against current adjudication methods. This comparative study includes existing methods designed for pooling-based evaluation and existing methods designed for metasearch. Our experiments show that our theoretically grounded adjudication methods can substantially minimize the assessment effort.
Bodoff, D.; Kambil, A.: Partial coordination : II. A preliminary evaluation and failure analysis (1998) 0.01
```
0.011128641 = product of:
  0.06677184 = sum of:
    0.06677184 = weight(_text_:propose in 2323) [ClassicSimilarity], result of:
      0.06677184 = score(doc=2323,freq=2.0), product of:
        0.19617504 = queryWeight, product of:
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.038207654 = queryNorm
        0.3403687 = fieldWeight in 2323, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.046875 = fieldNorm(doc=2323)
  0.16666667 = coord(1/6)
```
Abstract

Partial coordination is a new method for cataloging documents for subject access. It is especially designed to enhance the precision of document searches in online environments. This article reports a preliminary evaluation of partial coordination that shows promising results compared with full-text retrieval. We also report the difficulties in empirically evaluating the effectiveness of automatic full-text retrieval in contrast to mixed methods such as partial coordination which combine human cataloging with computerized retrieval. Based on our study, we propose research in this area will substantially benefit from a common framework for failure analysis and a common data set. This will allow information retrieval researchers adapting 'library style'cataloging to large electronic document collections, as well as those developing automated or mixed methods, to directly compare their proposals for indexing and retrieval. This article concludes by suggesting guidelines for constructing such as testbed
Baillie, M.; Azzopardi, L.; Ruthven, I.: Evaluating epistemic uncertainty under incomplete assessments (2008) 0.01
```
0.011128641 = product of:
  0.06677184 = sum of:
    0.06677184 = weight(_text_:propose in 2065) [ClassicSimilarity], result of:
      0.06677184 = score(doc=2065,freq=2.0), product of:
        0.19617504 = queryWeight, product of:
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.038207654 = queryNorm
        0.3403687 = fieldWeight in 2065, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.046875 = fieldNorm(doc=2065)
  0.16666667 = coord(1/6)
```
Abstract

The thesis of this study is to propose an extended methodology for laboratory based Information Retrieval evaluation under incomplete relevance assessments. This new methodology aims to identify potential uncertainty during system comparison that may result from incompleteness. The adoption of this methodology is advantageous, because the detection of epistemic uncertainty - the amount of knowledge (or ignorance) we have about the estimate of a system's performance - during the evaluation process can guide and direct researchers when evaluating new systems over existing and future test collections. Across a series of experiments we demonstrate how this methodology can lead towards a finer grained analysis of systems. In particular, we show through experimentation how the current practice in Information Retrieval evaluation of using a measurement depth larger than the pooling depth increases uncertainty during system comparison.
Li, J.; Zhang, P.; Song, D.; Wu, Y.: Understanding an enriched multidimensional user relevance model by analyzing query logs (2017) 0.01
```
0.011128641 = product of:
  0.06677184 = sum of:
    0.06677184 = weight(_text_:propose in 3961) [ClassicSimilarity], result of:
      0.06677184 = score(doc=3961,freq=2.0), product of:
        0.19617504 = queryWeight, product of:
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.038207654 = queryNorm
        0.3403687 = fieldWeight in 3961, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.046875 = fieldNorm(doc=3961)
  0.16666667 = coord(1/6)
```
Abstract

Modeling multidimensional relevance in information retrieval (IR) has attracted much attention in recent years. However, most existing studies are conducted through relatively small-scale user studies, which may not reflect a real-world and natural search scenario. In this article, we propose to study the multidimensional user relevance model (MURM) on large scale query logs, which record users' various search behaviors (e.g., query reformulations, clicks and dwelling time, etc.) in natural search settings. We advance an existing MURM model (including five dimensions: topicality, novelty, reliability, understandability, and scope) by providing two additional dimensions, that is, interest and habit. The two new dimensions represent personalized relevance judgment on retrieved documents. Further, for each dimension in the enriched MURM model, a set of computable features are formulated. By conducting extensive document ranking experiments on Bing's query logs and TREC session Track data, we systematically investigated the impact of each dimension on retrieval performance and gained a series of insightful findings which may bring benefits for the design of future IR systems.
Sun, Y.; Kantor, P.B.: Cross-evaluation : a new model for information system evaluation (2006) 0.01
```
0.009273868 = product of:
  0.055643205 = sum of:
    0.055643205 = weight(_text_:propose in 5048) [ClassicSimilarity], result of:
      0.055643205 = score(doc=5048,freq=2.0), product of:
        0.19617504 = queryWeight, product of:
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.038207654 = queryNorm
        0.2836406 = fieldWeight in 5048, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5048)
  0.16666667 = coord(1/6)
```
Abstract

In this article, we introduce a new information system evaluation method and report on its application to a collaborative information seeking system, AntWorld. The key innovation of the new method is to use precisely the same group of users who work with the system as judges, a system we call Cross-Evaluation. In the new method, we also propose to assess the system at the level of task completion. The obvious potential limitation of this method is that individuals may be inclined to think more highly of the materials that they themselves have found and are almost certain to think more highly of their own work product than they do of the products built by others. The keys to neutralizing this problem are careful design and a corresponding analytical model based on analysis of variance. We model the several measures of task completion with a linear model of five effects, describing the users who interact with the system, the system used to finish the task, the task itself, the behavior of individuals as judges, and the selfjudgment bias. Our analytical method successfully isolates the effect of each variable. This approach provides a successful model to make concrete the "threerealities" paradigm, which calls for "real tasks," "real users," and "real systems."
Toepfer, M.; Seifert, C.: Content-based quality estimation for automatic subject indexing of short texts under precision and recall constraints 0.01
```
0.009273868 = product of:
  0.055643205 = sum of:
    0.055643205 = weight(_text_:propose in 4309) [ClassicSimilarity], result of:
      0.055643205 = score(doc=4309,freq=2.0), product of:
        0.19617504 = queryWeight, product of:
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.038207654 = queryNorm
        0.2836406 = fieldWeight in 4309, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.0390625 = fieldNorm(doc=4309)
  0.16666667 = coord(1/6)
```
Abstract

Semantic annotations have to satisfy quality constraints to be useful for digital libraries, which is particularly challenging on large and diverse datasets. Confidence scores of multi-label classification methods typically refer only to the relevance of particular subjects, disregarding indicators of insufficient content representation at the document-level. Therefore, we propose a novel approach that detects documents rather than concepts where quality criteria are met. Our approach uses a deep, multi-layered regression architecture, which comprises a variety of content-based indicators. We evaluated multiple configurations using text collections from law and economics, where the available content is restricted to very short texts. Notably, we demonstrate that the proposed quality estimation technique can determine subsets of the previously unseen data where considerable gains in document-level recall can be achieved, while upholding precision at the same time. Hence, the approach effectively performs a filtering that ensures high data quality standards in operative information retrieval systems.
Losada, D.E.; Parapar, J.; Barreiro, A.: When to stop making relevance judgments? : a study of stopping methods for building information retrieval test collections (2019) 0.01
```
0.009273868 = product of:
  0.055643205 = sum of:
    0.055643205 = weight(_text_:propose in 4674) [ClassicSimilarity], result of:
      0.055643205 = score(doc=4674,freq=2.0), product of:
        0.19617504 = queryWeight, product of:
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.038207654 = queryNorm
        0.2836406 = fieldWeight in 4674, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.0390625 = fieldNorm(doc=4674)
  0.16666667 = coord(1/6)
```
Abstract

In information retrieval evaluation, pooling is a well-known technique to extract a sample of documents to be assessed for relevance. Given the pooled documents, a number of studies have proposed different prioritization methods to adjudicate documents for judgment. These methods follow different strategies to reduce the assessment effort. However, there is no clear guidance on how many relevance judgments are required for creating a reliable test collection. In this article we investigate and further develop methods to determine when to stop making relevance judgments. We propose a highly diversified set of stopping methods and provide a comprehensive analysis of the usefulness of the resulting test collections. Some of the stopping methods introduced here combine innovative estimates of recall with time series models used in Financial Trading. Experimental results on several representative collections show that some stopping methods can reduce up to 95% of the assessment effort and still produce a robust test collection. We demonstrate that the reduced set of judgments can be reliably employed to compare search systems using disparate effectiveness metrics such as Average Precision, NDCG, P@100, and Rank Biased Precision. With all these measures, the correlations found between full pool rankings and reduced pool rankings is very high.
Angelini, M.; Fazzini, V.; Ferro, N.; Santucci, G.; Silvello, G.: CLAIRE: A combinatorial visual analytics system for information retrieval evaluation (2018) 0.01
```
0.009273868 = product of:
  0.055643205 = sum of:
    0.055643205 = weight(_text_:propose in 5049) [ClassicSimilarity], result of:
      0.055643205 = score(doc=5049,freq=2.0), product of:
        0.19617504 = queryWeight, product of:
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.038207654 = queryNorm
        0.2836406 = fieldWeight in 5049, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5049)
  0.16666667 = coord(1/6)
```
Abstract

Information Retrieval (IR) develops complex systems, constituted of several components, which aim at returning and optimally ranking the most relevant documents in response to user queries. In this context, experimental evaluation plays a central role, since it allows for measuring IR systems effectiveness, increasing the understanding of their functioning, and better directing the efforts for improving them. Current evaluation methodologies are limited by two major factors: (i) IR systems are evaluated as "black boxes", since it is not possible to decompose the contributions of the different components, e.g., stop lists, stemmers, and IR models; (ii) given that it is not possible to predict the effectiveness of an IR system, both academia and industry need to explore huge numbers of systems, originated by large combinatorial compositions of their components, to understand how they perform and how these components interact together. We propose a Combinatorial visuaL Analytics system for Information Retrieval Evaluation (CLAIRE) which allows for exploring and making sense of the performances of a large amount of IR systems, in order to quickly and intuitively grasp which system configurations are preferred, what are the contributions of the different components and how these components interact together. The CLAIRE system is then validated against use cases based on several test collections using a wide set of systems, generated by a combinatorial composition of several off-the-shelf components, representing the most common denominator almost always present in English IR systems. In particular, we validate the findings enabled by CLAIRE with respect to consolidated deep statistical analyses and we show that the CLAIRE system allows the generation of new insights, which were not detectable with traditional approaches.
Parapar, J.; Losada, D.E.; Presedo-Quindimil, M.A.; Barreiro, A.: Using score distributions to compare statistical significance tests for information retrieval evaluation (2020) 0.01
```
0.009273868 = product of:
  0.055643205 = sum of:
    0.055643205 = weight(_text_:propose in 5506) [ClassicSimilarity], result of:
      0.055643205 = score(doc=5506,freq=2.0), product of:
        0.19617504 = queryWeight, product of:
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.038207654 = queryNorm
        0.2836406 = fieldWeight in 5506, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5506)
  0.16666667 = coord(1/6)
```
Abstract

Statistical significance tests can provide evidence that the observed difference in performance between 2 methods is not due to chance. In information retrieval (IR), some studies have examined the validity and suitability of such tests for comparing search systems. We argue here that current methods for assessing the reliability of statistical tests suffer from some methodological weaknesses, and we propose a novel way to study significance tests for retrieval evaluation. Using Score Distributions, we model the output of multiple search systems, produce simulated search results from such models, and compare them using various significance tests. A key strength of this approach is that we assess statistical tests under perfect knowledge about the truth or falseness of the null hypothesis. This new method for studying the power of significance tests in IR evaluation is formal and innovative. Following this type of analysis, we found that both the sign test and Wilcoxon signed test have more power than the permutation test and the t-test. The sign test and Wilcoxon signed test also have good behavior in terms of type I errors. The bootstrap test shows few type I errors, but it has less power than the other methods tested.
Gao, R.; Ge, Y.; Sha, C.: FAIR: Fairness-aware information retrieval evaluation (2022) 0.01
```
0.009273868 = product of:
  0.055643205 = sum of:
    0.055643205 = weight(_text_:propose in 669) [ClassicSimilarity], result of:
      0.055643205 = score(doc=669,freq=2.0), product of:
        0.19617504 = queryWeight, product of:
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.038207654 = queryNorm
        0.2836406 = fieldWeight in 669, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.0390625 = fieldNorm(doc=669)
  0.16666667 = coord(1/6)
```
Abstract

With the emerging needs of creating fairness-aware solutions for search and recommendation systems, a daunting challenge exists of evaluating such solutions. While many of the traditional information retrieval (IR) metrics can capture the relevance, diversity, and novelty for the utility with respect to users, they are not suitable for inferring whether the presented results are fair from the perspective of responsible information exposure. On the other hand, existing fairness metrics do not account for user utility or do not measure it adequately. To address this problem, we propose a new metric called FAIR. By unifying standard IR metrics and fairness measures into an integrated metric, this metric offers a new perspective for evaluating fairness-aware ranking results. Based on this metric, we developed an effective ranking algorithm that jointly optimized user utility and fairness. The experimental results showed that our FAIR metric could highlight results with good user utility and fair information exposure. We showed how FAIR related to a set of existing utility and fairness metrics and demonstrated the effectiveness of our FAIR-based algorithm. We believe our work opens up a new direction of pursuing a metric for evaluating and implementing the FAIR systems.

Rijsbergen, C.J. van: ¬A test for the separation of relevant and non-relevant documents in experimental retrieval collections (1973) 0.01

0.00924463 = product of:
  0.05546778 = sum of:
    0.05546778 = product of:
      0.08320167 = sum of:
        0.041788794 = weight(_text_:29 in 5002) [ClassicSimilarity], result of:
          0.041788794 = score(doc=5002,freq=2.0), product of:
            0.13440257 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.038207654 = queryNorm
            0.31092256 = fieldWeight in 5002, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.0625 = fieldNorm(doc=5002)
        0.041412875 = weight(_text_:22 in 5002) [ClassicSimilarity], result of:
          0.041412875 = score(doc=5002,freq=2.0), product of:
            0.13379669 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.038207654 = queryNorm
            0.30952093 = fieldWeight in 5002, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0625 = fieldNorm(doc=5002)
      0.6666667 = coord(2/3)
  0.16666667 = coord(1/6)

Date: 19. 3.1996 11:22:12
Source: Journal of documentation. 29(1973) no.3, S.251-257

Mansourian, Y.; Ford, N.: Search persistence and failure on the web : a "bounded rationality" and "satisficing" analysis (2007) 0.01
```
0.007419094 = product of:
  0.044514563 = sum of:
    0.044514563 = weight(_text_:propose in 841) [ClassicSimilarity], result of:
      0.044514563 = score(doc=841,freq=2.0), product of:
        0.19617504 = queryWeight, product of:
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.038207654 = queryNorm
        0.22691247 = fieldWeight in 841, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.03125 = fieldNorm(doc=841)
  0.16666667 = coord(1/6)
```
Abstract

Purpose - This paper aims to examine our current knowledge of how searchers perceive and react to the possibility of missing potentially important information whilst searching the web is limited. The study reported here seeks to investigate such perceptions and reactions, and to explore the extent to which Simon's "bounded rationality" theory is useful in illuminating these issues. Design/methodology/approach - Totally 37 academic staff, research staff and research students in three university departments were interviewed about their web searching. The open-ended, semi-structured interviews were inductively analysed. Emergence of the concept of "good enough" searching prompted a further analysis to explore the extent to which the data could be interpreted in terms of Simon's concepts of "bounded rationality" and "satisficing". Findings - The results indicate that the risk of missing potentially important information was a matter of concern to the interviewees. Their estimations of the likely extent and importance of missed information affected decisions by individuals as to when to stop searching - decisions based on very different criteria, which map well onto Simon's concepts. On the basis of the interview data, the authors propose tentative categorizations of perceptions of the risk of missing information including "inconsequential" "tolerable" "damaging" and "disastrous" and search strategies including "perfunctory" "minimalist" "nervous" and "extensive". It is concluded that there is at least a prima facie case for bounded rationality and satisficing being considered as potentially useful concepts in our quest better to understand aspects of human information behaviour. Research limitations/implications - Although the findings are based on a relatively small sample and an exploratory qualitative analysis, it is argued that the study raises a number of interesting questions, and has implications for both the development of theory and practice in the areas of web searching and information literacy. Originality/value - The paper focuses on an aspect of web searching which has not to date been well explored. Whilst research has done much to illuminate searchers' perceptions of what they find on the web, we know relatively little of their perceptions of, and reactions to information that they fail to find. The study reported here provides some tentative models, based on empirical evidence, of these phenomena.
Kutlu, M.; Elsayed, T.; Lease, M.: Intelligent topic selection for low-cost information retrieval evaluation : a new perspective on deep vs. shallow judging (2018) 0.01
```
0.007419094 = product of:
  0.044514563 = sum of:
    0.044514563 = weight(_text_:propose in 5092) [ClassicSimilarity], result of:
      0.044514563 = score(doc=5092,freq=2.0), product of:
        0.19617504 = queryWeight, product of:
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.038207654 = queryNorm
        0.22691247 = fieldWeight in 5092, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.1344433 = idf(docFreq=707, maxDocs=44218)
          0.03125 = fieldNorm(doc=5092)
  0.16666667 = coord(1/6)
```
Abstract

While test collections provide the cornerstone for Cranfield-based evaluation of information retrieval (IR) systems, it has become practically infeasible to rely on traditional pooling techniques to construct test collections at the scale of today's massive document collections (e.g., ClueWeb12's 700M+ Webpages). This has motivated a flurry of studies proposing more cost-effective yet reliable IR evaluation methods. In this paper, we propose a new intelligent topic selection method which reduces the number of search topics (and thereby costly human relevance judgments) needed for reliable IR evaluation. To rigorously assess our method, we integrate previously disparate lines of research on intelligent topic selection and deep vs. shallow judging (i.e., whether it is more cost-effective to collect many relevance judgments for a few topics or a few judgments for many topics). While prior work on intelligent topic selection has never been evaluated against shallow judging baselines, prior work on deep vs. shallow judging has largely argued for shallowed judging, but assuming random topic selection. We argue that for evaluating any topic selection method, ultimately one must ask whether it is actually useful to select topics, or should one simply perform shallow judging over many topics? In seeking a rigorous answer to this over-arching question, we conduct a comprehensive investigation over a set of relevant factors never previously studied together: 1) method of topic selection; 2) the effect of topic familiarity on human judging speed; and 3) how different topic generation processes (requiring varying human effort) impact (i) budget utilization and (ii) the resultant quality of judgments. Experiments on NIST TREC Robust 2003 and Robust 2004 test collections show that not only can we reliably evaluate IR systems with fewer topics, but also that: 1) when topics are intelligently selected, deep judging is often more cost-effective than shallow judging in evaluation reliability; and 2) topic familiarity and topic generation costs greatly impact the evaluation cost vs. reliability trade-off. Our findings challenge conventional wisdom in showing that deep judging is often preferable to shallow judging when topics are selected intelligently.

Search (100 results, page 1 of 5)

Authors

Years

Languages

Types

Themes

Subjects

Classifications