Search (1 results, page 1 of 1)

Sakai, T.: On the reliability of information retrieval metrics based on graded relevance (2007) 0.02
```
0.016797256 = product of:
  0.03359451 = sum of:
    0.03359451 = product of:
      0.06718902 = sum of:
        0.06718902 = weight(_text_:l in 910) [ClassicSimilarity], result of:
          0.06718902 = score(doc=910,freq=4.0), product of:
            0.18031335 = queryWeight, product of:
              3.9746525 = idf(docFreq=2257, maxDocs=44218)
              0.045365814 = queryNorm
            0.37262368 = fieldWeight in 910, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.9746525 = idf(docFreq=2257, maxDocs=44218)
              0.046875 = fieldNorm(doc=910)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

This paper compares 14 information retrieval metrics based on graded relevance, together with 10 traditional metrics based on binary relevance, in terms of stability, sensitivity and resemblance of system rankings. More specifically, we compare these metrics using the Buckley/Voorhees stability method, the Voorhees/Buckley swap method and Kendall's rank correlation, with three data sets comprising test collections and submitted runs from NTCIR. Our experiments show that (Average) Normalised Discounted Cumulative Gain at document cut-off l are the best among the rank-based graded-relevance metrics, provided that l is large. On the other hand, if one requires a recall-based graded-relevance metric that is highly correlated with Average Precision, then Q-measure is the best choice. Moreover, these best graded-relevance metrics are at least as stable and sensitive as Average Precision, and are fairly robust to the choice of gain values.