Search (1 results, page 1 of 1)

  • × author_ss:"Monz, C."
  • × author_ss:"Schwartz, R."
  • × theme_ss:"Automatisches Abstracting"
  • × type_ss:"a"
  1. Hobson, S.P.; Dorr, B.J.; Monz, C.; Schwartz, R.: Task-based evaluation of text summarization using Relevance Prediction (2007) 0.00
    0.0015457221 = product of:
      0.009274333 = sum of:
        0.009274333 = weight(_text_:in in 938) [ClassicSimilarity], result of:
          0.009274333 = score(doc=938,freq=6.0), product of:
            0.059380736 = queryWeight, product of:
              1.3602545 = idf(docFreq=30841, maxDocs=44218)
              0.043654136 = queryNorm
            0.1561842 = fieldWeight in 938, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              1.3602545 = idf(docFreq=30841, maxDocs=44218)
              0.046875 = fieldNorm(doc=938)
      0.16666667 = coord(1/6)
    
    Abstract
    This article introduces a new task-based evaluation measure called Relevance Prediction that is a more intuitive measure of an individual's performance on a real-world task than interannotator agreement. Relevance Prediction parallels what a user does in the real world task of browsing a set of documents using standard search tools, i.e., the user judges relevance based on a short summary and then that same user - not an independent user - decides whether to open (and judge) the corresponding document. This measure is shown to be a more reliable measure of task performance than LDC Agreement, a current gold-standard based measure used in the summarization evaluation community. Our goal is to provide a stable framework within which developers of new automatic measures may make stronger statistical statements about the effectiveness of their measures in predicting summary usefulness. We demonstrate - as a proof-of-concept methodology for automatic metric developers - that a current automatic evaluation measure has a better correlation with Relevance Prediction than with LDC Agreement and that the significance level for detected differences is higher for the former than for the latter.