Document (#35916)

Author
Westerman, S.J.
Cribbin, T.
Collins, J.
Title
Human assessments of document similarity
Source
Journal of the American Society for Information Science and Technology. 61(2010) no.8, S.1535-1542
Year
2010
Abstract
Two studies are reported that examined the reliability of human assessments of document similarity and the association between human ratings and the results of n-gram automatic text analysis (ATA). Human interassessor reliability (IAR) was moderate to poor. However, correlations between average human ratings and n-gram solutions were strong. The average correlation between ATA and individual human solutions was greater than IAR. N-gram length influenced the strength of association, but optimum string length depended on the nature of the text (technical vs. nontechnical). We conclude that the methodology applied in previous studies may have led to overoptimistic views on human reliability, but that an optimal n-gram solution can provide a good approximation of the average human assessment of document similarity, a result that has important implications for future development of document visualization systems.
Theme
Indexierungsstudien
Object
n-grams

Similar documents (author)

  1. Collins, B.R.: Beyond cruising : reviewing (1996) 5.14
    5.1444697 = sum of:
      5.1444697 = weight(author_txt:collins in 4739) [ClassicSimilarity], result of:
        5.1444697 = fieldWeight in 4739, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.231152 = idf(docFreq=31, maxDocs=44218)
          0.625 = fieldNorm(doc=4739)
    
  2. Collins, H.M.: ¬A review of Hubert Dreyfus' What computers still can't do (1996) 5.14
    5.1444697 = sum of:
      5.1444697 = weight(author_txt:collins in 6773) [ClassicSimilarity], result of:
        5.1444697 = fieldWeight in 6773, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.231152 = idf(docFreq=31, maxDocs=44218)
          0.625 = fieldNorm(doc=6773)
    
  3. Collins, B.R.: Webwatch (1996) 5.14
    5.1444697 = sum of:
      5.1444697 = weight(author_txt:collins in 6956) [ClassicSimilarity], result of:
        5.1444697 = fieldWeight in 6956, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.231152 = idf(docFreq=31, maxDocs=44218)
          0.625 = fieldNorm(doc=6956)
    
  4. Collins, M.: Leveling the information playing field : Illinois public libraries (1996) 5.14
    5.1444697 = sum of:
      5.1444697 = weight(author_txt:collins in 7318) [ClassicSimilarity], result of:
        5.1444697 = fieldWeight in 7318, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.231152 = idf(docFreq=31, maxDocs=44218)
          0.625 = fieldNorm(doc=7318)
    
  5. Collins, B.R.: Webwatch (1997) 5.14
    5.1444697 = sum of:
      5.1444697 = weight(author_txt:collins in 172) [ClassicSimilarity], result of:
        5.1444697 = fieldWeight in 172, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.231152 = idf(docFreq=31, maxDocs=44218)
          0.625 = fieldNorm(doc=172)
    

Similar documents (content)

  1. Ekmekcioglu, F.C.; Lynch, M.F.; Willet, P.: Development and evaluation of conflation techniques for the implementation of a document retrieval system for Turkish text databases (1995) 0.17
    0.16772372 = sum of:
      0.16772372 = product of:
        0.69884884 = sum of:
          0.07341664 = weight(abstract_txt:string in 5797) [ClassicSimilarity], result of:
            0.07341664 = score(doc=5797,freq=1.0), product of:
              0.10914287 = queryWeight, product of:
                1.0720397 = boost
                7.1750984 = idf(docFreq=91, maxDocs=44218)
                0.014189159 = queryNorm
              0.6726655 = fieldWeight in 5797, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.1750984 = idf(docFreq=91, maxDocs=44218)
                0.09375 = fieldNorm(doc=5797)
          0.03717461 = weight(abstract_txt:text in 5797) [ClassicSimilarity], result of:
            0.03717461 = score(doc=5797,freq=2.0), product of:
              0.06933672 = queryWeight, product of:
                1.2083975 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.014189159 = queryNorm
              0.53614604 = fieldWeight in 5797, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.09375 = fieldNorm(doc=5797)
          0.010576145 = weight(abstract_txt:that in 5797) [ClassicSimilarity], result of:
            0.010576145 = score(doc=5797,freq=1.0), product of:
              0.047610637 = queryWeight, product of:
                1.4161041 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.014189159 = queryNorm
              0.22213829 = fieldWeight in 5797, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.09375 = fieldNorm(doc=5797)
          0.06288321 = weight(abstract_txt:document in 5797) [ClassicSimilarity], result of:
            0.06288321 = score(doc=5797,freq=1.0), product of:
              0.15625797 = queryWeight, product of:
                2.5654542 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.014189159 = queryNorm
              0.40243202 = fieldWeight in 5797, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.09375 = fieldNorm(doc=5797)
          0.11749278 = weight(abstract_txt:similarity in 5797) [ClassicSimilarity], result of:
            0.11749278 = score(doc=5797,freq=1.0), product of:
              0.21536754 = queryWeight, product of:
                2.6083384 = boost
                5.8191514 = idf(docFreq=356, maxDocs=44218)
                0.014189159 = queryNorm
              0.54554546 = fieldWeight in 5797, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8191514 = idf(docFreq=356, maxDocs=44218)
                0.09375 = fieldNorm(doc=5797)
          0.39730546 = weight(abstract_txt:gram in 5797) [ClassicSimilarity], result of:
            0.39730546 = score(doc=5797,freq=1.0), product of:
              0.5340338 = queryWeight, product of:
                4.7427206 = boost
                7.935687 = idf(docFreq=42, maxDocs=44218)
                0.014189159 = queryNorm
              0.74397063 = fieldWeight in 5797, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.935687 = idf(docFreq=42, maxDocs=44218)
                0.09375 = fieldNorm(doc=5797)
        0.24 = coord(6/25)
    
  2. Losee, R.M.: Upper bounds for retrieval performance and their user measuring performance and generating optimal queries : can it get any better than this? (1994) 0.13
    0.1331322 = sum of:
      0.1331322 = product of:
        0.47547218 = sum of:
          0.086008474 = weight(abstract_txt:optimal in 7418) [ClassicSimilarity], result of:
            0.086008474 = score(doc=7418,freq=3.0), product of:
              0.094967194 = queryWeight, product of:
                6.6929407 = idf(docFreq=148, maxDocs=44218)
                0.014189159 = queryNorm
              0.9056651 = fieldWeight in 7418, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.6929407 = idf(docFreq=148, maxDocs=44218)
                0.078125 = fieldNorm(doc=7418)
          0.021905348 = weight(abstract_txt:text in 7418) [ClassicSimilarity], result of:
            0.021905348 = score(doc=7418,freq=1.0), product of:
              0.06933672 = queryWeight, product of:
                1.2083975 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.014189159 = queryNorm
              0.3159271 = fieldWeight in 7418, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.078125 = fieldNorm(doc=7418)
          0.10239295 = weight(abstract_txt:optimum in 7418) [ClassicSimilarity], result of:
            0.10239295 = score(doc=7418,freq=1.0), product of:
              0.15385085 = queryWeight, product of:
                1.2728087 = boost
                8.518833 = idf(docFreq=23, maxDocs=44218)
                0.014189159 = queryNorm
              0.66553384 = fieldWeight in 7418, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.518833 = idf(docFreq=23, maxDocs=44218)
                0.078125 = fieldNorm(doc=7418)
          0.01762691 = weight(abstract_txt:that in 7418) [ClassicSimilarity], result of:
            0.01762691 = score(doc=7418,freq=4.0), product of:
              0.047610637 = queryWeight, product of:
                1.4161041 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.014189159 = queryNorm
              0.3702305 = fieldWeight in 7418, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.078125 = fieldNorm(doc=7418)
          0.09256801 = weight(abstract_txt:length in 7418) [ClassicSimilarity], result of:
            0.09256801 = score(doc=7418,freq=1.0), product of:
              0.18123294 = queryWeight, product of:
                1.95365 = boost
                6.537832 = idf(docFreq=173, maxDocs=44218)
                0.014189159 = queryNorm
              0.5107681 = fieldWeight in 7418, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.537832 = idf(docFreq=173, maxDocs=44218)
                0.078125 = fieldNorm(doc=7418)
          0.052402675 = weight(abstract_txt:document in 7418) [ClassicSimilarity], result of:
            0.052402675 = score(doc=7418,freq=1.0), product of:
              0.15625797 = queryWeight, product of:
                2.5654542 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.014189159 = queryNorm
              0.33536002 = fieldWeight in 7418, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.078125 = fieldNorm(doc=7418)
          0.10256783 = weight(abstract_txt:average in 7418) [ClassicSimilarity], result of:
            0.10256783 = score(doc=7418,freq=1.0), product of:
              0.2221439 = queryWeight, product of:
                2.6490552 = boost
                5.90999 = idf(docFreq=325, maxDocs=44218)
                0.014189159 = queryNorm
              0.46171796 = fieldWeight in 7418, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.90999 = idf(docFreq=325, maxDocs=44218)
                0.078125 = fieldNorm(doc=7418)
        0.28 = coord(7/25)
    
  3. Ravana, S.D.; Rajagopal, P.; Balakrishnan, V.: Ranking retrieval systems using pseudo relevance judgments (2015) 0.12
    0.11528399 = sum of:
      0.11528399 = product of:
        0.48034996 = sum of:
          0.017524278 = weight(abstract_txt:text in 2591) [ClassicSimilarity], result of:
            0.017524278 = score(doc=2591,freq=1.0), product of:
              0.06933672 = queryWeight, product of:
                1.2083975 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.014189159 = queryNorm
              0.25274166 = fieldWeight in 2591, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=2591)
          0.01221228 = weight(abstract_txt:that in 2591) [ClassicSimilarity], result of:
            0.01221228 = score(doc=2591,freq=3.0), product of:
              0.047610637 = queryWeight, product of:
                1.4161041 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.014189159 = queryNorm
              0.2565032 = fieldWeight in 2591, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.0625 = fieldNorm(doc=2591)
          0.016513644 = weight(abstract_txt:between in 2591) [ClassicSimilarity], result of:
            0.016513644 = score(doc=2591,freq=1.0), product of:
              0.07628906 = queryWeight, product of:
                1.5524046 = boost
                3.4633842 = idf(docFreq=3764, maxDocs=44218)
                0.014189159 = queryNorm
              0.21646151 = fieldWeight in 2591, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4633842 = idf(docFreq=3764, maxDocs=44218)
                0.0625 = fieldNorm(doc=2591)
          0.08384428 = weight(abstract_txt:document in 2591) [ClassicSimilarity], result of:
            0.08384428 = score(doc=2591,freq=4.0), product of:
              0.15625797 = queryWeight, product of:
                2.5654542 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.014189159 = queryNorm
              0.53657603 = fieldWeight in 2591, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0625 = fieldNorm(doc=2591)
          0.082054265 = weight(abstract_txt:average in 2591) [ClassicSimilarity], result of:
            0.082054265 = score(doc=2591,freq=1.0), product of:
              0.2221439 = queryWeight, product of:
                2.6490552 = boost
                5.90999 = idf(docFreq=325, maxDocs=44218)
                0.014189159 = queryNorm
              0.36937436 = fieldWeight in 2591, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.90999 = idf(docFreq=325, maxDocs=44218)
                0.0625 = fieldNorm(doc=2591)
          0.2682012 = weight(abstract_txt:human in 2591) [ClassicSimilarity], result of:
            0.2682012 = score(doc=2591,freq=6.0), product of:
              0.37337616 = queryWeight, product of:
                5.608303 = boost
                4.692005 = idf(docFreq=1101, maxDocs=44218)
                0.014189159 = queryNorm
              0.7183137 = fieldWeight in 2591, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.692005 = idf(docFreq=1101, maxDocs=44218)
                0.0625 = fieldNorm(doc=2591)
        0.24 = coord(6/25)
    
  4. Leroy, G.; Miller, T.; Rosemblat, G.; Browne, A.: ¬A balanced approach to health information evaluation : a vocabulary-based naïve Bayes classifier and readability formulas (2008) 0.11
    0.113730706 = sum of:
      0.113730706 = product of:
        0.4061811 = sum of:
          0.03097884 = weight(abstract_txt:text in 1998) [ClassicSimilarity], result of:
            0.03097884 = score(doc=1998,freq=2.0), product of:
              0.06933672 = queryWeight, product of:
                1.2083975 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.014189159 = queryNorm
              0.44678837 = fieldWeight in 1998, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.078125 = fieldNorm(doc=1998)
          0.025752578 = weight(abstract_txt:studies in 1998) [ClassicSimilarity], result of:
            0.025752578 = score(doc=1998,freq=1.0), product of:
              0.07723432 = queryWeight, product of:
                1.2753617 = boost
                4.26796 = idf(docFreq=1683, maxDocs=44218)
                0.014189159 = queryNorm
              0.33343437 = fieldWeight in 1998, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.26796 = idf(docFreq=1683, maxDocs=44218)
                0.078125 = fieldNorm(doc=1998)
          0.015265351 = weight(abstract_txt:that in 1998) [ClassicSimilarity], result of:
            0.015265351 = score(doc=1998,freq=3.0), product of:
              0.047610637 = queryWeight, product of:
                1.4161041 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.014189159 = queryNorm
              0.320629 = fieldWeight in 1998, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.078125 = fieldNorm(doc=1998)
          0.020642057 = weight(abstract_txt:between in 1998) [ClassicSimilarity], result of:
            0.020642057 = score(doc=1998,freq=1.0), product of:
              0.07628906 = queryWeight, product of:
                1.5524046 = boost
                3.4633842 = idf(docFreq=3764, maxDocs=44218)
                0.014189159 = queryNorm
              0.2705769 = fieldWeight in 1998, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4633842 = idf(docFreq=3764, maxDocs=44218)
                0.078125 = fieldNorm(doc=1998)
          0.07410858 = weight(abstract_txt:document in 1998) [ClassicSimilarity], result of:
            0.07410858 = score(doc=1998,freq=2.0), product of:
              0.15625797 = queryWeight, product of:
                2.5654542 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.014189159 = queryNorm
              0.4742707 = fieldWeight in 1998, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.078125 = fieldNorm(doc=1998)
          0.10256783 = weight(abstract_txt:average in 1998) [ClassicSimilarity], result of:
            0.10256783 = score(doc=1998,freq=1.0), product of:
              0.2221439 = queryWeight, product of:
                2.6490552 = boost
                5.90999 = idf(docFreq=325, maxDocs=44218)
                0.014189159 = queryNorm
              0.46171796 = fieldWeight in 1998, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.90999 = idf(docFreq=325, maxDocs=44218)
                0.078125 = fieldNorm(doc=1998)
          0.13686585 = weight(abstract_txt:human in 1998) [ClassicSimilarity], result of:
            0.13686585 = score(doc=1998,freq=1.0), product of:
              0.37337616 = queryWeight, product of:
                5.608303 = boost
                4.692005 = idf(docFreq=1101, maxDocs=44218)
                0.014189159 = queryNorm
              0.3665629 = fieldWeight in 1998, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.692005 = idf(docFreq=1101, maxDocs=44218)
                0.078125 = fieldNorm(doc=1998)
        0.28 = coord(7/25)
    
  5. Blustein, J.; Webber, R.E.; Tague-Sutcliffe, J.: Methods for evaluating the quality of hypertext links (1997) 0.11
    0.10978818 = sum of:
      0.10978818 = product of:
        0.5489409 = sum of:
          0.080304846 = weight(abstract_txt:correlations in 152) [ClassicSimilarity], result of:
            0.080304846 = score(doc=152,freq=1.0), product of:
              0.115867116 = queryWeight, product of:
                1.1045702 = boost
                7.3928223 = idf(docFreq=73, maxDocs=44218)
                0.014189159 = queryNorm
              0.6930771 = fieldWeight in 152, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.3928223 = idf(docFreq=73, maxDocs=44218)
                0.09375 = fieldNorm(doc=152)
          0.1134667 = weight(abstract_txt:approximation in 152) [ClassicSimilarity], result of:
            0.1134667 = score(doc=152,freq=1.0), product of:
              0.14589642 = queryWeight, product of:
                1.2394686 = boost
                8.29569 = idf(docFreq=29, maxDocs=44218)
                0.014189159 = queryNorm
              0.7777209 = fieldWeight in 152, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.29569 = idf(docFreq=29, maxDocs=44218)
                0.09375 = fieldNorm(doc=152)
          0.024770467 = weight(abstract_txt:between in 152) [ClassicSimilarity], result of:
            0.024770467 = score(doc=152,freq=1.0), product of:
              0.07628906 = queryWeight, product of:
                1.5524046 = boost
                3.4633842 = idf(docFreq=3764, maxDocs=44218)
                0.014189159 = queryNorm
              0.32469225 = fieldWeight in 152, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4633842 = idf(docFreq=3764, maxDocs=44218)
                0.09375 = fieldNorm(doc=152)
          0.16615988 = weight(abstract_txt:similarity in 152) [ClassicSimilarity], result of:
            0.16615988 = score(doc=152,freq=2.0), product of:
              0.21536754 = queryWeight, product of:
                2.6083384 = boost
                5.8191514 = idf(docFreq=356, maxDocs=44218)
                0.014189159 = queryNorm
              0.77151775 = fieldWeight in 152, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.8191514 = idf(docFreq=356, maxDocs=44218)
                0.09375 = fieldNorm(doc=152)
          0.16423902 = weight(abstract_txt:human in 152) [ClassicSimilarity], result of:
            0.16423902 = score(doc=152,freq=1.0), product of:
              0.37337616 = queryWeight, product of:
                5.608303 = boost
                4.692005 = idf(docFreq=1101, maxDocs=44218)
                0.014189159 = queryNorm
              0.43987548 = fieldWeight in 152, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.692005 = idf(docFreq=1101, maxDocs=44218)
                0.09375 = fieldNorm(doc=152)
        0.2 = coord(5/25)