Document (#30161)

Author
Hoad, T.C.
Zobel, J.
Title
Methods for identifying versioned and plagiarized documents
Source
Journal of the American Society for Information Science and technology. 54(2003) no.3, S.203-215
Year
2003
Abstract
Hoad and Zobel term documents that originate from the same source, whether versions or plagiarisms, co-derivatives. Identification of co-derivatives is normally by a technique called fingerprinting, which uses hashing to generate surrogates in the form of integer strings derived from substrings of text, for comparison purposes, or by ranking using a similarity measure as in information retrieval. Hoad and Zobel derive several variants of what they term an identity measure, where documents with similar numbers of occurrences of words benefit and those with dissimilar numbers are penalized, for use in a ranking technique. They then review fingerprinting strategies, and characterize them by the substring size utilized, i.e. granularity, character of the hashing function, the size of the document fingerprint, i.e. resolution, and the substring selection strategy. In their experiments highest false match, HFM, the highest percentage score given an incorrect result, and separation, the difference between the lowest correct result and HFM were the measures utilized in two collections, one of 3,300 documents, and the other of 80,000 with 53 query documents. The new identity measure demonstrates superior performance to the alternatives. Only one fingerprinting strategy was able to identify all human identified similar documents, the anchor strategy. The key parameter in fingerprinting appears to be granularity, with three to five words producing the best results.

Similar documents (author)

  1. Kaszkiel, M.; Zobel, J.: Effective ranking with arbitrary passages (2001) 4.69
    4.68613 = sum of:
      4.68613 = weight(author_txt:zobel in 765) [ClassicSimilarity], result of:
        4.68613 = fieldWeight in 765, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.37226 = idf(docFreq=9, maxDocs=43254)
          0.5 = fieldNorm(doc=765)
    
  2. Heinz, S.; Zobel, J.: Efficient single-pass index construction for text databases (2003) 4.69
    4.68613 = sum of:
      4.68613 = weight(author_txt:zobel in 3679) [ClassicSimilarity], result of:
        4.68613 = fieldWeight in 3679, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.37226 = idf(docFreq=9, maxDocs=43254)
          0.5 = fieldNorm(doc=3679)
    
  3. Uitdenbogerd, A.L.; Zobel, J.: ¬An architecture for effective music information retrieval (2004) 4.69
    4.68613 = sum of:
      4.68613 = weight(author_txt:zobel in 5056) [ClassicSimilarity], result of:
        4.68613 = fieldWeight in 5056, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.37226 = idf(docFreq=9, maxDocs=43254)
          0.5 = fieldNorm(doc=5056)
    
  4. Moffat, A.; Zobel, J.: Self-indexing inverted files for fast text retrieval (1996) 4.69
    4.68613 = sum of:
      4.68613 = weight(author_txt:zobel in 2010) [ClassicSimilarity], result of:
        4.68613 = fieldWeight in 2010, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.37226 = idf(docFreq=9, maxDocs=43254)
          0.5 = fieldNorm(doc=2010)
    
  5. Hawking, D.; Zobel, J.: Does topic metadata help with Web search? (2007) 4.69
    4.68613 = sum of:
      4.68613 = weight(author_txt:zobel in 2205) [ClassicSimilarity], result of:
        4.68613 = fieldWeight in 2205, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.37226 = idf(docFreq=9, maxDocs=43254)
          0.5 = fieldNorm(doc=2205)
    

Similar documents (content)

  1. Wartik, S.; Fox, E.; Heath, L.; Chen, Q.-F.: Hashing algorithms (1992) 0.12
    0.11696816 = sum of:
      0.11696816 = product of:
        0.9747347 = sum of:
          0.017130986 = weight(abstract_txt:with in 5511) [ClassicSimilarity], result of:
            0.017130986 = score(doc=5511,freq=1.0), product of:
              0.054586545 = queryWeight, product of:
                1.1931702 = boost
                2.5106533 = idf(docFreq=9548, maxDocs=43254)
                0.018222017 = queryNorm
              0.31383166 = fieldWeight in 5511, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.5106533 = idf(docFreq=9548, maxDocs=43254)
                0.125 = fieldNorm(doc=5511)
          0.09432968 = weight(abstract_txt:technique in 5511) [ClassicSimilarity], result of:
            0.09432968 = score(doc=5511,freq=1.0), product of:
              0.13509925 = queryWeight, product of:
                1.327306 = boost
                5.5858 = idf(docFreq=440, maxDocs=43254)
                0.018222017 = queryNorm
              0.698225 = fieldWeight in 5511, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5858 = idf(docFreq=440, maxDocs=43254)
                0.125 = fieldNorm(doc=5511)
          0.86327404 = weight(abstract_txt:hashing in 5511) [ClassicSimilarity], result of:
            0.86327404 = score(doc=5511,freq=3.0), product of:
              0.4098385 = queryWeight, product of:
                2.3118038 = boost
                9.728935 = idf(docFreq=6, maxDocs=43254)
                0.018222017 = queryNorm
              2.1063762 = fieldWeight in 5511, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.728935 = idf(docFreq=6, maxDocs=43254)
                0.125 = fieldNorm(doc=5511)
        0.12 = coord(3/25)
    
  2. Lihui, C.; Lian, C.W.: Using Web structure and summarisation techniques for Web content mining (2005) 0.10
    0.09863173 = sum of:
      0.09863173 = product of:
        0.35225618 = sum of:
          0.012113436 = weight(abstract_txt:with in 3047) [ClassicSimilarity], result of:
            0.012113436 = score(doc=3047,freq=2.0), product of:
              0.054586545 = queryWeight, product of:
                1.1931702 = boost
                2.5106533 = idf(docFreq=9548, maxDocs=43254)
                0.018222017 = queryNorm
              0.22191249 = fieldWeight in 3047, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.5106533 = idf(docFreq=9548, maxDocs=43254)
                0.0625 = fieldNorm(doc=3047)
          0.03625088 = weight(abstract_txt:result in 3047) [ClassicSimilarity], result of:
            0.03625088 = score(doc=3047,freq=1.0), product of:
              0.11335824 = queryWeight, product of:
                1.2158252 = boost
                5.1166472 = idf(docFreq=704, maxDocs=43254)
                0.018222017 = queryNorm
              0.31979045 = fieldWeight in 3047, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1166472 = idf(docFreq=704, maxDocs=43254)
                0.0625 = fieldNorm(doc=3047)
          0.03887166 = weight(abstract_txt:similar in 3047) [ClassicSimilarity], result of:
            0.03887166 = score(doc=3047,freq=1.0), product of:
              0.11875797 = queryWeight, product of:
                1.2444458 = boost
                5.2370934 = idf(docFreq=624, maxDocs=43254)
                0.018222017 = queryNorm
              0.32731834 = fieldWeight in 3047, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2370934 = idf(docFreq=624, maxDocs=43254)
                0.0625 = fieldNorm(doc=3047)
          0.08169189 = weight(abstract_txt:technique in 3047) [ClassicSimilarity], result of:
            0.08169189 = score(doc=3047,freq=3.0), product of:
              0.13509925 = queryWeight, product of:
                1.327306 = boost
                5.5858 = idf(docFreq=440, maxDocs=43254)
                0.018222017 = queryNorm
              0.6046806 = fieldWeight in 3047, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.5858 = idf(docFreq=440, maxDocs=43254)
                0.0625 = fieldNorm(doc=3047)
          0.0475713 = weight(abstract_txt:ranking in 3047) [ClassicSimilarity], result of:
            0.0475713 = score(doc=3047,freq=1.0), product of:
              0.13587432 = queryWeight, product of:
                1.331108 = boost
                5.6018004 = idf(docFreq=433, maxDocs=43254)
                0.018222017 = queryNorm
              0.35011253 = fieldWeight in 3047, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.6018004 = idf(docFreq=433, maxDocs=43254)
                0.0625 = fieldNorm(doc=3047)
          0.055676203 = weight(abstract_txt:size in 3047) [ClassicSimilarity], result of:
            0.055676203 = score(doc=3047,freq=1.0), product of:
              0.15089926 = queryWeight, product of:
                1.4027754 = boost
                5.9034038 = idf(docFreq=320, maxDocs=43254)
                0.018222017 = queryNorm
              0.36896273 = fieldWeight in 3047, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9034038 = idf(docFreq=320, maxDocs=43254)
                0.0625 = fieldNorm(doc=3047)
          0.0800808 = weight(abstract_txt:documents in 3047) [ClassicSimilarity], result of:
            0.0800808 = score(doc=3047,freq=2.0), product of:
              0.22010168 = queryWeight, product of:
                2.9343839 = boost
                4.1163282 = idf(docFreq=1916, maxDocs=43254)
                0.018222017 = queryNorm
              0.36383545 = fieldWeight in 3047, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1163282 = idf(docFreq=1916, maxDocs=43254)
                0.0625 = fieldNorm(doc=3047)
        0.28 = coord(7/25)
    
  3. Ku, L.-W.; Chen, H.-H.: Mining opinions from the Web : beyond relevance retrieval (2007) 0.09
    0.08768705 = sum of:
      0.08768705 = product of:
        0.43843526 = sum of:
          0.012113436 = weight(abstract_txt:with in 2606) [ClassicSimilarity], result of:
            0.012113436 = score(doc=2606,freq=2.0), product of:
              0.054586545 = queryWeight, product of:
                1.1931702 = boost
                2.5106533 = idf(docFreq=9548, maxDocs=43254)
                0.018222017 = queryNorm
              0.22191249 = fieldWeight in 2606, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.5106533 = idf(docFreq=9548, maxDocs=43254)
                0.0625 = fieldNorm(doc=2606)
          0.08307034 = weight(abstract_txt:words in 2606) [ClassicSimilarity], result of:
            0.08307034 = score(doc=2606,freq=4.0), product of:
              0.12412274 = queryWeight, product of:
                1.2722436 = boost
                5.354077 = idf(docFreq=555, maxDocs=43254)
                0.018222017 = queryNorm
              0.6692596 = fieldWeight in 2606, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.354077 = idf(docFreq=555, maxDocs=43254)
                0.0625 = fieldNorm(doc=2606)
          0.12468714 = weight(abstract_txt:granularity in 2606) [ClassicSimilarity], result of:
            0.12468714 = score(doc=2606,freq=1.0), product of:
              0.25829846 = queryWeight, product of:
                1.8352934 = boost
                7.7236013 = idf(docFreq=51, maxDocs=43254)
                0.018222017 = queryNorm
              0.48272508 = fieldWeight in 2606, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.7236013 = idf(docFreq=51, maxDocs=43254)
                0.0625 = fieldNorm(doc=2606)
          0.09194551 = weight(abstract_txt:measure in 2606) [ClassicSimilarity], result of:
            0.09194551 = score(doc=2606,freq=2.0), product of:
              0.19154969 = queryWeight, product of:
                1.9356685 = boost
                5.430678 = idf(docFreq=514, maxDocs=43254)
                0.018222017 = queryNorm
              0.48000863 = fieldWeight in 2606, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.430678 = idf(docFreq=514, maxDocs=43254)
                0.0625 = fieldNorm(doc=2606)
          0.12661885 = weight(abstract_txt:documents in 2606) [ClassicSimilarity], result of:
            0.12661885 = score(doc=2606,freq=5.0), product of:
              0.22010168 = queryWeight, product of:
                2.9343839 = boost
                4.1163282 = idf(docFreq=1916, maxDocs=43254)
                0.018222017 = queryNorm
              0.57527435 = fieldWeight in 2606, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.1163282 = idf(docFreq=1916, maxDocs=43254)
                0.0625 = fieldNorm(doc=2606)
        0.2 = coord(5/25)
    
  4. Fricke, M.: Measuring recall (1998) 0.09
    0.08562654 = sum of:
      0.08562654 = product of:
        0.4281327 = sum of:
          0.014989614 = weight(abstract_txt:with in 5803) [ClassicSimilarity], result of:
            0.014989614 = score(doc=5803,freq=1.0), product of:
              0.054586545 = queryWeight, product of:
                1.1931702 = boost
                2.5106533 = idf(docFreq=9548, maxDocs=43254)
                0.018222017 = queryNorm
              0.2746027 = fieldWeight in 5803, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.5106533 = idf(docFreq=9548, maxDocs=43254)
                0.109375 = fieldNorm(doc=5803)
          0.08253846 = weight(abstract_txt:technique in 5803) [ClassicSimilarity], result of:
            0.08253846 = score(doc=5803,freq=1.0), product of:
              0.13509925 = queryWeight, product of:
                1.327306 = boost
                5.5858 = idf(docFreq=440, maxDocs=43254)
                0.018222017 = queryNorm
              0.6109469 = fieldWeight in 5803, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5858 = idf(docFreq=440, maxDocs=43254)
                0.109375 = fieldNorm(doc=5803)
          0.11773296 = weight(abstract_txt:ranking in 5803) [ClassicSimilarity], result of:
            0.11773296 = score(doc=5803,freq=2.0), product of:
              0.13587432 = queryWeight, product of:
                1.331108 = boost
                5.6018004 = idf(docFreq=433, maxDocs=43254)
                0.018222017 = queryNorm
              0.8664843 = fieldWeight in 5803, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.6018004 = idf(docFreq=433, maxDocs=43254)
                0.109375 = fieldNorm(doc=5803)
          0.11377676 = weight(abstract_txt:measure in 5803) [ClassicSimilarity], result of:
            0.11377676 = score(doc=5803,freq=1.0), product of:
              0.19154969 = queryWeight, product of:
                1.9356685 = boost
                5.430678 = idf(docFreq=514, maxDocs=43254)
                0.018222017 = queryNorm
              0.5939804 = fieldWeight in 5803, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.430678 = idf(docFreq=514, maxDocs=43254)
                0.109375 = fieldNorm(doc=5803)
          0.09909493 = weight(abstract_txt:documents in 5803) [ClassicSimilarity], result of:
            0.09909493 = score(doc=5803,freq=1.0), product of:
              0.22010168 = queryWeight, product of:
                2.9343839 = boost
                4.1163282 = idf(docFreq=1916, maxDocs=43254)
                0.018222017 = queryNorm
              0.4502234 = fieldWeight in 5803, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1163282 = idf(docFreq=1916, maxDocs=43254)
                0.109375 = fieldNorm(doc=5803)
        0.2 = coord(5/25)
    
  5. Savoy, J.: Text representation strategies : an example with the State of the union addresses (2016) 0.08
    0.08426356 = sum of:
      0.08426356 = product of:
        0.35109818 = sum of:
          0.06058719 = weight(abstract_txt:term in 4507) [ClassicSimilarity], result of:
            0.06058719 = score(doc=4507,freq=4.0), product of:
              0.10057142 = queryWeight, product of:
                1.1452014 = boost
                4.819436 = idf(docFreq=948, maxDocs=43254)
                0.018222017 = queryNorm
              0.6024295 = fieldWeight in 4507, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.819436 = idf(docFreq=948, maxDocs=43254)
                0.0625 = fieldNorm(doc=4507)
          0.012113436 = weight(abstract_txt:with in 4507) [ClassicSimilarity], result of:
            0.012113436 = score(doc=4507,freq=2.0), product of:
              0.054586545 = queryWeight, product of:
                1.1931702 = boost
                2.5106533 = idf(docFreq=9548, maxDocs=43254)
                0.018222017 = queryNorm
              0.22191249 = fieldWeight in 4507, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.5106533 = idf(docFreq=9548, maxDocs=43254)
                0.0625 = fieldNorm(doc=4507)
          0.04153517 = weight(abstract_txt:words in 4507) [ClassicSimilarity], result of:
            0.04153517 = score(doc=4507,freq=1.0), product of:
              0.12412274 = queryWeight, product of:
                1.2722436 = boost
                5.354077 = idf(docFreq=555, maxDocs=43254)
                0.018222017 = queryNorm
              0.3346298 = fieldWeight in 4507, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.354077 = idf(docFreq=555, maxDocs=43254)
                0.0625 = fieldNorm(doc=4507)
          0.04716484 = weight(abstract_txt:technique in 4507) [ClassicSimilarity], result of:
            0.04716484 = score(doc=4507,freq=1.0), product of:
              0.13509925 = queryWeight, product of:
                1.327306 = boost
                5.5858 = idf(docFreq=440, maxDocs=43254)
                0.018222017 = queryNorm
              0.3491125 = fieldWeight in 4507, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5858 = idf(docFreq=440, maxDocs=43254)
                0.0625 = fieldNorm(doc=4507)
          0.11260979 = weight(abstract_txt:measure in 4507) [ClassicSimilarity], result of:
            0.11260979 = score(doc=4507,freq=3.0), product of:
              0.19154969 = queryWeight, product of:
                1.9356685 = boost
                5.430678 = idf(docFreq=514, maxDocs=43254)
                0.018222017 = queryNorm
              0.5878881 = fieldWeight in 4507, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.430678 = idf(docFreq=514, maxDocs=43254)
                0.0625 = fieldNorm(doc=4507)
          0.077087745 = weight(abstract_txt:strategy in 4507) [ClassicSimilarity], result of:
            0.077087745 = score(doc=4507,freq=1.0), product of:
              0.21458268 = queryWeight, product of:
                2.0487435 = boost
                5.747919 = idf(docFreq=374, maxDocs=43254)
                0.018222017 = queryNorm
              0.35924494 = fieldWeight in 4507, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.747919 = idf(docFreq=374, maxDocs=43254)
                0.0625 = fieldNorm(doc=4507)
        0.24 = coord(6/25)