Document (#30160)

Author
Hoad, T.C.
Zobel, J.
Title
Methods for identifying versioned and plagiarized documents
Source
Journal of the American Society for Information Science and technology. 54(2003) no.3, S.203-215
Year
2003
Abstract
Hoad and Zobel term documents that originate from the same source, whether versions or plagiarisms, co-derivatives. Identification of co-derivatives is normally by a technique called fingerprinting, which uses hashing to generate surrogates in the form of integer strings derived from substrings of text, for comparison purposes, or by ranking using a similarity measure as in information retrieval. Hoad and Zobel derive several variants of what they term an identity measure, where documents with similar numbers of occurrences of words benefit and those with dissimilar numbers are penalized, for use in a ranking technique. They then review fingerprinting strategies, and characterize them by the substring size utilized, i.e. granularity, character of the hashing function, the size of the document fingerprint, i.e. resolution, and the substring selection strategy. In their experiments highest false match, HFM, the highest percentage score given an incorrect result, and separation, the difference between the lowest correct result and HFM were the measures utilized in two collections, one of 3,300 documents, and the other of 80,000 with 53 query documents. The new identity measure demonstrates superior performance to the alternatives. Only one fingerprinting strategy was able to identify all human identified similar documents, the anchor strategy. The key parameter in fingerprinting appears to be granularity, with three to five words producing the best results.

Similar documents (author)

  1. Kaszkiel, M.; Zobel, J.: Effective ranking with arbitrary passages (2001) 4.70
    4.697151 = sum of:
      4.697151 = weight(author_txt:zobel in 5764) [ClassicSimilarity], result of:
        4.697151 = fieldWeight in 5764, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.394302 = idf(docFreq=9, maxDocs=44218)
          0.5 = fieldNorm(doc=5764)
    
  2. Heinz, S.; Zobel, J.: Efficient single-pass index construction for text databases (2003) 4.70
    4.697151 = sum of:
      4.697151 = weight(author_txt:zobel in 1678) [ClassicSimilarity], result of:
        4.697151 = fieldWeight in 1678, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.394302 = idf(docFreq=9, maxDocs=44218)
          0.5 = fieldNorm(doc=1678)
    
  3. Uitdenbogerd, A.L.; Zobel, J.: ¬An architecture for effective music information retrieval (2004) 4.70
    4.697151 = sum of:
      4.697151 = weight(author_txt:zobel in 3055) [ClassicSimilarity], result of:
        4.697151 = fieldWeight in 3055, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.394302 = idf(docFreq=9, maxDocs=44218)
          0.5 = fieldNorm(doc=3055)
    
  4. Moffat, A.; Zobel, J.: Self-indexing inverted files for fast text retrieval (1996) 4.70
    4.697151 = sum of:
      4.697151 = weight(author_txt:zobel in 9) [ClassicSimilarity], result of:
        4.697151 = fieldWeight in 9, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.394302 = idf(docFreq=9, maxDocs=44218)
          0.5 = fieldNorm(doc=9)
    
  5. Hawking, D.; Zobel, J.: Does topic metadata help with Web search? (2007) 4.70
    4.697151 = sum of:
      4.697151 = weight(author_txt:zobel in 204) [ClassicSimilarity], result of:
        4.697151 = fieldWeight in 204, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.394302 = idf(docFreq=9, maxDocs=44218)
          0.5 = fieldNorm(doc=204)
    

Similar documents (content)

  1. Wartik, S.; Fox, E.; Heath, L.; Chen, Q.-F.: Hashing algorithms (1992) 0.12
    0.1177582 = sum of:
      0.1177582 = product of:
        0.98131835 = sum of:
          0.01692083 = weight(abstract_txt:with in 3510) [ClassicSimilarity], result of:
            0.01692083 = score(doc=3510,freq=1.0), product of:
              0.05415243 = queryWeight, product of:
                1.1902004 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.018201372 = queryNorm
              0.31246668 = fieldWeight in 3510, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.125 = fieldNorm(doc=3510)
          0.09460506 = weight(abstract_txt:technique in 3510) [ClassicSimilarity], result of:
            0.09460506 = score(doc=3510,freq=1.0), product of:
              0.13539514 = queryWeight, product of:
                1.3307537 = boost
                5.5898643 = idf(docFreq=448, maxDocs=44218)
                0.018201372 = queryNorm
              0.69873303 = fieldWeight in 3510, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5898643 = idf(docFreq=448, maxDocs=44218)
                0.125 = fieldNorm(doc=3510)
          0.86979246 = weight(abstract_txt:hashing in 3510) [ClassicSimilarity], result of:
            0.86979246 = score(doc=3510,freq=3.0), product of:
              0.41199964 = queryWeight, product of:
                2.321371 = boost
                9.7509775 = idf(docFreq=6, maxDocs=44218)
                0.018201372 = queryNorm
              2.1111486 = fieldWeight in 3510, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.7509775 = idf(docFreq=6, maxDocs=44218)
                0.125 = fieldNorm(doc=3510)
        0.12 = coord(3/25)
    
  2. Lihui, C.; Lian, C.W.: Using Web structure and summarisation techniques for Web content mining (2005) 0.10
    0.09845388 = sum of:
      0.09845388 = product of:
        0.351621 = sum of:
          0.011964833 = weight(abstract_txt:with in 1046) [ClassicSimilarity], result of:
            0.011964833 = score(doc=1046,freq=2.0), product of:
              0.05415243 = queryWeight, product of:
                1.1902004 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.018201372 = queryNorm
              0.22094731 = fieldWeight in 1046, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.0625 = fieldNorm(doc=1046)
          0.035976827 = weight(abstract_txt:result in 1046) [ClassicSimilarity], result of:
            0.035976827 = score(doc=1046,freq=1.0), product of:
              0.112813756 = queryWeight, product of:
                1.2147228 = boost
                5.1024737 = idf(docFreq=730, maxDocs=44218)
                0.018201372 = queryNorm
              0.3189046 = fieldWeight in 1046, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1024737 = idf(docFreq=730, maxDocs=44218)
                0.0625 = fieldNorm(doc=1046)
          0.03831557 = weight(abstract_txt:similar in 1046) [ClassicSimilarity], result of:
            0.03831557 = score(doc=1046,freq=1.0), product of:
              0.11765139 = queryWeight, product of:
                1.240494 = boost
                5.2107263 = idf(docFreq=655, maxDocs=44218)
                0.018201372 = queryNorm
              0.3256704 = fieldWeight in 1046, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2107263 = idf(docFreq=655, maxDocs=44218)
                0.0625 = fieldNorm(doc=1046)
          0.08193038 = weight(abstract_txt:technique in 1046) [ClassicSimilarity], result of:
            0.08193038 = score(doc=1046,freq=3.0), product of:
              0.13539514 = queryWeight, product of:
                1.3307537 = boost
                5.5898643 = idf(docFreq=448, maxDocs=44218)
                0.018201372 = queryNorm
              0.60512054 = fieldWeight in 1046, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.5898643 = idf(docFreq=448, maxDocs=44218)
                0.0625 = fieldNorm(doc=1046)
          0.04753007 = weight(abstract_txt:ranking in 1046) [ClassicSimilarity], result of:
            0.04753007 = score(doc=1046,freq=1.0), product of:
              0.13582899 = queryWeight, product of:
                1.3328841 = boost
                5.598813 = idf(docFreq=444, maxDocs=44218)
                0.018201372 = queryNorm
              0.34992582 = fieldWeight in 1046, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.598813 = idf(docFreq=444, maxDocs=44218)
                0.0625 = fieldNorm(doc=1046)
          0.055472896 = weight(abstract_txt:size in 1046) [ClassicSimilarity], result of:
            0.055472896 = score(doc=1046,freq=1.0), product of:
              0.15056847 = queryWeight, product of:
                1.4033408 = boost
                5.8947687 = idf(docFreq=330, maxDocs=44218)
                0.018201372 = queryNorm
              0.36842304 = fieldWeight in 1046, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8947687 = idf(docFreq=330, maxDocs=44218)
                0.0625 = fieldNorm(doc=1046)
          0.08043041 = weight(abstract_txt:documents in 1046) [ClassicSimilarity], result of:
            0.08043041 = score(doc=1046,freq=2.0), product of:
              0.22079578 = queryWeight, product of:
                2.9434195 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.018201372 = queryNorm
              0.36427513 = fieldWeight in 1046, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0625 = fieldNorm(doc=1046)
        0.28 = coord(7/25)
    
  3. Ku, L.-W.; Chen, H.-H.: Mining opinions from the Web : beyond relevance retrieval (2007) 0.09
    0.08736806 = sum of:
      0.08736806 = product of:
        0.4368403 = sum of:
          0.011964833 = weight(abstract_txt:with in 605) [ClassicSimilarity], result of:
            0.011964833 = score(doc=605,freq=2.0), product of:
              0.05415243 = queryWeight, product of:
                1.1902004 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.018201372 = queryNorm
              0.22094731 = fieldWeight in 605, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.0625 = fieldNorm(doc=605)
          0.083081424 = weight(abstract_txt:words in 605) [ClassicSimilarity], result of:
            0.083081424 = score(doc=605,freq=4.0), product of:
              0.12416412 = queryWeight, product of:
                1.274366 = boost
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.018201372 = queryNorm
              0.66912585 = fieldWeight in 605, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.0625 = fieldNorm(doc=605)
          0.12227211 = weight(abstract_txt:granularity in 605) [ClassicSimilarity], result of:
            0.12227211 = score(doc=605,freq=1.0), product of:
              0.25501463 = queryWeight, product of:
                1.8263277 = boost
                7.6715355 = idf(docFreq=55, maxDocs=44218)
                0.018201372 = queryNorm
              0.47947097 = fieldWeight in 605, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6715355 = idf(docFreq=55, maxDocs=44218)
                0.0625 = fieldNorm(doc=605)
          0.09235025 = weight(abstract_txt:measure in 605) [ClassicSimilarity], result of:
            0.09235025 = score(doc=605,freq=2.0), product of:
              0.19215837 = queryWeight, product of:
                1.9416522 = boost
                5.437306 = idf(docFreq=522, maxDocs=44218)
                0.018201372 = queryNorm
              0.4805945 = fieldWeight in 605, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.437306 = idf(docFreq=522, maxDocs=44218)
                0.0625 = fieldNorm(doc=605)
          0.12717165 = weight(abstract_txt:documents in 605) [ClassicSimilarity], result of:
            0.12717165 = score(doc=605,freq=5.0), product of:
              0.22079578 = queryWeight, product of:
                2.9434195 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.018201372 = queryNorm
              0.5759696 = fieldWeight in 605, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0625 = fieldNorm(doc=605)
        0.2 = coord(5/25)
    
  4. Fricke, M.: Measuring recall (1998) 0.09
    0.08580425 = sum of:
      0.08580425 = product of:
        0.42902124 = sum of:
          0.014805727 = weight(abstract_txt:with in 3802) [ClassicSimilarity], result of:
            0.014805727 = score(doc=3802,freq=1.0), product of:
              0.05415243 = queryWeight, product of:
                1.1902004 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.018201372 = queryNorm
              0.27340835 = fieldWeight in 3802, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.109375 = fieldNorm(doc=3802)
          0.08277943 = weight(abstract_txt:technique in 3802) [ClassicSimilarity], result of:
            0.08277943 = score(doc=3802,freq=1.0), product of:
              0.13539514 = queryWeight, product of:
                1.3307537 = boost
                5.5898643 = idf(docFreq=448, maxDocs=44218)
                0.018201372 = queryNorm
              0.6113914 = fieldWeight in 3802, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5898643 = idf(docFreq=448, maxDocs=44218)
                0.109375 = fieldNorm(doc=3802)
          0.11763092 = weight(abstract_txt:ranking in 3802) [ClassicSimilarity], result of:
            0.11763092 = score(doc=3802,freq=2.0), product of:
              0.13582899 = queryWeight, product of:
                1.3328841 = boost
                5.598813 = idf(docFreq=444, maxDocs=44218)
                0.018201372 = queryNorm
              0.8660222 = fieldWeight in 3802, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.598813 = idf(docFreq=444, maxDocs=44218)
                0.109375 = fieldNorm(doc=3802)
          0.11427761 = weight(abstract_txt:measure in 3802) [ClassicSimilarity], result of:
            0.11427761 = score(doc=3802,freq=1.0), product of:
              0.19215837 = queryWeight, product of:
                1.9416522 = boost
                5.437306 = idf(docFreq=522, maxDocs=44218)
                0.018201372 = queryNorm
              0.59470534 = fieldWeight in 3802, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.437306 = idf(docFreq=522, maxDocs=44218)
                0.109375 = fieldNorm(doc=3802)
          0.09952755 = weight(abstract_txt:documents in 3802) [ClassicSimilarity], result of:
            0.09952755 = score(doc=3802,freq=1.0), product of:
              0.22079578 = queryWeight, product of:
                2.9434195 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.018201372 = queryNorm
              0.45076746 = fieldWeight in 3802, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.109375 = fieldNorm(doc=3802)
        0.2 = coord(5/25)
    
  5. Savoy, J.: Text representation strategies : an example with the State of the union addresses (2016) 0.08
    0.08410013 = sum of:
      0.08410013 = product of:
        0.3504172 = sum of:
          0.05994613 = weight(abstract_txt:term in 3042) [ClassicSimilarity], result of:
            0.05994613 = score(doc=3042,freq=4.0), product of:
              0.09988515 = queryWeight, product of:
                1.143001 = boost
                4.8012047 = idf(docFreq=987, maxDocs=44218)
                0.018201372 = queryNorm
              0.6001506 = fieldWeight in 3042, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.8012047 = idf(docFreq=987, maxDocs=44218)
                0.0625 = fieldNorm(doc=3042)
          0.011964833 = weight(abstract_txt:with in 3042) [ClassicSimilarity], result of:
            0.011964833 = score(doc=3042,freq=2.0), product of:
              0.05415243 = queryWeight, product of:
                1.1902004 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.018201372 = queryNorm
              0.22094731 = fieldWeight in 3042, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.0625 = fieldNorm(doc=3042)
          0.041540712 = weight(abstract_txt:words in 3042) [ClassicSimilarity], result of:
            0.041540712 = score(doc=3042,freq=1.0), product of:
              0.12416412 = queryWeight, product of:
                1.274366 = boost
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.018201372 = queryNorm
              0.33456293 = fieldWeight in 3042, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.0625 = fieldNorm(doc=3042)
          0.04730253 = weight(abstract_txt:technique in 3042) [ClassicSimilarity], result of:
            0.04730253 = score(doc=3042,freq=1.0), product of:
              0.13539514 = queryWeight, product of:
                1.3307537 = boost
                5.5898643 = idf(docFreq=448, maxDocs=44218)
                0.018201372 = queryNorm
              0.34936652 = fieldWeight in 3042, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5898643 = idf(docFreq=448, maxDocs=44218)
                0.0625 = fieldNorm(doc=3042)
          0.1131055 = weight(abstract_txt:measure in 3042) [ClassicSimilarity], result of:
            0.1131055 = score(doc=3042,freq=3.0), product of:
              0.19215837 = queryWeight, product of:
                1.9416522 = boost
                5.437306 = idf(docFreq=522, maxDocs=44218)
                0.018201372 = queryNorm
              0.58860564 = fieldWeight in 3042, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.437306 = idf(docFreq=522, maxDocs=44218)
                0.0625 = fieldNorm(doc=3042)
          0.07655747 = weight(abstract_txt:strategy in 3042) [ClassicSimilarity], result of:
            0.07655747 = score(doc=3042,freq=1.0), product of:
              0.21364972 = queryWeight, product of:
                2.047354 = boost
                5.733308 = idf(docFreq=388, maxDocs=44218)
                0.018201372 = queryNorm
              0.35833174 = fieldWeight in 3042, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.733308 = idf(docFreq=388, maxDocs=44218)
                0.0625 = fieldNorm(doc=3042)
        0.24 = coord(6/25)