Document (#40124)

Author
Pertile, S. de L.
Moreira, V.P.
Title
Comparing and combining content- and citation-based approaches for plagiarism detection
Source
Journal of the Association for Information Science and Technology. 67(2016) no.10, S.2511-2526
Year
2016
Abstract
The vast amount of scientific publications available online makes it easier for students and researchers to reuse text from other authors and makes it harder for checking the originality of a given text. Reusing text without crediting the original authors is considered plagiarism. A number of studies have reported the prevalence of plagiarism in academia. As a consequence, numerous institutions and researchers are dedicated to devising systems to automate the process of checking for plagiarism. This work focuses on the problem of detecting text reuse in scientific papers. The contributions of this paper are twofold: (a) we survey the existing approaches for plagiarism detection based on content, based on content and structure, and based on citations and references; and (b) we compare content and citation-based approaches with the goal of evaluating whether they are complementary and if their combination can improve the quality of the detection. We carry out experiments with real data sets of scientific papers and concluded that a combination of the methods can be beneficial.
Content
Vgl.: http://onlinelibrary.wiley.com/doi/10.1002/asi.23593/full.

Similar documents (author)

  1. Moreira, F. Mosso => Mosso Moreira, F.: 4.80
    4.8036394 = sum of:
      4.8036394 = weight(author_txt:moreira in 4730) [ClassicSimilarity], result of:
        4.8036394 = fieldWeight in 4730, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          9.05783 = idf(docFreq=13, maxDocs=44218)
          0.375 = fieldNorm(doc=4730)
    
  2. Flores, F.N.; Moreira, V.P.: Assessing the impact of stemming accuracy on information retrieval : a multilingual perspective (2016) 4.53
    4.528915 = sum of:
      4.528915 = weight(author_txt:moreira in 3187) [ClassicSimilarity], result of:
        4.528915 = fieldWeight in 3187, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.05783 = idf(docFreq=13, maxDocs=44218)
          0.5 = fieldNorm(doc=3187)
    
  3. Santos Macula, B.C. Moreira dos => Moreira dos Santos Macula, B.C.: 4.00
    4.003033 = sum of:
      4.003033 = weight(author_txt:moreira in 1120) [ClassicSimilarity], result of:
        4.003033 = fieldWeight in 1120, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          9.05783 = idf(docFreq=13, maxDocs=44218)
          0.3125 = fieldNorm(doc=1120)
    
  4. Orengo, V.M. -> Moreira Orengo, V.: 3.96
    3.9628005 = sum of:
      3.9628005 = weight(author_txt:moreira in 411) [ClassicSimilarity], result of:
        3.9628005 = fieldWeight in 411, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.05783 = idf(docFreq=13, maxDocs=44218)
          0.4375 = fieldNorm(doc=411)
    
  5. Moreira Orengo, V.; Huyck, C.: Relevance feedback and cross-language information retrieval (2006) 3.96
    3.9628005 = sum of:
      3.9628005 = weight(author_txt:moreira in 970) [ClassicSimilarity], result of:
        3.9628005 = fieldWeight in 970, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.05783 = idf(docFreq=13, maxDocs=44218)
          0.4375 = fieldNorm(doc=970)
    

Similar documents (content)

  1. Gipp, B.; Meuschke, N.; Breitinger, C.: Citation-based plagiarism detection : practicability on a large-scale scientific corpus (2014) 0.50
    0.49991742 = sum of:
      0.49991742 = product of:
        1.7854193 = sum of:
          0.054268446 = weight(abstract_txt:detecting in 3332) [ClassicSimilarity], result of:
            0.054268446 = score(doc=3332,freq=1.0), product of:
              0.11291879 = queryWeight, product of:
                1.0323777 = boost
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.014224153 = queryNorm
              0.48059714 = fieldWeight in 3332, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.0625 = fieldNorm(doc=3332)
          0.056055594 = weight(abstract_txt:citation in 3332) [ClassicSimilarity], result of:
            0.056055594 = score(doc=3332,freq=4.0), product of:
              0.0915807 = queryWeight, product of:
                1.3148388 = boost
                4.896717 = idf(docFreq=897, maxDocs=44218)
                0.014224153 = queryNorm
              0.61208963 = fieldWeight in 3332, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.896717 = idf(docFreq=897, maxDocs=44218)
                0.0625 = fieldNorm(doc=3332)
          0.060701374 = weight(abstract_txt:approaches in 3332) [ClassicSimilarity], result of:
            0.060701374 = score(doc=3332,freq=3.0), product of:
              0.12167471 = queryWeight, product of:
                1.8561639 = boost
                4.6084785 = idf(docFreq=1197, maxDocs=44218)
                0.014224153 = queryNorm
              0.4988824 = fieldWeight in 3332, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.6084785 = idf(docFreq=1197, maxDocs=44218)
                0.0625 = fieldNorm(doc=3332)
          0.05115502 = weight(abstract_txt:based in 3332) [ClassicSimilarity], result of:
            0.05115502 = score(doc=3332,freq=7.0), product of:
              0.09704 = queryWeight, product of:
                2.1400106 = boost
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.014224153 = queryNorm
              0.52715397 = fieldWeight in 3332, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.0625 = fieldNorm(doc=3332)
          0.044648755 = weight(abstract_txt:text in 3332) [ClassicSimilarity], result of:
            0.044648755 = score(doc=3332,freq=2.0), product of:
              0.12491584 = queryWeight, product of:
                2.1716723 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.014224153 = queryNorm
              0.3574307 = fieldWeight in 3332, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=3332)
          0.2958123 = weight(abstract_txt:detection in 3332) [ClassicSimilarity], result of:
            0.2958123 = score(doc=3332,freq=7.0), product of:
              0.26368567 = queryWeight, product of:
                2.7324953 = boost
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.014224153 = queryNorm
              1.1218369 = fieldWeight in 3332, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.0625 = fieldNorm(doc=3332)
          1.2227778 = weight(abstract_txt:plagiarism in 3332) [ClassicSimilarity], result of:
            1.2227778 = score(doc=3332,freq=9.0), product of:
              0.7405291 = queryWeight, product of:
                5.9116893 = boost
                8.806516 = idf(docFreq=17, maxDocs=44218)
                0.014224153 = queryNorm
              1.6512218 = fieldWeight in 3332, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                8.806516 = idf(docFreq=17, maxDocs=44218)
                0.0625 = fieldNorm(doc=3332)
        0.28 = coord(7/25)
    
  2. Vani, K.; Gupta, D.: Integrating syntax-semantic-based text analysis with structural and citation information for scientific plagiarism detection (2018) 0.44
    0.4358856 = sum of:
      0.4358856 = product of:
        1.5567343 = sum of:
          0.06267206 = weight(abstract_txt:citation in 4543) [ClassicSimilarity], result of:
            0.06267206 = score(doc=4543,freq=5.0), product of:
              0.0915807 = queryWeight, product of:
                1.3148388 = boost
                4.896717 = idf(docFreq=897, maxDocs=44218)
                0.014224153 = queryNorm
              0.684337 = fieldWeight in 4543, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.896717 = idf(docFreq=897, maxDocs=44218)
                0.0625 = fieldNorm(doc=4543)
          0.035045955 = weight(abstract_txt:approaches in 4543) [ClassicSimilarity], result of:
            0.035045955 = score(doc=4543,freq=1.0), product of:
              0.12167471 = queryWeight, product of:
                1.8561639 = boost
                4.6084785 = idf(docFreq=1197, maxDocs=44218)
                0.014224153 = queryNorm
              0.2880299 = fieldWeight in 4543, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6084785 = idf(docFreq=1197, maxDocs=44218)
                0.0625 = fieldNorm(doc=4543)
          0.05063795 = weight(abstract_txt:scientific in 4543) [ClassicSimilarity], result of:
            0.05063795 = score(doc=4543,freq=2.0), product of:
              0.12342859 = queryWeight, product of:
                1.8694938 = boost
                4.6415744 = idf(docFreq=1158, maxDocs=44218)
                0.014224153 = queryNorm
              0.4102611 = fieldWeight in 4543, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.6415744 = idf(docFreq=1158, maxDocs=44218)
                0.0625 = fieldNorm(doc=4543)
          0.043233886 = weight(abstract_txt:based in 4543) [ClassicSimilarity], result of:
            0.043233886 = score(doc=4543,freq=5.0), product of:
              0.09704 = queryWeight, product of:
                2.1400106 = boost
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.014224153 = queryNorm
              0.44552645 = fieldWeight in 4543, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.0625 = fieldNorm(doc=4543)
          0.06314287 = weight(abstract_txt:text in 4543) [ClassicSimilarity], result of:
            0.06314287 = score(doc=4543,freq=4.0), product of:
              0.12491584 = queryWeight, product of:
                2.1716723 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.014224153 = queryNorm
              0.5054833 = fieldWeight in 4543, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=4543)
          0.2236131 = weight(abstract_txt:detection in 4543) [ClassicSimilarity], result of:
            0.2236131 = score(doc=4543,freq=4.0), product of:
              0.26368567 = queryWeight, product of:
                2.7324953 = boost
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.014224153 = queryNorm
              0.848029 = fieldWeight in 4543, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.0625 = fieldNorm(doc=4543)
          1.0783886 = weight(abstract_txt:plagiarism in 4543) [ClassicSimilarity], result of:
            1.0783886 = score(doc=4543,freq=7.0), product of:
              0.7405291 = queryWeight, product of:
                5.9116893 = boost
                8.806516 = idf(docFreq=17, maxDocs=44218)
                0.014224153 = queryNorm
              1.4562407 = fieldWeight in 4543, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                8.806516 = idf(docFreq=17, maxDocs=44218)
                0.0625 = fieldNorm(doc=4543)
        0.28 = coord(7/25)
    
  3. Alzahrani, S.; Palade, V.; Salim, N.; Abraham, A.: Using structural information and citation evidence to detect significant plagiarism cases in scientific publications (2012) 0.34
    0.3388605 = sum of:
      0.3388605 = product of:
        1.4119189 = sum of:
          0.029671239 = weight(abstract_txt:authors in 4982) [ClassicSimilarity], result of:
            0.029671239 = score(doc=4982,freq=2.0), product of:
              0.0825315 = queryWeight, product of:
                1.2481891 = boost
                4.648501 = idf(docFreq=1150, maxDocs=44218)
                0.014224153 = queryNorm
              0.35951412 = fieldWeight in 4982, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.648501 = idf(docFreq=1150, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4982)
          0.042477373 = weight(abstract_txt:citation in 4982) [ClassicSimilarity], result of:
            0.042477373 = score(doc=4982,freq=3.0), product of:
              0.0915807 = queryWeight, product of:
                1.3148388 = boost
                4.896717 = idf(docFreq=897, maxDocs=44218)
                0.014224153 = queryNorm
              0.4638245 = fieldWeight in 4982, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.896717 = idf(docFreq=897, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4982)
          0.044308204 = weight(abstract_txt:scientific in 4982) [ClassicSimilarity], result of:
            0.044308204 = score(doc=4982,freq=2.0), product of:
              0.12342859 = queryWeight, product of:
                1.8694938 = boost
                4.6415744 = idf(docFreq=1158, maxDocs=44218)
                0.014224153 = queryNorm
              0.35897845 = fieldWeight in 4982, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.6415744 = idf(docFreq=1158, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4982)
          0.02930272 = weight(abstract_txt:based in 4982) [ClassicSimilarity], result of:
            0.02930272 = score(doc=4982,freq=3.0), product of:
              0.09704 = queryWeight, product of:
                2.1400106 = boost
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.014224153 = queryNorm
              0.3019654 = fieldWeight in 4982, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4982)
          0.13835354 = weight(abstract_txt:detection in 4982) [ClassicSimilarity], result of:
            0.13835354 = score(doc=4982,freq=2.0), product of:
              0.26368567 = queryWeight, product of:
                2.7324953 = boost
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.014224153 = queryNorm
              0.52469116 = fieldWeight in 4982, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4982)
          1.1278058 = weight(abstract_txt:plagiarism in 4982) [ClassicSimilarity], result of:
            1.1278058 = score(doc=4982,freq=10.0), product of:
              0.7405291 = queryWeight, product of:
                5.9116893 = boost
                8.806516 = idf(docFreq=17, maxDocs=44218)
                0.014224153 = queryNorm
              1.522973 = fieldWeight in 4982, product of:
                3.1622777 = tf(freq=10.0), with freq of:
                  10.0 = termFreq=10.0
                8.806516 = idf(docFreq=17, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4982)
        0.24 = coord(6/25)
    
  4. Stamatatos, E.: Plagiarism detection using stopword n-grams (2011) 0.20
    0.19917475 = sum of:
      0.19917475 = product of:
        0.9958737 = sum of:
          0.067835554 = weight(abstract_txt:detecting in 4955) [ClassicSimilarity], result of:
            0.067835554 = score(doc=4955,freq=1.0), product of:
              0.11291879 = queryWeight, product of:
                1.0323777 = boost
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.014224153 = queryNorm
              0.6007464 = fieldWeight in 4955, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.078125 = fieldNorm(doc=4955)
          0.024168476 = weight(abstract_txt:based in 4955) [ClassicSimilarity], result of:
            0.024168476 = score(doc=4955,freq=1.0), product of:
              0.09704 = queryWeight, product of:
                2.1400106 = boost
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.014224153 = queryNorm
              0.24905685 = fieldWeight in 4955, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.078125 = fieldNorm(doc=4955)
          0.043582767 = weight(abstract_txt:content in 4955) [ClassicSimilarity], result of:
            0.043582767 = score(doc=4955,freq=1.0), product of:
              0.13346206 = queryWeight, product of:
                2.2447317 = boost
                4.17991 = idf(docFreq=1838, maxDocs=44218)
                0.014224153 = queryNorm
              0.3265555 = fieldWeight in 4955, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.17991 = idf(docFreq=1838, maxDocs=44218)
                0.078125 = fieldNorm(doc=4955)
          0.1397582 = weight(abstract_txt:detection in 4955) [ClassicSimilarity], result of:
            0.1397582 = score(doc=4955,freq=1.0), product of:
              0.26368567 = queryWeight, product of:
                2.7324953 = boost
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.014224153 = queryNorm
              0.53001815 = fieldWeight in 4955, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.078125 = fieldNorm(doc=4955)
          0.72052866 = weight(abstract_txt:plagiarism in 4955) [ClassicSimilarity], result of:
            0.72052866 = score(doc=4955,freq=2.0), product of:
              0.7405291 = queryWeight, product of:
                5.9116893 = boost
                8.806516 = idf(docFreq=17, maxDocs=44218)
                0.014224153 = queryNorm
              0.97299165 = fieldWeight in 4955, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.806516 = idf(docFreq=17, maxDocs=44218)
                0.078125 = fieldNorm(doc=4955)
        0.2 = coord(5/25)
    
  5. Agarwal, B.; Ramampiaro, H.; Langseth, H.; Ruocco, M.: ¬A deep network model for paraphrase detection in short text messages (2018) 0.19
    0.18917023 = sum of:
      0.18917023 = product of:
        0.7882093 = sum of:
          0.054268446 = weight(abstract_txt:detecting in 5043) [ClassicSimilarity], result of:
            0.054268446 = score(doc=5043,freq=1.0), product of:
              0.11291879 = queryWeight, product of:
                1.0323777 = boost
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.014224153 = queryNorm
              0.48059714 = fieldWeight in 5043, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.0625 = fieldNorm(doc=5043)
          0.060701374 = weight(abstract_txt:approaches in 5043) [ClassicSimilarity], result of:
            0.060701374 = score(doc=5043,freq=3.0), product of:
              0.12167471 = queryWeight, product of:
                1.8561639 = boost
                4.6084785 = idf(docFreq=1197, maxDocs=44218)
                0.014224153 = queryNorm
              0.4988824 = fieldWeight in 5043, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.6084785 = idf(docFreq=1197, maxDocs=44218)
                0.0625 = fieldNorm(doc=5043)
          0.027343508 = weight(abstract_txt:based in 5043) [ClassicSimilarity], result of:
            0.027343508 = score(doc=5043,freq=2.0), product of:
              0.09704 = queryWeight, product of:
                2.1400106 = boost
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.014224153 = queryNorm
              0.28177565 = fieldWeight in 5043, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.0625 = fieldNorm(doc=5043)
          0.044648755 = weight(abstract_txt:text in 5043) [ClassicSimilarity], result of:
            0.044648755 = score(doc=5043,freq=2.0), product of:
              0.12491584 = queryWeight, product of:
                2.1716723 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.014224153 = queryNorm
              0.3574307 = fieldWeight in 5043, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=5043)
          0.19365461 = weight(abstract_txt:detection in 5043) [ClassicSimilarity], result of:
            0.19365461 = score(doc=5043,freq=3.0), product of:
              0.26368567 = queryWeight, product of:
                2.7324953 = boost
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.014224153 = queryNorm
              0.73441464 = fieldWeight in 5043, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.0625 = fieldNorm(doc=5043)
          0.4075926 = weight(abstract_txt:plagiarism in 5043) [ClassicSimilarity], result of:
            0.4075926 = score(doc=5043,freq=1.0), product of:
              0.7405291 = queryWeight, product of:
                5.9116893 = boost
                8.806516 = idf(docFreq=17, maxDocs=44218)
                0.014224153 = queryNorm
              0.55040723 = fieldWeight in 5043, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.806516 = idf(docFreq=17, maxDocs=44218)
                0.0625 = fieldNorm(doc=5043)
        0.24 = coord(6/25)