Document (#34542)

Author
Sitas, A.
Kapidakis, S.
Title
Duplicate detection algorithms of bibliographic descriptions
Source
Library hi tech. 26(2008) no.2, S.287-301
Year
2008
Abstract
Purpose - The purpose of this paper is to focus on duplicate record detection algorithms used for detection in bibliographic databases. Design/methodology/approach - Individual algorithms, their application process for duplicate detection and their results are described based on available literature (published articles), information found at various library web sites and follow-up e-mail communications. Findings - Algorithms are categorized according to their application as a process of a single step or two consecutive steps. The results of deletion, merging, and temporary and virtual consolidation of duplicate records are studied. Originality/value - The paper presents an overview of the duplication detection algorithms and an up-to-date state of their application in different library systems.
Theme
Formalerschließung

Similar documents (content)

  1. Hustand, S.: Problems of duplicate records (1986) 0.28
    0.28330618 = sum of:
      0.28330618 = product of:
        1.1804425 = sum of:
          0.07520472 = weight(abstract_txt:merging in 266) [ClassicSimilarity], result of:
            0.07520472 = score(doc=266,freq=1.0), product of:
              0.1274206 = queryWeight, product of:
                1.2755014 = boost
                7.5546684 = idf(docFreq=61, maxDocs=43556)
                0.013223405 = queryNorm
              0.5902085 = fieldWeight in 266, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5546684 = idf(docFreq=61, maxDocs=43556)
                0.078125 = fieldNorm(doc=266)
          0.1323784 = weight(abstract_txt:duplication in 266) [ClassicSimilarity], result of:
            0.1323784 = score(doc=266,freq=2.0), product of:
              0.14743854 = queryWeight, product of:
                1.3720396 = boost
                8.126454 = idf(docFreq=34, maxDocs=43556)
                0.013223405 = queryNorm
              0.8978548 = fieldWeight in 266, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.126454 = idf(docFreq=34, maxDocs=43556)
                0.078125 = fieldNorm(doc=266)
          0.026413305 = weight(abstract_txt:bibliographic in 266) [ClassicSimilarity], result of:
            0.026413305 = score(doc=266,freq=1.0), product of:
              0.07991619 = queryWeight, product of:
                1.4285436 = boost
                4.2305613 = idf(docFreq=1721, maxDocs=43556)
                0.013223405 = queryNorm
              0.33051258 = fieldWeight in 266, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2305613 = idf(docFreq=1721, maxDocs=43556)
                0.078125 = fieldNorm(doc=266)
          0.16462016 = weight(abstract_txt:algorithms in 266) [ClassicSimilarity], result of:
            0.16462016 = score(doc=266,freq=1.0), product of:
              0.36732876 = queryWeight, product of:
                4.842544 = boost
                5.736382 = idf(docFreq=381, maxDocs=43556)
                0.013223405 = queryNorm
              0.44815484 = fieldWeight in 266, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.736382 = idf(docFreq=381, maxDocs=43556)
                0.078125 = fieldNorm(doc=266)
          0.5086407 = weight(abstract_txt:duplicate in 266) [ClassicSimilarity], result of:
            0.5086407 = score(doc=266,freq=2.0), product of:
              0.5741522 = queryWeight, product of:
                5.4150767 = boost
                8.018241 = idf(docFreq=38, maxDocs=43556)
                0.013223405 = queryNorm
              0.8858988 = fieldWeight in 266, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.018241 = idf(docFreq=38, maxDocs=43556)
                0.078125 = fieldNorm(doc=266)
          0.27318513 = weight(abstract_txt:detection in 266) [ClassicSimilarity], result of:
            0.27318513 = score(doc=266,freq=1.0), product of:
              0.514878 = queryWeight, product of:
                5.7332153 = boost
                6.791454 = idf(docFreq=132, maxDocs=43556)
                0.013223405 = queryNorm
              0.5305823 = fieldWeight in 266, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.791454 = idf(docFreq=132, maxDocs=43556)
                0.078125 = fieldNorm(doc=266)
        0.24 = coord(6/25)
    
  2. Meir, D.D.; Lazinger, S.S.: Measuring the performance of a merging algorithm : mismatches, missed-matches, and overlap in Israel's union list (1998) 0.25
    0.25041738 = sum of:
      0.25041738 = product of:
        1.0434058 = sum of:
          0.021077495 = weight(abstract_txt:results in 4380) [ClassicSimilarity], result of:
            0.021077495 = score(doc=4380,freq=2.0), product of:
              0.0545702 = queryWeight, product of:
                1.1804671 = boost
                3.4958951 = idf(docFreq=3589, maxDocs=43556)
                0.013223405 = queryNorm
              0.3862455 = fieldWeight in 4380, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4958951 = idf(docFreq=3589, maxDocs=43556)
                0.078125 = fieldNorm(doc=4380)
          0.1302584 = weight(abstract_txt:merging in 4380) [ClassicSimilarity], result of:
            0.1302584 = score(doc=4380,freq=3.0), product of:
              0.1274206 = queryWeight, product of:
                1.2755014 = boost
                7.5546684 = idf(docFreq=61, maxDocs=43556)
                0.013223405 = queryNorm
              1.022271 = fieldWeight in 4380, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.5546684 = idf(docFreq=61, maxDocs=43556)
                0.078125 = fieldNorm(doc=4380)
          0.026413305 = weight(abstract_txt:bibliographic in 4380) [ClassicSimilarity], result of:
            0.026413305 = score(doc=4380,freq=1.0), product of:
              0.07991619 = queryWeight, product of:
                1.4285436 = boost
                4.2305613 = idf(docFreq=1721, maxDocs=43556)
                0.013223405 = queryNorm
              0.33051258 = fieldWeight in 4380, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2305613 = idf(docFreq=1721, maxDocs=43556)
                0.078125 = fieldNorm(doc=4380)
          0.23280805 = weight(abstract_txt:algorithms in 4380) [ClassicSimilarity], result of:
            0.23280805 = score(doc=4380,freq=2.0), product of:
              0.36732876 = queryWeight, product of:
                4.842544 = boost
                5.736382 = idf(docFreq=381, maxDocs=43556)
                0.013223405 = queryNorm
              0.6337866 = fieldWeight in 4380, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.736382 = idf(docFreq=381, maxDocs=43556)
                0.078125 = fieldNorm(doc=4380)
          0.35966334 = weight(abstract_txt:duplicate in 4380) [ClassicSimilarity], result of:
            0.35966334 = score(doc=4380,freq=1.0), product of:
              0.5741522 = queryWeight, product of:
                5.4150767 = boost
                8.018241 = idf(docFreq=38, maxDocs=43556)
                0.013223405 = queryNorm
              0.6264251 = fieldWeight in 4380, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.018241 = idf(docFreq=38, maxDocs=43556)
                0.078125 = fieldNorm(doc=4380)
          0.27318513 = weight(abstract_txt:detection in 4380) [ClassicSimilarity], result of:
            0.27318513 = score(doc=4380,freq=1.0), product of:
              0.514878 = queryWeight, product of:
                5.7332153 = boost
                6.791454 = idf(docFreq=132, maxDocs=43556)
                0.013223405 = queryNorm
              0.5305823 = fieldWeight in 4380, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.791454 = idf(docFreq=132, maxDocs=43556)
                0.078125 = fieldNorm(doc=4380)
        0.24 = coord(6/25)
    
  3. Cousins, S.A.: Duplicate detection and record consolidation in large bibliographic databases : the COPAC database experience (1998) 0.25
    0.25020343 = sum of:
      0.25020343 = product of:
        1.2510171 = sum of:
          0.011307981 = weight(abstract_txt:library in 3831) [ClassicSimilarity], result of:
            0.011307981 = score(doc=3831,freq=1.0), product of:
              0.045395166 = queryWeight, product of:
                1.0766658 = boost
                3.1884925 = idf(docFreq=4881, maxDocs=43556)
                0.013223405 = queryNorm
              0.24910098 = fieldWeight in 3831, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1884925 = idf(docFreq=4881, maxDocs=43556)
                0.078125 = fieldNorm(doc=3831)
          0.09360567 = weight(abstract_txt:duplication in 3831) [ClassicSimilarity], result of:
            0.09360567 = score(doc=3831,freq=1.0), product of:
              0.14743854 = queryWeight, product of:
                1.3720396 = boost
                8.126454 = idf(docFreq=34, maxDocs=43556)
                0.013223405 = queryNorm
              0.63487923 = fieldWeight in 3831, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.126454 = idf(docFreq=34, maxDocs=43556)
                0.078125 = fieldNorm(doc=3831)
          0.13680619 = weight(abstract_txt:consolidation in 3831) [ClassicSimilarity], result of:
            0.13680619 = score(doc=3831,freq=2.0), product of:
              0.15070815 = queryWeight, product of:
                1.3871694 = boost
                8.216067 = idf(docFreq=31, maxDocs=43556)
                0.013223405 = queryNorm
              0.90775573 = fieldWeight in 3831, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.216067 = idf(docFreq=31, maxDocs=43556)
                0.078125 = fieldNorm(doc=3831)
          0.62295514 = weight(abstract_txt:duplicate in 3831) [ClassicSimilarity], result of:
            0.62295514 = score(doc=3831,freq=3.0), product of:
              0.5741522 = queryWeight, product of:
                5.4150767 = boost
                8.018241 = idf(docFreq=38, maxDocs=43556)
                0.013223405 = queryNorm
              1.085 = fieldWeight in 3831, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                8.018241 = idf(docFreq=38, maxDocs=43556)
                0.078125 = fieldNorm(doc=3831)
          0.38634214 = weight(abstract_txt:detection in 3831) [ClassicSimilarity], result of:
            0.38634214 = score(doc=3831,freq=2.0), product of:
              0.514878 = queryWeight, product of:
                5.7332153 = boost
                6.791454 = idf(docFreq=132, maxDocs=43556)
                0.013223405 = queryNorm
              0.7503567 = fieldWeight in 3831, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.791454 = idf(docFreq=132, maxDocs=43556)
                0.078125 = fieldNorm(doc=3831)
        0.2 = coord(5/25)
    
  4. Sedhai, S.; Sun, A.: ¬An analysis of 14 Million tweets on hashtag-oriented spamming* (2017) 0.14
    0.14139782 = sum of:
      0.14139782 = product of:
        0.70698905 = sum of:
          0.028992942 = weight(abstract_txt:descriptions in 681) [ClassicSimilarity], result of:
            0.028992942 = score(doc=681,freq=1.0), product of:
              0.07832092 = queryWeight, product of:
                5.922901 = idf(docFreq=316, maxDocs=43556)
                0.013223405 = queryNorm
              0.37018132 = fieldWeight in 681, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.922901 = idf(docFreq=316, maxDocs=43556)
                0.0625 = fieldNorm(doc=681)
          0.016729942 = weight(abstract_txt:paper in 681) [ClassicSimilarity], result of:
            0.016729942 = score(doc=681,freq=2.0), product of:
              0.054284923 = queryWeight, product of:
                1.1773775 = boost
                3.486745 = idf(docFreq=3622, maxDocs=43556)
                0.013223405 = queryNorm
              0.30818763 = fieldWeight in 681, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.486745 = idf(docFreq=3622, maxDocs=43556)
                0.0625 = fieldNorm(doc=681)
          0.03580546 = weight(abstract_txt:their in 681) [ClassicSimilarity], result of:
            0.03580546 = score(doc=681,freq=4.0), product of:
              0.09015346 = queryWeight, product of:
                2.1457658 = boost
                3.1772897 = idf(docFreq=4936, maxDocs=43556)
                0.013223405 = queryNorm
              0.39716122 = fieldWeight in 681, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.1772897 = idf(docFreq=4936, maxDocs=43556)
                0.0625 = fieldNorm(doc=681)
          0.4069126 = weight(abstract_txt:duplicate in 681) [ClassicSimilarity], result of:
            0.4069126 = score(doc=681,freq=2.0), product of:
              0.5741522 = queryWeight, product of:
                5.4150767 = boost
                8.018241 = idf(docFreq=38, maxDocs=43556)
                0.013223405 = queryNorm
              0.7087191 = fieldWeight in 681, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.018241 = idf(docFreq=38, maxDocs=43556)
                0.0625 = fieldNorm(doc=681)
          0.21854812 = weight(abstract_txt:detection in 681) [ClassicSimilarity], result of:
            0.21854812 = score(doc=681,freq=1.0), product of:
              0.514878 = queryWeight, product of:
                5.7332153 = boost
                6.791454 = idf(docFreq=132, maxDocs=43556)
                0.013223405 = queryNorm
              0.42446586 = fieldWeight in 681, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.791454 = idf(docFreq=132, maxDocs=43556)
                0.0625 = fieldNorm(doc=681)
        0.2 = coord(5/25)
    
  5. Weiss, P.J.: Getting the expert into the system : expert systems and cataloging (1995) 0.14
    0.13594621 = sum of:
      0.13594621 = product of:
        1.1328851 = sum of:
          0.120327555 = weight(abstract_txt:merging in 2463) [ClassicSimilarity], result of:
            0.120327555 = score(doc=2463,freq=1.0), product of:
              0.1274206 = queryWeight, product of:
                1.2755014 = boost
                7.5546684 = idf(docFreq=61, maxDocs=43556)
                0.013223405 = queryNorm
              0.94433355 = fieldWeight in 2463, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5546684 = idf(docFreq=61, maxDocs=43556)
                0.125 = fieldNorm(doc=2463)
          0.5754613 = weight(abstract_txt:duplicate in 2463) [ClassicSimilarity], result of:
            0.5754613 = score(doc=2463,freq=1.0), product of:
              0.5741522 = queryWeight, product of:
                5.4150767 = boost
                8.018241 = idf(docFreq=38, maxDocs=43556)
                0.013223405 = queryNorm
              1.0022801 = fieldWeight in 2463, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.018241 = idf(docFreq=38, maxDocs=43556)
                0.125 = fieldNorm(doc=2463)
          0.43709624 = weight(abstract_txt:detection in 2463) [ClassicSimilarity], result of:
            0.43709624 = score(doc=2463,freq=1.0), product of:
              0.514878 = queryWeight, product of:
                5.7332153 = boost
                6.791454 = idf(docFreq=132, maxDocs=43556)
                0.013223405 = queryNorm
              0.8489317 = fieldWeight in 2463, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.791454 = idf(docFreq=132, maxDocs=43556)
                0.125 = fieldNorm(doc=2463)
        0.12 = coord(3/25)