Document (#34544)

Author
Sitas, A.
Kapidakis, S.
Title
Duplicate detection algorithms of bibliographic descriptions
Source
Library hi tech. 26(2008) no.2, S.287-301
Year
2008
Abstract
Purpose - The purpose of this paper is to focus on duplicate record detection algorithms used for detection in bibliographic databases. Design/methodology/approach - Individual algorithms, their application process for duplicate detection and their results are described based on available literature (published articles), information found at various library web sites and follow-up e-mail communications. Findings - Algorithms are categorized according to their application as a process of a single step or two consecutive steps. The results of deletion, merging, and temporary and virtual consolidation of duplicate records are studied. Originality/value - The paper presents an overview of the duplication detection algorithms and an up-to-date state of their application in different library systems.
Theme
Formalerschließung

Similar documents (content)

  1. Hustand, S.: Problems of duplicate records (1986) 0.28
    0.2834875 = sum of:
      0.2834875 = product of:
        1.1811979 = sum of:
          0.07476923 = weight(abstract_txt:merging in 266) [ClassicSimilarity], result of:
            0.07476923 = score(doc=266,freq=1.0), product of:
              0.12696281 = queryWeight, product of:
                1.2728093 = boost
                7.538004 = idf(docFreq=63, maxDocs=44218)
                0.013232955 = queryNorm
              0.5889065 = fieldWeight in 266, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.538004 = idf(docFreq=63, maxDocs=44218)
                0.078125 = fieldNorm(doc=266)
          0.13322586 = weight(abstract_txt:duplication in 266) [ClassicSimilarity], result of:
            0.13322586 = score(doc=266,freq=2.0), product of:
              0.14810745 = queryWeight, product of:
                1.3747177 = boost
                8.14154 = idf(docFreq=34, maxDocs=44218)
                0.013232955 = queryNorm
              0.8995216 = fieldWeight in 266, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.14154 = idf(docFreq=34, maxDocs=44218)
                0.078125 = fieldNorm(doc=266)
          0.02641532 = weight(abstract_txt:bibliographic in 266) [ClassicSimilarity], result of:
            0.02641532 = score(doc=266,freq=1.0), product of:
              0.07994203 = queryWeight, product of:
                1.4283271 = boost
                4.229516 = idf(docFreq=1749, maxDocs=44218)
                0.013232955 = queryNorm
              0.33043092 = fieldWeight in 266, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.229516 = idf(docFreq=1749, maxDocs=44218)
                0.078125 = fieldNorm(doc=266)
          0.16231506 = weight(abstract_txt:algorithms in 266) [ClassicSimilarity], result of:
            0.16231506 = score(doc=266,freq=1.0), product of:
              0.36399084 = queryWeight, product of:
                4.818982 = boost
                5.707926 = idf(docFreq=398, maxDocs=44218)
                0.013232955 = queryNorm
              0.4459317 = fieldWeight in 266, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.707926 = idf(docFreq=398, maxDocs=44218)
                0.078125 = fieldNorm(doc=266)
          0.51193506 = weight(abstract_txt:duplicate in 266) [ClassicSimilarity], result of:
            0.51193506 = score(doc=266,freq=2.0), product of:
              0.57678574 = queryWeight, product of:
                5.4257817 = boost
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.013232955 = queryNorm
              0.8875654 = fieldWeight in 266, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.078125 = fieldNorm(doc=266)
          0.27253732 = weight(abstract_txt:detection in 266) [ClassicSimilarity], result of:
            0.27253732 = score(doc=266,freq=1.0), product of:
              0.5142037 = queryWeight, product of:
                5.7276664 = boost
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.013232955 = queryNorm
              0.53001815 = fieldWeight in 266, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.078125 = fieldNorm(doc=266)
        0.24 = coord(6/25)
    
  2. Cousins, S.A.: Duplicate detection and record consolidation in large bibliographic databases : the COPAC database experience (1998) 0.25
    0.2511185 = sum of:
      0.2511185 = product of:
        1.2555926 = sum of:
          0.011298351 = weight(abstract_txt:library in 2833) [ClassicSimilarity], result of:
            0.011298351 = score(doc=2833,freq=1.0), product of:
              0.045381755 = queryWeight, product of:
                1.0761696 = boost
                3.1867187 = idf(docFreq=4964, maxDocs=44218)
                0.013232955 = queryNorm
              0.2489624 = fieldWeight in 2833, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1867187 = idf(docFreq=4964, maxDocs=44218)
                0.078125 = fieldNorm(doc=2833)
          0.0942049 = weight(abstract_txt:duplication in 2833) [ClassicSimilarity], result of:
            0.0942049 = score(doc=2833,freq=1.0), product of:
              0.14810745 = queryWeight, product of:
                1.3747177 = boost
                8.14154 = idf(docFreq=34, maxDocs=44218)
                0.013232955 = queryNorm
              0.6360578 = fieldWeight in 2833, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.14154 = idf(docFreq=34, maxDocs=44218)
                0.078125 = fieldNorm(doc=2833)
          0.13767359 = weight(abstract_txt:consolidation in 2833) [ClassicSimilarity], result of:
            0.13767359 = score(doc=2833,freq=2.0), product of:
              0.15138575 = queryWeight, product of:
                1.3898488 = boost
                8.231152 = idf(docFreq=31, maxDocs=44218)
                0.013232955 = queryNorm
              0.90942234 = fieldWeight in 2833, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.231152 = idf(docFreq=31, maxDocs=44218)
                0.078125 = fieldNorm(doc=2833)
          0.62698984 = weight(abstract_txt:duplicate in 2833) [ClassicSimilarity], result of:
            0.62698984 = score(doc=2833,freq=3.0), product of:
              0.57678574 = queryWeight, product of:
                5.4257817 = boost
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.013232955 = queryNorm
              1.0870411 = fieldWeight in 2833, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.078125 = fieldNorm(doc=2833)
          0.38542593 = weight(abstract_txt:detection in 2833) [ClassicSimilarity], result of:
            0.38542593 = score(doc=2833,freq=2.0), product of:
              0.5142037 = queryWeight, product of:
                5.7276664 = boost
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.013232955 = queryNorm
              0.7495588 = fieldWeight in 2833, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.078125 = fieldNorm(doc=2833)
        0.2 = coord(5/25)
    
  3. Meir, D.D.; Lazinger, S.S.: Measuring the performance of a merging algorithm : mismatches, missed-matches, and overlap in Israel's union list (1998) 0.25
    0.24980386 = sum of:
      0.24980386 = product of:
        1.0408494 = sum of:
          0.020851778 = weight(abstract_txt:results in 3382) [ClassicSimilarity], result of:
            0.020851778 = score(doc=3382,freq=2.0), product of:
              0.054194678 = queryWeight, product of:
                1.1760299 = boost
                3.482422 = idf(docFreq=3693, maxDocs=44218)
                0.013232955 = queryNorm
              0.38475692 = fieldWeight in 3382, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.482422 = idf(docFreq=3693, maxDocs=44218)
                0.078125 = fieldNorm(doc=3382)
          0.12950411 = weight(abstract_txt:merging in 3382) [ClassicSimilarity], result of:
            0.12950411 = score(doc=3382,freq=3.0), product of:
              0.12696281 = queryWeight, product of:
                1.2728093 = boost
                7.538004 = idf(docFreq=63, maxDocs=44218)
                0.013232955 = queryNorm
              1.0200161 = fieldWeight in 3382, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.538004 = idf(docFreq=63, maxDocs=44218)
                0.078125 = fieldNorm(doc=3382)
          0.02641532 = weight(abstract_txt:bibliographic in 3382) [ClassicSimilarity], result of:
            0.02641532 = score(doc=3382,freq=1.0), product of:
              0.07994203 = queryWeight, product of:
                1.4283271 = boost
                4.229516 = idf(docFreq=1749, maxDocs=44218)
                0.013232955 = queryNorm
              0.33043092 = fieldWeight in 3382, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.229516 = idf(docFreq=1749, maxDocs=44218)
                0.078125 = fieldNorm(doc=3382)
          0.22954816 = weight(abstract_txt:algorithms in 3382) [ClassicSimilarity], result of:
            0.22954816 = score(doc=3382,freq=2.0), product of:
              0.36399084 = queryWeight, product of:
                4.818982 = boost
                5.707926 = idf(docFreq=398, maxDocs=44218)
                0.013232955 = queryNorm
              0.63064265 = fieldWeight in 3382, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.707926 = idf(docFreq=398, maxDocs=44218)
                0.078125 = fieldNorm(doc=3382)
          0.36199278 = weight(abstract_txt:duplicate in 3382) [ClassicSimilarity], result of:
            0.36199278 = score(doc=3382,freq=1.0), product of:
              0.57678574 = queryWeight, product of:
                5.4257817 = boost
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.013232955 = queryNorm
              0.62760353 = fieldWeight in 3382, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.078125 = fieldNorm(doc=3382)
          0.27253732 = weight(abstract_txt:detection in 3382) [ClassicSimilarity], result of:
            0.27253732 = score(doc=3382,freq=1.0), product of:
              0.5142037 = queryWeight, product of:
                5.7276664 = boost
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.013232955 = queryNorm
              0.53001815 = fieldWeight in 3382, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.078125 = fieldNorm(doc=3382)
        0.24 = coord(6/25)
    
  4. Sedhai, S.; Sun, A.: ¬An analysis of 14 Million tweets on hashtag-oriented spamming* (2017) 0.14
    0.14165771 = sum of:
      0.14165771 = product of:
        0.70828855 = sum of:
          0.02900834 = weight(abstract_txt:descriptions in 3683) [ClassicSimilarity], result of:
            0.02900834 = score(doc=3683,freq=1.0), product of:
              0.07837 = queryWeight, product of:
                5.9223356 = idf(docFreq=321, maxDocs=44218)
                0.013232955 = queryNorm
              0.37014598 = fieldWeight in 3683, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9223356 = idf(docFreq=321, maxDocs=44218)
                0.0625 = fieldNorm(doc=3683)
          0.016466135 = weight(abstract_txt:paper in 3683) [ClassicSimilarity], result of:
            0.016466135 = score(doc=3683,freq=2.0), product of:
              0.053727385 = queryWeight, product of:
                1.1709489 = boost
                3.467376 = idf(docFreq=3749, maxDocs=44218)
                0.013232955 = queryNorm
              0.30647564 = fieldWeight in 3683, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.467376 = idf(docFreq=3749, maxDocs=44218)
                0.0625 = fieldNorm(doc=3683)
          0.035236165 = weight(abstract_txt:their in 3683) [ClassicSimilarity], result of:
            0.035236165 = score(doc=3683,freq=4.0), product of:
              0.089219615 = queryWeight, product of:
                2.133955 = boost
                3.1594994 = idf(docFreq=5101, maxDocs=44218)
                0.013232955 = queryNorm
              0.39493743 = fieldWeight in 3683, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.1594994 = idf(docFreq=5101, maxDocs=44218)
                0.0625 = fieldNorm(doc=3683)
          0.40954804 = weight(abstract_txt:duplicate in 3683) [ClassicSimilarity], result of:
            0.40954804 = score(doc=3683,freq=2.0), product of:
              0.57678574 = queryWeight, product of:
                5.4257817 = boost
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.013232955 = queryNorm
              0.7100523 = fieldWeight in 3683, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.0625 = fieldNorm(doc=3683)
          0.21802984 = weight(abstract_txt:detection in 3683) [ClassicSimilarity], result of:
            0.21802984 = score(doc=3683,freq=1.0), product of:
              0.5142037 = queryWeight, product of:
                5.7276664 = boost
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.013232955 = queryNorm
              0.4240145 = fieldWeight in 3683, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.0625 = fieldNorm(doc=3683)
        0.2 = coord(5/25)
    
  5. Weiss, P.J.: Getting the expert into the system : expert systems and cataloging (1995) 0.14
    0.13618547 = sum of:
      0.13618547 = product of:
        1.1348789 = sum of:
          0.11963077 = weight(abstract_txt:merging in 2397) [ClassicSimilarity], result of:
            0.11963077 = score(doc=2397,freq=1.0), product of:
              0.12696281 = queryWeight, product of:
                1.2728093 = boost
                7.538004 = idf(docFreq=63, maxDocs=44218)
                0.013232955 = queryNorm
              0.9422505 = fieldWeight in 2397, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.538004 = idf(docFreq=63, maxDocs=44218)
                0.125 = fieldNorm(doc=2397)
          0.5791884 = weight(abstract_txt:duplicate in 2397) [ClassicSimilarity], result of:
            0.5791884 = score(doc=2397,freq=1.0), product of:
              0.57678574 = queryWeight, product of:
                5.4257817 = boost
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.013232955 = queryNorm
              1.0041656 = fieldWeight in 2397, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.125 = fieldNorm(doc=2397)
          0.43605968 = weight(abstract_txt:detection in 2397) [ClassicSimilarity], result of:
            0.43605968 = score(doc=2397,freq=1.0), product of:
              0.5142037 = queryWeight, product of:
                5.7276664 = boost
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.013232955 = queryNorm
              0.848029 = fieldWeight in 2397, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.125 = fieldNorm(doc=2397)
        0.12 = coord(3/25)