Document (#11499)

Author
Kirriemuir, J.W.
Willet, P.
Title
Identification of duplicate and near-duplicate full-text records in database search-outputs using hierarchic cluster analysis
Source
Program. 29(1995) no.3, S.241-256
Year
1995
Abstract
Clustering the output of a multi database online search enables users to obtain an overview of the information that has been retrieved without the need to inspect any documents that contain only redundant information. Describes a classification scheme that characterizes the degree of relationship between pairs of documents in database search outputs and then reports the application of a range of clustering methods and similarity coefficients to 20 such outputs. Results indicate that clustering is capable of grouping documents in the search output on the basis of their term similarities

Similar documents (content)

  1. Tombros, A.; Villa, R.; Rijsbergen, C.J. Van: ¬The effectiveness of query-specific hierarchic clustering in information retrieval (2002) 0.21
    0.2050702 = sum of:
      0.2050702 = product of:
        1.2816888 = sum of:
          0.772533 = weight(title_txt:hierarchic in 3587) [ClassicSimilarity], result of:
            0.772533 = score(doc=3587,freq=1.0), product of:
              0.25845116 = queryWeight, product of:
                1.6388725 = boost
                9.565078 = idf(docFreq=7, maxDocs=41962)
                0.01648712 = queryNorm
              2.9890869 = fieldWeight in 3587, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.565078 = idf(docFreq=7, maxDocs=41962)
                0.3125 = fieldNorm(doc=3587)
          0.024781829 = weight(abstract_txt:that in 3587) [ClassicSimilarity], result of:
            0.024781829 = score(doc=3587,freq=4.0), product of:
              0.06575004 = queryWeight, product of:
                1.6532326 = boost
                2.4122221 = idf(docFreq=10221, maxDocs=41962)
                0.01648712 = queryNorm
              0.3769097 = fieldWeight in 3587, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.4122221 = idf(docFreq=10221, maxDocs=41962)
                0.078125 = fieldNorm(doc=3587)
          0.061125703 = weight(abstract_txt:search in 3587) [ClassicSimilarity], result of:
            0.061125703 = score(doc=3587,freq=2.0), product of:
              0.1512283 = queryWeight, product of:
                2.5072777 = boost
                3.6583548 = idf(docFreq=2939, maxDocs=41962)
                0.01648712 = queryNorm
              0.4041949 = fieldWeight in 3587, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.6583548 = idf(docFreq=2939, maxDocs=41962)
                0.078125 = fieldNorm(doc=3587)
          0.4232483 = weight(abstract_txt:clustering in 3587) [ClassicSimilarity], result of:
            0.4232483 = score(doc=3587,freq=7.0), product of:
              0.32875952 = queryWeight, product of:
                3.2015164 = boost
                6.2284193 = idf(docFreq=224, maxDocs=41962)
                0.01648712 = queryNorm
              1.28741 = fieldWeight in 3587, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                6.2284193 = idf(docFreq=224, maxDocs=41962)
                0.078125 = fieldNorm(doc=3587)
        0.16 = coord(4/25)
    
  2. Kishida, K.: High-speed rough clustering for very large document collections (2010) 0.15
    0.15087835 = sum of:
      0.15087835 = product of:
        0.62865984 = sum of:
          0.045340355 = weight(abstract_txt:obtain in 464) [ClassicSimilarity], result of:
            0.045340355 = score(doc=464,freq=1.0), product of:
              0.11413103 = queryWeight, product of:
                1.0890751 = boost
                6.3562527 = idf(docFreq=197, maxDocs=41962)
                0.01648712 = queryNorm
              0.3972658 = fieldWeight in 464, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.3562527 = idf(docFreq=197, maxDocs=41962)
                0.0625 = fieldNorm(doc=464)
          0.05019849 = weight(abstract_txt:cluster in 464) [ClassicSimilarity], result of:
            0.05019849 = score(doc=464,freq=1.0), product of:
              0.122144595 = queryWeight, product of:
                1.1266605 = boost
                6.5756154 = idf(docFreq=158, maxDocs=41962)
                0.01648712 = queryNorm
              0.41097596 = fieldWeight in 464, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5756154 = idf(docFreq=158, maxDocs=41962)
                0.0625 = fieldNorm(doc=464)
          0.116347015 = weight(abstract_txt:grouping in 464) [ClassicSimilarity], result of:
            0.116347015 = score(doc=464,freq=2.0), product of:
              0.16978812 = queryWeight, product of:
                1.328341 = boost
                7.7526994 = idf(docFreq=48, maxDocs=41962)
                0.01648712 = queryNorm
              0.68524826 = fieldWeight in 464, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.7526994 = idf(docFreq=48, maxDocs=41962)
                0.0625 = fieldNorm(doc=464)
          0.014018719 = weight(abstract_txt:that in 464) [ClassicSimilarity], result of:
            0.014018719 = score(doc=464,freq=2.0), product of:
              0.06575004 = queryWeight, product of:
                1.6532326 = boost
                2.4122221 = idf(docFreq=10221, maxDocs=41962)
                0.01648712 = queryNorm
              0.21321233 = fieldWeight in 464, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4122221 = idf(docFreq=10221, maxDocs=41962)
                0.0625 = fieldNorm(doc=464)
          0.06415667 = weight(abstract_txt:documents in 464) [ClassicSimilarity], result of:
            0.06415667 = score(doc=464,freq=3.0), product of:
              0.14384949 = queryWeight, product of:
                2.1177306 = boost
                4.1199584 = idf(docFreq=1852, maxDocs=41962)
                0.01648712 = queryNorm
              0.44599858 = fieldWeight in 464, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.1199584 = idf(docFreq=1852, maxDocs=41962)
                0.0625 = fieldNorm(doc=464)
          0.33859864 = weight(abstract_txt:clustering in 464) [ClassicSimilarity], result of:
            0.33859864 = score(doc=464,freq=7.0), product of:
              0.32875952 = queryWeight, product of:
                3.2015164 = boost
                6.2284193 = idf(docFreq=224, maxDocs=41962)
                0.01648712 = queryNorm
              1.029928 = fieldWeight in 464, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                6.2284193 = idf(docFreq=224, maxDocs=41962)
                0.0625 = fieldNorm(doc=464)
        0.24 = coord(6/25)
    
  3. Na, S.-H.; Kang, I.-S.; Lee, J.-H.: Adaptive document clustering based on query-based similarity (2007) 0.14
    0.14074185 = sum of:
      0.14074185 = product of:
        0.5864244 = sum of:
          0.060795598 = weight(abstract_txt:similarity in 2921) [ClassicSimilarity], result of:
            0.060795598 = score(doc=2921,freq=3.0), product of:
              0.096225046 = queryWeight, product of:
                5.836377 = idf(docFreq=332, maxDocs=41962)
                0.01648712 = queryNorm
              0.6318064 = fieldWeight in 2921, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.836377 = idf(docFreq=332, maxDocs=41962)
                0.0625 = fieldNorm(doc=2921)
          0.07099138 = weight(abstract_txt:cluster in 2921) [ClassicSimilarity], result of:
            0.07099138 = score(doc=2921,freq=2.0), product of:
              0.122144595 = queryWeight, product of:
                1.1266605 = boost
                6.5756154 = idf(docFreq=158, maxDocs=41962)
                0.01648712 = queryNorm
              0.58120775 = fieldWeight in 2921, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.5756154 = idf(docFreq=158, maxDocs=41962)
                0.0625 = fieldNorm(doc=2921)
          0.09009642 = weight(abstract_txt:similarities in 2921) [ClassicSimilarity], result of:
            0.09009642 = score(doc=2921,freq=3.0), product of:
              0.12507728 = queryWeight, product of:
                1.1401057 = boost
                6.654087 = idf(docFreq=146, maxDocs=41962)
                0.01648712 = queryNorm
              0.72032607 = fieldWeight in 2921, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.654087 = idf(docFreq=146, maxDocs=41962)
                0.0625 = fieldNorm(doc=2921)
          0.014018719 = weight(abstract_txt:that in 2921) [ClassicSimilarity], result of:
            0.014018719 = score(doc=2921,freq=2.0), product of:
              0.06575004 = queryWeight, product of:
                1.6532326 = boost
                2.4122221 = idf(docFreq=10221, maxDocs=41962)
                0.01648712 = queryNorm
              0.21321233 = fieldWeight in 2921, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4122221 = idf(docFreq=10221, maxDocs=41962)
                0.0625 = fieldNorm(doc=2921)
          0.03704087 = weight(abstract_txt:documents in 2921) [ClassicSimilarity], result of:
            0.03704087 = score(doc=2921,freq=1.0), product of:
              0.14384949 = queryWeight, product of:
                2.1177306 = boost
                4.1199584 = idf(docFreq=1852, maxDocs=41962)
                0.01648712 = queryNorm
              0.2574974 = fieldWeight in 2921, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1199584 = idf(docFreq=1852, maxDocs=41962)
                0.0625 = fieldNorm(doc=2921)
          0.31348145 = weight(abstract_txt:clustering in 2921) [ClassicSimilarity], result of:
            0.31348145 = score(doc=2921,freq=6.0), product of:
              0.32875952 = queryWeight, product of:
                3.2015164 = boost
                6.2284193 = idf(docFreq=224, maxDocs=41962)
                0.01648712 = queryNorm
              0.9535281 = fieldWeight in 2921, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                6.2284193 = idf(docFreq=224, maxDocs=41962)
                0.0625 = fieldNorm(doc=2921)
        0.24 = coord(6/25)
    
  4. Conrad, J.G.; Schriber, C.P.: Managing déjà vu : collection building for the identification of nonidentical duplicate documents (2006) 0.14
    0.13621442 = sum of:
      0.13621442 = product of:
        0.6810721 = sum of:
          0.044923156 = weight(abstract_txt:identification in 60) [ClassicSimilarity], result of:
            0.044923156 = score(doc=60,freq=1.0), product of:
              0.09775087 = queryWeight, product of:
                1.0078973 = boost
                5.882468 = idf(docFreq=317, maxDocs=41962)
                0.01648712 = queryNorm
              0.45956784 = fieldWeight in 60, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.882468 = idf(docFreq=317, maxDocs=41962)
                0.078125 = fieldNorm(doc=60)
          0.10618775 = weight(abstract_txt:near in 60) [ClassicSimilarity], result of:
            0.10618775 = score(doc=60,freq=2.0), product of:
              0.13767236 = queryWeight, product of:
                1.1961325 = boost
                6.9810805 = idf(docFreq=105, maxDocs=41962)
                0.01648712 = queryNorm
              0.7713077 = fieldWeight in 60, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.9810805 = idf(docFreq=105, maxDocs=41962)
                0.078125 = fieldNorm(doc=60)
          0.08019584 = weight(abstract_txt:documents in 60) [ClassicSimilarity], result of:
            0.08019584 = score(doc=60,freq=3.0), product of:
              0.14384949 = queryWeight, product of:
                2.1177306 = boost
                4.1199584 = idf(docFreq=1852, maxDocs=41962)
                0.01648712 = queryNorm
              0.5574982 = fieldWeight in 60, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.1199584 = idf(docFreq=1852, maxDocs=41962)
                0.078125 = fieldNorm(doc=60)
          0.061125703 = weight(abstract_txt:search in 60) [ClassicSimilarity], result of:
            0.061125703 = score(doc=60,freq=2.0), product of:
              0.1512283 = queryWeight, product of:
                2.5072777 = boost
                3.6583548 = idf(docFreq=2939, maxDocs=41962)
                0.01648712 = queryNorm
              0.4041949 = fieldWeight in 60, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.6583548 = idf(docFreq=2939, maxDocs=41962)
                0.078125 = fieldNorm(doc=60)
          0.3886397 = weight(abstract_txt:duplicate in 60) [ClassicSimilarity], result of:
            0.3886397 = score(doc=60,freq=3.0), product of:
              0.35986656 = queryWeight, product of:
                2.7349014 = boost
                7.980958 = idf(docFreq=38, maxDocs=41962)
                0.01648712 = queryNorm
              1.079955 = fieldWeight in 60, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.980958 = idf(docFreq=38, maxDocs=41962)
                0.078125 = fieldNorm(doc=60)
        0.2 = coord(5/25)
    
  5. Hu, G.; Zhou, S.; Guan, J.; Hu, X.: Towards effective document clustering : a constrained K-means based approach (2008) 0.12
    0.12362535 = sum of:
      0.12362535 = product of:
        0.61812675 = sum of:
          0.10648708 = weight(abstract_txt:cluster in 4114) [ClassicSimilarity], result of:
            0.10648708 = score(doc=4114,freq=2.0), product of:
              0.122144595 = queryWeight, product of:
                1.1266605 = boost
                6.5756154 = idf(docFreq=158, maxDocs=41962)
                0.01648712 = queryNorm
              0.8718116 = fieldWeight in 4114, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.5756154 = idf(docFreq=158, maxDocs=41962)
                0.09375 = fieldNorm(doc=4114)
          0.08569777 = weight(abstract_txt:pairs in 4114) [ClassicSimilarity], result of:
            0.08569777 = score(doc=4114,freq=1.0), product of:
              0.13314739 = queryWeight, product of:
                1.1763113 = boost
                6.865396 = idf(docFreq=118, maxDocs=41962)
                0.01648712 = queryNorm
              0.64363086 = fieldWeight in 4114, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.865396 = idf(docFreq=118, maxDocs=41962)
                0.09375 = fieldNorm(doc=4114)
          0.014869098 = weight(abstract_txt:that in 4114) [ClassicSimilarity], result of:
            0.014869098 = score(doc=4114,freq=1.0), product of:
              0.06575004 = queryWeight, product of:
                1.6532326 = boost
                2.4122221 = idf(docFreq=10221, maxDocs=41962)
                0.01648712 = queryNorm
              0.22614583 = fieldWeight in 4114, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4122221 = idf(docFreq=10221, maxDocs=41962)
                0.09375 = fieldNorm(doc=4114)
          0.07857555 = weight(abstract_txt:documents in 4114) [ClassicSimilarity], result of:
            0.07857555 = score(doc=4114,freq=2.0), product of:
              0.14384949 = queryWeight, product of:
                2.1177306 = boost
                4.1199584 = idf(docFreq=1852, maxDocs=41962)
                0.01648712 = queryNorm
              0.5462345 = fieldWeight in 4114, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1199584 = idf(docFreq=1852, maxDocs=41962)
                0.09375 = fieldNorm(doc=4114)
          0.33249727 = weight(abstract_txt:clustering in 4114) [ClassicSimilarity], result of:
            0.33249727 = score(doc=4114,freq=3.0), product of:
              0.32875952 = queryWeight, product of:
                3.2015164 = boost
                6.2284193 = idf(docFreq=224, maxDocs=41962)
                0.01648712 = queryNorm
              1.0113692 = fieldWeight in 4114, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.2284193 = idf(docFreq=224, maxDocs=41962)
                0.09375 = fieldNorm(doc=4114)
        0.2 = coord(5/25)