Document (#11499)

Author
Kirriemuir, J.W.
Willet, P.
Title
Identification of duplicate and near-duplicate full-text records in database search-outputs using hierarchic cluster analysis
Source
Program. 29(1995) no.3, S.241-256
Year
1995
Abstract
Clustering the output of a multi database online search enables users to obtain an overview of the information that has been retrieved without the need to inspect any documents that contain only redundant information. Describes a classification scheme that characterizes the degree of relationship between pairs of documents in database search outputs and then reports the application of a range of clustering methods and similarity coefficients to 20 such outputs. Results indicate that clustering is capable of grouping documents in the search output on the basis of their term similarities

Similar documents (content)

  1. Tombros, A.; Villa, R.; Rijsbergen, C.J. Van: ¬The effectiveness of query-specific hierarchic clustering in information retrieval (2002) 0.21
    0.20865777 = sum of:
      0.20865777 = product of:
        1.3041111 = sum of:
          0.023732323 = weight(abstract_txt:that in 2586) [ClassicSimilarity], result of:
            0.023732323 = score(doc=2586,freq=4.0), product of:
              0.06410148 = queryWeight, product of:
                1.6287427 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.016609762 = queryNorm
              0.3702305 = fieldWeight in 2586, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.078125 = fieldNorm(doc=2586)
          0.7934746 = weight(title_txt:hierarchic in 2586) [ClassicSimilarity], result of:
            0.7934746 = score(doc=2586,freq=1.0), product of:
              0.26401174 = queryWeight, product of:
                1.6527231 = boost
                9.617446 = idf(docFreq=7, maxDocs=44218)
                0.016609762 = queryNorm
              3.005452 = fieldWeight in 2586, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.617446 = idf(docFreq=7, maxDocs=44218)
                0.3125 = fieldNorm(doc=2586)
          0.061747383 = weight(abstract_txt:search in 2586) [ClassicSimilarity], result of:
            0.061747383 = score(doc=2586,freq=2.0), product of:
              0.15277898 = queryWeight, product of:
                2.514492 = boost
                3.6580524 = idf(docFreq=3098, maxDocs=44218)
                0.016609762 = queryNorm
              0.4041615 = fieldWeight in 2586, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.6580524 = idf(docFreq=3098, maxDocs=44218)
                0.078125 = fieldNorm(doc=2586)
          0.42515686 = weight(abstract_txt:clustering in 2586) [ClassicSimilarity], result of:
            0.42515686 = score(doc=2586,freq=7.0), product of:
              0.33088857 = queryWeight, product of:
                3.2047193 = boost
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.016609762 = queryNorm
              1.2848943 = fieldWeight in 2586, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.078125 = fieldNorm(doc=2586)
        0.16 = coord(4/25)
    
  2. Kishida, K.: High-speed rough clustering for very large document collections (2010) 0.15
    0.15092719 = sum of:
      0.15092719 = product of:
        0.6288633 = sum of:
          0.04477347 = weight(abstract_txt:obtain in 3463) [ClassicSimilarity], result of:
            0.04477347 = score(doc=3463,freq=1.0), product of:
              0.11356951 = queryWeight, product of:
                1.0839752 = boost
                6.3078156 = idf(docFreq=218, maxDocs=44218)
                0.016609762 = queryNorm
              0.39423847 = fieldWeight in 3463, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.3078156 = idf(docFreq=218, maxDocs=44218)
                0.0625 = fieldNorm(doc=3463)
          0.050117206 = weight(abstract_txt:cluster in 3463) [ClassicSimilarity], result of:
            0.050117206 = score(doc=3463,freq=1.0), product of:
              0.12243506 = queryWeight, product of:
                1.1254894 = boost
                6.5493927 = idf(docFreq=171, maxDocs=44218)
                0.016609762 = queryNorm
              0.40933704 = fieldWeight in 3463, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5493927 = idf(docFreq=171, maxDocs=44218)
                0.0625 = fieldNorm(doc=3463)
          0.115533404 = weight(abstract_txt:grouping in 3463) [ClassicSimilarity], result of:
            0.115533404 = score(doc=3463,freq=2.0), product of:
              0.16958065 = queryWeight, product of:
                1.3245752 = boost
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.016609762 = queryNorm
              0.68128884 = fieldWeight in 3463, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.0625 = fieldNorm(doc=3463)
          0.013425029 = weight(abstract_txt:that in 3463) [ClassicSimilarity], result of:
            0.013425029 = score(doc=3463,freq=2.0), product of:
              0.06410148 = queryWeight, product of:
                1.6287427 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.016609762 = queryNorm
              0.20943399 = fieldWeight in 3463, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.0625 = fieldNorm(doc=3463)
          0.06488871 = weight(abstract_txt:documents in 3463) [ClassicSimilarity], result of:
            0.06488871 = score(doc=3463,freq=3.0), product of:
              0.1454434 = queryWeight, product of:
                2.1246927 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.016609762 = queryNorm
              0.44614407 = fieldWeight in 3463, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0625 = fieldNorm(doc=3463)
          0.34012547 = weight(abstract_txt:clustering in 3463) [ClassicSimilarity], result of:
            0.34012547 = score(doc=3463,freq=7.0), product of:
              0.33088857 = queryWeight, product of:
                3.2047193 = boost
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.016609762 = queryNorm
              1.0279155 = fieldWeight in 3463, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.0625 = fieldNorm(doc=3463)
        0.24 = coord(6/25)
    
  3. Na, S.-H.; Kang, I.-S.; Lee, J.-H.: Adaptive document clustering based on query-based similarity (2007) 0.14
    0.14041294 = sum of:
      0.14041294 = product of:
        0.5850539 = sum of:
          0.060886826 = weight(abstract_txt:similarity in 920) [ClassicSimilarity], result of:
            0.060886826 = score(doc=920,freq=3.0), product of:
              0.09665472 = queryWeight, product of:
                5.8191514 = idf(docFreq=356, maxDocs=44218)
                0.016609762 = queryNorm
              0.6299416 = fieldWeight in 920, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.8191514 = idf(docFreq=356, maxDocs=44218)
                0.0625 = fieldNorm(doc=920)
          0.07087643 = weight(abstract_txt:cluster in 920) [ClassicSimilarity], result of:
            0.07087643 = score(doc=920,freq=2.0), product of:
              0.12243506 = queryWeight, product of:
                1.1254894 = boost
                6.5493927 = idf(docFreq=171, maxDocs=44218)
                0.016609762 = queryNorm
              0.57888997 = fieldWeight in 920, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.5493927 = idf(docFreq=171, maxDocs=44218)
                0.0625 = fieldNorm(doc=920)
          0.08750707 = weight(abstract_txt:similarities in 920) [ClassicSimilarity], result of:
            0.08750707 = score(doc=920,freq=3.0), product of:
              0.12309382 = queryWeight, product of:
                1.1285131 = boost
                6.5669885 = idf(docFreq=168, maxDocs=44218)
                0.016609762 = queryNorm
              0.7108973 = fieldWeight in 920, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.5669885 = idf(docFreq=168, maxDocs=44218)
                0.0625 = fieldNorm(doc=920)
          0.013425029 = weight(abstract_txt:that in 920) [ClassicSimilarity], result of:
            0.013425029 = score(doc=920,freq=2.0), product of:
              0.06410148 = queryWeight, product of:
                1.6287427 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.016609762 = queryNorm
              0.20943399 = fieldWeight in 920, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.0625 = fieldNorm(doc=920)
          0.037463516 = weight(abstract_txt:documents in 920) [ClassicSimilarity], result of:
            0.037463516 = score(doc=920,freq=1.0), product of:
              0.1454434 = queryWeight, product of:
                2.1246927 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.016609762 = queryNorm
              0.2575814 = fieldWeight in 920, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0625 = fieldNorm(doc=920)
          0.31489503 = weight(abstract_txt:clustering in 920) [ClassicSimilarity], result of:
            0.31489503 = score(doc=920,freq=6.0), product of:
              0.33088857 = queryWeight, product of:
                3.2047193 = boost
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.016609762 = queryNorm
              0.95166487 = fieldWeight in 920, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.0625 = fieldNorm(doc=920)
        0.24 = coord(6/25)
    
  4. Conrad, J.G.; Schriber, C.P.: Managing déjà vu : collection building for the identification of nonidentical duplicate documents (2006) 0.14
    0.13900423 = sum of:
      0.13900423 = product of:
        0.69502115 = sum of:
          0.044522237 = weight(abstract_txt:identification in 5059) [ClassicSimilarity], result of:
            0.044522237 = score(doc=5059,freq=1.0), product of:
              0.09750477 = queryWeight, product of:
                1.0043877 = boost
                5.8446846 = idf(docFreq=347, maxDocs=44218)
                0.016609762 = queryNorm
              0.45661598 = fieldWeight in 5059, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8446846 = idf(docFreq=347, maxDocs=44218)
                0.078125 = fieldNorm(doc=5059)
          0.10717024 = weight(abstract_txt:near in 5059) [ClassicSimilarity], result of:
            0.10717024 = score(doc=5059,freq=2.0), product of:
              0.13899976 = queryWeight, product of:
                1.1992106 = boost
                6.9783883 = idf(docFreq=111, maxDocs=44218)
                0.016609762 = queryNorm
              0.7710102 = fieldWeight in 5059, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.9783883 = idf(docFreq=111, maxDocs=44218)
                0.078125 = fieldNorm(doc=5059)
          0.08111088 = weight(abstract_txt:documents in 5059) [ClassicSimilarity], result of:
            0.08111088 = score(doc=5059,freq=3.0), product of:
              0.1454434 = queryWeight, product of:
                2.1246927 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.016609762 = queryNorm
              0.5576801 = fieldWeight in 5059, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.078125 = fieldNorm(doc=5059)
          0.061747383 = weight(abstract_txt:search in 5059) [ClassicSimilarity], result of:
            0.061747383 = score(doc=5059,freq=2.0), product of:
              0.15277898 = queryWeight, product of:
                2.514492 = boost
                3.6580524 = idf(docFreq=3098, maxDocs=44218)
                0.016609762 = queryNorm
              0.4041615 = fieldWeight in 5059, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.6580524 = idf(docFreq=3098, maxDocs=44218)
                0.078125 = fieldNorm(doc=5059)
          0.4004704 = weight(abstract_txt:duplicate in 5059) [ClassicSimilarity], result of:
            0.4004704 = score(doc=5059,freq=3.0), product of:
              0.3684041 = queryWeight, product of:
                2.7609954 = boost
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.016609762 = queryNorm
              1.0870411 = fieldWeight in 5059, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.078125 = fieldNorm(doc=5059)
        0.2 = coord(5/25)
    
  5. Hu, G.; Zhou, S.; Guan, J.; Hu, X.: Towards effective document clustering : a constrained K-means based approach (2008) 0.12
    0.12362545 = sum of:
      0.12362545 = product of:
        0.6181272 = sum of:
          0.106314644 = weight(abstract_txt:cluster in 2113) [ClassicSimilarity], result of:
            0.106314644 = score(doc=2113,freq=2.0), product of:
              0.12243506 = queryWeight, product of:
                1.1254894 = boost
                6.5493927 = idf(docFreq=171, maxDocs=44218)
                0.016609762 = queryNorm
              0.86833495 = fieldWeight in 2113, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.5493927 = idf(docFreq=171, maxDocs=44218)
                0.09375 = fieldNorm(doc=2113)
          0.0841045 = weight(abstract_txt:pairs in 2113) [ClassicSimilarity], result of:
            0.0841045 = score(doc=2113,freq=1.0), product of:
              0.13194712 = queryWeight, product of:
                1.1683916 = boost
                6.7990475 = idf(docFreq=133, maxDocs=44218)
                0.016609762 = queryNorm
              0.6374107 = fieldWeight in 2113, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.7990475 = idf(docFreq=133, maxDocs=44218)
                0.09375 = fieldNorm(doc=2113)
          0.014239393 = weight(abstract_txt:that in 2113) [ClassicSimilarity], result of:
            0.014239393 = score(doc=2113,freq=1.0), product of:
              0.06410148 = queryWeight, product of:
                1.6287427 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.016609762 = queryNorm
              0.22213829 = fieldWeight in 2113, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.09375 = fieldNorm(doc=2113)
          0.07947212 = weight(abstract_txt:documents in 2113) [ClassicSimilarity], result of:
            0.07947212 = score(doc=2113,freq=2.0), product of:
              0.1454434 = queryWeight, product of:
                2.1246927 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.016609762 = queryNorm
              0.5464127 = fieldWeight in 2113, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.09375 = fieldNorm(doc=2113)
          0.3339966 = weight(abstract_txt:clustering in 2113) [ClassicSimilarity], result of:
            0.3339966 = score(doc=2113,freq=3.0), product of:
              0.33088857 = queryWeight, product of:
                3.2047193 = boost
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.016609762 = queryNorm
              1.009393 = fieldWeight in 2113, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.09375 = fieldNorm(doc=2113)
        0.2 = coord(5/25)