Document (#34287)

Author
Khoo, C.S.G.
Ou, S.
Title
Machine versus human clustering of concepts across documents
Source
Culture and identity in knowledge organization: Proceedings of the Tenth International ISKO Conference 5-8 August 2008, Montreal, Canada. Ed. by Clément Arsenault and Joseph T. Tennis
Imprint
Würzburg : Ergon Verlag
Year
2008
Pages
S.333-339
Series
Advances in knowledge organization; vol.11
Content
An automated method for clustering terms/concepts from a set of documents on the same topic was developed for the purpose of multidocument summarization. The clustering method makes use of a combination of lexical overlap between multiword terms, syntactic constraints and semantic consideration based on a manually constructed taxonomy to generate hierarchically organized clusters of terms. This study evaluates the machine-generated clusters by calculating the proportion of overlap with two sets of human-generated clusters for 15 topics. It was found that the overlap between machine-generated clusters and individual human-generated clusters are higher than that between two human-generated clusters. A quailtative analysis of the human clustering found that clusters formed are either semantic-conceptual based or lexical based (similar to machine clustering). The semantic-conceptual based clusters that were formed tended to be different for different human coders. This has raised questions about whether machine-generated clustering can be evaluated by comparing with human clustering.
Footnote
Vgl. unter: http://www.ergon-verlag.de/isko_ko/tocs/0497f79b0c0b3ed06/0497f79b0c0b5550a/index.php.
Theme
Automatisches Klassifizieren

Similar documents (author)

  1. Khoo, C.S.G.; Poo, D.C.C.: ¬An expert system approach to online catalog subject searching (1994) 6.10
    6.1002054 = sum of:
      6.1002054 = sum of:
        2.7357059 = weight(author_txt:khoo in 7303) [ClassicSimilarity], result of:
          2.7357059 = score(doc=7303,freq=1.0), product of:
            0.65686435 = queryWeight, product of:
              8.329592 = idf(docFreq=28, maxDocs=44218)
              0.07885913 = queryNorm
            4.164796 = fieldWeight in 7303, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.329592 = idf(docFreq=28, maxDocs=44218)
              0.5 = fieldNorm(doc=7303)
        3.3644998 = weight(author_txt:c.s.g in 7303) [ClassicSimilarity], result of:
          3.3644998 = score(doc=7303,freq=1.0), product of:
            0.7540088 = queryWeight, product of:
              1.0713968 = boost
              8.924298 = idf(docFreq=15, maxDocs=44218)
              0.07885913 = queryNorm
            4.462149 = fieldWeight in 7303, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.924298 = idf(docFreq=15, maxDocs=44218)
              0.5 = fieldNorm(doc=7303)
    
  2. Chaudhry, A.S.; Khoo, C.S.G..: ¬A survey of the top-level categories in the structure of corporate Websites (2008) 6.10
    6.1002054 = sum of:
      6.1002054 = sum of:
        2.7357059 = weight(author_txt:khoo in 2259) [ClassicSimilarity], result of:
          2.7357059 = score(doc=2259,freq=1.0), product of:
            0.65686435 = queryWeight, product of:
              8.329592 = idf(docFreq=28, maxDocs=44218)
              0.07885913 = queryNorm
            4.164796 = fieldWeight in 2259, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.329592 = idf(docFreq=28, maxDocs=44218)
              0.5 = fieldNorm(doc=2259)
        3.3644998 = weight(author_txt:c.s.g in 2259) [ClassicSimilarity], result of:
          3.3644998 = score(doc=2259,freq=1.0), product of:
            0.7540088 = queryWeight, product of:
              1.0713968 = boost
              8.924298 = idf(docFreq=15, maxDocs=44218)
              0.07885913 = queryNorm
            4.462149 = fieldWeight in 2259, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.924298 = idf(docFreq=15, maxDocs=44218)
              0.5 = fieldNorm(doc=2259)
    
  3. Poo, D.C.C.; Khoo, C.S.G.: Online Catalog Subject Searching (2009) 6.10
    6.1002054 = sum of:
      6.1002054 = sum of:
        2.7357059 = weight(author_txt:khoo in 3851) [ClassicSimilarity], result of:
          2.7357059 = score(doc=3851,freq=1.0), product of:
            0.65686435 = queryWeight, product of:
              8.329592 = idf(docFreq=28, maxDocs=44218)
              0.07885913 = queryNorm
            4.164796 = fieldWeight in 3851, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.329592 = idf(docFreq=28, maxDocs=44218)
              0.5 = fieldNorm(doc=3851)
        3.3644998 = weight(author_txt:c.s.g in 3851) [ClassicSimilarity], result of:
          3.3644998 = score(doc=3851,freq=1.0), product of:
            0.7540088 = queryWeight, product of:
              1.0713968 = boost
              8.924298 = idf(docFreq=15, maxDocs=44218)
              0.07885913 = queryNorm
            4.462149 = fieldWeight in 3851, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.924298 = idf(docFreq=15, maxDocs=44218)
              0.5 = fieldNorm(doc=3851)
    
  4. Sun, G.; Khoo, C.S.G.: ¬A framework to represent variables and values in social science research data sets to support data curation and reuse (2018) 6.10
    6.1002054 = sum of:
      6.1002054 = sum of:
        2.7357059 = weight(author_txt:khoo in 4744) [ClassicSimilarity], result of:
          2.7357059 = score(doc=4744,freq=1.0), product of:
            0.65686435 = queryWeight, product of:
              8.329592 = idf(docFreq=28, maxDocs=44218)
              0.07885913 = queryNorm
            4.164796 = fieldWeight in 4744, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.329592 = idf(docFreq=28, maxDocs=44218)
              0.5 = fieldNorm(doc=4744)
        3.3644998 = weight(author_txt:c.s.g in 4744) [ClassicSimilarity], result of:
          3.3644998 = score(doc=4744,freq=1.0), product of:
            0.7540088 = queryWeight, product of:
              1.0713968 = boost
              8.924298 = idf(docFreq=15, maxDocs=44218)
              0.07885913 = queryNorm
            4.462149 = fieldWeight in 4744, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.924298 = idf(docFreq=15, maxDocs=44218)
              0.5 = fieldNorm(doc=4744)
    
  5. Khoo, C.S.G.; Wan, K.-W.: ¬A simple relevancy-ranking strategy for an interface to Boolean OPACs (2004) 5.34
    5.33768 = sum of:
      5.33768 = sum of:
        2.3937428 = weight(author_txt:khoo in 2509) [ClassicSimilarity], result of:
          2.3937428 = score(doc=2509,freq=1.0), product of:
            0.65686435 = queryWeight, product of:
              8.329592 = idf(docFreq=28, maxDocs=44218)
              0.07885913 = queryNorm
            3.6441965 = fieldWeight in 2509, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.329592 = idf(docFreq=28, maxDocs=44218)
              0.4375 = fieldNorm(doc=2509)
        2.9439373 = weight(author_txt:c.s.g in 2509) [ClassicSimilarity], result of:
          2.9439373 = score(doc=2509,freq=1.0), product of:
            0.7540088 = queryWeight, product of:
              1.0713968 = boost
              8.924298 = idf(docFreq=15, maxDocs=44218)
              0.07885913 = queryNorm
            3.9043806 = fieldWeight in 2509, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.924298 = idf(docFreq=15, maxDocs=44218)
              0.4375 = fieldNorm(doc=2509)
    

Similar documents (content)

  1. Huang, L.; Milne, D.; Frank, E.; Witten, I.H.: Learning a concept-based document similarity measure (2012) 0.40
    0.3986201 = sum of:
      0.3986201 = product of:
        0.69758517 = sum of:
          0.10021979 = weight(abstract_txt:documents in 372) [ClassicSimilarity], result of:
            0.10021979 = score(doc=372,freq=2.0), product of:
              0.22009693 = queryWeight, product of:
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0534047 = queryNorm
              0.4553439 = fieldWeight in 372, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.078125 = fieldNorm(doc=372)
          0.1045708 = weight(abstract_txt:human in 372) [ClassicSimilarity], result of:
            0.1045708 = score(doc=372,freq=1.0), product of:
              0.28527382 = queryWeight, product of:
                1.1384763 = boost
                4.692005 = idf(docFreq=1101, maxDocs=44218)
                0.0534047 = queryNorm
              0.3665629 = fieldWeight in 372, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.692005 = idf(docFreq=1101, maxDocs=44218)
                0.078125 = fieldNorm(doc=372)
          0.14889227 = weight(abstract_txt:machine in 372) [ClassicSimilarity], result of:
            0.14889227 = score(doc=372,freq=1.0), product of:
              0.36105198 = queryWeight, product of:
                1.2807899 = boost
                5.2785225 = idf(docFreq=612, maxDocs=44218)
                0.0534047 = queryNorm
              0.41238457 = fieldWeight in 372, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2785225 = idf(docFreq=612, maxDocs=44218)
                0.078125 = fieldNorm(doc=372)
          0.3439023 = weight(abstract_txt:clustering in 372) [ClassicSimilarity], result of:
            0.3439023 = score(doc=372,freq=2.0), product of:
              0.5007278 = queryWeight, product of:
                1.5083213 = boost
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.0534047 = queryNorm
              0.6868049 = fieldWeight in 372, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.078125 = fieldNorm(doc=372)
        0.5714286 = coord(4/7)
    
  2. Zheng, H.-T.; Borchert, C.; Kim, H.-G.: Exploiting corpus-related ontologies for conceptualizing document corpora (2009) 0.38
    0.37989578 = sum of:
      0.37989578 = product of:
        0.6648176 = sum of:
          0.08017584 = weight(abstract_txt:documents in 3165) [ClassicSimilarity], result of:
            0.08017584 = score(doc=3165,freq=2.0), product of:
              0.22009693 = queryWeight, product of:
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0534047 = queryNorm
              0.36427513 = fieldWeight in 3165, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0625 = fieldNorm(doc=3165)
          0.20227586 = weight(abstract_txt:concepts in 3165) [ClassicSimilarity], result of:
            0.20227586 = score(doc=3165,freq=7.0), product of:
              0.268653 = queryWeight, product of:
                1.1048132 = boost
                4.5532694 = idf(docFreq=1265, maxDocs=44218)
                0.0534047 = queryNorm
              0.7529261 = fieldWeight in 3165, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                4.5532694 = idf(docFreq=1265, maxDocs=44218)
                0.0625 = fieldNorm(doc=3165)
          0.107244045 = weight(abstract_txt:across in 3165) [ClassicSimilarity], result of:
            0.107244045 = score(doc=3165,freq=1.0), product of:
              0.33664882 = queryWeight, product of:
                1.2367489 = boost
                5.097017 = idf(docFreq=734, maxDocs=44218)
                0.0534047 = queryNorm
              0.31856355 = fieldWeight in 3165, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.097017 = idf(docFreq=734, maxDocs=44218)
                0.0625 = fieldNorm(doc=3165)
          0.2751218 = weight(abstract_txt:clustering in 3165) [ClassicSimilarity], result of:
            0.2751218 = score(doc=3165,freq=2.0), product of:
              0.5007278 = queryWeight, product of:
                1.5083213 = boost
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.0534047 = queryNorm
              0.5494439 = fieldWeight in 3165, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.0625 = fieldNorm(doc=3165)
        0.5714286 = coord(4/7)
    
  3. Golub, K.: Automatic subject indexing of text (2019) 0.32
    0.31895578 = sum of:
      0.31895578 = product of:
        0.5581726 = sum of:
          0.05669288 = weight(abstract_txt:documents in 5268) [ClassicSimilarity], result of:
            0.05669288 = score(doc=5268,freq=1.0), product of:
              0.22009693 = queryWeight, product of:
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0534047 = queryNorm
              0.2575814 = fieldWeight in 5268, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0625 = fieldNorm(doc=5268)
          0.107244045 = weight(abstract_txt:across in 5268) [ClassicSimilarity], result of:
            0.107244045 = score(doc=5268,freq=1.0), product of:
              0.33664882 = queryWeight, product of:
                1.2367489 = boost
                5.097017 = idf(docFreq=734, maxDocs=44218)
                0.0534047 = queryNorm
              0.31856355 = fieldWeight in 5268, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.097017 = idf(docFreq=734, maxDocs=44218)
                0.0625 = fieldNorm(doc=5268)
          0.11911381 = weight(abstract_txt:machine in 5268) [ClassicSimilarity], result of:
            0.11911381 = score(doc=5268,freq=1.0), product of:
              0.36105198 = queryWeight, product of:
                1.2807899 = boost
                5.2785225 = idf(docFreq=612, maxDocs=44218)
                0.0534047 = queryNorm
              0.32990766 = fieldWeight in 5268, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2785225 = idf(docFreq=612, maxDocs=44218)
                0.0625 = fieldNorm(doc=5268)
          0.2751218 = weight(abstract_txt:clustering in 5268) [ClassicSimilarity], result of:
            0.2751218 = score(doc=5268,freq=2.0), product of:
              0.5007278 = queryWeight, product of:
                1.5083213 = boost
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.0534047 = queryNorm
              0.5494439 = fieldWeight in 5268, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.0625 = fieldNorm(doc=5268)
        0.5714286 = coord(4/7)
    
  4. Baker, T.: Languages for Dublin Core (1998) 0.31
    0.3111256 = sum of:
      0.3111256 = product of:
        0.43557584 = sum of:
          0.03543305 = weight(abstract_txt:documents in 1257) [ClassicSimilarity], result of:
            0.03543305 = score(doc=1257,freq=1.0), product of:
              0.22009693 = queryWeight, product of:
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0534047 = queryNorm
              0.16098839 = fieldWeight in 1257, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0390625 = fieldNorm(doc=1257)
          0.09056097 = weight(abstract_txt:human in 1257) [ClassicSimilarity], result of:
            0.09056097 = score(doc=1257,freq=3.0), product of:
              0.28527382 = queryWeight, product of:
                1.1384763 = boost
                4.692005 = idf(docFreq=1101, maxDocs=44218)
                0.0534047 = queryNorm
              0.3174528 = fieldWeight in 1257, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.692005 = idf(docFreq=1101, maxDocs=44218)
                0.0390625 = fieldNorm(doc=1257)
          0.09479124 = weight(abstract_txt:across in 1257) [ClassicSimilarity], result of:
            0.09479124 = score(doc=1257,freq=2.0), product of:
              0.33664882 = queryWeight, product of:
                1.2367489 = boost
                5.097017 = idf(docFreq=734, maxDocs=44218)
                0.0534047 = queryNorm
              0.28157306 = fieldWeight in 1257, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.097017 = idf(docFreq=734, maxDocs=44218)
                0.0390625 = fieldNorm(doc=1257)
          0.074446134 = weight(abstract_txt:machine in 1257) [ClassicSimilarity], result of:
            0.074446134 = score(doc=1257,freq=1.0), product of:
              0.36105198 = queryWeight, product of:
                1.2807899 = boost
                5.2785225 = idf(docFreq=612, maxDocs=44218)
                0.0534047 = queryNorm
              0.20619228 = fieldWeight in 1257, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2785225 = idf(docFreq=612, maxDocs=44218)
                0.0390625 = fieldNorm(doc=1257)
          0.14034444 = weight(abstract_txt:versus in 1257) [ClassicSimilarity], result of:
            0.14034444 = score(doc=1257,freq=1.0), product of:
              0.5509833 = queryWeight, product of:
                1.582203 = boost
                6.5207376 = idf(docFreq=176, maxDocs=44218)
                0.0534047 = queryNorm
              0.2547163 = fieldWeight in 1257, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5207376 = idf(docFreq=176, maxDocs=44218)
                0.0390625 = fieldNorm(doc=1257)
        0.71428573 = coord(5/7)
    
  5. Losee, R.M.; Church Jr., L.: Are two document clusters better than one? : the cluster performance question for information retrieval (2005) 0.31
    0.3058615 = sum of:
      0.3058615 = product of:
        0.7136768 = sum of:
          0.08503932 = weight(abstract_txt:documents in 3270) [ClassicSimilarity], result of:
            0.08503932 = score(doc=3270,freq=1.0), product of:
              0.22009693 = queryWeight, product of:
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0534047 = queryNorm
              0.38637212 = fieldWeight in 3270, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.09375 = fieldNorm(doc=3270)
          0.29181078 = weight(abstract_txt:clustering in 3270) [ClassicSimilarity], result of:
            0.29181078 = score(doc=3270,freq=1.0), product of:
              0.5007278 = queryWeight, product of:
                1.5083213 = boost
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.0534047 = queryNorm
              0.5827733 = fieldWeight in 3270, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.09375 = fieldNorm(doc=3270)
          0.33682668 = weight(abstract_txt:versus in 3270) [ClassicSimilarity], result of:
            0.33682668 = score(doc=3270,freq=1.0), product of:
              0.5509833 = queryWeight, product of:
                1.582203 = boost
                6.5207376 = idf(docFreq=176, maxDocs=44218)
                0.0534047 = queryNorm
              0.6113192 = fieldWeight in 3270, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5207376 = idf(docFreq=176, maxDocs=44218)
                0.09375 = fieldNorm(doc=3270)
        0.42857143 = coord(3/7)