Document (#34288)

Author
Khoo, C.S.G.
Ou, S.
Title
Machine versus human clustering of concepts across documents
Source
Culture and identity in knowledge organization: Proceedings of the Tenth International ISKO Conference 5-8 August 2008, Montreal, Canada. Ed. by Clément Arsenault and Joseph T. Tennis
Imprint
Würzburg : Ergon Verlag
Year
2008
Pages
S.333-339
Series
Advances in knowledge organization; vol.11
Content
An automated method for clustering terms/concepts from a set of documents on the same topic was developed for the purpose of multidocument summarization. The clustering method makes use of a combination of lexical overlap between multiword terms, syntactic constraints and semantic consideration based on a manually constructed taxonomy to generate hierarchically organized clusters of terms. This study evaluates the machine-generated clusters by calculating the proportion of overlap with two sets of human-generated clusters for 15 topics. It was found that the overlap between machine-generated clusters and individual human-generated clusters are higher than that between two human-generated clusters. A quailtative analysis of the human clustering found that clusters formed are either semantic-conceptual based or lexical based (similar to machine clustering). The semantic-conceptual based clusters that were formed tended to be different for different human coders. This has raised questions about whether machine-generated clustering can be evaluated by comparing with human clustering.
Footnote
Vgl. unter: http://www.ergon-verlag.de/isko_ko/tocs/0497f79b0c0b3ed06/0497f79b0c0b5550a/index.php.
Theme
Automatisches Klassifizieren

Similar documents (author)

  1. Khoo, C.S.G.; Poo, D.C.C.: ¬An expert system approach to online catalog subject searching (1994) 6.12
    6.119851 = sum of:
      6.119851 = sum of:
        2.730059 = weight(author_txt:khoo in 303) [ClassicSimilarity], result of:
          2.730059 = score(doc=303,freq=1.0), product of:
            0.65448314 = queryWeight, product of:
              8.342641 = idf(docFreq=27, maxDocs=43254)
              0.07845035 = queryNorm
            4.1713204 = fieldWeight in 303, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.342641 = idf(docFreq=27, maxDocs=43254)
              0.5 = fieldNorm(doc=303)
        3.389792 = weight(author_txt:c.s.g in 303) [ClassicSimilarity], result of:
          3.389792 = score(doc=303,freq=1.0), product of:
            0.75607663 = queryWeight, product of:
              1.0748149 = boost
              8.966795 = idf(docFreq=14, maxDocs=43254)
              0.07845035 = queryNorm
            4.4833975 = fieldWeight in 303, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.966795 = idf(docFreq=14, maxDocs=43254)
              0.5 = fieldNorm(doc=303)
    
  2. Chaudhry, A.S.; Khoo, C.S.G..: ¬A survey of the top-level categories in the structure of corporate Websites (2008) 6.12
    6.119851 = sum of:
      6.119851 = sum of:
        2.730059 = weight(author_txt:khoo in 4260) [ClassicSimilarity], result of:
          2.730059 = score(doc=4260,freq=1.0), product of:
            0.65448314 = queryWeight, product of:
              8.342641 = idf(docFreq=27, maxDocs=43254)
              0.07845035 = queryNorm
            4.1713204 = fieldWeight in 4260, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.342641 = idf(docFreq=27, maxDocs=43254)
              0.5 = fieldNorm(doc=4260)
        3.389792 = weight(author_txt:c.s.g in 4260) [ClassicSimilarity], result of:
          3.389792 = score(doc=4260,freq=1.0), product of:
            0.75607663 = queryWeight, product of:
              1.0748149 = boost
              8.966795 = idf(docFreq=14, maxDocs=43254)
              0.07845035 = queryNorm
            4.4833975 = fieldWeight in 4260, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.966795 = idf(docFreq=14, maxDocs=43254)
              0.5 = fieldNorm(doc=4260)
    
  3. Poo, D.C.C.; Khoo, C.S.G.: Online Catalog Subject Searching (2009) 6.12
    6.119851 = sum of:
      6.119851 = sum of:
        2.730059 = weight(author_txt:khoo in 316) [ClassicSimilarity], result of:
          2.730059 = score(doc=316,freq=1.0), product of:
            0.65448314 = queryWeight, product of:
              8.342641 = idf(docFreq=27, maxDocs=43254)
              0.07845035 = queryNorm
            4.1713204 = fieldWeight in 316, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.342641 = idf(docFreq=27, maxDocs=43254)
              0.5 = fieldNorm(doc=316)
        3.389792 = weight(author_txt:c.s.g in 316) [ClassicSimilarity], result of:
          3.389792 = score(doc=316,freq=1.0), product of:
            0.75607663 = queryWeight, product of:
              1.0748149 = boost
              8.966795 = idf(docFreq=14, maxDocs=43254)
              0.07845035 = queryNorm
            4.4833975 = fieldWeight in 316, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.966795 = idf(docFreq=14, maxDocs=43254)
              0.5 = fieldNorm(doc=316)
    
  4. Sun, G.; Khoo, C.S.G.: ¬A framework to represent variables and values in social science research data sets to support data curation and reuse (2018) 6.12
    6.119851 = sum of:
      6.119851 = sum of:
        2.730059 = weight(author_txt:khoo in 745) [ClassicSimilarity], result of:
          2.730059 = score(doc=745,freq=1.0), product of:
            0.65448314 = queryWeight, product of:
              8.342641 = idf(docFreq=27, maxDocs=43254)
              0.07845035 = queryNorm
            4.1713204 = fieldWeight in 745, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.342641 = idf(docFreq=27, maxDocs=43254)
              0.5 = fieldNorm(doc=745)
        3.389792 = weight(author_txt:c.s.g in 745) [ClassicSimilarity], result of:
          3.389792 = score(doc=745,freq=1.0), product of:
            0.75607663 = queryWeight, product of:
              1.0748149 = boost
              8.966795 = idf(docFreq=14, maxDocs=43254)
              0.07845035 = queryNorm
            4.4833975 = fieldWeight in 745, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.966795 = idf(docFreq=14, maxDocs=43254)
              0.5 = fieldNorm(doc=745)
    
  5. Khoo, C.S.G.; Wan, K.-W.: ¬A simple relevancy-ranking strategy for an interface to Boolean OPACs (2004) 5.35
    5.35487 = sum of:
      5.35487 = sum of:
        2.3888016 = weight(author_txt:khoo in 4510) [ClassicSimilarity], result of:
          2.3888016 = score(doc=4510,freq=1.0), product of:
            0.65448314 = queryWeight, product of:
              8.342641 = idf(docFreq=27, maxDocs=43254)
              0.07845035 = queryNorm
            3.6499054 = fieldWeight in 4510, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.342641 = idf(docFreq=27, maxDocs=43254)
              0.4375 = fieldNorm(doc=4510)
        2.966068 = weight(author_txt:c.s.g in 4510) [ClassicSimilarity], result of:
          2.966068 = score(doc=4510,freq=1.0), product of:
            0.75607663 = queryWeight, product of:
              1.0748149 = boost
              8.966795 = idf(docFreq=14, maxDocs=43254)
              0.07845035 = queryNorm
            3.9229727 = fieldWeight in 4510, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.966795 = idf(docFreq=14, maxDocs=43254)
              0.4375 = fieldNorm(doc=4510)
    

Similar documents (content)

  1. Huang, L.; Milne, D.; Frank, E.; Witten, I.H.: Learning a concept-based document similarity measure (2012) 0.40
    0.40069306 = sum of:
      0.40069306 = product of:
        0.7012128 = sum of:
          0.09924779 = weight(abstract_txt:documents in 1837) [ClassicSimilarity], result of:
            0.09924779 = score(doc=1837,freq=2.0), product of:
              0.21822566 = queryWeight, product of:
                4.1163282 = idf(docFreq=1916, maxDocs=43254)
                0.05301464 = queryNorm
              0.45479432 = fieldWeight in 1837, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1163282 = idf(docFreq=1916, maxDocs=43254)
                0.078125 = fieldNorm(doc=1837)
          0.10594633 = weight(abstract_txt:human in 1837) [ClassicSimilarity], result of:
            0.10594633 = score(doc=1837,freq=1.0), product of:
              0.2871833 = queryWeight, product of:
                1.1471671 = boost
                4.7221165 = idf(docFreq=1045, maxDocs=43254)
                0.05301464 = queryNorm
              0.36891535 = fieldWeight in 1837, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7221165 = idf(docFreq=1045, maxDocs=43254)
                0.078125 = fieldNorm(doc=1837)
          0.15153992 = weight(abstract_txt:machine in 1837) [ClassicSimilarity], result of:
            0.15153992 = score(doc=1837,freq=1.0), product of:
              0.3645748 = queryWeight, product of:
                1.2925293 = boost
                5.320475 = idf(docFreq=574, maxDocs=43254)
                0.05301464 = queryNorm
              0.4156621 = fieldWeight in 1837, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.320475 = idf(docFreq=574, maxDocs=43254)
                0.078125 = fieldNorm(doc=1837)
          0.3444788 = weight(abstract_txt:clustering in 1837) [ClassicSimilarity], result of:
            0.3444788 = score(doc=1837,freq=2.0), product of:
              0.5002651 = queryWeight, product of:
                1.5140743 = boost
                6.232427 = idf(docFreq=230, maxDocs=43254)
                0.05301464 = queryNorm
              0.68859243 = fieldWeight in 1837, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.232427 = idf(docFreq=230, maxDocs=43254)
                0.078125 = fieldNorm(doc=1837)
        0.5714286 = coord(4/7)
    
  2. Zheng, H.-T.; Borchert, C.; Kim, H.-G.: Exploiting corpus-related ontologies for conceptualizing document corpora (2009) 0.38
    0.38134107 = sum of:
      0.38134107 = product of:
        0.66734684 = sum of:
          0.07939823 = weight(abstract_txt:documents in 166) [ClassicSimilarity], result of:
            0.07939823 = score(doc=166,freq=2.0), product of:
              0.21822566 = queryWeight, product of:
                4.1163282 = idf(docFreq=1916, maxDocs=43254)
                0.05301464 = queryNorm
              0.36383545 = fieldWeight in 166, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1163282 = idf(docFreq=1916, maxDocs=43254)
                0.0625 = fieldNorm(doc=166)
          0.20335896 = weight(abstract_txt:concepts in 166) [ClassicSimilarity], result of:
            0.20335896 = score(doc=166,freq=7.0), product of:
              0.26906145 = queryWeight, product of:
                1.110383 = boost
                4.570701 = idf(docFreq=1216, maxDocs=43254)
                0.05301464 = queryNorm
              0.75580865 = fieldWeight in 166, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                4.570701 = idf(docFreq=1216, maxDocs=43254)
                0.0625 = fieldNorm(doc=166)
          0.10900663 = weight(abstract_txt:across in 166) [ClassicSimilarity], result of:
            0.10900663 = score(doc=166,freq=1.0), product of:
              0.3396335 = queryWeight, product of:
                1.2475339 = boost
                5.135259 = idf(docFreq=691, maxDocs=43254)
                0.05301464 = queryNorm
              0.3209537 = fieldWeight in 166, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.135259 = idf(docFreq=691, maxDocs=43254)
                0.0625 = fieldNorm(doc=166)
          0.27558303 = weight(abstract_txt:clustering in 166) [ClassicSimilarity], result of:
            0.27558303 = score(doc=166,freq=2.0), product of:
              0.5002651 = queryWeight, product of:
                1.5140743 = boost
                6.232427 = idf(docFreq=230, maxDocs=43254)
                0.05301464 = queryNorm
              0.55087394 = fieldWeight in 166, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.232427 = idf(docFreq=230, maxDocs=43254)
                0.0625 = fieldNorm(doc=166)
        0.5714286 = coord(4/7)
    
  3. Golub, K.: Automatic subject indexing of text (2019) 0.32
    0.32112265 = sum of:
      0.32112265 = product of:
        0.56196463 = sum of:
          0.056143027 = weight(abstract_txt:documents in 269) [ClassicSimilarity], result of:
            0.056143027 = score(doc=269,freq=1.0), product of:
              0.21822566 = queryWeight, product of:
                4.1163282 = idf(docFreq=1916, maxDocs=43254)
                0.05301464 = queryNorm
              0.25727051 = fieldWeight in 269, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1163282 = idf(docFreq=1916, maxDocs=43254)
                0.0625 = fieldNorm(doc=269)
          0.10900663 = weight(abstract_txt:across in 269) [ClassicSimilarity], result of:
            0.10900663 = score(doc=269,freq=1.0), product of:
              0.3396335 = queryWeight, product of:
                1.2475339 = boost
                5.135259 = idf(docFreq=691, maxDocs=43254)
                0.05301464 = queryNorm
              0.3209537 = fieldWeight in 269, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.135259 = idf(docFreq=691, maxDocs=43254)
                0.0625 = fieldNorm(doc=269)
          0.12123194 = weight(abstract_txt:machine in 269) [ClassicSimilarity], result of:
            0.12123194 = score(doc=269,freq=1.0), product of:
              0.3645748 = queryWeight, product of:
                1.2925293 = boost
                5.320475 = idf(docFreq=574, maxDocs=43254)
                0.05301464 = queryNorm
              0.3325297 = fieldWeight in 269, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.320475 = idf(docFreq=574, maxDocs=43254)
                0.0625 = fieldNorm(doc=269)
          0.27558303 = weight(abstract_txt:clustering in 269) [ClassicSimilarity], result of:
            0.27558303 = score(doc=269,freq=2.0), product of:
              0.5002651 = queryWeight, product of:
                1.5140743 = boost
                6.232427 = idf(docFreq=230, maxDocs=43254)
                0.05301464 = queryNorm
              0.55087394 = fieldWeight in 269, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.232427 = idf(docFreq=230, maxDocs=43254)
                0.0625 = fieldNorm(doc=269)
        0.5714286 = coord(4/7)
    
  4. Baker, T.: Languages for Dublin Core (1998) 0.31
    0.31295082 = sum of:
      0.31295082 = product of:
        0.43813115 = sum of:
          0.035089392 = weight(abstract_txt:documents in 3258) [ClassicSimilarity], result of:
            0.035089392 = score(doc=3258,freq=1.0), product of:
              0.21822566 = queryWeight, product of:
                4.1163282 = idf(docFreq=1916, maxDocs=43254)
                0.05301464 = queryNorm
              0.16079408 = fieldWeight in 3258, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1163282 = idf(docFreq=1916, maxDocs=43254)
                0.0390625 = fieldNorm(doc=3258)
          0.091752216 = weight(abstract_txt:human in 3258) [ClassicSimilarity], result of:
            0.091752216 = score(doc=3258,freq=3.0), product of:
              0.2871833 = queryWeight, product of:
                1.1471671 = boost
                4.7221165 = idf(docFreq=1045, maxDocs=43254)
                0.05301464 = queryNorm
              0.31949008 = fieldWeight in 3258, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.7221165 = idf(docFreq=1045, maxDocs=43254)
                0.0390625 = fieldNorm(doc=3258)
          0.09634915 = weight(abstract_txt:across in 3258) [ClassicSimilarity], result of:
            0.09634915 = score(doc=3258,freq=2.0), product of:
              0.3396335 = queryWeight, product of:
                1.2475339 = boost
                5.135259 = idf(docFreq=691, maxDocs=43254)
                0.05301464 = queryNorm
              0.28368565 = fieldWeight in 3258, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.135259 = idf(docFreq=691, maxDocs=43254)
                0.0390625 = fieldNorm(doc=3258)
          0.07576996 = weight(abstract_txt:machine in 3258) [ClassicSimilarity], result of:
            0.07576996 = score(doc=3258,freq=1.0), product of:
              0.3645748 = queryWeight, product of:
                1.2925293 = boost
                5.320475 = idf(docFreq=574, maxDocs=43254)
                0.05301464 = queryNorm
              0.20783105 = fieldWeight in 3258, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.320475 = idf(docFreq=574, maxDocs=43254)
                0.0390625 = fieldNorm(doc=3258)
          0.13917045 = weight(abstract_txt:versus in 3258) [ClassicSimilarity], result of:
            0.13917045 = score(doc=3258,freq=1.0), product of:
              0.5467892 = queryWeight, product of:
                1.582913 = boost
                6.5157895 = idf(docFreq=173, maxDocs=43254)
                0.05301464 = queryNorm
              0.25452304 = fieldWeight in 3258, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5157895 = idf(docFreq=173, maxDocs=43254)
                0.0390625 = fieldNorm(doc=3258)
        0.71428573 = coord(5/7)
    
  5. Losee, R.M.; Church Jr., L.: Are two document clusters better than one? : the cluster performance question for information retrieval (2005) 0.30
    0.3045101 = sum of:
      0.3045101 = product of:
        0.71052355 = sum of:
          0.08421454 = weight(abstract_txt:documents in 5271) [ClassicSimilarity], result of:
            0.08421454 = score(doc=5271,freq=1.0), product of:
              0.21822566 = queryWeight, product of:
                4.1163282 = idf(docFreq=1916, maxDocs=43254)
                0.05301464 = queryNorm
              0.38590577 = fieldWeight in 5271, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1163282 = idf(docFreq=1916, maxDocs=43254)
                0.09375 = fieldNorm(doc=5271)
          0.29229993 = weight(abstract_txt:clustering in 5271) [ClassicSimilarity], result of:
            0.29229993 = score(doc=5271,freq=1.0), product of:
              0.5002651 = queryWeight, product of:
                1.5140743 = boost
                6.232427 = idf(docFreq=230, maxDocs=43254)
                0.05301464 = queryNorm
              0.58429 = fieldWeight in 5271, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.232427 = idf(docFreq=230, maxDocs=43254)
                0.09375 = fieldNorm(doc=5271)
          0.33400908 = weight(abstract_txt:versus in 5271) [ClassicSimilarity], result of:
            0.33400908 = score(doc=5271,freq=1.0), product of:
              0.5467892 = queryWeight, product of:
                1.582913 = boost
                6.5157895 = idf(docFreq=173, maxDocs=43254)
                0.05301464 = queryNorm
              0.6108553 = fieldWeight in 5271, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5157895 = idf(docFreq=173, maxDocs=43254)
                0.09375 = fieldNorm(doc=5271)
        0.42857143 = coord(3/7)