Document (#34288)

Author
Khoo, C.S.G.
Ou, S.
Title
Machine versus human clustering of concepts across documents
Source
Culture and identity in knowledge organization: Proceedings of the Tenth International ISKO Conference 5-8 August 2008, Montreal, Canada. Ed. by Clément Arsenault and Joseph T. Tennis
Imprint
Würzburg : Ergon Verlag
Year
2008
Pages
S.333-339
Series
Advances in knowledge organization; vol.11
Content
An automated method for clustering terms/concepts from a set of documents on the same topic was developed for the purpose of multidocument summarization. The clustering method makes use of a combination of lexical overlap between multiword terms, syntactic constraints and semantic consideration based on a manually constructed taxonomy to generate hierarchically organized clusters of terms. This study evaluates the machine-generated clusters by calculating the proportion of overlap with two sets of human-generated clusters for 15 topics. It was found that the overlap between machine-generated clusters and individual human-generated clusters are higher than that between two human-generated clusters. A quailtative analysis of the human clustering found that clusters formed are either semantic-conceptual based or lexical based (similar to machine clustering). The semantic-conceptual based clusters that were formed tended to be different for different human coders. This has raised questions about whether machine-generated clustering can be evaluated by comparing with human clustering.
Footnote
Vgl. unter: http://www.ergon-verlag.de/isko_ko/tocs/0497f79b0c0b3ed06/0497f79b0c0b5550a/index.php.
Theme
Automatisches Klassifizieren

Similar documents (author)

  1. Khoo, C.S.G.; Poo, D.C.C.: ¬An expert system approach to online catalog subject searching (1994) 6.11
    6.1113977 = sum of:
      6.1113977 = sum of:
        2.7258348 = weight(author_txt:khoo in 7303) [ClassicSimilarity], result of:
          2.7258348 = score(doc=7303,freq=1.0), product of:
            0.6544083 = queryWeight, product of:
              8.330686 = idf(docFreq=27, maxDocs=42740)
              0.07855395 = queryNorm
            4.165343 = fieldWeight in 7303, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.330686 = idf(docFreq=27, maxDocs=42740)
              0.5 = fieldNorm(doc=7303)
        3.385563 = weight(author_txt:c.s.g in 7303) [ClassicSimilarity], result of:
          3.385563 = score(doc=7303,freq=1.0), product of:
            0.7561414 = queryWeight, product of:
              1.0749224 = boost
              8.954841 = idf(docFreq=14, maxDocs=42740)
              0.07855395 = queryNorm
            4.4774203 = fieldWeight in 7303, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.954841 = idf(docFreq=14, maxDocs=42740)
              0.5 = fieldNorm(doc=7303)
    
  2. Chaudhry, A.S.; Khoo, C.S.G..: ¬A survey of the top-level categories in the structure of corporate Websites (2008) 6.11
    6.1113977 = sum of:
      6.1113977 = sum of:
        2.7258348 = weight(author_txt:khoo in 4260) [ClassicSimilarity], result of:
          2.7258348 = score(doc=4260,freq=1.0), product of:
            0.6544083 = queryWeight, product of:
              8.330686 = idf(docFreq=27, maxDocs=42740)
              0.07855395 = queryNorm
            4.165343 = fieldWeight in 4260, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.330686 = idf(docFreq=27, maxDocs=42740)
              0.5 = fieldNorm(doc=4260)
        3.385563 = weight(author_txt:c.s.g in 4260) [ClassicSimilarity], result of:
          3.385563 = score(doc=4260,freq=1.0), product of:
            0.7561414 = queryWeight, product of:
              1.0749224 = boost
              8.954841 = idf(docFreq=14, maxDocs=42740)
              0.07855395 = queryNorm
            4.4774203 = fieldWeight in 4260, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.954841 = idf(docFreq=14, maxDocs=42740)
              0.5 = fieldNorm(doc=4260)
    
  3. Poo, D.C.C.; Khoo, C.S.G.: Online Catalog Subject Searching (2009) 6.11
    6.1113977 = sum of:
      6.1113977 = sum of:
        2.7258348 = weight(author_txt:khoo in 852) [ClassicSimilarity], result of:
          2.7258348 = score(doc=852,freq=1.0), product of:
            0.6544083 = queryWeight, product of:
              8.330686 = idf(docFreq=27, maxDocs=42740)
              0.07855395 = queryNorm
            4.165343 = fieldWeight in 852, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.330686 = idf(docFreq=27, maxDocs=42740)
              0.5 = fieldNorm(doc=852)
        3.385563 = weight(author_txt:c.s.g in 852) [ClassicSimilarity], result of:
          3.385563 = score(doc=852,freq=1.0), product of:
            0.7561414 = queryWeight, product of:
              1.0749224 = boost
              8.954841 = idf(docFreq=14, maxDocs=42740)
              0.07855395 = queryNorm
            4.4774203 = fieldWeight in 852, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.954841 = idf(docFreq=14, maxDocs=42740)
              0.5 = fieldNorm(doc=852)
    
  4. Sun, G.; Khoo, C.S.G.: ¬A framework to represent variables and values in social science research data sets to support data curation and reuse (2018) 6.11
    6.1113977 = sum of:
      6.1113977 = sum of:
        2.7258348 = weight(author_txt:khoo in 745) [ClassicSimilarity], result of:
          2.7258348 = score(doc=745,freq=1.0), product of:
            0.6544083 = queryWeight, product of:
              8.330686 = idf(docFreq=27, maxDocs=42740)
              0.07855395 = queryNorm
            4.165343 = fieldWeight in 745, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.330686 = idf(docFreq=27, maxDocs=42740)
              0.5 = fieldNorm(doc=745)
        3.385563 = weight(author_txt:c.s.g in 745) [ClassicSimilarity], result of:
          3.385563 = score(doc=745,freq=1.0), product of:
            0.7561414 = queryWeight, product of:
              1.0749224 = boost
              8.954841 = idf(docFreq=14, maxDocs=42740)
              0.07855395 = queryNorm
            4.4774203 = fieldWeight in 745, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.954841 = idf(docFreq=14, maxDocs=42740)
              0.5 = fieldNorm(doc=745)
    
  5. Khoo, C.S.G.; Wan, K.-W.: ¬A simple relevancy-ranking strategy for an interface to Boolean OPACs (2004) 5.35
    5.347473 = sum of:
      5.347473 = sum of:
        2.3851056 = weight(author_txt:khoo in 3510) [ClassicSimilarity], result of:
          2.3851056 = score(doc=3510,freq=1.0), product of:
            0.6544083 = queryWeight, product of:
              8.330686 = idf(docFreq=27, maxDocs=42740)
              0.07855395 = queryNorm
            3.644675 = fieldWeight in 3510, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.330686 = idf(docFreq=27, maxDocs=42740)
              0.4375 = fieldNorm(doc=3510)
        2.9623675 = weight(author_txt:c.s.g in 3510) [ClassicSimilarity], result of:
          2.9623675 = score(doc=3510,freq=1.0), product of:
            0.7561414 = queryWeight, product of:
              1.0749224 = boost
              8.954841 = idf(docFreq=14, maxDocs=42740)
              0.07855395 = queryNorm
            3.9177427 = fieldWeight in 3510, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.954841 = idf(docFreq=14, maxDocs=42740)
              0.4375 = fieldNorm(doc=3510)
    

Similar documents (content)

  1. Huang, L.; Milne, D.; Frank, E.; Witten, I.H.: Learning a concept-based document similarity measure (2012) 0.40
    0.4001165 = sum of:
      0.4001165 = product of:
        0.70020384 = sum of:
          0.09900025 = weight(abstract_txt:documents in 2373) [ClassicSimilarity], result of:
            0.09900025 = score(doc=2373,freq=2.0), product of:
              0.21773107 = queryWeight, product of:
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.05290656 = queryNorm
              0.45469052 = fieldWeight in 2373, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.078125 = fieldNorm(doc=2373)
          0.10651262 = weight(abstract_txt:human in 2373) [ClassicSimilarity], result of:
            0.10651262 = score(doc=2373,freq=1.0), product of:
              0.28803167 = queryWeight, product of:
                1.1501644 = boost
                4.7333736 = idf(docFreq=1021, maxDocs=42740)
                0.05290656 = queryNorm
              0.36979482 = fieldWeight in 2373, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7333736 = idf(docFreq=1021, maxDocs=42740)
                0.078125 = fieldNorm(doc=2373)
          0.15281083 = weight(abstract_txt:machine in 2373) [ClassicSimilarity], result of:
            0.15281083 = score(doc=2373,freq=1.0), product of:
              0.3663889 = queryWeight, product of:
                1.2972119 = boost
                5.3385315 = idf(docFreq=557, maxDocs=42740)
                0.05290656 = queryNorm
              0.41707277 = fieldWeight in 2373, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.3385315 = idf(docFreq=557, maxDocs=42740)
                0.078125 = fieldNorm(doc=2373)
          0.34188014 = weight(abstract_txt:clustering in 2373) [ClassicSimilarity], result of:
            0.34188014 = score(doc=2373,freq=2.0), product of:
              0.49744543 = queryWeight, product of:
                1.5115151 = boost
                6.220473 = idf(docFreq=230, maxDocs=42740)
                0.05290656 = queryNorm
              0.68727165 = fieldWeight in 2373, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.220473 = idf(docFreq=230, maxDocs=42740)
                0.078125 = fieldNorm(doc=2373)
        0.5714286 = coord(4/7)
    
  2. Zheng, H.-T.; Borchert, C.; Kim, H.-G.: Exploiting corpus-related ontologies for conceptualizing document corpora (2009) 0.38
    0.38126832 = sum of:
      0.38126832 = product of:
        0.6672195 = sum of:
          0.07920021 = weight(abstract_txt:documents in 166) [ClassicSimilarity], result of:
            0.07920021 = score(doc=166,freq=2.0), product of:
              0.21773107 = queryWeight, product of:
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.05290656 = queryNorm
              0.36375242 = fieldWeight in 166, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.0625 = fieldNorm(doc=166)
          0.20382963 = weight(abstract_txt:concepts in 166) [ClassicSimilarity], result of:
            0.20382963 = score(doc=166,freq=7.0), product of:
              0.2693137 = queryWeight, product of:
                1.1121645 = boost
                4.576989 = idf(docFreq=1194, maxDocs=42740)
                0.05290656 = queryNorm
              0.7568484 = fieldWeight in 166, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                4.576989 = idf(docFreq=1194, maxDocs=42740)
                0.0625 = fieldNorm(doc=166)
          0.110685535 = weight(abstract_txt:across in 166) [ClassicSimilarity], result of:
            0.110685535 = score(doc=166,freq=1.0), product of:
              0.34290472 = queryWeight, product of:
                1.2549503 = boost
                5.1646085 = idf(docFreq=663, maxDocs=42740)
                0.05290656 = queryNorm
              0.32278803 = fieldWeight in 166, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1646085 = idf(docFreq=663, maxDocs=42740)
                0.0625 = fieldNorm(doc=166)
          0.2735041 = weight(abstract_txt:clustering in 166) [ClassicSimilarity], result of:
            0.2735041 = score(doc=166,freq=2.0), product of:
              0.49744543 = queryWeight, product of:
                1.5115151 = boost
                6.220473 = idf(docFreq=230, maxDocs=42740)
                0.05290656 = queryNorm
              0.5498173 = fieldWeight in 166, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.220473 = idf(docFreq=230, maxDocs=42740)
                0.0625 = fieldNorm(doc=166)
        0.5714286 = coord(4/7)
    
  3. Golub, K.: Automatic subject indexing of text (2019) 0.32
    0.32139507 = sum of:
      0.32139507 = product of:
        0.56244135 = sum of:
          0.056003 = weight(abstract_txt:documents in 1269) [ClassicSimilarity], result of:
            0.056003 = score(doc=1269,freq=1.0), product of:
              0.21773107 = queryWeight, product of:
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.05290656 = queryNorm
              0.2572118 = fieldWeight in 1269, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.0625 = fieldNorm(doc=1269)
          0.110685535 = weight(abstract_txt:across in 1269) [ClassicSimilarity], result of:
            0.110685535 = score(doc=1269,freq=1.0), product of:
              0.34290472 = queryWeight, product of:
                1.2549503 = boost
                5.1646085 = idf(docFreq=663, maxDocs=42740)
                0.05290656 = queryNorm
              0.32278803 = fieldWeight in 1269, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1646085 = idf(docFreq=663, maxDocs=42740)
                0.0625 = fieldNorm(doc=1269)
          0.122248664 = weight(abstract_txt:machine in 1269) [ClassicSimilarity], result of:
            0.122248664 = score(doc=1269,freq=1.0), product of:
              0.3663889 = queryWeight, product of:
                1.2972119 = boost
                5.3385315 = idf(docFreq=557, maxDocs=42740)
                0.05290656 = queryNorm
              0.33365822 = fieldWeight in 1269, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.3385315 = idf(docFreq=557, maxDocs=42740)
                0.0625 = fieldNorm(doc=1269)
          0.2735041 = weight(abstract_txt:clustering in 1269) [ClassicSimilarity], result of:
            0.2735041 = score(doc=1269,freq=2.0), product of:
              0.49744543 = queryWeight, product of:
                1.5115151 = boost
                6.220473 = idf(docFreq=230, maxDocs=42740)
                0.05290656 = queryNorm
              0.5498173 = fieldWeight in 1269, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.220473 = idf(docFreq=230, maxDocs=42740)
                0.0625 = fieldNorm(doc=1269)
        0.5714286 = coord(4/7)
    
  4. Baker, T.: Languages for Dublin Core (1998) 0.31
    0.31455448 = sum of:
      0.31455448 = product of:
        0.44037628 = sum of:
          0.035001878 = weight(abstract_txt:documents in 3258) [ClassicSimilarity], result of:
            0.035001878 = score(doc=3258,freq=1.0), product of:
              0.21773107 = queryWeight, product of:
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.05290656 = queryNorm
              0.16075738 = fieldWeight in 3258, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.0390625 = fieldNorm(doc=3258)
          0.092242636 = weight(abstract_txt:human in 3258) [ClassicSimilarity], result of:
            0.092242636 = score(doc=3258,freq=3.0), product of:
              0.28803167 = queryWeight, product of:
                1.1501644 = boost
                4.7333736 = idf(docFreq=1021, maxDocs=42740)
                0.05290656 = queryNorm
              0.3202517 = fieldWeight in 3258, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.7333736 = idf(docFreq=1021, maxDocs=42740)
                0.0390625 = fieldNorm(doc=3258)
          0.09783311 = weight(abstract_txt:across in 3258) [ClassicSimilarity], result of:
            0.09783311 = score(doc=3258,freq=2.0), product of:
              0.34290472 = queryWeight, product of:
                1.2549503 = boost
                5.1646085 = idf(docFreq=663, maxDocs=42740)
                0.05290656 = queryNorm
              0.285307 = fieldWeight in 3258, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.1646085 = idf(docFreq=663, maxDocs=42740)
                0.0390625 = fieldNorm(doc=3258)
          0.07640541 = weight(abstract_txt:machine in 3258) [ClassicSimilarity], result of:
            0.07640541 = score(doc=3258,freq=1.0), product of:
              0.3663889 = queryWeight, product of:
                1.2972119 = boost
                5.3385315 = idf(docFreq=557, maxDocs=42740)
                0.05290656 = queryNorm
              0.20853639 = fieldWeight in 3258, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.3385315 = idf(docFreq=557, maxDocs=42740)
                0.0390625 = fieldNorm(doc=3258)
          0.13889326 = weight(abstract_txt:versus in 3258) [ClassicSimilarity], result of:
            0.13889326 = score(doc=3258,freq=1.0), product of:
              0.54573315 = queryWeight, product of:
                1.5831788 = boost
                6.515396 = idf(docFreq=171, maxDocs=42740)
                0.05290656 = queryNorm
              0.25450766 = fieldWeight in 3258, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.515396 = idf(docFreq=171, maxDocs=42740)
                0.0390625 = fieldNorm(doc=3258)
        0.71428573 = coord(5/7)
    
  5. Losee, R.M.; Church Jr., L.: Are two document clusters better than one? : the cluster performance question for information retrieval (2005) 0.30
    0.30318996 = sum of:
      0.30318996 = product of:
        0.70744324 = sum of:
          0.08400451 = weight(abstract_txt:documents in 4271) [ClassicSimilarity], result of:
            0.08400451 = score(doc=4271,freq=1.0), product of:
              0.21773107 = queryWeight, product of:
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.05290656 = queryNorm
              0.3858177 = fieldWeight in 4271, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.09375 = fieldNorm(doc=4271)
          0.2900949 = weight(abstract_txt:clustering in 4271) [ClassicSimilarity], result of:
            0.2900949 = score(doc=4271,freq=1.0), product of:
              0.49744543 = queryWeight, product of:
                1.5115151 = boost
                6.220473 = idf(docFreq=230, maxDocs=42740)
                0.05290656 = queryNorm
              0.58316934 = fieldWeight in 4271, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.220473 = idf(docFreq=230, maxDocs=42740)
                0.09375 = fieldNorm(doc=4271)
          0.33334383 = weight(abstract_txt:versus in 4271) [ClassicSimilarity], result of:
            0.33334383 = score(doc=4271,freq=1.0), product of:
              0.54573315 = queryWeight, product of:
                1.5831788 = boost
                6.515396 = idf(docFreq=171, maxDocs=42740)
                0.05290656 = queryNorm
              0.6108184 = fieldWeight in 4271, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.515396 = idf(docFreq=171, maxDocs=42740)
                0.09375 = fieldNorm(doc=4271)
        0.42857143 = coord(3/7)