Document (#39497)

Author
Mu, T.
Goulermas, J.Y.
Korkontzelos, I.
Ananiadou, S.
Title
Descriptive document clustering via discriminant learning in a co-embedded space of multilevel similarities
Source
Journal of the Association for Information Science and Technology. 67(2016) no.1, S.106-133
Year
2016
Abstract
Descriptive document clustering aims at discovering clusters of semantically interrelated documents together with meaningful labels to summarize the content of each document cluster. In this work, we propose a novel descriptive clustering framework, referred to as CEDL. It relies on the formulation and generation of 2 types of heterogeneous objects, which correspond to documents and candidate phrases, using multilevel similarity information. CEDL is composed of 5 main processing stages. First, it simultaneously maps the documents and candidate phrases into a common co-embedded space that preserves higher-order, neighbor-based proximities between the combined sets of documents and phrases. Then, it discovers an approximate cluster structure of documents in the common space. The third stage extracts promising topic phrases by constructing a discriminant model where documents along with their cluster memberships are used as training instances. Subsequently, the final cluster labels are selected from the topic phrases using a ranking scheme using multiple scores based on the extracted co-embedding information and the discriminant output. The final stage polishes the initial clusters to reduce noise and accommodate the multitopic nature of documents. The effectiveness and competitiveness of CEDL is demonstrated qualitatively and quantitatively with experiments using document databases from different application fields.
Content
Vgl.: http://onlinelibrary.wiley.com/doi/10.1002/asi.23374/abstract.
Theme
Automatisches Klassifizieren

Similar documents (content)

  1. Li, T.; Zhu, S.; Ogihara, M.: Hierarchical document classification using automatically generated hierarchy (2007) 0.21
    0.20820573 = sum of:
      0.20820573 = product of:
        0.7435919 = sum of:
          0.035689887 = weight(abstract_txt:topic in 4797) [ClassicSimilarity], result of:
            0.035689887 = score(doc=4797,freq=1.0), product of:
              0.09024252 = queryWeight, product of:
                1.2682085 = boost
                5.062254 = idf(docFreq=760, maxDocs=44218)
                0.014056481 = queryNorm
              0.3954886 = fieldWeight in 4797, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.062254 = idf(docFreq=760, maxDocs=44218)
                0.078125 = fieldNorm(doc=4797)
          0.07608121 = weight(abstract_txt:clusters in 4797) [ClassicSimilarity], result of:
            0.07608121 = score(doc=4797,freq=1.0), product of:
              0.14947413 = queryWeight, product of:
                1.6321801 = boost
                6.515104 = idf(docFreq=177, maxDocs=44218)
                0.014056481 = queryNorm
              0.5089925 = fieldWeight in 4797, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.515104 = idf(docFreq=177, maxDocs=44218)
                0.078125 = fieldNorm(doc=4797)
          0.032319188 = weight(abstract_txt:using in 4797) [ClassicSimilarity], result of:
            0.032319188 = score(doc=4797,freq=2.0), product of:
              0.084467195 = queryWeight, product of:
                1.7351782 = boost
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.014056481 = queryNorm
              0.38262415 = fieldWeight in 4797, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.078125 = fieldNorm(doc=4797)
          0.06470833 = weight(abstract_txt:space in 4797) [ClassicSimilarity], result of:
            0.06470833 = score(doc=4797,freq=1.0), product of:
              0.15359779 = queryWeight, product of:
                2.0263906 = boost
                5.3924384 = idf(docFreq=546, maxDocs=44218)
                0.014056481 = queryNorm
              0.42128426 = fieldWeight in 4797, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.3924384 = idf(docFreq=546, maxDocs=44218)
                0.078125 = fieldNorm(doc=4797)
          0.043521833 = weight(abstract_txt:document in 4797) [ClassicSimilarity], result of:
            0.043521833 = score(doc=4797,freq=1.0), product of:
              0.12977645 = queryWeight, product of:
                2.15079 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.014056481 = queryNorm
              0.33536002 = fieldWeight in 4797, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.078125 = fieldNorm(doc=4797)
          0.42386755 = weight(abstract_txt:discriminant in 4797) [ClassicSimilarity], result of:
            0.42386755 = score(doc=4797,freq=2.0), product of:
              0.42679727 = queryWeight, product of:
                3.3778589 = boost
                8.988837 = idf(docFreq=14, maxDocs=44218)
                0.014056481 = queryNorm
              0.9931356 = fieldWeight in 4797, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.988837 = idf(docFreq=14, maxDocs=44218)
                0.078125 = fieldNorm(doc=4797)
          0.06740389 = weight(abstract_txt:documents in 4797) [ClassicSimilarity], result of:
            0.06740389 = score(doc=4797,freq=1.0), product of:
              0.20934395 = queryWeight, product of:
                3.6136765 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.014056481 = queryNorm
              0.32197678 = fieldWeight in 4797, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.078125 = fieldNorm(doc=4797)
        0.28 = coord(7/25)
    
  2. Kishida, K.: High-speed rough clustering for very large document collections (2010) 0.18
    0.18358065 = sum of:
      0.18358065 = product of:
        0.6556452 = sum of:
          0.10038757 = weight(abstract_txt:stage in 3463) [ClassicSimilarity], result of:
            0.10038757 = score(doc=3463,freq=4.0), product of:
              0.13144813 = queryWeight, product of:
                1.530602 = boost
                6.1096387 = idf(docFreq=266, maxDocs=44218)
                0.014056481 = queryNorm
              0.76370484 = fieldWeight in 3463, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.1096387 = idf(docFreq=266, maxDocs=44218)
                0.0625 = fieldNorm(doc=3463)
          0.060864966 = weight(abstract_txt:clusters in 3463) [ClassicSimilarity], result of:
            0.060864966 = score(doc=3463,freq=1.0), product of:
              0.14947413 = queryWeight, product of:
                1.6321801 = boost
                6.515104 = idf(docFreq=177, maxDocs=44218)
                0.014056481 = queryNorm
              0.407194 = fieldWeight in 3463, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.515104 = idf(docFreq=177, maxDocs=44218)
                0.0625 = fieldNorm(doc=3463)
          0.018282494 = weight(abstract_txt:using in 3463) [ClassicSimilarity], result of:
            0.018282494 = score(doc=3463,freq=1.0), product of:
              0.084467195 = queryWeight, product of:
                1.7351782 = boost
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.014056481 = queryNorm
              0.21644491 = fieldWeight in 3463, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.0625 = fieldNorm(doc=3463)
          0.049239334 = weight(abstract_txt:document in 3463) [ClassicSimilarity], result of:
            0.049239334 = score(doc=3463,freq=2.0), product of:
              0.12977645 = queryWeight, product of:
                2.15079 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.014056481 = queryNorm
              0.37941656 = fieldWeight in 3463, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0625 = fieldNorm(doc=3463)
          0.20981124 = weight(abstract_txt:clustering in 3463) [ClassicSimilarity], result of:
            0.20981124 = score(doc=3463,freq=7.0), product of:
              0.20411332 = queryWeight, product of:
                2.3359652 = boost
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.014056481 = queryNorm
              1.0279155 = fieldWeight in 3463, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.0625 = fieldNorm(doc=3463)
          0.123662055 = weight(abstract_txt:cluster in 3463) [ClassicSimilarity], result of:
            0.123662055 = score(doc=3463,freq=1.0), product of:
              0.30210325 = queryWeight, product of:
                3.2815404 = boost
                6.5493927 = idf(docFreq=171, maxDocs=44218)
                0.014056481 = queryNorm
              0.40933704 = fieldWeight in 3463, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5493927 = idf(docFreq=171, maxDocs=44218)
                0.0625 = fieldNorm(doc=3463)
          0.093397565 = weight(abstract_txt:documents in 3463) [ClassicSimilarity], result of:
            0.093397565 = score(doc=3463,freq=3.0), product of:
              0.20934395 = queryWeight, product of:
                3.6136765 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.014056481 = queryNorm
              0.44614407 = fieldWeight in 3463, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0625 = fieldNorm(doc=3463)
        0.28 = coord(7/25)
    
  3. Losee, R.M.; Church Jr., L.: Are two document clusters better than one? : the cluster performance question for information retrieval (2005) 0.16
    0.16374879 = sum of:
      0.16374879 = product of:
        0.6822866 = sum of:
          0.12911409 = weight(abstract_txt:clusters in 3270) [ClassicSimilarity], result of:
            0.12911409 = score(doc=3270,freq=2.0), product of:
              0.14947413 = queryWeight, product of:
                1.6321801 = boost
                6.515104 = idf(docFreq=177, maxDocs=44218)
                0.014056481 = queryNorm
              0.86378884 = fieldWeight in 3270, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.515104 = idf(docFreq=177, maxDocs=44218)
                0.09375 = fieldNorm(doc=3270)
          0.03878303 = weight(abstract_txt:using in 3270) [ClassicSimilarity], result of:
            0.03878303 = score(doc=3270,freq=2.0), product of:
              0.084467195 = queryWeight, product of:
                1.7351782 = boost
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.014056481 = queryNorm
              0.459149 = fieldWeight in 3270, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.09375 = fieldNorm(doc=3270)
          0.052226197 = weight(abstract_txt:document in 3270) [ClassicSimilarity], result of:
            0.052226197 = score(doc=3270,freq=1.0), product of:
              0.12977645 = queryWeight, product of:
                2.15079 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.014056481 = queryNorm
              0.40243202 = fieldWeight in 3270, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.09375 = fieldNorm(doc=3270)
          0.1189518 = weight(abstract_txt:clustering in 3270) [ClassicSimilarity], result of:
            0.1189518 = score(doc=3270,freq=1.0), product of:
              0.20411332 = queryWeight, product of:
                2.3359652 = boost
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.014056481 = queryNorm
              0.5827733 = fieldWeight in 3270, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.09375 = fieldNorm(doc=3270)
          0.2623268 = weight(abstract_txt:cluster in 3270) [ClassicSimilarity], result of:
            0.2623268 = score(doc=3270,freq=2.0), product of:
              0.30210325 = queryWeight, product of:
                3.2815404 = boost
                6.5493927 = idf(docFreq=171, maxDocs=44218)
                0.014056481 = queryNorm
              0.86833495 = fieldWeight in 3270, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.5493927 = idf(docFreq=171, maxDocs=44218)
                0.09375 = fieldNorm(doc=3270)
          0.080884665 = weight(abstract_txt:documents in 3270) [ClassicSimilarity], result of:
            0.080884665 = score(doc=3270,freq=1.0), product of:
              0.20934395 = queryWeight, product of:
                3.6136765 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.014056481 = queryNorm
              0.38637212 = fieldWeight in 3270, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.09375 = fieldNorm(doc=3270)
        0.24 = coord(6/25)
    
  4. Mather, L.A.: ¬A linear algebra measure of cluster quality (2000) 0.15
    0.15174466 = sum of:
      0.15174466 = product of:
        0.63226944 = sum of:
          0.024435999 = weight(abstract_txt:common in 4767) [ClassicSimilarity], result of:
            0.024435999 = score(doc=4767,freq=1.0), product of:
              0.08134693 = queryWeight, product of:
                1.2040808 = boost
                4.806278 = idf(docFreq=982, maxDocs=44218)
                0.014056481 = queryNorm
              0.3003924 = fieldWeight in 4767, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.806278 = idf(docFreq=982, maxDocs=44218)
                0.0625 = fieldNorm(doc=4767)
          0.105421215 = weight(abstract_txt:clusters in 4767) [ClassicSimilarity], result of:
            0.105421215 = score(doc=4767,freq=3.0), product of:
              0.14947413 = queryWeight, product of:
                1.6321801 = boost
                6.515104 = idf(docFreq=177, maxDocs=44218)
                0.014056481 = queryNorm
              0.70528066 = fieldWeight in 4767, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.515104 = idf(docFreq=177, maxDocs=44218)
                0.0625 = fieldNorm(doc=4767)
          0.051766664 = weight(abstract_txt:space in 4767) [ClassicSimilarity], result of:
            0.051766664 = score(doc=4767,freq=1.0), product of:
              0.15359779 = queryWeight, product of:
                2.0263906 = boost
                5.3924384 = idf(docFreq=546, maxDocs=44218)
                0.014056481 = queryNorm
              0.3370274 = fieldWeight in 4767, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.3924384 = idf(docFreq=546, maxDocs=44218)
                0.0625 = fieldNorm(doc=4767)
          0.07785422 = weight(abstract_txt:document in 4767) [ClassicSimilarity], result of:
            0.07785422 = score(doc=4767,freq=5.0), product of:
              0.12977645 = queryWeight, product of:
                2.15079 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.014056481 = queryNorm
              0.59991026 = fieldWeight in 4767, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0625 = fieldNorm(doc=4767)
          0.15860239 = weight(abstract_txt:clustering in 4767) [ClassicSimilarity], result of:
            0.15860239 = score(doc=4767,freq=4.0), product of:
              0.20411332 = queryWeight, product of:
                2.3359652 = boost
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.014056481 = queryNorm
              0.77703106 = fieldWeight in 4767, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.0625 = fieldNorm(doc=4767)
          0.21418895 = weight(abstract_txt:cluster in 4767) [ClassicSimilarity], result of:
            0.21418895 = score(doc=4767,freq=3.0), product of:
              0.30210325 = queryWeight, product of:
                3.2815404 = boost
                6.5493927 = idf(docFreq=171, maxDocs=44218)
                0.014056481 = queryNorm
              0.70899254 = fieldWeight in 4767, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.5493927 = idf(docFreq=171, maxDocs=44218)
                0.0625 = fieldNorm(doc=4767)
        0.24 = coord(6/25)
    
  5. Rasmussen, E.: Clustering algorithms (1992) 0.15
    0.14774333 = sum of:
      0.14774333 = product of:
        0.61559725 = sum of:
          0.076632805 = weight(abstract_txt:neighbor in 3513) [ClassicSimilarity], result of:
            0.076632805 = score(doc=3513,freq=1.0), product of:
              0.13833144 = queryWeight, product of:
                1.1102749 = boost
                8.863674 = idf(docFreq=16, maxDocs=44218)
                0.014056481 = queryNorm
              0.55397964 = fieldWeight in 3513, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.863674 = idf(docFreq=16, maxDocs=44218)
                0.0625 = fieldNorm(doc=3513)
          0.08607606 = weight(abstract_txt:clusters in 3513) [ClassicSimilarity], result of:
            0.08607606 = score(doc=3513,freq=2.0), product of:
              0.14947413 = queryWeight, product of:
                1.6321801 = boost
                6.515104 = idf(docFreq=177, maxDocs=44218)
                0.014056481 = queryNorm
              0.57585925 = fieldWeight in 3513, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.515104 = idf(docFreq=177, maxDocs=44218)
                0.0625 = fieldNorm(doc=3513)
          0.073209114 = weight(abstract_txt:space in 3513) [ClassicSimilarity], result of:
            0.073209114 = score(doc=3513,freq=2.0), product of:
              0.15359779 = queryWeight, product of:
                2.0263906 = boost
                5.3924384 = idf(docFreq=546, maxDocs=44218)
                0.014056481 = queryNorm
              0.47662872 = fieldWeight in 3513, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.3924384 = idf(docFreq=546, maxDocs=44218)
                0.0625 = fieldNorm(doc=3513)
          0.049239334 = weight(abstract_txt:document in 3513) [ClassicSimilarity], result of:
            0.049239334 = score(doc=3513,freq=2.0), product of:
              0.12977645 = queryWeight, product of:
                2.15079 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.014056481 = queryNorm
              0.37941656 = fieldWeight in 3513, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0625 = fieldNorm(doc=3513)
          0.27651677 = weight(abstract_txt:cluster in 3513) [ClassicSimilarity], result of:
            0.27651677 = score(doc=3513,freq=5.0), product of:
              0.30210325 = queryWeight, product of:
                3.2815404 = boost
                6.5493927 = idf(docFreq=171, maxDocs=44218)
                0.014056481 = queryNorm
              0.9153055 = fieldWeight in 3513, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                6.5493927 = idf(docFreq=171, maxDocs=44218)
                0.0625 = fieldNorm(doc=3513)
          0.05392311 = weight(abstract_txt:documents in 3513) [ClassicSimilarity], result of:
            0.05392311 = score(doc=3513,freq=1.0), product of:
              0.20934395 = queryWeight, product of:
                3.6136765 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.014056481 = queryNorm
              0.2575814 = fieldWeight in 3513, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0625 = fieldNorm(doc=3513)
        0.24 = coord(6/25)