Document (#36799)

Author
Li, T.
Zhu, S.
Ogihara, M.
Title
Hierarchical document classification using automatically generated hierarchy
Source
Journal of intelligent information systems. 29(2007) no.2, S.211-230
Year
2007
Abstract
Automated text categorization has witnessed a booming interest with the exponential growth of information and the ever-increasing needs for organizations. The underlying hierarchical structure identifies the relationships of dependence between different categories and provides valuable sources of information for categorization. Although considerable research has been conducted in the field of hierarchical document categorization, little has been done on automatic generation of topic hierarchies. In this paper, we propose the method of using linear discriminant projection to generate more meaningful intermediate levels of hierarchies in large flat sets of classes. The linear discriminant projection approach first transforms all documents onto a low-dimensional space and then clusters the categories into hier- archies accordingly. The paper also investigates the effect of using generated hierarchical structure for text classification. Our experiments show that generated hierarchies improve classification performance in most cases.
Theme
Automatisches Klassifizieren

Similar documents (content)

  1. Ruiz, M.E.; Srinivasan, P.: Combining machine learning and hierarchical indexing structures for text categorization (2001) 0.28
    0.27979186 = sum of:
      0.27979186 = product of:
        0.8743496 = sum of:
          0.018964317 = weight(abstract_txt:paper in 2596) [ClassicSimilarity], result of:
            0.018964317 = score(doc=2596,freq=1.0), product of:
              0.05778424 = queryWeight, product of:
                1.0320014 = boost
                3.500713 = idf(docFreq=3493, maxDocs=42596)
                0.015994571 = queryNorm
              0.32819185 = fieldWeight in 2596, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.500713 = idf(docFreq=3493, maxDocs=42596)
                0.09375 = fieldNorm(doc=2596)
          0.041502744 = weight(abstract_txt:text in 2596) [ClassicSimilarity], result of:
            0.041502744 = score(doc=2596,freq=2.0), product of:
              0.077308245 = queryWeight, product of:
                1.1936816 = boost
                4.049158 = idf(docFreq=2018, maxDocs=42596)
                0.015994571 = queryNorm
              0.5368476 = fieldWeight in 2596, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.049158 = idf(docFreq=2018, maxDocs=42596)
                0.09375 = fieldNorm(doc=2596)
          0.120223194 = weight(abstract_txt:flat in 2596) [ClassicSimilarity], result of:
            0.120223194 = score(doc=2596,freq=1.0), product of:
              0.1570966 = queryWeight, product of:
                1.203217 = boost
                8.163008 = idf(docFreq=32, maxDocs=42596)
                0.015994571 = queryNorm
              0.765282 = fieldWeight in 2596, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.163008 = idf(docFreq=32, maxDocs=42596)
                0.09375 = fieldNorm(doc=2596)
          0.03692248 = weight(abstract_txt:structure in 2596) [ClassicSimilarity], result of:
            0.03692248 = score(doc=2596,freq=1.0), product of:
              0.090097316 = queryWeight, product of:
                1.2886398 = boost
                4.371271 = idf(docFreq=1462, maxDocs=42596)
                0.015994571 = queryNorm
              0.40980667 = fieldWeight in 2596, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.371271 = idf(docFreq=1462, maxDocs=42596)
                0.09375 = fieldNorm(doc=2596)
          0.06225375 = weight(abstract_txt:categories in 2596) [ClassicSimilarity], result of:
            0.06225375 = score(doc=2596,freq=1.0), product of:
              0.1276326 = queryWeight, product of:
                1.5337564 = boost
                5.202746 = idf(docFreq=636, maxDocs=42596)
                0.015994571 = queryNorm
              0.48775744 = fieldWeight in 2596, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.202746 = idf(docFreq=636, maxDocs=42596)
                0.09375 = fieldNorm(doc=2596)
          0.027938621 = weight(abstract_txt:using in 2596) [ClassicSimilarity], result of:
            0.027938621 = score(doc=2596,freq=1.0), product of:
              0.085641645 = queryWeight, product of:
                1.5387346 = boost
                3.4797552 = idf(docFreq=3567, maxDocs=42596)
                0.015994571 = queryNorm
              0.32622704 = fieldWeight in 2596, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4797552 = idf(docFreq=3567, maxDocs=42596)
                0.09375 = fieldNorm(doc=2596)
          0.2748 = weight(abstract_txt:categorization in 2596) [ClassicSimilarity], result of:
            0.2748 = score(doc=2596,freq=2.0), product of:
              0.31204426 = queryWeight, product of:
                2.9371717 = boost
                6.6422358 = idf(docFreq=150, maxDocs=42596)
                0.015994571 = queryNorm
              0.8806443 = fieldWeight in 2596, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.6422358 = idf(docFreq=150, maxDocs=42596)
                0.09375 = fieldNorm(doc=2596)
          0.29174447 = weight(abstract_txt:hierarchical in 2596) [ClassicSimilarity], result of:
            0.29174447 = score(doc=2596,freq=3.0), product of:
              0.31224054 = queryWeight, product of:
                3.39262 = boost
                5.7541537 = idf(docFreq=366, maxDocs=42596)
                0.015994571 = queryNorm
              0.93435806 = fieldWeight in 2596, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.7541537 = idf(docFreq=366, maxDocs=42596)
                0.09375 = fieldNorm(doc=2596)
        0.32 = coord(8/25)
    
  2. Li, T.; Zhu, S.; Ogihara, M.: Text categorization via generalized discriminant analysis (2008) 0.25
    0.24918208 = sum of:
      0.24918208 = product of:
        0.6921724 = sum of:
          0.012642878 = weight(abstract_txt:paper in 3299) [ClassicSimilarity], result of:
            0.012642878 = score(doc=3299,freq=1.0), product of:
              0.05778424 = queryWeight, product of:
                1.0320014 = boost
                3.500713 = idf(docFreq=3493, maxDocs=42596)
                0.015994571 = queryNorm
              0.21879457 = fieldWeight in 3299, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.500713 = idf(docFreq=3493, maxDocs=42596)
                0.0625 = fieldNorm(doc=3299)
          0.014161767 = weight(abstract_txt:been in 3299) [ClassicSimilarity], result of:
            0.014161767 = score(doc=3299,freq=1.0), product of:
              0.062324256 = queryWeight, product of:
                1.0717763 = boost
                3.6356356 = idf(docFreq=3052, maxDocs=42596)
                0.015994571 = queryNorm
              0.22722723 = fieldWeight in 3299, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6356356 = idf(docFreq=3052, maxDocs=42596)
                0.0625 = fieldNorm(doc=3299)
          0.043747734 = weight(abstract_txt:text in 3299) [ClassicSimilarity], result of:
            0.043747734 = score(doc=3299,freq=5.0), product of:
              0.077308245 = queryWeight, product of:
                1.1936816 = boost
                4.049158 = idf(docFreq=2018, maxDocs=42596)
                0.015994571 = queryNorm
              0.56588703 = fieldWeight in 3299, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.049158 = idf(docFreq=2018, maxDocs=42596)
                0.0625 = fieldNorm(doc=3299)
          0.023083014 = weight(abstract_txt:document in 3299) [ClassicSimilarity], result of:
            0.023083014 = score(doc=3299,freq=1.0), product of:
              0.08631914 = queryWeight, product of:
                1.2613312 = boost
                4.2786365 = idf(docFreq=1604, maxDocs=42596)
                0.015994571 = queryNorm
              0.26741478 = fieldWeight in 3299, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2786365 = idf(docFreq=1604, maxDocs=42596)
                0.0625 = fieldNorm(doc=3299)
          0.024614988 = weight(abstract_txt:structure in 3299) [ClassicSimilarity], result of:
            0.024614988 = score(doc=3299,freq=1.0), product of:
              0.090097316 = queryWeight, product of:
                1.2886398 = boost
                4.371271 = idf(docFreq=1462, maxDocs=42596)
                0.015994571 = queryNorm
              0.27320445 = fieldWeight in 3299, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.371271 = idf(docFreq=1462, maxDocs=42596)
                0.0625 = fieldNorm(doc=3299)
          0.018625747 = weight(abstract_txt:using in 3299) [ClassicSimilarity], result of:
            0.018625747 = score(doc=3299,freq=1.0), product of:
              0.085641645 = queryWeight, product of:
                1.5387346 = boost
                3.4797552 = idf(docFreq=3567, maxDocs=42596)
                0.015994571 = queryNorm
              0.2174847 = fieldWeight in 3299, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4797552 = idf(docFreq=3567, maxDocs=42596)
                0.0625 = fieldNorm(doc=3299)
          0.08483477 = weight(abstract_txt:classification in 3299) [ClassicSimilarity], result of:
            0.08483477 = score(doc=3299,freq=9.0), product of:
              0.11312996 = queryWeight, product of:
                1.7685202 = boost
                3.9994013 = idf(docFreq=2121, maxDocs=42596)
                0.015994571 = queryNorm
              0.74988776 = fieldWeight in 3299, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                3.9994013 = idf(docFreq=2121, maxDocs=42596)
                0.0625 = fieldNorm(doc=3299)
          0.21137753 = weight(abstract_txt:discriminant in 3299) [ClassicSimilarity], result of:
            0.21137753 = score(doc=3299,freq=1.0), product of:
              0.37781975 = queryWeight, product of:
                2.6388695 = boost
                8.951466 = idf(docFreq=14, maxDocs=42596)
                0.015994571 = queryNorm
              0.5594666 = fieldWeight in 3299, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.951466 = idf(docFreq=14, maxDocs=42596)
                0.0625 = fieldNorm(doc=3299)
          0.25908396 = weight(abstract_txt:categorization in 3299) [ClassicSimilarity], result of:
            0.25908396 = score(doc=3299,freq=4.0), product of:
              0.31204426 = queryWeight, product of:
                2.9371717 = boost
                6.6422358 = idf(docFreq=150, maxDocs=42596)
                0.015994571 = queryNorm
              0.83027947 = fieldWeight in 3299, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.6422358 = idf(docFreq=150, maxDocs=42596)
                0.0625 = fieldNorm(doc=3299)
        0.36 = coord(9/25)
    
  3. Sebastiani, F.: Machine learning in automated text categorization (2002) 0.19
    0.18591861 = sum of:
      0.18591861 = product of:
        0.663995 = sum of:
          0.024455726 = weight(abstract_txt:text in 4390) [ClassicSimilarity], result of:
            0.024455726 = score(doc=4390,freq=1.0), product of:
              0.077308245 = queryWeight, product of:
                1.1936816 = boost
                4.049158 = idf(docFreq=2018, maxDocs=42596)
                0.015994571 = queryNorm
              0.31634048 = fieldWeight in 4390, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.049158 = idf(docFreq=2018, maxDocs=42596)
                0.078125 = fieldNorm(doc=4390)
          0.028853768 = weight(abstract_txt:document in 4390) [ClassicSimilarity], result of:
            0.028853768 = score(doc=4390,freq=1.0), product of:
              0.08631914 = queryWeight, product of:
                1.2613312 = boost
                4.2786365 = idf(docFreq=1604, maxDocs=42596)
                0.015994571 = queryNorm
              0.33426848 = fieldWeight in 4390, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2786365 = idf(docFreq=1604, maxDocs=42596)
                0.078125 = fieldNorm(doc=4390)
          0.1266464 = weight(abstract_txt:witnessed in 4390) [ClassicSimilarity], result of:
            0.1266464 = score(doc=4390,freq=1.0), product of:
              0.183664 = queryWeight, product of:
                1.3009859 = boost
                8.826303 = idf(docFreq=16, maxDocs=42596)
                0.015994571 = queryNorm
              0.68955487 = fieldWeight in 4390, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.826303 = idf(docFreq=16, maxDocs=42596)
                0.078125 = fieldNorm(doc=4390)
          0.14632459 = weight(abstract_txt:booming in 4390) [ClassicSimilarity], result of:
            0.14632459 = score(doc=4390,freq=1.0), product of:
              0.20222755 = queryWeight, product of:
                1.3651512 = boost
                9.2616205 = idf(docFreq=10, maxDocs=42596)
                0.015994571 = queryNorm
              0.7235641 = fieldWeight in 4390, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.2616205 = idf(docFreq=10, maxDocs=42596)
                0.078125 = fieldNorm(doc=4390)
          0.073366754 = weight(abstract_txt:categories in 4390) [ClassicSimilarity], result of:
            0.073366754 = score(doc=4390,freq=2.0), product of:
              0.1276326 = queryWeight, product of:
                1.5337564 = boost
                5.202746 = idf(docFreq=636, maxDocs=42596)
                0.015994571 = queryNorm
              0.5748277 = fieldWeight in 4390, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.202746 = idf(docFreq=636, maxDocs=42596)
                0.078125 = fieldNorm(doc=4390)
          0.035347823 = weight(abstract_txt:classification in 4390) [ClassicSimilarity], result of:
            0.035347823 = score(doc=4390,freq=1.0), product of:
              0.11312996 = queryWeight, product of:
                1.7685202 = boost
                3.9994013 = idf(docFreq=2121, maxDocs=42596)
                0.015994571 = queryNorm
              0.31245324 = fieldWeight in 4390, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.9994013 = idf(docFreq=2121, maxDocs=42596)
                0.078125 = fieldNorm(doc=4390)
          0.229 = weight(abstract_txt:categorization in 4390) [ClassicSimilarity], result of:
            0.229 = score(doc=4390,freq=2.0), product of:
              0.31204426 = queryWeight, product of:
                2.9371717 = boost
                6.6422358 = idf(docFreq=150, maxDocs=42596)
                0.015994571 = queryNorm
              0.73387027 = fieldWeight in 4390, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.6422358 = idf(docFreq=150, maxDocs=42596)
                0.078125 = fieldNorm(doc=4390)
        0.28 = coord(7/25)
    
  4. Desale, S.K.; Kumbhar, R.: Research on automatic classification of documents in library environment : a literature review (2013) 0.16
    0.15830886 = sum of:
      0.15830886 = product of:
        0.5653888 = sum of:
          0.015803596 = weight(abstract_txt:paper in 2072) [ClassicSimilarity], result of:
            0.015803596 = score(doc=2072,freq=1.0), product of:
              0.05778424 = queryWeight, product of:
                1.0320014 = boost
                3.500713 = idf(docFreq=3493, maxDocs=42596)
                0.015994571 = queryNorm
              0.2734932 = fieldWeight in 2072, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.500713 = idf(docFreq=3493, maxDocs=42596)
                0.078125 = fieldNorm(doc=2072)
          0.024455726 = weight(abstract_txt:text in 2072) [ClassicSimilarity], result of:
            0.024455726 = score(doc=2072,freq=1.0), product of:
              0.077308245 = queryWeight, product of:
                1.1936816 = boost
                4.049158 = idf(docFreq=2018, maxDocs=42596)
                0.015994571 = queryNorm
              0.31634048 = fieldWeight in 2072, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.049158 = idf(docFreq=2018, maxDocs=42596)
                0.078125 = fieldNorm(doc=2072)
          0.028853768 = weight(abstract_txt:document in 2072) [ClassicSimilarity], result of:
            0.028853768 = score(doc=2072,freq=1.0), product of:
              0.08631914 = queryWeight, product of:
                1.2613312 = boost
                4.2786365 = idf(docFreq=1604, maxDocs=42596)
                0.015994571 = queryNorm
              0.33426848 = fieldWeight in 2072, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2786365 = idf(docFreq=1604, maxDocs=42596)
                0.078125 = fieldNorm(doc=2072)
          0.040325925 = weight(abstract_txt:using in 2072) [ClassicSimilarity], result of:
            0.040325925 = score(doc=2072,freq=3.0), product of:
              0.085641645 = queryWeight, product of:
                1.5387346 = boost
                3.4797552 = idf(docFreq=3567, maxDocs=42596)
                0.015994571 = queryNorm
              0.47086817 = fieldWeight in 2072, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.4797552 = idf(docFreq=3567, maxDocs=42596)
                0.078125 = fieldNorm(doc=2072)
          0.08658413 = weight(abstract_txt:classification in 2072) [ClassicSimilarity], result of:
            0.08658413 = score(doc=2072,freq=6.0), product of:
              0.11312996 = queryWeight, product of:
                1.7685202 = boost
                3.9994013 = idf(docFreq=2121, maxDocs=42596)
                0.015994571 = queryNorm
              0.765351 = fieldWeight in 2072, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                3.9994013 = idf(docFreq=2121, maxDocs=42596)
                0.078125 = fieldNorm(doc=2072)
          0.229 = weight(abstract_txt:categorization in 2072) [ClassicSimilarity], result of:
            0.229 = score(doc=2072,freq=2.0), product of:
              0.31204426 = queryWeight, product of:
                2.9371717 = boost
                6.6422358 = idf(docFreq=150, maxDocs=42596)
                0.015994571 = queryNorm
              0.73387027 = fieldWeight in 2072, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.6422358 = idf(docFreq=150, maxDocs=42596)
                0.078125 = fieldNorm(doc=2072)
          0.14036563 = weight(abstract_txt:hierarchical in 2072) [ClassicSimilarity], result of:
            0.14036563 = score(doc=2072,freq=1.0), product of:
              0.31224054 = queryWeight, product of:
                3.39262 = boost
                5.7541537 = idf(docFreq=366, maxDocs=42596)
                0.015994571 = queryNorm
              0.44954327 = fieldWeight in 2072, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7541537 = idf(docFreq=366, maxDocs=42596)
                0.078125 = fieldNorm(doc=2072)
        0.28 = coord(7/25)
    
  5. Yoon, Y.; Lee, C.; Lee, G.G.: ¬An effective procedure for constructing a hierarchical text classification system (2006) 0.15
    0.14923576 = sum of:
      0.14923576 = product of:
        0.74617875 = sum of:
          0.024455726 = weight(abstract_txt:text in 274) [ClassicSimilarity], result of:
            0.024455726 = score(doc=274,freq=1.0), product of:
              0.077308245 = queryWeight, product of:
                1.1936816 = boost
                4.049158 = idf(docFreq=2018, maxDocs=42596)
                0.015994571 = queryNorm
              0.31634048 = fieldWeight in 274, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.049158 = idf(docFreq=2018, maxDocs=42596)
                0.078125 = fieldNorm(doc=274)
          0.079040125 = weight(abstract_txt:classification in 274) [ClassicSimilarity], result of:
            0.079040125 = score(doc=274,freq=5.0), product of:
              0.11312996 = queryWeight, product of:
                1.7685202 = boost
                3.9994013 = idf(docFreq=2121, maxDocs=42596)
                0.015994571 = queryNorm
              0.69866663 = fieldWeight in 274, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                3.9994013 = idf(docFreq=2121, maxDocs=42596)
                0.078125 = fieldNorm(doc=274)
          0.16192746 = weight(abstract_txt:categorization in 274) [ClassicSimilarity], result of:
            0.16192746 = score(doc=274,freq=1.0), product of:
              0.31204426 = queryWeight, product of:
                2.9371717 = boost
                6.6422358 = idf(docFreq=150, maxDocs=42596)
                0.015994571 = queryNorm
              0.51892465 = fieldWeight in 274, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.6422358 = idf(docFreq=150, maxDocs=42596)
                0.078125 = fieldNorm(doc=274)
          0.20002422 = weight(abstract_txt:hierarchies in 274) [ClassicSimilarity], result of:
            0.20002422 = score(doc=274,freq=1.0), product of:
              0.35924515 = queryWeight, product of:
                3.1514955 = boost
                7.126916 = idf(docFreq=92, maxDocs=42596)
                0.015994571 = queryNorm
              0.5567903 = fieldWeight in 274, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.126916 = idf(docFreq=92, maxDocs=42596)
                0.078125 = fieldNorm(doc=274)
          0.28073126 = weight(abstract_txt:hierarchical in 274) [ClassicSimilarity], result of:
            0.28073126 = score(doc=274,freq=4.0), product of:
              0.31224054 = queryWeight, product of:
                3.39262 = boost
                5.7541537 = idf(docFreq=366, maxDocs=42596)
                0.015994571 = queryNorm
              0.89908653 = fieldWeight in 274, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.7541537 = idf(docFreq=366, maxDocs=42596)
                0.078125 = fieldNorm(doc=274)
        0.2 = coord(5/25)