Document (#36799)

Author
Li, T.
Zhu, S.
Ogihara, M.
Title
Hierarchical document classification using automatically generated hierarchy
Source
Journal of intelligent information systems. 29(2007) no.2, S.211-230
Year
2007
Abstract
Automated text categorization has witnessed a booming interest with the exponential growth of information and the ever-increasing needs for organizations. The underlying hierarchical structure identifies the relationships of dependence between different categories and provides valuable sources of information for categorization. Although considerable research has been conducted in the field of hierarchical document categorization, little has been done on automatic generation of topic hierarchies. In this paper, we propose the method of using linear discriminant projection to generate more meaningful intermediate levels of hierarchies in large flat sets of classes. The linear discriminant projection approach first transforms all documents onto a low-dimensional space and then clusters the categories into hier- archies accordingly. The paper also investigates the effect of using generated hierarchical structure for text classification. Our experiments show that generated hierarchies improve classification performance in most cases.
Theme
Automatisches Klassifizieren

Similar documents (content)

  1. Ruiz, M.E.; Srinivasan, P.: Combining machine learning and hierarchical indexing structures for text categorization (2001) 0.28
    0.27913663 = sum of:
      0.27913663 = product of:
        0.87230194 = sum of:
          0.018820811 = weight(abstract_txt:paper in 3596) [ClassicSimilarity], result of:
            0.018820811 = score(doc=3596,freq=1.0), product of:
              0.05752215 = queryWeight, product of:
                1.0276885 = boost
                3.4900522 = idf(docFreq=3585, maxDocs=43254)
                0.016037686 = queryNorm
              0.3271924 = fieldWeight in 3596, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4900522 = idf(docFreq=3585, maxDocs=43254)
                0.09375 = fieldNorm(doc=3596)
          0.041585155 = weight(abstract_txt:text in 3596) [ClassicSimilarity], result of:
            0.041585155 = score(doc=3596,freq=2.0), product of:
              0.077450655 = queryWeight, product of:
                1.1924948 = boost
                4.049738 = idf(docFreq=2048, maxDocs=43254)
                0.016037686 = queryNorm
              0.5369245 = fieldWeight in 3596, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.049738 = idf(docFreq=2048, maxDocs=43254)
                0.09375 = fieldNorm(doc=3596)
          0.11976861 = weight(abstract_txt:flat in 3596) [ClassicSimilarity], result of:
            0.11976861 = score(doc=3596,freq=1.0), product of:
              0.15678152 = queryWeight, product of:
                1.1997102 = boost
                8.148484 = idf(docFreq=33, maxDocs=43254)
                0.016037686 = queryNorm
              0.7639204 = fieldWeight in 3596, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.148484 = idf(docFreq=33, maxDocs=43254)
                0.09375 = fieldNorm(doc=3596)
          0.036887895 = weight(abstract_txt:structure in 3596) [ClassicSimilarity], result of:
            0.036887895 = score(doc=3596,freq=1.0), product of:
              0.0900877 = queryWeight, product of:
                1.2861058 = boost
                4.367643 = idf(docFreq=1490, maxDocs=43254)
                0.016037686 = queryNorm
              0.4094665 = fieldWeight in 3596, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.367643 = idf(docFreq=1490, maxDocs=43254)
                0.09375 = fieldNorm(doc=3596)
          0.062286176 = weight(abstract_txt:categories in 3596) [ClassicSimilarity], result of:
            0.062286176 = score(doc=3596,freq=1.0), product of:
              0.12774307 = queryWeight, product of:
                1.5314845 = boost
                5.2009544 = idf(docFreq=647, maxDocs=43254)
                0.016037686 = queryNorm
              0.48758948 = fieldWeight in 3596, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2009544 = idf(docFreq=647, maxDocs=43254)
                0.09375 = fieldNorm(doc=3596)
          0.027823867 = weight(abstract_txt:using in 3596) [ClassicSimilarity], result of:
            0.027823867 = score(doc=3596,freq=1.0), product of:
              0.08545123 = queryWeight, product of:
                1.5340825 = boost
                3.4731848 = idf(docFreq=3646, maxDocs=43254)
                0.016037686 = queryNorm
              0.32561108 = fieldWeight in 3596, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4731848 = idf(docFreq=3646, maxDocs=43254)
                0.09375 = fieldNorm(doc=3596)
          0.2746878 = weight(abstract_txt:categorization in 3596) [ClassicSimilarity], result of:
            0.2746878 = score(doc=3596,freq=2.0), product of:
              0.31212094 = queryWeight, product of:
                2.9319127 = boost
                6.6378922 = idf(docFreq=153, maxDocs=43254)
                0.016037686 = queryNorm
              0.8800684 = fieldWeight in 3596, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.6378922 = idf(docFreq=153, maxDocs=43254)
                0.09375 = fieldNorm(doc=3596)
          0.2904416 = weight(abstract_txt:hierarchical in 3596) [ClassicSimilarity], result of:
            0.2904416 = score(doc=3596,freq=3.0), product of:
              0.31147155 = queryWeight, product of:
                3.3819573 = boost
                5.7426 = idf(docFreq=376, maxDocs=43254)
                0.016037686 = queryNorm
              0.932482 = fieldWeight in 3596, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.7426 = idf(docFreq=376, maxDocs=43254)
                0.09375 = fieldNorm(doc=3596)
        0.32 = coord(8/25)
    
  2. Li, T.; Zhu, S.; Ogihara, M.: Text categorization via generalized discriminant analysis (2008) 0.25
    0.24966654 = sum of:
      0.24966654 = product of:
        0.69351816 = sum of:
          0.012547207 = weight(abstract_txt:paper in 4120) [ClassicSimilarity], result of:
            0.012547207 = score(doc=4120,freq=1.0), product of:
              0.05752215 = queryWeight, product of:
                1.0276885 = boost
                3.4900522 = idf(docFreq=3585, maxDocs=43254)
                0.016037686 = queryNorm
              0.21812826 = fieldWeight in 4120, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4900522 = idf(docFreq=3585, maxDocs=43254)
                0.0625 = fieldNorm(doc=4120)
          0.014142981 = weight(abstract_txt:been in 4120) [ClassicSimilarity], result of:
            0.014142981 = score(doc=4120,freq=1.0), product of:
              0.062301386 = queryWeight, product of:
                1.0695295 = boost
                3.6321454 = idf(docFreq=3110, maxDocs=43254)
                0.016037686 = queryNorm
              0.22700909 = fieldWeight in 4120, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6321454 = idf(docFreq=3110, maxDocs=43254)
                0.0625 = fieldNorm(doc=4120)
          0.0438346 = weight(abstract_txt:text in 4120) [ClassicSimilarity], result of:
            0.0438346 = score(doc=4120,freq=5.0), product of:
              0.077450655 = queryWeight, product of:
                1.1924948 = boost
                4.049738 = idf(docFreq=2048, maxDocs=43254)
                0.016037686 = queryNorm
              0.5659681 = fieldWeight in 4120, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.049738 = idf(docFreq=2048, maxDocs=43254)
                0.0625 = fieldNorm(doc=4120)
          0.023206728 = weight(abstract_txt:document in 4120) [ClassicSimilarity], result of:
            0.023206728 = score(doc=4120,freq=1.0), product of:
              0.08667217 = queryWeight, product of:
                1.2614899 = boost
                4.2840466 = idf(docFreq=1620, maxDocs=43254)
                0.016037686 = queryNorm
              0.26775292 = fieldWeight in 4120, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2840466 = idf(docFreq=1620, maxDocs=43254)
                0.0625 = fieldNorm(doc=4120)
          0.02459193 = weight(abstract_txt:structure in 4120) [ClassicSimilarity], result of:
            0.02459193 = score(doc=4120,freq=1.0), product of:
              0.0900877 = queryWeight, product of:
                1.2861058 = boost
                4.367643 = idf(docFreq=1490, maxDocs=43254)
                0.016037686 = queryNorm
              0.27297768 = fieldWeight in 4120, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.367643 = idf(docFreq=1490, maxDocs=43254)
                0.0625 = fieldNorm(doc=4120)
          0.018549245 = weight(abstract_txt:using in 4120) [ClassicSimilarity], result of:
            0.018549245 = score(doc=4120,freq=1.0), product of:
              0.08545123 = queryWeight, product of:
                1.5340825 = boost
                3.4731848 = idf(docFreq=3646, maxDocs=43254)
                0.016037686 = queryNorm
              0.21707405 = fieldWeight in 4120, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4731848 = idf(docFreq=3646, maxDocs=43254)
                0.0625 = fieldNorm(doc=4120)
          0.084871545 = weight(abstract_txt:classification in 4120) [ClassicSimilarity], result of:
            0.084871545 = score(doc=4120,freq=9.0), product of:
              0.11322128 = queryWeight, product of:
                1.7658491 = boost
                3.9979079 = idf(docFreq=2157, maxDocs=43254)
                0.016037686 = queryNorm
              0.74960774 = fieldWeight in 4120, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                3.9979079 = idf(docFreq=2157, maxDocs=43254)
                0.0625 = fieldNorm(doc=4120)
          0.21279578 = weight(abstract_txt:discriminant in 4120) [ClassicSimilarity], result of:
            0.21279578 = score(doc=4120,freq=1.0), product of:
              0.3797045 = queryWeight, product of:
                2.6403823 = boost
                8.966795 = idf(docFreq=14, maxDocs=43254)
                0.016037686 = queryNorm
              0.5604247 = fieldWeight in 4120, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.966795 = idf(docFreq=14, maxDocs=43254)
                0.0625 = fieldNorm(doc=4120)
          0.25897816 = weight(abstract_txt:categorization in 4120) [ClassicSimilarity], result of:
            0.25897816 = score(doc=4120,freq=4.0), product of:
              0.31212094 = queryWeight, product of:
                2.9319127 = boost
                6.6378922 = idf(docFreq=153, maxDocs=43254)
                0.016037686 = queryNorm
              0.82973653 = fieldWeight in 4120, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.6378922 = idf(docFreq=153, maxDocs=43254)
                0.0625 = fieldNorm(doc=4120)
        0.36 = coord(9/25)
    
  3. Sebastiani, F.: Machine learning in automated text categorization (2002) 0.19
    0.18647265 = sum of:
      0.18647265 = product of:
        0.6659738 = sum of:
          0.024504285 = weight(abstract_txt:text in 5390) [ClassicSimilarity], result of:
            0.024504285 = score(doc=5390,freq=1.0), product of:
              0.077450655 = queryWeight, product of:
                1.1924948 = boost
                4.049738 = idf(docFreq=2048, maxDocs=43254)
                0.016037686 = queryNorm
              0.31638578 = fieldWeight in 5390, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.049738 = idf(docFreq=2048, maxDocs=43254)
                0.078125 = fieldNorm(doc=5390)
          0.029008407 = weight(abstract_txt:document in 5390) [ClassicSimilarity], result of:
            0.029008407 = score(doc=5390,freq=1.0), product of:
              0.08667217 = queryWeight, product of:
                1.2614899 = boost
                4.2840466 = idf(docFreq=1620, maxDocs=43254)
                0.016037686 = queryNorm
              0.33469114 = fieldWeight in 5390, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2840466 = idf(docFreq=1620, maxDocs=43254)
                0.078125 = fieldNorm(doc=5390)
          0.1275054 = weight(abstract_txt:witnessed in 5390) [ClassicSimilarity], result of:
            0.1275054 = score(doc=5390,freq=1.0), product of:
              0.18458913 = queryWeight, product of:
                1.3017632 = boost
                8.841632 = idf(docFreq=16, maxDocs=43254)
                0.016037686 = queryNorm
              0.6907525 = fieldWeight in 5390, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.841632 = idf(docFreq=16, maxDocs=43254)
                0.078125 = fieldNorm(doc=5390)
          0.14728108 = weight(abstract_txt:booming in 5390) [ClassicSimilarity], result of:
            0.14728108 = score(doc=5390,freq=1.0), product of:
              0.2032131 = queryWeight, product of:
                1.3658556 = boost
                9.27695 = idf(docFreq=10, maxDocs=43254)
                0.016037686 = queryNorm
              0.7247617 = fieldWeight in 5390, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.27695 = idf(docFreq=10, maxDocs=43254)
                0.078125 = fieldNorm(doc=5390)
          0.07340496 = weight(abstract_txt:categories in 5390) [ClassicSimilarity], result of:
            0.07340496 = score(doc=5390,freq=2.0), product of:
              0.12774307 = queryWeight, product of:
                1.5314845 = boost
                5.2009544 = idf(docFreq=647, maxDocs=43254)
                0.016037686 = queryNorm
              0.5746297 = fieldWeight in 5390, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.2009544 = idf(docFreq=647, maxDocs=43254)
                0.078125 = fieldNorm(doc=5390)
          0.035363145 = weight(abstract_txt:classification in 5390) [ClassicSimilarity], result of:
            0.035363145 = score(doc=5390,freq=1.0), product of:
              0.11322128 = queryWeight, product of:
                1.7658491 = boost
                3.9979079 = idf(docFreq=2157, maxDocs=43254)
                0.016037686 = queryNorm
              0.31233656 = fieldWeight in 5390, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.9979079 = idf(docFreq=2157, maxDocs=43254)
                0.078125 = fieldNorm(doc=5390)
          0.2289065 = weight(abstract_txt:categorization in 5390) [ClassicSimilarity], result of:
            0.2289065 = score(doc=5390,freq=2.0), product of:
              0.31212094 = queryWeight, product of:
                2.9319127 = boost
                6.6378922 = idf(docFreq=153, maxDocs=43254)
                0.016037686 = queryNorm
              0.7333904 = fieldWeight in 5390, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.6378922 = idf(docFreq=153, maxDocs=43254)
                0.078125 = fieldNorm(doc=5390)
        0.28 = coord(7/25)
    
  4. Desale, S.K.; Kumbhar, R.: Research on automatic classification of documents in library environment : a literature review (2013) 0.16
    0.1580947 = sum of:
      0.1580947 = product of:
        0.56462395 = sum of:
          0.015684009 = weight(abstract_txt:paper in 2536) [ClassicSimilarity], result of:
            0.015684009 = score(doc=2536,freq=1.0), product of:
              0.05752215 = queryWeight, product of:
                1.0276885 = boost
                3.4900522 = idf(docFreq=3585, maxDocs=43254)
                0.016037686 = queryNorm
              0.27266032 = fieldWeight in 2536, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4900522 = idf(docFreq=3585, maxDocs=43254)
                0.078125 = fieldNorm(doc=2536)
          0.024504285 = weight(abstract_txt:text in 2536) [ClassicSimilarity], result of:
            0.024504285 = score(doc=2536,freq=1.0), product of:
              0.077450655 = queryWeight, product of:
                1.1924948 = boost
                4.049738 = idf(docFreq=2048, maxDocs=43254)
                0.016037686 = queryNorm
              0.31638578 = fieldWeight in 2536, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.049738 = idf(docFreq=2048, maxDocs=43254)
                0.078125 = fieldNorm(doc=2536)
          0.029008407 = weight(abstract_txt:document in 2536) [ClassicSimilarity], result of:
            0.029008407 = score(doc=2536,freq=1.0), product of:
              0.08667217 = queryWeight, product of:
                1.2614899 = boost
                4.2840466 = idf(docFreq=1620, maxDocs=43254)
                0.016037686 = queryNorm
              0.33469114 = fieldWeight in 2536, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2840466 = idf(docFreq=1620, maxDocs=43254)
                0.078125 = fieldNorm(doc=2536)
          0.04016029 = weight(abstract_txt:using in 2536) [ClassicSimilarity], result of:
            0.04016029 = score(doc=2536,freq=3.0), product of:
              0.08545123 = queryWeight, product of:
                1.5340825 = boost
                3.4731848 = idf(docFreq=3646, maxDocs=43254)
                0.016037686 = queryNorm
              0.46997908 = fieldWeight in 2536, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.4731848 = idf(docFreq=3646, maxDocs=43254)
                0.078125 = fieldNorm(doc=2536)
          0.08662166 = weight(abstract_txt:classification in 2536) [ClassicSimilarity], result of:
            0.08662166 = score(doc=2536,freq=6.0), product of:
              0.11322128 = queryWeight, product of:
                1.7658491 = boost
                3.9979079 = idf(docFreq=2157, maxDocs=43254)
                0.016037686 = queryNorm
              0.7650652 = fieldWeight in 2536, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                3.9979079 = idf(docFreq=2157, maxDocs=43254)
                0.078125 = fieldNorm(doc=2536)
          0.2289065 = weight(abstract_txt:categorization in 2536) [ClassicSimilarity], result of:
            0.2289065 = score(doc=2536,freq=2.0), product of:
              0.31212094 = queryWeight, product of:
                2.9319127 = boost
                6.6378922 = idf(docFreq=153, maxDocs=43254)
                0.016037686 = queryNorm
              0.7333904 = fieldWeight in 2536, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.6378922 = idf(docFreq=153, maxDocs=43254)
                0.078125 = fieldNorm(doc=2536)
          0.13973878 = weight(abstract_txt:hierarchical in 2536) [ClassicSimilarity], result of:
            0.13973878 = score(doc=2536,freq=1.0), product of:
              0.31147155 = queryWeight, product of:
                3.3819573 = boost
                5.7426 = idf(docFreq=376, maxDocs=43254)
                0.016037686 = queryNorm
              0.4486406 = fieldWeight in 2536, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7426 = idf(docFreq=376, maxDocs=43254)
                0.078125 = fieldNorm(doc=2536)
        0.28 = coord(7/25)
    
  5. Yoon, Y.; Lee, C.; Lee, G.G.: ¬An effective procedure for constructing a hierarchical text classification system (2006) 0.15
    0.14930968 = sum of:
      0.14930968 = product of:
        0.7465484 = sum of:
          0.024504285 = weight(abstract_txt:text in 274) [ClassicSimilarity], result of:
            0.024504285 = score(doc=274,freq=1.0), product of:
              0.077450655 = queryWeight, product of:
                1.1924948 = boost
                4.049738 = idf(docFreq=2048, maxDocs=43254)
                0.016037686 = queryNorm
              0.31638578 = fieldWeight in 274, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.049738 = idf(docFreq=2048, maxDocs=43254)
                0.078125 = fieldNorm(doc=274)
          0.0790744 = weight(abstract_txt:classification in 274) [ClassicSimilarity], result of:
            0.0790744 = score(doc=274,freq=5.0), product of:
              0.11322128 = queryWeight, product of:
                1.7658491 = boost
                3.9979079 = idf(docFreq=2157, maxDocs=43254)
                0.016037686 = queryNorm
              0.6984058 = fieldWeight in 274, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                3.9979079 = idf(docFreq=2157, maxDocs=43254)
                0.078125 = fieldNorm(doc=274)
          0.16186135 = weight(abstract_txt:categorization in 274) [ClassicSimilarity], result of:
            0.16186135 = score(doc=274,freq=1.0), product of:
              0.31212094 = queryWeight, product of:
                2.9319127 = boost
                6.6378922 = idf(docFreq=153, maxDocs=43254)
                0.016037686 = queryNorm
              0.5185853 = fieldWeight in 274, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.6378922 = idf(docFreq=153, maxDocs=43254)
                0.078125 = fieldNorm(doc=274)
          0.20163079 = weight(abstract_txt:hierarchies in 274) [ClassicSimilarity], result of:
            0.20163079 = score(doc=274,freq=1.0), product of:
              0.3613533 = queryWeight, product of:
                3.1546817 = boost
                7.1422453 = idf(docFreq=92, maxDocs=43254)
                0.016037686 = queryNorm
              0.5579879 = fieldWeight in 274, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.1422453 = idf(docFreq=92, maxDocs=43254)
                0.078125 = fieldNorm(doc=274)
          0.27947757 = weight(abstract_txt:hierarchical in 274) [ClassicSimilarity], result of:
            0.27947757 = score(doc=274,freq=4.0), product of:
              0.31147155 = queryWeight, product of:
                3.3819573 = boost
                5.7426 = idf(docFreq=376, maxDocs=43254)
                0.016037686 = queryNorm
              0.8972812 = fieldWeight in 274, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.7426 = idf(docFreq=376, maxDocs=43254)
                0.078125 = fieldNorm(doc=274)
        0.2 = coord(5/25)