Document (#36798)

Author
Li, T.
Zhu, S.
Ogihara, M.
Title
Hierarchical document classification using automatically generated hierarchy
Source
Journal of intelligent information systems. 29(2007) no.2, S.211-230
Year
2007
Abstract
Automated text categorization has witnessed a booming interest with the exponential growth of information and the ever-increasing needs for organizations. The underlying hierarchical structure identifies the relationships of dependence between different categories and provides valuable sources of information for categorization. Although considerable research has been conducted in the field of hierarchical document categorization, little has been done on automatic generation of topic hierarchies. In this paper, we propose the method of using linear discriminant projection to generate more meaningful intermediate levels of hierarchies in large flat sets of classes. The linear discriminant projection approach first transforms all documents onto a low-dimensional space and then clusters the categories into hier- archies accordingly. The paper also investigates the effect of using generated hierarchical structure for text classification. Our experiments show that generated hierarchies improve classification performance in most cases.
Theme
Automatisches Klassifizieren

Similar documents (content)

  1. Ruiz, M.E.; Srinivasan, P.: Combining machine learning and hierarchical indexing structures for text categorization (2001) 0.28
    0.27644098 = sum of:
      0.27644098 = product of:
        0.8638781 = sum of:
          0.01851122 = weight(abstract_txt:paper in 1595) [ClassicSimilarity], result of:
            0.01851122 = score(doc=1595,freq=1.0), product of:
              0.05694595 = queryWeight, product of:
                1.0188369 = boost
                3.467376 = idf(docFreq=3749, maxDocs=44218)
                0.016119711 = queryNorm
              0.3250665 = fieldWeight in 1595, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.467376 = idf(docFreq=3749, maxDocs=44218)
                0.09375 = fieldNorm(doc=1595)
          0.04152768 = weight(abstract_txt:text in 1595) [ClassicSimilarity], result of:
            0.04152768 = score(doc=1595,freq=2.0), product of:
              0.077455916 = queryWeight, product of:
                1.18823 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.016119711 = queryNorm
              0.53614604 = fieldWeight in 1595, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.09375 = fieldNorm(doc=1595)
          0.11857845 = weight(abstract_txt:flat in 1595) [ClassicSimilarity], result of:
            0.11857845 = score(doc=1595,freq=1.0), product of:
              0.15589541 = queryWeight, product of:
                1.1919962 = boost
                8.113368 = idf(docFreq=35, maxDocs=44218)
                0.016119711 = queryNorm
              0.7606282 = fieldWeight in 1595, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.113368 = idf(docFreq=35, maxDocs=44218)
                0.09375 = fieldNorm(doc=1595)
          0.03675308 = weight(abstract_txt:structure in 1595) [ClassicSimilarity], result of:
            0.03675308 = score(doc=1595,freq=1.0), product of:
              0.08995707 = queryWeight, product of:
                1.2805333 = boost
                4.3579993 = idf(docFreq=1538, maxDocs=44218)
                0.016119711 = queryNorm
              0.40856242 = fieldWeight in 1595, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.3579993 = idf(docFreq=1538, maxDocs=44218)
                0.09375 = fieldNorm(doc=1595)
          0.06163863 = weight(abstract_txt:categories in 1595) [ClassicSimilarity], result of:
            0.06163863 = score(doc=1595,freq=1.0), product of:
              0.1269818 = queryWeight, product of:
                1.5214019 = boost
                5.17774 = idf(docFreq=677, maxDocs=44218)
                0.016119711 = queryNorm
              0.48541313 = fieldWeight in 1595, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.17774 = idf(docFreq=677, maxDocs=44218)
                0.09375 = fieldNorm(doc=1595)
          0.027664674 = weight(abstract_txt:using in 1595) [ClassicSimilarity], result of:
            0.027664674 = score(doc=1595,freq=1.0), product of:
              0.08520929 = queryWeight, product of:
                1.5263789 = boost
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.016119711 = queryNorm
              0.32466736 = fieldWeight in 1595, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.09375 = fieldNorm(doc=1595)
          0.26970002 = weight(abstract_txt:categorization in 1595) [ClassicSimilarity], result of:
            0.26970002 = score(doc=1595,freq=2.0), product of:
              0.30863646 = queryWeight, product of:
                2.9049754 = boost
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.016119711 = queryNorm
              0.87384367 = fieldWeight in 1595, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.09375 = fieldNorm(doc=1595)
          0.28950435 = weight(abstract_txt:hierarchical in 1595) [ClassicSimilarity], result of:
            0.28950435 = score(doc=1595,freq=3.0), product of:
              0.31110892 = queryWeight, product of:
                3.3677857 = boost
                5.7307405 = idf(docFreq=389, maxDocs=44218)
                0.016119711 = queryNorm
              0.9305563 = fieldWeight in 1595, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.7307405 = idf(docFreq=389, maxDocs=44218)
                0.09375 = fieldNorm(doc=1595)
        0.32 = coord(8/25)
    
  2. Li, T.; Zhu, S.; Ogihara, M.: Text categorization via generalized discriminant analysis (2008) 0.25
    0.2485889 = sum of:
      0.2485889 = product of:
        0.6905247 = sum of:
          0.012340814 = weight(abstract_txt:paper in 2119) [ClassicSimilarity], result of:
            0.012340814 = score(doc=2119,freq=1.0), product of:
              0.05694595 = queryWeight, product of:
                1.0188369 = boost
                3.467376 = idf(docFreq=3749, maxDocs=44218)
                0.016119711 = queryNorm
              0.216711 = fieldWeight in 2119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.467376 = idf(docFreq=3749, maxDocs=44218)
                0.0625 = fieldNorm(doc=2119)
          0.014015064 = weight(abstract_txt:been in 2119) [ClassicSimilarity], result of:
            0.014015064 = score(doc=2119,freq=1.0), product of:
              0.061986487 = queryWeight, product of:
                1.0629718 = boost
                3.617579 = idf(docFreq=3226, maxDocs=44218)
                0.016119711 = queryNorm
              0.22609869 = fieldWeight in 2119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.617579 = idf(docFreq=3226, maxDocs=44218)
                0.0625 = fieldNorm(doc=2119)
          0.043774024 = weight(abstract_txt:text in 2119) [ClassicSimilarity], result of:
            0.043774024 = score(doc=2119,freq=5.0), product of:
              0.077455916 = queryWeight, product of:
                1.18823 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.016119711 = queryNorm
              0.5651476 = fieldWeight in 2119, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=2119)
          0.02341557 = weight(abstract_txt:document in 2119) [ClassicSimilarity], result of:
            0.02341557 = score(doc=2119,freq=1.0), product of:
              0.08727773 = queryWeight, product of:
                1.261319 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.016119711 = queryNorm
              0.26828802 = fieldWeight in 2119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0625 = fieldNorm(doc=2119)
          0.024502054 = weight(abstract_txt:structure in 2119) [ClassicSimilarity], result of:
            0.024502054 = score(doc=2119,freq=1.0), product of:
              0.08995707 = queryWeight, product of:
                1.2805333 = boost
                4.3579993 = idf(docFreq=1538, maxDocs=44218)
                0.016119711 = queryNorm
              0.27237496 = fieldWeight in 2119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.3579993 = idf(docFreq=1538, maxDocs=44218)
                0.0625 = fieldNorm(doc=2119)
          0.018443117 = weight(abstract_txt:using in 2119) [ClassicSimilarity], result of:
            0.018443117 = score(doc=2119,freq=1.0), product of:
              0.08520929 = queryWeight, product of:
                1.5263789 = boost
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.016119711 = queryNorm
              0.21644491 = fieldWeight in 2119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.0625 = fieldNorm(doc=2119)
          0.08475194 = weight(abstract_txt:classification in 2119) [ClassicSimilarity], result of:
            0.08475194 = score(doc=2119,freq=9.0), product of:
              0.11322691 = queryWeight, product of:
                1.7595179 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.016119711 = queryNorm
              0.7485141 = fieldWeight in 2119, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.0625 = fieldNorm(doc=2119)
          0.21500647 = weight(abstract_txt:discriminant in 2119) [ClassicSimilarity], result of:
            0.21500647 = score(doc=2119,freq=1.0), product of:
              0.3827084 = queryWeight, product of:
                2.641236 = boost
                8.988837 = idf(docFreq=14, maxDocs=44218)
                0.016119711 = queryNorm
              0.5618023 = fieldWeight in 2119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.988837 = idf(docFreq=14, maxDocs=44218)
                0.0625 = fieldNorm(doc=2119)
          0.25427562 = weight(abstract_txt:categorization in 2119) [ClassicSimilarity], result of:
            0.25427562 = score(doc=2119,freq=4.0), product of:
              0.30863646 = queryWeight, product of:
                2.9049754 = boost
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.016119711 = queryNorm
              0.82386774 = fieldWeight in 2119, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.0625 = fieldNorm(doc=2119)
        0.36 = coord(9/25)
    
  3. Sebastiani, F.: Machine learning in automated text categorization (2002) 0.19
    0.18524422 = sum of:
      0.18524422 = product of:
        0.66158646 = sum of:
          0.024470422 = weight(abstract_txt:text in 3389) [ClassicSimilarity], result of:
            0.024470422 = score(doc=3389,freq=1.0), product of:
              0.077455916 = queryWeight, product of:
                1.18823 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.016119711 = queryNorm
              0.3159271 = fieldWeight in 3389, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.078125 = fieldNorm(doc=3389)
          0.029269462 = weight(abstract_txt:document in 3389) [ClassicSimilarity], result of:
            0.029269462 = score(doc=3389,freq=1.0), product of:
              0.08727773 = queryWeight, product of:
                1.261319 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.016119711 = queryNorm
              0.33536002 = fieldWeight in 3389, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.078125 = fieldNorm(doc=3389)
          0.1263669 = weight(abstract_txt:witnessed in 3389) [ClassicSimilarity], result of:
            0.1263669 = score(doc=3389,freq=1.0), product of:
              0.18367042 = queryWeight, product of:
                1.2938318 = boost
                8.806516 = idf(docFreq=17, maxDocs=44218)
                0.016119711 = queryNorm
              0.688009 = fieldWeight in 3389, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.806516 = idf(docFreq=17, maxDocs=44218)
                0.078125 = fieldNorm(doc=3389)
          0.14877455 = weight(abstract_txt:booming in 3389) [ClassicSimilarity], result of:
            0.14877455 = score(doc=3389,freq=1.0), product of:
              0.20478716 = queryWeight, product of:
                1.3661853 = boost
                9.298992 = idf(docFreq=10, maxDocs=44218)
                0.016119711 = queryNorm
              0.72648376 = fieldWeight in 3389, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.298992 = idf(docFreq=10, maxDocs=44218)
                0.078125 = fieldNorm(doc=3389)
          0.07264183 = weight(abstract_txt:categories in 3389) [ClassicSimilarity], result of:
            0.07264183 = score(doc=3389,freq=2.0), product of:
              0.1269818 = queryWeight, product of:
                1.5214019 = boost
                5.17774 = idf(docFreq=677, maxDocs=44218)
                0.016119711 = queryNorm
              0.5720649 = fieldWeight in 3389, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.17774 = idf(docFreq=677, maxDocs=44218)
                0.078125 = fieldNorm(doc=3389)
          0.03531331 = weight(abstract_txt:classification in 3389) [ClassicSimilarity], result of:
            0.03531331 = score(doc=3389,freq=1.0), product of:
              0.11322691 = queryWeight, product of:
                1.7595179 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.016119711 = queryNorm
              0.3118809 = fieldWeight in 3389, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.078125 = fieldNorm(doc=3389)
          0.22475001 = weight(abstract_txt:categorization in 3389) [ClassicSimilarity], result of:
            0.22475001 = score(doc=3389,freq=2.0), product of:
              0.30863646 = queryWeight, product of:
                2.9049754 = boost
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.016119711 = queryNorm
              0.72820306 = fieldWeight in 3389, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.078125 = fieldNorm(doc=3389)
        0.28 = coord(7/25)
    
  4. Desale, S.K.; Kumbhar, R.: Research on automatic classification of documents in library environment : a literature review (2013) 0.16
    0.15669748 = sum of:
      0.15669748 = product of:
        0.55963385 = sum of:
          0.015426017 = weight(abstract_txt:paper in 1071) [ClassicSimilarity], result of:
            0.015426017 = score(doc=1071,freq=1.0), product of:
              0.05694595 = queryWeight, product of:
                1.0188369 = boost
                3.467376 = idf(docFreq=3749, maxDocs=44218)
                0.016119711 = queryNorm
              0.27088875 = fieldWeight in 1071, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.467376 = idf(docFreq=3749, maxDocs=44218)
                0.078125 = fieldNorm(doc=1071)
          0.024470422 = weight(abstract_txt:text in 1071) [ClassicSimilarity], result of:
            0.024470422 = score(doc=1071,freq=1.0), product of:
              0.077455916 = queryWeight, product of:
                1.18823 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.016119711 = queryNorm
              0.3159271 = fieldWeight in 1071, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.078125 = fieldNorm(doc=1071)
          0.029269462 = weight(abstract_txt:document in 1071) [ClassicSimilarity], result of:
            0.029269462 = score(doc=1071,freq=1.0), product of:
              0.08727773 = queryWeight, product of:
                1.261319 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.016119711 = queryNorm
              0.33536002 = fieldWeight in 1071, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.078125 = fieldNorm(doc=1071)
          0.03993052 = weight(abstract_txt:using in 1071) [ClassicSimilarity], result of:
            0.03993052 = score(doc=1071,freq=3.0), product of:
              0.08520929 = queryWeight, product of:
                1.5263789 = boost
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.016119711 = queryNorm
              0.46861696 = fieldWeight in 1071, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.078125 = fieldNorm(doc=1071)
          0.086499594 = weight(abstract_txt:classification in 1071) [ClassicSimilarity], result of:
            0.086499594 = score(doc=1071,freq=6.0), product of:
              0.11322691 = queryWeight, product of:
                1.7595179 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.016119711 = queryNorm
              0.76394904 = fieldWeight in 1071, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.078125 = fieldNorm(doc=1071)
          0.22475001 = weight(abstract_txt:categorization in 1071) [ClassicSimilarity], result of:
            0.22475001 = score(doc=1071,freq=2.0), product of:
              0.30863646 = queryWeight, product of:
                2.9049754 = boost
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.016119711 = queryNorm
              0.72820306 = fieldWeight in 1071, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.078125 = fieldNorm(doc=1071)
          0.13928784 = weight(abstract_txt:hierarchical in 1071) [ClassicSimilarity], result of:
            0.13928784 = score(doc=1071,freq=1.0), product of:
              0.31110892 = queryWeight, product of:
                3.3677857 = boost
                5.7307405 = idf(docFreq=389, maxDocs=44218)
                0.016119711 = queryNorm
              0.4477141 = fieldWeight in 1071, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7307405 = idf(docFreq=389, maxDocs=44218)
                0.078125 = fieldNorm(doc=1071)
        0.28 = coord(7/25)
    
  5. Yoon, Y.; Lee, C.; Lee, G.G.: ¬An effective procedure for constructing a hierarchical text classification system (2006) 0.15
    0.14794858 = sum of:
      0.14794858 = product of:
        0.7397429 = sum of:
          0.024470422 = weight(abstract_txt:text in 5273) [ClassicSimilarity], result of:
            0.024470422 = score(doc=5273,freq=1.0), product of:
              0.077455916 = queryWeight, product of:
                1.18823 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.016119711 = queryNorm
              0.3159271 = fieldWeight in 5273, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.078125 = fieldNorm(doc=5273)
          0.07896296 = weight(abstract_txt:classification in 5273) [ClassicSimilarity], result of:
            0.07896296 = score(doc=5273,freq=5.0), product of:
              0.11322691 = queryWeight, product of:
                1.7595179 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.016119711 = queryNorm
              0.69738686 = fieldWeight in 5273, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.078125 = fieldNorm(doc=5273)
          0.15892226 = weight(abstract_txt:categorization in 5273) [ClassicSimilarity], result of:
            0.15892226 = score(doc=5273,freq=1.0), product of:
              0.30863646 = queryWeight, product of:
                2.9049754 = boost
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.016119711 = queryNorm
              0.5149173 = fieldWeight in 5273, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.078125 = fieldNorm(doc=5273)
          0.19881156 = weight(abstract_txt:hierarchies in 5273) [ClassicSimilarity], result of:
            0.19881156 = score(doc=5273,freq=1.0), product of:
              0.35833162 = queryWeight, product of:
                3.1301231 = boost
                7.1017675 = idf(docFreq=98, maxDocs=44218)
                0.016119711 = queryNorm
              0.5548256 = fieldWeight in 5273, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.1017675 = idf(docFreq=98, maxDocs=44218)
                0.078125 = fieldNorm(doc=5273)
          0.2785757 = weight(abstract_txt:hierarchical in 5273) [ClassicSimilarity], result of:
            0.2785757 = score(doc=5273,freq=4.0), product of:
              0.31110892 = queryWeight, product of:
                3.3677857 = boost
                5.7307405 = idf(docFreq=389, maxDocs=44218)
                0.016119711 = queryNorm
              0.8954282 = fieldWeight in 5273, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.7307405 = idf(docFreq=389, maxDocs=44218)
                0.078125 = fieldNorm(doc=5273)
        0.2 = coord(5/25)