Document (#34121)

Author
Li, T.
Zhu, S.
Ogihara, M.
Title
Text categorization via generalized discriminant analysis
Source
Information processing and management. 44(2008) no.5, S.1684-1697
Year
2008
Abstract
Text categorization is an important research area and has been receiving much attention due to the growth of the on-line information and of Internet. Automated text categorization is generally cast as a multi-class classification problem. Much of previous work focused on binary document classification problems. Support vector machines (SVMs) excel in binary classification, but the elegant theory behind large-margin hyperplane cannot be easily extended to multi-class text classification. In addition, the training time and scaling are also important concerns. On the other hand, other techniques naturally extensible to handle multi-class classification are generally not as accurate as SVM. This paper presents a simple and efficient solution to multi-class text categorization. Classification problems are first formulated as optimization via discriminant analysis. Text categorization is then cast as the problem of finding coordinate transformations that reflects the inherent similarity from the data. While most of the previous approaches decompose a multi-class classification problem into multiple independent binary classification tasks, the proposed approach enables direct multi-class classification. By using generalized singular value decomposition (GSVD), a coordinate transformation that reflects the inherent class structure indicated by the generalized singular values is identified. Extensive experiments demonstrate the efficiency and effectiveness of the proposed approach.
Theme
Automatisches Klassifizieren

Similar documents (content)

  1. Stamatatos, E.: Author identification : using text sampling to handle the class imbalance problem (2008) 0.21
    0.21123376 = sum of:
      0.21123376 = product of:
        0.8801407 = sum of:
          0.026824545 = weight(abstract_txt:problem in 4064) [ClassicSimilarity], result of:
            0.026824545 = score(doc=4064,freq=1.0), product of:
              0.09627477 = queryWeight, product of:
                1.6315408 = boost
                4.457998 = idf(docFreq=1345, maxDocs=42740)
                0.013236547 = queryNorm
              0.27862486 = fieldWeight in 4064, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.457998 = idf(docFreq=1345, maxDocs=42740)
                0.0625 = fieldNorm(doc=4064)
          0.09853774 = weight(abstract_txt:text in 4064) [ClassicSimilarity], result of:
            0.09853774 = score(doc=4064,freq=6.0), product of:
              0.15892257 = queryWeight, product of:
                2.9644866 = boost
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.013236547 = queryNorm
              0.6200362 = fieldWeight in 4064, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.0625 = fieldNorm(doc=4064)
          0.14810383 = weight(abstract_txt:categorization in 4064) [ClassicSimilarity], result of:
            0.14810383 = score(doc=4064,freq=1.0), product of:
              0.3565754 = queryWeight, product of:
                4.053608 = boost
                6.645611 = idf(docFreq=150, maxDocs=42740)
                0.013236547 = queryNorm
              0.41535068 = fieldWeight in 4064, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.645611 = idf(docFreq=150, maxDocs=42740)
                0.0625 = fieldNorm(doc=4064)
          0.22344911 = weight(abstract_txt:multi in 4064) [ClassicSimilarity], result of:
            0.22344911 = score(doc=4064,freq=3.0), product of:
              0.34560466 = queryWeight, product of:
                4.371661 = boost
                5.972531 = idf(docFreq=295, maxDocs=42740)
                0.013236547 = queryNorm
              0.6465454 = fieldWeight in 4064, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.972531 = idf(docFreq=295, maxDocs=42740)
                0.0625 = fieldNorm(doc=4064)
          0.05812978 = weight(abstract_txt:classification in 4064) [ClassicSimilarity], result of:
            0.05812978 = score(doc=4064,freq=1.0), product of:
              0.23252186 = queryWeight, product of:
                4.391716 = boost
                3.9999528 = idf(docFreq=2127, maxDocs=42740)
                0.013236547 = queryNorm
              0.24999705 = fieldWeight in 4064, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.9999528 = idf(docFreq=2127, maxDocs=42740)
                0.0625 = fieldNorm(doc=4064)
          0.3250957 = weight(abstract_txt:class in 4064) [ClassicSimilarity], result of:
            0.3250957 = score(doc=4064,freq=5.0), product of:
              0.39400405 = queryWeight, product of:
                5.04174 = boost
                5.903989 = idf(docFreq=316, maxDocs=42740)
                0.013236547 = queryNorm
              0.8251075 = fieldWeight in 4064, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.903989 = idf(docFreq=316, maxDocs=42740)
                0.0625 = fieldNorm(doc=4064)
        0.24 = coord(6/25)
    
  2. Billal, B.; Fonseca, A.; Sadat, F.; Lounis, H.: Semi-supervised learning and social media text analysis towards multi-labeling categorization (2017) 0.15
    0.14810239 = sum of:
      0.14810239 = product of:
        0.6170933 = sum of:
          0.017653883 = weight(abstract_txt:proposed in 96) [ClassicSimilarity], result of:
            0.017653883 = score(doc=96,freq=1.0), product of:
              0.06955825 = queryWeight, product of:
                1.1323231 = boost
                4.640914 = idf(docFreq=1120, maxDocs=42740)
                0.013236547 = queryNorm
              0.25379997 = fieldWeight in 96, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.640914 = idf(docFreq=1120, maxDocs=42740)
                0.0546875 = fieldNorm(doc=96)
          0.023471477 = weight(abstract_txt:problem in 96) [ClassicSimilarity], result of:
            0.023471477 = score(doc=96,freq=1.0), product of:
              0.09627477 = queryWeight, product of:
                1.6315408 = boost
                4.457998 = idf(docFreq=1345, maxDocs=42740)
                0.013236547 = queryNorm
              0.24379675 = fieldWeight in 96, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.457998 = idf(docFreq=1345, maxDocs=42740)
                0.0546875 = fieldNorm(doc=96)
          0.07039876 = weight(abstract_txt:text in 96) [ClassicSimilarity], result of:
            0.07039876 = score(doc=96,freq=4.0), product of:
              0.15892257 = queryWeight, product of:
                2.9644866 = boost
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.013236547 = queryNorm
              0.44297522 = fieldWeight in 96, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.0546875 = fieldNorm(doc=96)
          0.2257647 = weight(abstract_txt:multi in 96) [ClassicSimilarity], result of:
            0.2257647 = score(doc=96,freq=4.0), product of:
              0.34560466 = queryWeight, product of:
                4.371661 = boost
                5.972531 = idf(docFreq=295, maxDocs=42740)
                0.013236547 = queryNorm
              0.65324557 = fieldWeight in 96, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.972531 = idf(docFreq=295, maxDocs=42740)
                0.0546875 = fieldNorm(doc=96)
          0.15259068 = weight(abstract_txt:classification in 96) [ClassicSimilarity], result of:
            0.15259068 = score(doc=96,freq=9.0), product of:
              0.23252186 = queryWeight, product of:
                4.391716 = boost
                3.9999528 = idf(docFreq=2127, maxDocs=42740)
                0.013236547 = queryNorm
              0.65624225 = fieldWeight in 96, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                3.9999528 = idf(docFreq=2127, maxDocs=42740)
                0.0546875 = fieldNorm(doc=96)
          0.12721382 = weight(abstract_txt:class in 96) [ClassicSimilarity], result of:
            0.12721382 = score(doc=96,freq=1.0), product of:
              0.39400405 = queryWeight, product of:
                5.04174 = boost
                5.903989 = idf(docFreq=316, maxDocs=42740)
                0.013236547 = queryNorm
              0.3228744 = fieldWeight in 96, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.903989 = idf(docFreq=316, maxDocs=42740)
                0.0546875 = fieldNorm(doc=96)
        0.24 = coord(6/25)
    
  3. Muneer, I.; Sharjeel, M.; Iqbal, M.; Adeel Nawab, R.M.; Rayson, P.: CLEU - A Cross-language english-urdu corpus and benchmark for text reuse experiments (2019) 0.14
    0.13848752 = sum of:
      0.13848752 = product of:
        0.49459827 = sum of:
          0.028532982 = weight(abstract_txt:proposed in 1300) [ClassicSimilarity], result of:
            0.028532982 = score(doc=1300,freq=2.0), product of:
              0.06955825 = queryWeight, product of:
                1.1323231 = boost
                4.640914 = idf(docFreq=1120, maxDocs=42740)
                0.013236547 = queryNorm
              0.4102027 = fieldWeight in 1300, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.640914 = idf(docFreq=1120, maxDocs=42740)
                0.0625 = fieldNorm(doc=1300)
          0.025016578 = weight(abstract_txt:much in 1300) [ClassicSimilarity], result of:
            0.025016578 = score(doc=1300,freq=1.0), product of:
              0.08028094 = queryWeight, product of:
                1.2164725 = boost
                4.985807 = idf(docFreq=793, maxDocs=42740)
                0.013236547 = queryNorm
              0.31161293 = fieldWeight in 1300, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.985807 = idf(docFreq=793, maxDocs=42740)
                0.0625 = fieldNorm(doc=1300)
          0.026824545 = weight(abstract_txt:problem in 1300) [ClassicSimilarity], result of:
            0.026824545 = score(doc=1300,freq=1.0), product of:
              0.09627477 = queryWeight, product of:
                1.6315408 = boost
                4.457998 = idf(docFreq=1345, maxDocs=42740)
                0.013236547 = queryNorm
              0.27862486 = fieldWeight in 1300, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.457998 = idf(docFreq=1345, maxDocs=42740)
                0.0625 = fieldNorm(doc=1300)
          0.12065304 = weight(abstract_txt:binary in 1300) [ClassicSimilarity], result of:
            0.12065304 = score(doc=1300,freq=1.0), product of:
              0.2623311 = queryWeight, product of:
                2.6931875 = boost
                7.358825 = idf(docFreq=73, maxDocs=42740)
                0.013236547 = queryNorm
              0.45992658 = fieldWeight in 1300, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.358825 = idf(docFreq=73, maxDocs=42740)
                0.0625 = fieldNorm(doc=1300)
          0.10643292 = weight(abstract_txt:text in 1300) [ClassicSimilarity], result of:
            0.10643292 = score(doc=1300,freq=7.0), product of:
              0.15892257 = queryWeight, product of:
                2.9644866 = boost
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.013236547 = queryNorm
              0.6697156 = fieldWeight in 1300, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.0625 = fieldNorm(doc=1300)
          0.12900841 = weight(abstract_txt:multi in 1300) [ClassicSimilarity], result of:
            0.12900841 = score(doc=1300,freq=1.0), product of:
              0.34560466 = queryWeight, product of:
                4.371661 = boost
                5.972531 = idf(docFreq=295, maxDocs=42740)
                0.013236547 = queryNorm
              0.37328318 = fieldWeight in 1300, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.972531 = idf(docFreq=295, maxDocs=42740)
                0.0625 = fieldNorm(doc=1300)
          0.05812978 = weight(abstract_txt:classification in 1300) [ClassicSimilarity], result of:
            0.05812978 = score(doc=1300,freq=1.0), product of:
              0.23252186 = queryWeight, product of:
                4.391716 = boost
                3.9999528 = idf(docFreq=2127, maxDocs=42740)
                0.013236547 = queryNorm
              0.24999705 = fieldWeight in 1300, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.9999528 = idf(docFreq=2127, maxDocs=42740)
                0.0625 = fieldNorm(doc=1300)
        0.28 = coord(7/25)
    
  4. Aphinyanaphongs, Y.; Fu, L.D.; Li, Z.; Peskin, E.R.; Efstathiadis, E.; Aliferis, C.F.; Statnikov, A.: ¬A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization (2014) 0.12
    0.12189068 = sum of:
      0.12189068 = product of:
        0.6094534 = sum of:
          0.019288186 = weight(abstract_txt:important in 3497) [ClassicSimilarity], result of:
            0.019288186 = score(doc=3497,freq=1.0), product of:
              0.0581721 = queryWeight, product of:
                1.0355079 = boost
                4.2441096 = idf(docFreq=1666, maxDocs=42740)
                0.013236547 = queryNorm
              0.33157107 = fieldWeight in 3497, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2441096 = idf(docFreq=1666, maxDocs=42740)
                0.078125 = fieldNorm(doc=3497)
          0.03709067 = weight(abstract_txt:previous in 3497) [ClassicSimilarity], result of:
            0.03709067 = score(doc=3497,freq=1.0), product of:
              0.08995603 = queryWeight, product of:
                1.2876897 = boost
                5.277696 = idf(docFreq=592, maxDocs=42740)
                0.013236547 = queryNorm
              0.41232002 = fieldWeight in 3497, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.277696 = idf(docFreq=592, maxDocs=42740)
                0.078125 = fieldNorm(doc=3497)
          0.08709588 = weight(abstract_txt:text in 3497) [ClassicSimilarity], result of:
            0.08709588 = score(doc=3497,freq=3.0), product of:
              0.15892257 = queryWeight, product of:
                2.9644866 = boost
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.013236547 = queryNorm
              0.54803973 = fieldWeight in 3497, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.078125 = fieldNorm(doc=3497)
          0.3206542 = weight(abstract_txt:categorization in 3497) [ClassicSimilarity], result of:
            0.3206542 = score(doc=3497,freq=3.0), product of:
              0.3565754 = queryWeight, product of:
                4.053608 = boost
                6.645611 = idf(docFreq=150, maxDocs=42740)
                0.013236547 = queryNorm
              0.8992606 = fieldWeight in 3497, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.645611 = idf(docFreq=150, maxDocs=42740)
                0.078125 = fieldNorm(doc=3497)
          0.14532444 = weight(abstract_txt:classification in 3497) [ClassicSimilarity], result of:
            0.14532444 = score(doc=3497,freq=4.0), product of:
              0.23252186 = queryWeight, product of:
                4.391716 = boost
                3.9999528 = idf(docFreq=2127, maxDocs=42740)
                0.013236547 = queryNorm
              0.6249926 = fieldWeight in 3497, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.9999528 = idf(docFreq=2127, maxDocs=42740)
                0.078125 = fieldNorm(doc=3497)
        0.2 = coord(5/25)
    
  5. Li, T.; Zhu, S.; Ogihara, M.: Hierarchical document classification using automatically generated hierarchy (2007) 0.12
    0.12012029 = sum of:
      0.12012029 = product of:
        0.7507518 = sum of:
          0.25622422 = weight(abstract_txt:discriminant in 1798) [ClassicSimilarity], result of:
            0.25622422 = score(doc=1798,freq=2.0), product of:
              0.25897467 = queryWeight, product of:
                2.1848655 = boost
                8.954841 = idf(docFreq=14, maxDocs=42740)
                0.013236547 = queryNorm
              0.9893794 = fieldWeight in 1798, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.954841 = idf(docFreq=14, maxDocs=42740)
                0.078125 = fieldNorm(doc=1798)
          0.07111349 = weight(abstract_txt:text in 1798) [ClassicSimilarity], result of:
            0.07111349 = score(doc=1798,freq=2.0), product of:
              0.15892257 = queryWeight, product of:
                2.9644866 = boost
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.013236547 = queryNorm
              0.44747257 = fieldWeight in 1798, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.078125 = fieldNorm(doc=1798)
          0.3206542 = weight(abstract_txt:categorization in 1798) [ClassicSimilarity], result of:
            0.3206542 = score(doc=1798,freq=3.0), product of:
              0.3565754 = queryWeight, product of:
                4.053608 = boost
                6.645611 = idf(docFreq=150, maxDocs=42740)
                0.013236547 = queryNorm
              0.8992606 = fieldWeight in 1798, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.645611 = idf(docFreq=150, maxDocs=42740)
                0.078125 = fieldNorm(doc=1798)
          0.102759905 = weight(abstract_txt:classification in 1798) [ClassicSimilarity], result of:
            0.102759905 = score(doc=1798,freq=2.0), product of:
              0.23252186 = queryWeight, product of:
                4.391716 = boost
                3.9999528 = idf(docFreq=2127, maxDocs=42740)
                0.013236547 = queryNorm
              0.44193652 = fieldWeight in 1798, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.9999528 = idf(docFreq=2127, maxDocs=42740)
                0.078125 = fieldNorm(doc=1798)
        0.16 = coord(4/25)