Document (#38497)

Author
Aphinyanaphongs, Y.
Fu, L.D.
Li, Z.
Peskin, E.R.
Efstathiadis, E.
Aliferis, C.F.
Statnikov, A.
Title
¬A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization
Source
Journal of the Association for Information Science and Technology. 65(2014) no.10, S.1964-1987
Year
2014
Abstract
An important aspect to performing text categorization is selecting appropriate supervised classification and feature selection methods. A comprehensive benchmark is needed to inform best practices in this broad application field. Previous benchmarks have evaluated performance for a few supervised classification and feature selection methods and limited ways to optimize them. The present work updates prior benchmarks by increasing the number of classifiers and feature selection methods order of magnitude, including adding recently developed, state-of-the-art methods. Specifically, this study used 229 text categorization data sets/tasks, and evaluated 28 classification methods (both well-established and proprietary/commercial) and 19 feature selection methods according to 4 classification performance metrics. We report several key findings that will be helpful in establishing best methodological practices for text categorization.
Theme
Automatisches Klassifizieren

Similar documents (content)

  1. Seki, K.; Mostafa, J.: Gene ontology annotation as text categorization : an empirical study (2008) 0.29
    0.29066065 = sum of:
      0.29066065 = product of:
        0.9083145 = sum of:
          0.042293403 = weight(abstract_txt:performing in 2123) [ClassicSimilarity], result of:
            0.042293403 = score(doc=2123,freq=1.0), product of:
              0.10048859 = queryWeight, product of:
                1.0388913 = boost
                6.7340426 = idf(docFreq=142, maxDocs=44218)
                0.014363847 = queryNorm
              0.42087767 = fieldWeight in 2123, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.7340426 = idf(docFreq=142, maxDocs=44218)
                0.0625 = fieldNorm(doc=2123)
          0.027500222 = weight(abstract_txt:performance in 2123) [ClassicSimilarity], result of:
            0.027500222 = score(doc=2123,freq=1.0), product of:
              0.09502454 = queryWeight, product of:
                1.4287118 = boost
                4.63042 = idf(docFreq=1171, maxDocs=44218)
                0.014363847 = queryNorm
              0.28940126 = fieldWeight in 2123, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.63042 = idf(docFreq=1171, maxDocs=44218)
                0.0625 = fieldNorm(doc=2123)
          0.06345358 = weight(abstract_txt:text in 2123) [ClassicSimilarity], result of:
            0.06345358 = score(doc=2123,freq=3.0), product of:
              0.14495015 = queryWeight, product of:
                2.4954627 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.014363847 = queryNorm
              0.4377614 = fieldWeight in 2123, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=2123)
          0.17269048 = weight(abstract_txt:supervised in 2123) [ClassicSimilarity], result of:
            0.17269048 = score(doc=2123,freq=1.0), product of:
              0.3702437 = queryWeight, product of:
                3.4539514 = boost
                7.462781 = idf(docFreq=68, maxDocs=44218)
                0.014363847 = queryNorm
              0.4664238 = fieldWeight in 2123, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.462781 = idf(docFreq=68, maxDocs=44218)
                0.0625 = fieldNorm(doc=2123)
          0.22431713 = weight(abstract_txt:categorization in 2123) [ClassicSimilarity], result of:
            0.22431713 = score(doc=2123,freq=2.0), product of:
              0.3850525 = queryWeight, product of:
                4.067258 = boost
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.014363847 = queryNorm
              0.58256245 = fieldWeight in 2123, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.0625 = fieldNorm(doc=2123)
          0.10771045 = weight(abstract_txt:selection in 2123) [ClassicSimilarity], result of:
            0.10771045 = score(doc=2123,freq=1.0), product of:
              0.32045242 = queryWeight, product of:
                4.14838 = boost
                5.377919 = idf(docFreq=554, maxDocs=44218)
                0.014363847 = queryNorm
              0.33611995 = fieldWeight in 2123, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.377919 = idf(docFreq=554, maxDocs=44218)
                0.0625 = fieldNorm(doc=2123)
          0.0691301 = weight(abstract_txt:methods in 2123) [ClassicSimilarity], result of:
            0.0691301 = score(doc=2123,freq=1.0), product of:
              0.26673445 = queryWeight, product of:
                4.4781675 = boost
                4.146752 = idf(docFreq=1900, maxDocs=44218)
                0.014363847 = queryNorm
              0.259172 = fieldWeight in 2123, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.146752 = idf(docFreq=1900, maxDocs=44218)
                0.0625 = fieldNorm(doc=2123)
          0.20121907 = weight(abstract_txt:feature in 2123) [ClassicSimilarity], result of:
            0.20121907 = score(doc=2123,freq=2.0), product of:
              0.385799 = queryWeight, product of:
                4.5517383 = boost
                5.9008293 = idf(docFreq=328, maxDocs=44218)
                0.014363847 = queryNorm
              0.52156454 = fieldWeight in 2123, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.9008293 = idf(docFreq=328, maxDocs=44218)
                0.0625 = fieldNorm(doc=2123)
        0.32 = coord(8/25)
    
  2. Wang, H.; Hong, M.: Supervised Hebb rule based feature selection for text classification (2019) 0.26
    0.2567164 = sum of:
      0.2567164 = product of:
        0.91684425 = sum of:
          0.038891185 = weight(abstract_txt:performance in 5036) [ClassicSimilarity], result of:
            0.038891185 = score(doc=5036,freq=2.0), product of:
              0.09502454 = queryWeight, product of:
                1.4287118 = boost
                4.63042 = idf(docFreq=1171, maxDocs=44218)
                0.014363847 = queryNorm
              0.40927517 = fieldWeight in 5036, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.63042 = idf(docFreq=1171, maxDocs=44218)
                0.0625 = fieldNorm(doc=5036)
          0.06345358 = weight(abstract_txt:text in 5036) [ClassicSimilarity], result of:
            0.06345358 = score(doc=5036,freq=3.0), product of:
              0.14495015 = queryWeight, product of:
                2.4954627 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.014363847 = queryNorm
              0.4377614 = fieldWeight in 5036, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=5036)
          0.04405662 = weight(abstract_txt:classification in 5036) [ClassicSimilarity], result of:
            0.04405662 = score(doc=5036,freq=1.0), product of:
              0.17657632 = queryWeight, product of:
                3.079378 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.014363847 = queryNorm
              0.2495047 = fieldWeight in 5036, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.0625 = fieldNorm(doc=5036)
          0.17269048 = weight(abstract_txt:supervised in 5036) [ClassicSimilarity], result of:
            0.17269048 = score(doc=5036,freq=1.0), product of:
              0.3702437 = queryWeight, product of:
                3.4539514 = boost
                7.462781 = idf(docFreq=68, maxDocs=44218)
                0.014363847 = queryNorm
              0.4664238 = fieldWeight in 5036, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.462781 = idf(docFreq=68, maxDocs=44218)
                0.0625 = fieldNorm(doc=5036)
          0.2154209 = weight(abstract_txt:selection in 5036) [ClassicSimilarity], result of:
            0.2154209 = score(doc=5036,freq=4.0), product of:
              0.32045242 = queryWeight, product of:
                4.14838 = boost
                5.377919 = idf(docFreq=554, maxDocs=44218)
                0.014363847 = queryNorm
              0.6722399 = fieldWeight in 5036, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.377919 = idf(docFreq=554, maxDocs=44218)
                0.0625 = fieldNorm(doc=5036)
          0.09776472 = weight(abstract_txt:methods in 5036) [ClassicSimilarity], result of:
            0.09776472 = score(doc=5036,freq=2.0), product of:
              0.26673445 = queryWeight, product of:
                4.4781675 = boost
                4.146752 = idf(docFreq=1900, maxDocs=44218)
                0.014363847 = queryNorm
              0.36652455 = fieldWeight in 5036, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.146752 = idf(docFreq=1900, maxDocs=44218)
                0.0625 = fieldNorm(doc=5036)
          0.28456676 = weight(abstract_txt:feature in 5036) [ClassicSimilarity], result of:
            0.28456676 = score(doc=5036,freq=4.0), product of:
              0.385799 = queryWeight, product of:
                4.5517383 = boost
                5.9008293 = idf(docFreq=328, maxDocs=44218)
                0.014363847 = queryNorm
              0.73760366 = fieldWeight in 5036, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.9008293 = idf(docFreq=328, maxDocs=44218)
                0.0625 = fieldNorm(doc=5036)
        0.28 = coord(7/25)
    
  3. Mengle, S.S.R.; Goharian, N.: Ambiguity measure feature-selection algorithm (2009) 0.22
    0.22212343 = sum of:
      0.22212343 = product of:
        0.9255143 = sum of:
          0.07666166 = weight(abstract_txt:benchmark in 2804) [ClassicSimilarity], result of:
            0.07666166 = score(doc=2804,freq=2.0), product of:
              0.11857064 = queryWeight, product of:
                1.1284968 = boost
                7.314861 = idf(docFreq=79, maxDocs=44218)
                0.014363847 = queryNorm
              0.64654845 = fieldWeight in 2804, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.314861 = idf(docFreq=79, maxDocs=44218)
                0.0625 = fieldNorm(doc=2804)
          0.089736916 = weight(abstract_txt:text in 2804) [ClassicSimilarity], result of:
            0.089736916 = score(doc=2804,freq=6.0), product of:
              0.14495015 = queryWeight, product of:
                2.4954627 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.014363847 = queryNorm
              0.6190881 = fieldWeight in 2804, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=2804)
          0.04405662 = weight(abstract_txt:classification in 2804) [ClassicSimilarity], result of:
            0.04405662 = score(doc=2804,freq=1.0), product of:
              0.17657632 = queryWeight, product of:
                3.079378 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.014363847 = queryNorm
              0.2495047 = fieldWeight in 2804, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.0625 = fieldNorm(doc=2804)
          0.2408479 = weight(abstract_txt:selection in 2804) [ClassicSimilarity], result of:
            0.2408479 = score(doc=2804,freq=5.0), product of:
              0.32045242 = queryWeight, product of:
                4.14838 = boost
                5.377919 = idf(docFreq=554, maxDocs=44218)
                0.014363847 = queryNorm
              0.7515871 = fieldWeight in 2804, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.377919 = idf(docFreq=554, maxDocs=44218)
                0.0625 = fieldNorm(doc=2804)
          0.09776472 = weight(abstract_txt:methods in 2804) [ClassicSimilarity], result of:
            0.09776472 = score(doc=2804,freq=2.0), product of:
              0.26673445 = queryWeight, product of:
                4.4781675 = boost
                4.146752 = idf(docFreq=1900, maxDocs=44218)
                0.014363847 = queryNorm
              0.36652455 = fieldWeight in 2804, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.146752 = idf(docFreq=1900, maxDocs=44218)
                0.0625 = fieldNorm(doc=2804)
          0.37644643 = weight(abstract_txt:feature in 2804) [ClassicSimilarity], result of:
            0.37644643 = score(doc=2804,freq=7.0), product of:
              0.385799 = queryWeight, product of:
                4.5517383 = boost
                5.9008293 = idf(docFreq=328, maxDocs=44218)
                0.014363847 = queryNorm
              0.9757579 = fieldWeight in 2804, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                5.9008293 = idf(docFreq=328, maxDocs=44218)
                0.0625 = fieldNorm(doc=2804)
        0.24 = coord(6/25)
    
  4. Goller, C.; Löning, J.; Will, T.; Wolff, W.: Automatic document classification : a thourough evaluation of various methods (2000) 0.21
    0.20754212 = sum of:
      0.20754212 = product of:
        0.86475885 = sum of:
          0.10422133 = weight(abstract_txt:classifiers in 5480) [ClassicSimilarity], result of:
            0.10422133 = score(doc=5480,freq=2.0), product of:
              0.12539765 = queryWeight, product of:
                1.1605302 = boost
                7.5225 = idf(docFreq=64, maxDocs=44218)
                0.014363847 = queryNorm
              0.83112663 = fieldWeight in 5480, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.5225 = idf(docFreq=64, maxDocs=44218)
                0.078125 = fieldNorm(doc=5480)
          0.04579368 = weight(abstract_txt:text in 5480) [ClassicSimilarity], result of:
            0.04579368 = score(doc=5480,freq=1.0), product of:
              0.14495015 = queryWeight, product of:
                2.4954627 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.014363847 = queryNorm
              0.3159271 = fieldWeight in 5480, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.078125 = fieldNorm(doc=5480)
          0.123142004 = weight(abstract_txt:classification in 5480) [ClassicSimilarity], result of:
            0.123142004 = score(doc=5480,freq=5.0), product of:
              0.17657632 = queryWeight, product of:
                3.079378 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.014363847 = queryNorm
              0.69738686 = fieldWeight in 5480, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.078125 = fieldNorm(doc=5480)
          0.19040696 = weight(abstract_txt:selection in 5480) [ClassicSimilarity], result of:
            0.19040696 = score(doc=5480,freq=2.0), product of:
              0.32045242 = queryWeight, product of:
                4.14838 = boost
                5.377919 = idf(docFreq=554, maxDocs=44218)
                0.014363847 = queryNorm
              0.5941817 = fieldWeight in 5480, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.377919 = idf(docFreq=554, maxDocs=44218)
                0.078125 = fieldNorm(doc=5480)
          0.14967106 = weight(abstract_txt:methods in 5480) [ClassicSimilarity], result of:
            0.14967106 = score(doc=5480,freq=3.0), product of:
              0.26673445 = queryWeight, product of:
                4.4781675 = boost
                4.146752 = idf(docFreq=1900, maxDocs=44218)
                0.014363847 = queryNorm
              0.56112385 = fieldWeight in 5480, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.146752 = idf(docFreq=1900, maxDocs=44218)
                0.078125 = fieldNorm(doc=5480)
          0.25152382 = weight(abstract_txt:feature in 5480) [ClassicSimilarity], result of:
            0.25152382 = score(doc=5480,freq=2.0), product of:
              0.385799 = queryWeight, product of:
                4.5517383 = boost
                5.9008293 = idf(docFreq=328, maxDocs=44218)
                0.014363847 = queryNorm
              0.65195566 = fieldWeight in 5480, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.9008293 = idf(docFreq=328, maxDocs=44218)
                0.078125 = fieldNorm(doc=5480)
        0.24 = coord(6/25)
    
  5. Maghsoodi, N.; Homayounpour, M.M.: Improving Farsi multiclass text classification using a thesaurus and two-stage feature selection (2011) 0.19
    0.19276421 = sum of:
      0.19276421 = product of:
        0.8031842 = sum of:
          0.040912393 = weight(abstract_txt:adding in 4775) [ClassicSimilarity], result of:
            0.040912393 = score(doc=4775,freq=1.0), product of:
              0.098289 = queryWeight, product of:
                1.0274583 = boost
                6.6599345 = idf(docFreq=153, maxDocs=44218)
                0.014363847 = queryNorm
              0.4162459 = fieldWeight in 4775, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.6599345 = idf(docFreq=153, maxDocs=44218)
                0.0625 = fieldNorm(doc=4775)
          0.036634944 = weight(abstract_txt:text in 4775) [ClassicSimilarity], result of:
            0.036634944 = score(doc=4775,freq=1.0), product of:
              0.14495015 = queryWeight, product of:
                2.4954627 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.014363847 = queryNorm
              0.25274166 = fieldWeight in 4775, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=4775)
          0.062305465 = weight(abstract_txt:classification in 4775) [ClassicSimilarity], result of:
            0.062305465 = score(doc=4775,freq=2.0), product of:
              0.17657632 = queryWeight, product of:
                3.079378 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.014363847 = queryNorm
              0.3528529 = fieldWeight in 4775, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.0625 = fieldNorm(doc=4775)
          0.15861617 = weight(abstract_txt:categorization in 4775) [ClassicSimilarity], result of:
            0.15861617 = score(doc=4775,freq=1.0), product of:
              0.3850525 = queryWeight, product of:
                4.067258 = boost
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.014363847 = queryNorm
              0.41193387 = fieldWeight in 4775, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.0625 = fieldNorm(doc=4775)
          0.18655996 = weight(abstract_txt:selection in 4775) [ClassicSimilarity], result of:
            0.18655996 = score(doc=4775,freq=3.0), product of:
              0.32045242 = queryWeight, product of:
                4.14838 = boost
                5.377919 = idf(docFreq=554, maxDocs=44218)
                0.014363847 = queryNorm
              0.5821768 = fieldWeight in 4775, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.377919 = idf(docFreq=554, maxDocs=44218)
                0.0625 = fieldNorm(doc=4775)
          0.3181553 = weight(abstract_txt:feature in 4775) [ClassicSimilarity], result of:
            0.3181553 = score(doc=4775,freq=5.0), product of:
              0.385799 = queryWeight, product of:
                4.5517383 = boost
                5.9008293 = idf(docFreq=328, maxDocs=44218)
                0.014363847 = queryNorm
              0.82466596 = fieldWeight in 4775, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.9008293 = idf(docFreq=328, maxDocs=44218)
                0.0625 = fieldNorm(doc=4775)
        0.24 = coord(6/25)