Document (#30174)

Author
Giorgetti, D.
Sebastiani, F.
Title
Automating survey coding by multiclass text categorization techniques
Source
Journal of the American Society for Information Science and technology. 54(2003) no.14, S.1269-1277
Year
2003
Abstract
In this issue Giorgetti, and Sebastiani suggest that answers to open ended questions in survey instruments can be coded automatically by creating classifiers which learn from training sets of manually coded answers. The manual effort required is only that of classifying a representative set of documents, not creating a dictionary of words that trigger an assignment. They use a naive Bayesian probabilistic learner from Mc Callum's RAINBOW package and the multi-class support vector machine learner from Hsu and Lin's BSVM package, both examples of text categorization techniques. Data from the 1996 General Social Survey by the U.S. National Opinion Research Center provided a set of answers to three questions (previously tested by Viechnicki using a dictionary approach), their associated manually assigned category codes, and a complete set of predefined category codes. The learners were run on three random disjoint subsets of the answer sets to create the classifiers and a remaining set was used as a test set. The dictionary approach is out preformed by 18% for RAINBOW and by 17% for BSVM, while the standard deviation of the results is reduced by 28% and 34% respectively over the dictionary approach.
Theme
Automatisches Klassifizieren

Similar documents (author)

  1. Sebastiani, F.: On the role of logic in information retrieval (1998) 5.99
    5.9875464 = sum of:
      5.9875464 = weight(author_txt:sebastiani in 2141) [ClassicSimilarity], result of:
        5.9875464 = fieldWeight in 2141, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.580074 = idf(docFreq=7, maxDocs=42596)
          0.625 = fieldNorm(doc=2141)
    
  2. Sebastiani, F.: Machine learning in automated text categorization (2002) 5.99
    5.9875464 = sum of:
      5.9875464 = weight(author_txt:sebastiani in 4390) [ClassicSimilarity], result of:
        5.9875464 = fieldWeight in 4390, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.580074 = idf(docFreq=7, maxDocs=42596)
          0.625 = fieldNorm(doc=4390)
    
  3. Sebastiani, F.: ¬A tutorial an automated text categorisation (1999) 5.99
    5.9875464 = sum of:
      5.9875464 = weight(author_txt:sebastiani in 4391) [ClassicSimilarity], result of:
        5.9875464 = fieldWeight in 4391, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.580074 = idf(docFreq=7, maxDocs=42596)
          0.625 = fieldNorm(doc=4391)
    
  4. Sebastiani, F.: Classification of text, automatic (2006) 5.99
    5.9875464 = sum of:
      5.9875464 = weight(author_txt:sebastiani in 4) [ClassicSimilarity], result of:
        5.9875464 = fieldWeight in 4, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.580074 = idf(docFreq=7, maxDocs=42596)
          0.625 = fieldNorm(doc=4)
    
  5. Debole, F.; Sebastiani, F.: ¬An analysis of the relative hardness of Reuters-21578 subsets (2005) 4.79
    4.790037 = sum of:
      4.790037 = weight(author_txt:sebastiani in 4457) [ClassicSimilarity], result of:
        4.790037 = fieldWeight in 4457, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.580074 = idf(docFreq=7, maxDocs=42596)
          0.5 = fieldNorm(doc=4457)
    

Similar documents (content)

  1. Sun, A.; Lim, E.-P.; Ng, W.-K.: Performance measurement framework for hierarchical text classification (2003) 0.12
    0.11848109 = sum of:
      0.11848109 = product of:
        0.59240544 = sum of:
          0.11652214 = weight(abstract_txt:naive in 2809) [ClassicSimilarity], result of:
            0.11652214 = score(doc=2809,freq=2.0), product of:
              0.15898006 = queryWeight, product of:
                8.29222 = idf(docFreq=28, maxDocs=42596)
                0.019172195 = queryNorm
              0.7329356 = fieldWeight in 2809, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.29222 = idf(docFreq=28, maxDocs=42596)
                0.0625 = fieldNorm(doc=2809)
          0.017736338 = weight(abstract_txt:from in 2809) [ClassicSimilarity], result of:
            0.017736338 = score(doc=2809,freq=2.0), product of:
              0.07194484 = queryWeight, product of:
                1.3454219 = boost
                2.7891335 = idf(docFreq=7117, maxDocs=42596)
                0.019172195 = queryNorm
              0.2465269 = fieldWeight in 2809, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.7891335 = idf(docFreq=7117, maxDocs=42596)
                0.0625 = fieldNorm(doc=2809)
          0.124230824 = weight(abstract_txt:category in 2809) [ClassicSimilarity], result of:
            0.124230824 = score(doc=2809,freq=3.0), product of:
              0.18261488 = queryWeight, product of:
                1.5156947 = boost
                6.2842374 = idf(docFreq=215, maxDocs=42596)
                0.019172195 = queryNorm
              0.6802886 = fieldWeight in 2809, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.2842374 = idf(docFreq=215, maxDocs=42596)
                0.0625 = fieldNorm(doc=2809)
          0.054113135 = weight(abstract_txt:survey in 2809) [ClassicSimilarity], result of:
            0.054113135 = score(doc=2809,freq=1.0), product of:
              0.17324308 = queryWeight, product of:
                1.8080783 = boost
                4.997661 = idf(docFreq=781, maxDocs=42596)
                0.019172195 = queryNorm
              0.31235382 = fieldWeight in 2809, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.997661 = idf(docFreq=781, maxDocs=42596)
                0.0625 = fieldNorm(doc=2809)
          0.27980298 = weight(abstract_txt:classifiers in 2809) [ClassicSimilarity], result of:
            0.27980298 = score(doc=2809,freq=5.0), product of:
              0.26464796 = queryWeight, product of:
                1.8246431 = boost
                7.5651712 = idf(docFreq=59, maxDocs=42596)
                0.019172195 = queryNorm
              1.0572648 = fieldWeight in 2809, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.5651712 = idf(docFreq=59, maxDocs=42596)
                0.0625 = fieldNorm(doc=2809)
        0.2 = coord(5/25)
    
  2. Sebastiani, F.: Classification of text, automatic (2006) 0.11
    0.111612774 = sum of:
      0.111612774 = product of:
        0.55806386 = sum of:
          0.032583732 = weight(abstract_txt:from in 4) [ClassicSimilarity], result of:
            0.032583732 = score(doc=4,freq=3.0), product of:
              0.07194484 = queryWeight, product of:
                1.3454219 = boost
                2.7891335 = idf(docFreq=7117, maxDocs=42596)
                0.019172195 = queryNorm
              0.4528988 = fieldWeight in 4, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.7891335 = idf(docFreq=7117, maxDocs=42596)
                0.09375 = fieldNorm(doc=4)
          0.03499076 = weight(abstract_txt:approach in 4) [ClassicSimilarity], result of:
            0.03499076 = score(doc=4,freq=1.0), product of:
              0.09886187 = queryWeight, product of:
                1.3658521 = boost
                3.7753158 = idf(docFreq=2654, maxDocs=42596)
                0.019172195 = queryNorm
              0.35393584 = fieldWeight in 4, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.7753158 = idf(docFreq=2654, maxDocs=42596)
                0.09375 = fieldNorm(doc=4)
          0.0811697 = weight(abstract_txt:survey in 4) [ClassicSimilarity], result of:
            0.0811697 = score(doc=4,freq=1.0), product of:
              0.17324308 = queryWeight, product of:
                1.8080783 = boost
                4.997661 = idf(docFreq=781, maxDocs=42596)
                0.019172195 = queryNorm
              0.4685307 = fieldWeight in 4, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.997661 = idf(docFreq=781, maxDocs=42596)
                0.09375 = fieldNorm(doc=4)
          0.18769756 = weight(abstract_txt:classifiers in 4) [ClassicSimilarity], result of:
            0.18769756 = score(doc=4,freq=1.0), product of:
              0.26464796 = queryWeight, product of:
                1.8246431 = boost
                7.5651712 = idf(docFreq=59, maxDocs=42596)
                0.019172195 = queryNorm
              0.70923483 = fieldWeight in 4, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5651712 = idf(docFreq=59, maxDocs=42596)
                0.09375 = fieldNorm(doc=4)
          0.22162214 = weight(abstract_txt:learner in 4) [ClassicSimilarity], result of:
            0.22162214 = score(doc=4,freq=1.0), product of:
              0.2956457 = queryWeight, product of:
                1.9285436 = boost
                7.995954 = idf(docFreq=38, maxDocs=42596)
                0.019172195 = queryNorm
              0.7496207 = fieldWeight in 4, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.995954 = idf(docFreq=38, maxDocs=42596)
                0.09375 = fieldNorm(doc=4)
        0.2 = coord(5/25)
    
  3. Ko, Y.; Park, J.; Seo, J.: Improving text categorization using the importance of sentences (2004) 0.11
    0.105655946 = sum of:
      0.105655946 = product of:
        0.4402331 = sum of:
          0.0823936 = weight(abstract_txt:naive in 3558) [ClassicSimilarity], result of:
            0.0823936 = score(doc=3558,freq=1.0), product of:
              0.15898006 = queryWeight, product of:
                8.29222 = idf(docFreq=28, maxDocs=42596)
                0.019172195 = queryNorm
              0.51826376 = fieldWeight in 3558, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.29222 = idf(docFreq=28, maxDocs=42596)
                0.0625 = fieldNorm(doc=3558)
          0.02679896 = weight(abstract_txt:techniques in 3558) [ClassicSimilarity], result of:
            0.02679896 = score(doc=3558,freq=1.0), product of:
              0.094733216 = queryWeight, product of:
                1.0916786 = boost
                4.52622 = idf(docFreq=1252, maxDocs=42596)
                0.019172195 = queryNorm
              0.28288874 = fieldWeight in 3558, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.52622 = idf(docFreq=1252, maxDocs=42596)
                0.0625 = fieldNorm(doc=3558)
          0.056841865 = weight(abstract_txt:sets in 3558) [ClassicSimilarity], result of:
            0.056841865 = score(doc=3558,freq=2.0), product of:
              0.12412499 = queryWeight, product of:
                1.2496065 = boost
                5.181006 = idf(docFreq=650, maxDocs=42596)
                0.019172195 = queryNorm
              0.45794055 = fieldWeight in 3558, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.181006 = idf(docFreq=650, maxDocs=42596)
                0.0625 = fieldNorm(doc=3558)
          0.012541485 = weight(abstract_txt:from in 3558) [ClassicSimilarity], result of:
            0.012541485 = score(doc=3558,freq=1.0), product of:
              0.07194484 = queryWeight, product of:
                1.3454219 = boost
                2.7891335 = idf(docFreq=7117, maxDocs=42596)
                0.019172195 = queryNorm
              0.17432085 = fieldWeight in 3558, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.7891335 = idf(docFreq=7117, maxDocs=42596)
                0.0625 = fieldNorm(doc=3558)
          0.084694244 = weight(abstract_txt:categorization in 3558) [ClassicSimilarity], result of:
            0.084694244 = score(doc=3558,freq=1.0), product of:
              0.20401382 = queryWeight, product of:
                1.6020404 = boost
                6.6422358 = idf(docFreq=150, maxDocs=42596)
                0.019172195 = queryNorm
              0.41513973 = fieldWeight in 3558, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.6422358 = idf(docFreq=150, maxDocs=42596)
                0.0625 = fieldNorm(doc=3558)
          0.17696294 = weight(abstract_txt:classifiers in 3558) [ClassicSimilarity], result of:
            0.17696294 = score(doc=3558,freq=2.0), product of:
              0.26464796 = queryWeight, product of:
                1.8246431 = boost
                7.5651712 = idf(docFreq=59, maxDocs=42596)
                0.019172195 = queryNorm
              0.668673 = fieldWeight in 3558, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.5651712 = idf(docFreq=59, maxDocs=42596)
                0.0625 = fieldNorm(doc=3558)
        0.24 = coord(6/25)
    
  4. Liu, R.-L.: ¬A passage extractor for classification of disease aspect information (2013) 0.09
    0.090841785 = sum of:
      0.090841785 = product of:
        0.4542089 = sum of:
          0.025534457 = weight(abstract_txt:three in 2108) [ClassicSimilarity], result of:
            0.025534457 = score(doc=2108,freq=1.0), product of:
              0.0917293 = queryWeight, product of:
                1.074231 = boost
                4.4538803 = idf(docFreq=1346, maxDocs=42596)
                0.019172195 = queryNorm
              0.27836752 = fieldWeight in 2108, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.4538803 = idf(docFreq=1346, maxDocs=42596)
                0.0625 = fieldNorm(doc=2108)
          0.037899453 = weight(abstract_txt:techniques in 2108) [ClassicSimilarity], result of:
            0.037899453 = score(doc=2108,freq=2.0), product of:
              0.094733216 = queryWeight, product of:
                1.0916786 = boost
                4.52622 = idf(docFreq=1252, maxDocs=42596)
                0.019172195 = queryNorm
              0.4000651 = fieldWeight in 2108, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.52622 = idf(docFreq=1252, maxDocs=42596)
                0.0625 = fieldNorm(doc=2108)
          0.012541485 = weight(abstract_txt:from in 2108) [ClassicSimilarity], result of:
            0.012541485 = score(doc=2108,freq=1.0), product of:
              0.07194484 = queryWeight, product of:
                1.3454219 = boost
                2.7891335 = idf(docFreq=7117, maxDocs=42596)
                0.019172195 = queryNorm
              0.17432085 = fieldWeight in 2108, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.7891335 = idf(docFreq=7117, maxDocs=42596)
                0.0625 = fieldNorm(doc=2108)
          0.071724705 = weight(abstract_txt:category in 2108) [ClassicSimilarity], result of:
            0.071724705 = score(doc=2108,freq=1.0), product of:
              0.18261488 = queryWeight, product of:
                1.5156947 = boost
                6.2842374 = idf(docFreq=215, maxDocs=42596)
                0.019172195 = queryNorm
              0.39276484 = fieldWeight in 2108, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.2842374 = idf(docFreq=215, maxDocs=42596)
                0.0625 = fieldNorm(doc=2108)
          0.3065088 = weight(abstract_txt:classifiers in 2108) [ClassicSimilarity], result of:
            0.3065088 = score(doc=2108,freq=6.0), product of:
              0.26464796 = queryWeight, product of:
                1.8246431 = boost
                7.5651712 = idf(docFreq=59, maxDocs=42596)
                0.019172195 = queryNorm
              1.1581756 = fieldWeight in 2108, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                7.5651712 = idf(docFreq=59, maxDocs=42596)
                0.0625 = fieldNorm(doc=2108)
        0.2 = coord(5/25)
    
  5. Sebastiani, F.: Machine learning in automated text categorization (2002) 0.08
    0.08375029 = sum of:
      0.08375029 = product of:
        0.34895957 = sum of:
          0.03191807 = weight(abstract_txt:three in 4390) [ClassicSimilarity], result of:
            0.03191807 = score(doc=4390,freq=1.0), product of:
              0.0917293 = queryWeight, product of:
                1.074231 = boost
                4.4538803 = idf(docFreq=1346, maxDocs=42596)
                0.019172195 = queryNorm
              0.3479594 = fieldWeight in 4390, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.4538803 = idf(docFreq=1346, maxDocs=42596)
                0.078125 = fieldNorm(doc=4390)
          0.0334987 = weight(abstract_txt:techniques in 4390) [ClassicSimilarity], result of:
            0.0334987 = score(doc=4390,freq=1.0), product of:
              0.094733216 = queryWeight, product of:
                1.0916786 = boost
                4.52622 = idf(docFreq=1252, maxDocs=42596)
                0.019172195 = queryNorm
              0.35361093 = fieldWeight in 4390, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.52622 = idf(docFreq=1252, maxDocs=42596)
                0.078125 = fieldNorm(doc=4390)
          0.015676856 = weight(abstract_txt:from in 4390) [ClassicSimilarity], result of:
            0.015676856 = score(doc=4390,freq=1.0), product of:
              0.07194484 = queryWeight, product of:
                1.3454219 = boost
                2.7891335 = idf(docFreq=7117, maxDocs=42596)
                0.019172195 = queryNorm
              0.21790105 = fieldWeight in 4390, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.7891335 = idf(docFreq=7117, maxDocs=42596)
                0.078125 = fieldNorm(doc=4390)
          0.050504815 = weight(abstract_txt:approach in 4390) [ClassicSimilarity], result of:
            0.050504815 = score(doc=4390,freq=3.0), product of:
              0.09886187 = queryWeight, product of:
                1.3658521 = boost
                3.7753158 = idf(docFreq=2654, maxDocs=42596)
                0.019172195 = queryNorm
              0.5108624 = fieldWeight in 4390, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.7753158 = idf(docFreq=2654, maxDocs=42596)
                0.078125 = fieldNorm(doc=4390)
          0.14971969 = weight(abstract_txt:categorization in 4390) [ClassicSimilarity], result of:
            0.14971969 = score(doc=4390,freq=2.0), product of:
              0.20401382 = queryWeight, product of:
                1.6020404 = boost
                6.6422358 = idf(docFreq=150, maxDocs=42596)
                0.019172195 = queryNorm
              0.73387027 = fieldWeight in 4390, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.6422358 = idf(docFreq=150, maxDocs=42596)
                0.078125 = fieldNorm(doc=4390)
          0.06764142 = weight(abstract_txt:survey in 4390) [ClassicSimilarity], result of:
            0.06764142 = score(doc=4390,freq=1.0), product of:
              0.17324308 = queryWeight, product of:
                1.8080783 = boost
                4.997661 = idf(docFreq=781, maxDocs=42596)
                0.019172195 = queryNorm
              0.39044228 = fieldWeight in 4390, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.997661 = idf(docFreq=781, maxDocs=42596)
                0.078125 = fieldNorm(doc=4390)
        0.24 = coord(6/25)