Document (#30171)

Author
Giorgetti, D.
Sebastiani, F.
Title
Automating survey coding by multiclass text categorization techniques
Source
Journal of the American Society for Information Science and technology. 54(2003) no.14, S.1269-1277
Year
2003
Abstract
In this issue Giorgetti, and Sebastiani suggest that answers to open ended questions in survey instruments can be coded automatically by creating classifiers which learn from training sets of manually coded answers. The manual effort required is only that of classifying a representative set of documents, not creating a dictionary of words that trigger an assignment. They use a naive Bayesian probabilistic learner from Mc Callum's RAINBOW package and the multi-class support vector machine learner from Hsu and Lin's BSVM package, both examples of text categorization techniques. Data from the 1996 General Social Survey by the U.S. National Opinion Research Center provided a set of answers to three questions (previously tested by Viechnicki using a dictionary approach), their associated manually assigned category codes, and a complete set of predefined category codes. The learners were run on three random disjoint subsets of the answer sets to create the classifiers and a remaining set was used as a test set. The dictionary approach is out preformed by 18% for RAINBOW and by 17% for BSVM, while the standard deviation of the results is reduced by 28% and 34% respectively over the dictionary approach.
Theme
Automatisches Klassifizieren

Similar documents (author)

  1. Sebastiani, F.: On the role of logic in information retrieval (1998) 6.00
    6.0014763 = sum of:
      6.0014763 = weight(author_txt:sebastiani in 2138) [ClassicSimilarity], result of:
        6.0014763 = fieldWeight in 2138, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.602362 = idf(docFreq=7, maxDocs=43556)
          0.625 = fieldNorm(doc=2138)
    
  2. Sebastiani, F.: Machine learning in automated text categorization (2002) 6.00
    6.0014763 = sum of:
      6.0014763 = weight(author_txt:sebastiani in 4387) [ClassicSimilarity], result of:
        6.0014763 = fieldWeight in 4387, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.602362 = idf(docFreq=7, maxDocs=43556)
          0.625 = fieldNorm(doc=4387)
    
  3. Sebastiani, F.: ¬A tutorial an automated text categorisation (1999) 6.00
    6.0014763 = sum of:
      6.0014763 = weight(author_txt:sebastiani in 4388) [ClassicSimilarity], result of:
        6.0014763 = fieldWeight in 4388, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.602362 = idf(docFreq=7, maxDocs=43556)
          0.625 = fieldNorm(doc=4388)
    
  4. Sebastiani, F.: Classification of text, automatic (2006) 6.00
    6.0014763 = sum of:
      6.0014763 = weight(author_txt:sebastiani in 1) [ClassicSimilarity], result of:
        6.0014763 = fieldWeight in 1, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.602362 = idf(docFreq=7, maxDocs=43556)
          0.625 = fieldNorm(doc=1)
    
  5. Debole, F.; Sebastiani, F.: ¬An analysis of the relative hardness of Reuters-21578 subsets (2005) 4.80
    4.801181 = sum of:
      4.801181 = weight(author_txt:sebastiani in 4454) [ClassicSimilarity], result of:
        4.801181 = fieldWeight in 4454, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.602362 = idf(docFreq=7, maxDocs=43556)
          0.5 = fieldNorm(doc=4454)
    

Similar documents (content)

  1. Sun, A.; Lim, E.-P.; Ng, W.-K.: Performance measurement framework for hierarchical text classification (2003) 0.12
    0.11757052 = sum of:
      0.11757052 = product of:
        0.5878526 = sum of:
          0.11779187 = weight(abstract_txt:naive in 2806) [ClassicSimilarity], result of:
            0.11779187 = score(doc=2806,freq=2.0), product of:
              0.16028166 = queryWeight, product of:
                8.314507 = idf(docFreq=28, maxDocs=43556)
                0.019277351 = queryNorm
              0.7349055 = fieldWeight in 2806, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.314507 = idf(docFreq=28, maxDocs=43556)
                0.0625 = fieldNorm(doc=2806)
          0.017565707 = weight(abstract_txt:from in 2806) [ClassicSimilarity], result of:
            0.017565707 = score(doc=2806,freq=2.0), product of:
              0.07154908 = queryWeight, product of:
                1.3362573 = boost
                2.77758 = idf(docFreq=7362, maxDocs=43556)
                0.019277351 = queryNorm
              0.2455057 = fieldWeight in 2806, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.77758 = idf(docFreq=7362, maxDocs=43556)
                0.0625 = fieldNorm(doc=2806)
          0.123216346 = weight(abstract_txt:category in 2806) [ClassicSimilarity], result of:
            0.123216346 = score(doc=2806,freq=3.0), product of:
              0.18178809 = queryWeight, product of:
                1.5061069 = boost
                6.2612677 = idf(docFreq=225, maxDocs=43556)
                0.019277351 = queryNorm
              0.6778021 = fieldWeight in 2806, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.2612677 = idf(docFreq=225, maxDocs=43556)
                0.0625 = fieldNorm(doc=2806)
          0.05337005 = weight(abstract_txt:survey in 2806) [ClassicSimilarity], result of:
            0.05337005 = score(doc=2806,freq=1.0), product of:
              0.1718129 = queryWeight, product of:
                1.7932738 = boost
                4.9700623 = idf(docFreq=821, maxDocs=43556)
                0.019277351 = queryNorm
              0.3106289 = fieldWeight in 2806, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.9700623 = idf(docFreq=821, maxDocs=43556)
                0.0625 = fieldNorm(doc=2806)
          0.2759086 = weight(abstract_txt:classifiers in 2806) [ClassicSimilarity], result of:
            0.2759086 = score(doc=2806,freq=5.0), product of:
              0.26243016 = queryWeight, product of:
                1.809589 = boost
                7.5229197 = idf(docFreq=63, maxDocs=43556)
                0.019277351 = queryNorm
              1.05136 = fieldWeight in 2806, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.5229197 = idf(docFreq=63, maxDocs=43556)
                0.0625 = fieldNorm(doc=2806)
        0.2 = coord(5/25)
    
  2. Sebastiani, F.: Classification of text, automatic (2006) 0.11
    0.110793725 = sum of:
      0.110793725 = product of:
        0.5539686 = sum of:
          0.032270268 = weight(abstract_txt:from in 1) [ClassicSimilarity], result of:
            0.032270268 = score(doc=1,freq=3.0), product of:
              0.07154908 = queryWeight, product of:
                1.3362573 = boost
                2.77758 = idf(docFreq=7362, maxDocs=43556)
                0.019277351 = queryNorm
              0.4510228 = fieldWeight in 1, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.77758 = idf(docFreq=7362, maxDocs=43556)
                0.09375 = fieldNorm(doc=1)
          0.034570366 = weight(abstract_txt:approach in 1) [ClassicSimilarity], result of:
            0.034570366 = score(doc=1,freq=1.0), product of:
              0.09815954 = queryWeight, product of:
                1.3554546 = boost
                3.7566452 = idf(docFreq=2765, maxDocs=43556)
                0.019277351 = queryNorm
              0.3521855 = fieldWeight in 1, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.7566452 = idf(docFreq=2765, maxDocs=43556)
                0.09375 = fieldNorm(doc=1)
          0.08005508 = weight(abstract_txt:survey in 1) [ClassicSimilarity], result of:
            0.08005508 = score(doc=1,freq=1.0), product of:
              0.1718129 = queryWeight, product of:
                1.7932738 = boost
                4.9700623 = idf(docFreq=821, maxDocs=43556)
                0.019277351 = queryNorm
              0.46594334 = fieldWeight in 1, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.9700623 = idf(docFreq=821, maxDocs=43556)
                0.09375 = fieldNorm(doc=1)
          0.1850851 = weight(abstract_txt:classifiers in 1) [ClassicSimilarity], result of:
            0.1850851 = score(doc=1,freq=1.0), product of:
              0.26243016 = queryWeight, product of:
                1.809589 = boost
                7.5229197 = idf(docFreq=63, maxDocs=43556)
                0.019277351 = queryNorm
              0.70527375 = fieldWeight in 1, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5229197 = idf(docFreq=63, maxDocs=43556)
                0.09375 = fieldNorm(doc=1)
          0.22198777 = weight(abstract_txt:learner in 1) [ClassicSimilarity], result of:
            0.22198777 = score(doc=1,freq=1.0), product of:
              0.29624575 = queryWeight, product of:
                1.9226452 = boost
                7.9929233 = idf(docFreq=39, maxDocs=43556)
                0.019277351 = queryNorm
              0.74933654 = fieldWeight in 1, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.9929233 = idf(docFreq=39, maxDocs=43556)
                0.09375 = fieldNorm(doc=1)
        0.2 = coord(5/25)
    
  3. Ko, Y.; Park, J.; Seo, J.: Improving text categorization using the importance of sentences (2004) 0.11
    0.10521445 = sum of:
      0.10521445 = product of:
        0.43839353 = sum of:
          0.083291434 = weight(abstract_txt:naive in 3555) [ClassicSimilarity], result of:
            0.083291434 = score(doc=3555,freq=1.0), product of:
              0.16028166 = queryWeight, product of:
                8.314507 = idf(docFreq=28, maxDocs=43556)
                0.019277351 = queryNorm
              0.51965666 = fieldWeight in 3555, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.314507 = idf(docFreq=28, maxDocs=43556)
                0.0625 = fieldNorm(doc=3555)
          0.026904888 = weight(abstract_txt:techniques in 3555) [ClassicSimilarity], result of:
            0.026904888 = score(doc=3555,freq=1.0), product of:
              0.09507093 = queryWeight, product of:
                1.0891732 = boost
                4.527969 = idf(docFreq=1278, maxDocs=43556)
                0.019277351 = queryNorm
              0.28299806 = fieldWeight in 3555, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.527969 = idf(docFreq=1278, maxDocs=43556)
                0.0625 = fieldNorm(doc=3555)
          0.05698416 = weight(abstract_txt:sets in 3555) [ClassicSimilarity], result of:
            0.05698416 = score(doc=3555,freq=2.0), product of:
              0.124447554 = queryWeight, product of:
                1.2461383 = boost
                5.180513 = idf(docFreq=665, maxDocs=43556)
                0.019277351 = queryNorm
              0.45789698 = fieldWeight in 3555, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.180513 = idf(docFreq=665, maxDocs=43556)
                0.0625 = fieldNorm(doc=3555)
          0.012420831 = weight(abstract_txt:from in 3555) [ClassicSimilarity], result of:
            0.012420831 = score(doc=3555,freq=1.0), product of:
              0.07154908 = queryWeight, product of:
                1.3362573 = boost
                2.77758 = idf(docFreq=7362, maxDocs=43556)
                0.019277351 = queryNorm
              0.17359875 = fieldWeight in 3555, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.77758 = idf(docFreq=7362, maxDocs=43556)
                0.0625 = fieldNorm(doc=3555)
          0.08429232 = weight(abstract_txt:categorization in 3555) [ClassicSimilarity], result of:
            0.08429232 = score(doc=3555,freq=1.0), product of:
              0.2035568 = queryWeight, product of:
                1.5937343 = boost
                6.625557 = idf(docFreq=156, maxDocs=43556)
                0.019277351 = queryNorm
              0.4140973 = fieldWeight in 3555, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.625557 = idf(docFreq=156, maxDocs=43556)
                0.0625 = fieldNorm(doc=3555)
          0.1744999 = weight(abstract_txt:classifiers in 3555) [ClassicSimilarity], result of:
            0.1744999 = score(doc=3555,freq=2.0), product of:
              0.26243016 = queryWeight, product of:
                1.809589 = boost
                7.5229197 = idf(docFreq=63, maxDocs=43556)
                0.019277351 = queryNorm
              0.66493845 = fieldWeight in 3555, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.5229197 = idf(docFreq=63, maxDocs=43556)
                0.0625 = fieldNorm(doc=3555)
        0.24 = coord(6/25)
    
  4. Liu, R.-L.: ¬A passage extractor for classification of disease aspect information (2013) 0.09
    0.08982813 = sum of:
      0.08982813 = product of:
        0.44914067 = sum of:
          0.025288876 = weight(abstract_txt:three in 3105) [ClassicSimilarity], result of:
            0.025288876 = score(doc=3105,freq=1.0), product of:
              0.09122488 = queryWeight, product of:
                1.0669148 = boost
                4.435435 = idf(docFreq=1402, maxDocs=43556)
                0.019277351 = queryNorm
              0.27721468 = fieldWeight in 3105, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.435435 = idf(docFreq=1402, maxDocs=43556)
                0.0625 = fieldNorm(doc=3105)
          0.038049255 = weight(abstract_txt:techniques in 3105) [ClassicSimilarity], result of:
            0.038049255 = score(doc=3105,freq=2.0), product of:
              0.09507093 = queryWeight, product of:
                1.0891732 = boost
                4.527969 = idf(docFreq=1278, maxDocs=43556)
                0.019277351 = queryNorm
              0.40021968 = fieldWeight in 3105, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.527969 = idf(docFreq=1278, maxDocs=43556)
                0.0625 = fieldNorm(doc=3105)
          0.012420831 = weight(abstract_txt:from in 3105) [ClassicSimilarity], result of:
            0.012420831 = score(doc=3105,freq=1.0), product of:
              0.07154908 = queryWeight, product of:
                1.3362573 = boost
                2.77758 = idf(docFreq=7362, maxDocs=43556)
                0.019277351 = queryNorm
              0.17359875 = fieldWeight in 3105, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.77758 = idf(docFreq=7362, maxDocs=43556)
                0.0625 = fieldNorm(doc=3105)
          0.07113899 = weight(abstract_txt:category in 3105) [ClassicSimilarity], result of:
            0.07113899 = score(doc=3105,freq=1.0), product of:
              0.18178809 = queryWeight, product of:
                1.5061069 = boost
                6.2612677 = idf(docFreq=225, maxDocs=43556)
                0.019277351 = queryNorm
              0.39132923 = fieldWeight in 3105, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.2612677 = idf(docFreq=225, maxDocs=43556)
                0.0625 = fieldNorm(doc=3105)
          0.3022427 = weight(abstract_txt:classifiers in 3105) [ClassicSimilarity], result of:
            0.3022427 = score(doc=3105,freq=6.0), product of:
              0.26243016 = queryWeight, product of:
                1.809589 = boost
                7.5229197 = idf(docFreq=63, maxDocs=43556)
                0.019277351 = queryNorm
              1.1517072 = fieldWeight in 3105, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                7.5229197 = idf(docFreq=63, maxDocs=43556)
                0.0625 = fieldNorm(doc=3105)
        0.2 = coord(5/25)
    
  5. Sebastiani, F.: Machine learning in automated text categorization (2002) 0.08
    0.08313312 = sum of:
      0.08313312 = product of:
        0.34638798 = sum of:
          0.031611092 = weight(abstract_txt:three in 4387) [ClassicSimilarity], result of:
            0.031611092 = score(doc=4387,freq=1.0), product of:
              0.09122488 = queryWeight, product of:
                1.0669148 = boost
                4.435435 = idf(docFreq=1402, maxDocs=43556)
                0.019277351 = queryNorm
              0.34651834 = fieldWeight in 4387, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.435435 = idf(docFreq=1402, maxDocs=43556)
                0.078125 = fieldNorm(doc=4387)
          0.03363111 = weight(abstract_txt:techniques in 4387) [ClassicSimilarity], result of:
            0.03363111 = score(doc=4387,freq=1.0), product of:
              0.09507093 = queryWeight, product of:
                1.0891732 = boost
                4.527969 = idf(docFreq=1278, maxDocs=43556)
                0.019277351 = queryNorm
              0.35374758 = fieldWeight in 4387, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.527969 = idf(docFreq=1278, maxDocs=43556)
                0.078125 = fieldNorm(doc=4387)
          0.015526039 = weight(abstract_txt:from in 4387) [ClassicSimilarity], result of:
            0.015526039 = score(doc=4387,freq=1.0), product of:
              0.07154908 = queryWeight, product of:
                1.3362573 = boost
                2.77758 = idf(docFreq=7362, maxDocs=43556)
                0.019277351 = queryNorm
              0.21699844 = fieldWeight in 4387, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.77758 = idf(docFreq=7362, maxDocs=43556)
                0.078125 = fieldNorm(doc=4387)
          0.04989802 = weight(abstract_txt:approach in 4387) [ClassicSimilarity], result of:
            0.04989802 = score(doc=4387,freq=3.0), product of:
              0.09815954 = queryWeight, product of:
                1.3554546 = boost
                3.7566452 = idf(docFreq=2765, maxDocs=43556)
                0.019277351 = queryNorm
              0.50833595 = fieldWeight in 4387, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.7566452 = idf(docFreq=2765, maxDocs=43556)
                0.078125 = fieldNorm(doc=4387)
          0.14900918 = weight(abstract_txt:categorization in 4387) [ClassicSimilarity], result of:
            0.14900918 = score(doc=4387,freq=2.0), product of:
              0.2035568 = queryWeight, product of:
                1.5937343 = boost
                6.625557 = idf(docFreq=156, maxDocs=43556)
                0.019277351 = queryNorm
              0.73202753 = fieldWeight in 4387, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.625557 = idf(docFreq=156, maxDocs=43556)
                0.078125 = fieldNorm(doc=4387)
          0.066712566 = weight(abstract_txt:survey in 4387) [ClassicSimilarity], result of:
            0.066712566 = score(doc=4387,freq=1.0), product of:
              0.1718129 = queryWeight, product of:
                1.7932738 = boost
                4.9700623 = idf(docFreq=821, maxDocs=43556)
                0.019277351 = queryNorm
              0.3882861 = fieldWeight in 4387, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.9700623 = idf(docFreq=821, maxDocs=43556)
                0.078125 = fieldNorm(doc=4387)
        0.24 = coord(6/25)