Document (#28458)

Author
Debole, F.
Sebastiani, F.
Title
¬An analysis of the relative hardness of Reuters-21578 subsets
Source
Journal of the American Society for Information Science and Technology. 56(2005) no.6, S.584-596
Year
2005
Abstract
The existence, public availability, and widespread acceptance of a standard benchmark for a given information retrieval (IR) task are beneficial to research an this task, because they allow different researchers to experimentally compare their own systems by comparing the results they have obtained an this benchmark. The Reuters-21578 test collection, together with its earlier variants, has been such a standard benchmark for the text categorization (TC) task throughout the last 10 years. However, the benefits that this has brought about have somehow been limited by the fact that different researchers have "carved" different subsets out of this collection and tested their systems an one of these subsets only; systems that have been tested an different Reuters-21578 subsets are thus not readily comparable. In this article, we present a systematic, comparative experimental study of the three subsets of Reuters-21578 that have been most popular among TC researchers. The results we obtain allow us to determine the relative hardness of these subsets, thus establishing an indirect means for comparing TC systems that have, or will be, tested an these different subsets.
Theme
Retrievalstudien

Similar documents (author)

  1. Sebastiani, F.: On the role of logic in information retrieval (1998) 5.99
    5.9875464 = sum of:
      5.9875464 = weight(author_txt:sebastiani in 2141) [ClassicSimilarity], result of:
        5.9875464 = fieldWeight in 2141, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.580074 = idf(docFreq=7, maxDocs=42596)
          0.625 = fieldNorm(doc=2141)
    
  2. Sebastiani, F.: Machine learning in automated text categorization (2002) 5.99
    5.9875464 = sum of:
      5.9875464 = weight(author_txt:sebastiani in 4390) [ClassicSimilarity], result of:
        5.9875464 = fieldWeight in 4390, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.580074 = idf(docFreq=7, maxDocs=42596)
          0.625 = fieldNorm(doc=4390)
    
  3. Sebastiani, F.: ¬A tutorial an automated text categorisation (1999) 5.99
    5.9875464 = sum of:
      5.9875464 = weight(author_txt:sebastiani in 4391) [ClassicSimilarity], result of:
        5.9875464 = fieldWeight in 4391, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.580074 = idf(docFreq=7, maxDocs=42596)
          0.625 = fieldNorm(doc=4391)
    
  4. Sebastiani, F.: Classification of text, automatic (2006) 5.99
    5.9875464 = sum of:
      5.9875464 = weight(author_txt:sebastiani in 4) [ClassicSimilarity], result of:
        5.9875464 = fieldWeight in 4, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.580074 = idf(docFreq=7, maxDocs=42596)
          0.625 = fieldNorm(doc=4)
    
  5. Giorgetti, D.; Sebastiani, F.: Automating survey coding by multiclass text categorization techniques (2003) 4.79
    4.790037 = sum of:
      4.790037 = weight(author_txt:sebastiani in 173) [ClassicSimilarity], result of:
        4.790037 = fieldWeight in 173, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.580074 = idf(docFreq=7, maxDocs=42596)
          0.5 = fieldNorm(doc=173)
    

Similar documents (content)

  1. Fagni, T.; Sebastiani, F.: Selecting negative examples for hierarchical text classification: An experimental comparison (2010) 0.24
    0.24163227 = sum of:
      0.24163227 = product of:
        0.6040807 = sum of:
          0.009134622 = weight(abstract_txt:they in 102) [ClassicSimilarity], result of:
            0.009134622 = score(doc=102,freq=1.0), product of:
              0.038631782 = queryWeight, product of:
                1.0087833 = boost
                3.7832568 = idf(docFreq=2633, maxDocs=42596)
                0.010122342 = queryNorm
              0.23645355 = fieldWeight in 102, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.7832568 = idf(docFreq=2633, maxDocs=42596)
                0.0625 = fieldNorm(doc=102)
          0.017752472 = weight(abstract_txt:standard in 102) [ClassicSimilarity], result of:
            0.017752472 = score(doc=102,freq=1.0), product of:
              0.060162183 = queryWeight, product of:
                1.2588887 = boost
                4.721231 = idf(docFreq=1030, maxDocs=42596)
                0.010122342 = queryNorm
              0.29507694 = fieldWeight in 102, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.721231 = idf(docFreq=1030, maxDocs=42596)
                0.0625 = fieldNorm(doc=102)
          0.008446796 = weight(abstract_txt:these in 102) [ClassicSimilarity], result of:
            0.008446796 = score(doc=102,freq=1.0), product of:
              0.04197359 = queryWeight, product of:
                1.287832 = boost
                3.2198517 = idf(docFreq=4626, maxDocs=42596)
                0.010122342 = queryNorm
              0.20124073 = fieldWeight in 102, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.2198517 = idf(docFreq=4626, maxDocs=42596)
                0.0625 = fieldNorm(doc=102)
          0.00579891 = weight(abstract_txt:that in 102) [ClassicSimilarity], result of:
            0.00579891 = score(doc=102,freq=1.0), product of:
              0.038728315 = queryWeight, product of:
                1.5970181 = boost
                2.3957293 = idf(docFreq=10548, maxDocs=42596)
                0.010122342 = queryNorm
              0.14973308 = fieldWeight in 102, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3957293 = idf(docFreq=10548, maxDocs=42596)
                0.0625 = fieldNorm(doc=102)
          0.012318927 = weight(abstract_txt:this in 102) [ClassicSimilarity], result of:
            0.012318927 = score(doc=102,freq=4.0), product of:
              0.04031744 = queryWeight, product of:
                1.6294537 = boost
                2.4443867 = idf(docFreq=10047, maxDocs=42596)
                0.010122342 = queryNorm
              0.30554834 = fieldWeight in 102, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.4443867 = idf(docFreq=10047, maxDocs=42596)
                0.0625 = fieldNorm(doc=102)
          0.028081788 = weight(abstract_txt:been in 102) [ClassicSimilarity], result of:
            0.028081788 = score(doc=102,freq=3.0), product of:
              0.07135161 = queryWeight, product of:
                1.938842 = boost
                3.6356356 = idf(docFreq=3052, maxDocs=42596)
                0.010122342 = queryNorm
              0.39356908 = fieldWeight in 102, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.6356356 = idf(docFreq=3052, maxDocs=42596)
                0.0625 = fieldNorm(doc=102)
          0.029054701 = weight(abstract_txt:researchers in 102) [ClassicSimilarity], result of:
            0.029054701 = score(doc=102,freq=1.0), product of:
              0.09564429 = queryWeight, product of:
                1.9440198 = boost
                4.86046 = idf(docFreq=896, maxDocs=42596)
                0.010122342 = queryNorm
              0.30377874 = fieldWeight in 102, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.86046 = idf(docFreq=896, maxDocs=42596)
                0.0625 = fieldNorm(doc=102)
          0.024197616 = weight(abstract_txt:have in 102) [ClassicSimilarity], result of:
            0.024197616 = score(doc=102,freq=2.0), product of:
              0.08466356 = queryWeight, product of:
                2.5866308 = boost
                3.233561 = idf(docFreq=4563, maxDocs=42596)
                0.010122342 = queryNorm
              0.2858091 = fieldWeight in 102, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.233561 = idf(docFreq=4563, maxDocs=42596)
                0.0625 = fieldNorm(doc=102)
          0.14512077 = weight(abstract_txt:reuters in 102) [ClassicSimilarity], result of:
            0.14512077 = score(doc=102,freq=1.0), product of:
              0.30759603 = queryWeight, product of:
                4.025603 = boost
                7.5486417 = idf(docFreq=60, maxDocs=42596)
                0.010122342 = queryNorm
              0.4717901 = fieldWeight in 102, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5486417 = idf(docFreq=60, maxDocs=42596)
                0.0625 = fieldNorm(doc=102)
          0.3241741 = weight(abstract_txt:21578 in 102) [ClassicSimilarity], result of:
            0.3241741 = score(doc=102,freq=1.0), product of:
              0.5256297 = queryWeight, product of:
                5.262359 = boost
                9.867756 = idf(docFreq=5, maxDocs=42596)
                0.010122342 = queryNorm
              0.61673474 = fieldWeight in 102, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.867756 = idf(docFreq=5, maxDocs=42596)
                0.0625 = fieldNorm(doc=102)
        0.4 = coord(10/25)
    
  2. Sun, A.; Lim, E.-P.; Ng, W.-K.: Performance measurement framework for hierarchical text classification (2003) 0.15
    0.14831068 = sum of:
      0.14831068 = product of:
        0.52968097 = sum of:
          0.017071707 = weight(abstract_txt:collection in 2809) [ClassicSimilarity], result of:
            0.017071707 = score(doc=2809,freq=1.0), product of:
              0.05861413 = queryWeight, product of:
                1.2425867 = boost
                4.6600933 = idf(docFreq=1095, maxDocs=42596)
                0.010122342 = queryNorm
              0.29125583 = fieldWeight in 2809, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6600933 = idf(docFreq=1095, maxDocs=42596)
                0.0625 = fieldNorm(doc=2809)
          0.008446796 = weight(abstract_txt:these in 2809) [ClassicSimilarity], result of:
            0.008446796 = score(doc=2809,freq=1.0), product of:
              0.04197359 = queryWeight, product of:
                1.287832 = boost
                3.2198517 = idf(docFreq=4626, maxDocs=42596)
                0.010122342 = queryNorm
              0.20124073 = fieldWeight in 2809, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.2198517 = idf(docFreq=4626, maxDocs=42596)
                0.0625 = fieldNorm(doc=2809)
          0.01159782 = weight(abstract_txt:that in 2809) [ClassicSimilarity], result of:
            0.01159782 = score(doc=2809,freq=4.0), product of:
              0.038728315 = queryWeight, product of:
                1.5970181 = boost
                2.3957293 = idf(docFreq=10548, maxDocs=42596)
                0.010122342 = queryNorm
              0.29946616 = fieldWeight in 2809, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.3957293 = idf(docFreq=10548, maxDocs=42596)
                0.0625 = fieldNorm(doc=2809)
          0.0061594634 = weight(abstract_txt:this in 2809) [ClassicSimilarity], result of:
            0.0061594634 = score(doc=2809,freq=1.0), product of:
              0.04031744 = queryWeight, product of:
                1.6294537 = boost
                2.4443867 = idf(docFreq=10047, maxDocs=42596)
                0.010122342 = queryNorm
              0.15277417 = fieldWeight in 2809, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4443867 = idf(docFreq=10047, maxDocs=42596)
                0.0625 = fieldNorm(doc=2809)
          0.0171103 = weight(abstract_txt:have in 2809) [ClassicSimilarity], result of:
            0.0171103 = score(doc=2809,freq=1.0), product of:
              0.08466356 = queryWeight, product of:
                2.5866308 = boost
                3.233561 = idf(docFreq=4563, maxDocs=42596)
                0.010122342 = queryNorm
              0.20209756 = fieldWeight in 2809, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.233561 = idf(docFreq=4563, maxDocs=42596)
                0.0625 = fieldNorm(doc=2809)
          0.14512077 = weight(abstract_txt:reuters in 2809) [ClassicSimilarity], result of:
            0.14512077 = score(doc=2809,freq=1.0), product of:
              0.30759603 = queryWeight, product of:
                4.025603 = boost
                7.5486417 = idf(docFreq=60, maxDocs=42596)
                0.010122342 = queryNorm
              0.4717901 = fieldWeight in 2809, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5486417 = idf(docFreq=60, maxDocs=42596)
                0.0625 = fieldNorm(doc=2809)
          0.3241741 = weight(abstract_txt:21578 in 2809) [ClassicSimilarity], result of:
            0.3241741 = score(doc=2809,freq=1.0), product of:
              0.5256297 = queryWeight, product of:
                5.262359 = boost
                9.867756 = idf(docFreq=5, maxDocs=42596)
                0.010122342 = queryNorm
              0.61673474 = fieldWeight in 2809, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.867756 = idf(docFreq=5, maxDocs=42596)
                0.0625 = fieldNorm(doc=2809)
        0.28 = coord(7/25)
    
  3. Egghe, L.; Rousseau, R.: ¬A theoretical study of recall and precision using a topological approach to information retrieval (1998) 0.13
    0.13160814 = sum of:
      0.13160814 = product of:
        0.8225509 = sum of:
          0.035504945 = weight(abstract_txt:standard in 4268) [ClassicSimilarity], result of:
            0.035504945 = score(doc=4268,freq=1.0), product of:
              0.060162183 = queryWeight, product of:
                1.2588887 = boost
                4.721231 = idf(docFreq=1030, maxDocs=42596)
                0.010122342 = queryNorm
              0.5901539 = fieldWeight in 4268, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.721231 = idf(docFreq=1030, maxDocs=42596)
                0.125 = fieldNorm(doc=4268)
          0.053693723 = weight(abstract_txt:systems in 4268) [ClassicSimilarity], result of:
            0.053693723 = score(doc=4268,freq=4.0), product of:
              0.062912464 = queryWeight, product of:
                1.8205763 = boost
                3.4138687 = idf(docFreq=3810, maxDocs=42596)
                0.010122342 = queryNorm
              0.85346717 = fieldWeight in 4268, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.4138687 = idf(docFreq=3810, maxDocs=42596)
                0.125 = fieldNorm(doc=4268)
          0.042503115 = weight(abstract_txt:different in 4268) [ClassicSimilarity], result of:
            0.042503115 = score(doc=4268,freq=1.0), product of:
              0.092057295 = queryWeight, product of:
                2.4622076 = boost
                3.6936228 = idf(docFreq=2880, maxDocs=42596)
                0.010122342 = queryNorm
              0.46170285 = fieldWeight in 4268, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6936228 = idf(docFreq=2880, maxDocs=42596)
                0.125 = fieldNorm(doc=4268)
          0.6908491 = weight(abstract_txt:subsets in 4268) [ClassicSimilarity], result of:
            0.6908491 = score(doc=4268,freq=1.0), product of:
              0.66080886 = queryWeight, product of:
                7.805442 = boost
                8.363679 = idf(docFreq=26, maxDocs=42596)
                0.010122342 = queryNorm
              1.0454599 = fieldWeight in 4268, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.363679 = idf(docFreq=26, maxDocs=42596)
                0.125 = fieldNorm(doc=4268)
        0.16 = coord(4/25)
    
  4. Hung, C.-M.; Chien, L.-F.: Web-based text classification in the absence of manually labeled training documents (2007) 0.12
    0.12319746 = sum of:
      0.12319746 = product of:
        0.6159873 = sum of:
          0.011418277 = weight(abstract_txt:they in 1267) [ClassicSimilarity], result of:
            0.011418277 = score(doc=1267,freq=1.0), product of:
              0.038631782 = queryWeight, product of:
                1.0087833 = boost
                3.7832568 = idf(docFreq=2633, maxDocs=42596)
                0.010122342 = queryNorm
              0.29556695 = fieldWeight in 1267, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.7832568 = idf(docFreq=2633, maxDocs=42596)
                0.078125 = fieldNorm(doc=1267)
          0.010251121 = weight(abstract_txt:that in 1267) [ClassicSimilarity], result of:
            0.010251121 = score(doc=1267,freq=2.0), product of:
              0.038728315 = queryWeight, product of:
                1.5970181 = boost
                2.3957293 = idf(docFreq=10548, maxDocs=42596)
                0.010122342 = queryNorm
              0.26469317 = fieldWeight in 1267, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3957293 = idf(docFreq=10548, maxDocs=42596)
                0.078125 = fieldNorm(doc=1267)
          0.007699329 = weight(abstract_txt:this in 1267) [ClassicSimilarity], result of:
            0.007699329 = score(doc=1267,freq=1.0), product of:
              0.04031744 = queryWeight, product of:
                1.6294537 = boost
                2.4443867 = idf(docFreq=10047, maxDocs=42596)
                0.010122342 = queryNorm
              0.19096771 = fieldWeight in 1267, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4443867 = idf(docFreq=10047, maxDocs=42596)
                0.078125 = fieldNorm(doc=1267)
          0.18140095 = weight(abstract_txt:reuters in 1267) [ClassicSimilarity], result of:
            0.18140095 = score(doc=1267,freq=1.0), product of:
              0.30759603 = queryWeight, product of:
                4.025603 = boost
                7.5486417 = idf(docFreq=60, maxDocs=42596)
                0.010122342 = queryNorm
              0.58973765 = fieldWeight in 1267, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5486417 = idf(docFreq=60, maxDocs=42596)
                0.078125 = fieldNorm(doc=1267)
          0.40521762 = weight(abstract_txt:21578 in 1267) [ClassicSimilarity], result of:
            0.40521762 = score(doc=1267,freq=1.0), product of:
              0.5256297 = queryWeight, product of:
                5.262359 = boost
                9.867756 = idf(docFreq=5, maxDocs=42596)
                0.010122342 = queryNorm
              0.7709184 = fieldWeight in 1267, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.867756 = idf(docFreq=5, maxDocs=42596)
                0.078125 = fieldNorm(doc=1267)
        0.2 = coord(5/25)
    
  5. Thelwall, M.; Harries, G.: ¬The connection between the research of a university and counts of links to its Web pages : an investigation based upon a classification of the relationships of pages to the research of the host university (2003) 0.12
    0.117095605 = sum of:
      0.117095605 = product of:
        0.585478 = sum of:
          0.012301345 = weight(abstract_txt:that in 2677) [ClassicSimilarity], result of:
            0.012301345 = score(doc=2677,freq=2.0), product of:
              0.038728315 = queryWeight, product of:
                1.5970181 = boost
                2.3957293 = idf(docFreq=10548, maxDocs=42596)
                0.010122342 = queryNorm
              0.3176318 = fieldWeight in 2677, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3957293 = idf(docFreq=10548, maxDocs=42596)
                0.09375 = fieldNorm(doc=2677)
          0.009239195 = weight(abstract_txt:this in 2677) [ClassicSimilarity], result of:
            0.009239195 = score(doc=2677,freq=1.0), product of:
              0.04031744 = queryWeight, product of:
                1.6294537 = boost
                2.4443867 = idf(docFreq=10047, maxDocs=42596)
                0.010122342 = queryNorm
              0.22916126 = fieldWeight in 2677, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4443867 = idf(docFreq=10047, maxDocs=42596)
                0.09375 = fieldNorm(doc=2677)
          0.020135146 = weight(abstract_txt:systems in 2677) [ClassicSimilarity], result of:
            0.020135146 = score(doc=2677,freq=1.0), product of:
              0.062912464 = queryWeight, product of:
                1.8205763 = boost
                3.4138687 = idf(docFreq=3810, maxDocs=42596)
                0.010122342 = queryNorm
              0.32005018 = fieldWeight in 2677, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4138687 = idf(docFreq=3810, maxDocs=42596)
                0.09375 = fieldNorm(doc=2677)
          0.02566545 = weight(abstract_txt:have in 2677) [ClassicSimilarity], result of:
            0.02566545 = score(doc=2677,freq=1.0), product of:
              0.08466356 = queryWeight, product of:
                2.5866308 = boost
                3.233561 = idf(docFreq=4563, maxDocs=42596)
                0.010122342 = queryNorm
              0.30314636 = fieldWeight in 2677, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.233561 = idf(docFreq=4563, maxDocs=42596)
                0.09375 = fieldNorm(doc=2677)
          0.51813686 = weight(abstract_txt:subsets in 2677) [ClassicSimilarity], result of:
            0.51813686 = score(doc=2677,freq=1.0), product of:
              0.66080886 = queryWeight, product of:
                7.805442 = boost
                8.363679 = idf(docFreq=26, maxDocs=42596)
                0.010122342 = queryNorm
              0.7840949 = fieldWeight in 2677, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.363679 = idf(docFreq=26, maxDocs=42596)
                0.09375 = fieldNorm(doc=2677)
        0.2 = coord(5/25)