Document (#28457)

Author
Debole, F.
Sebastiani, F.
Title
¬An analysis of the relative hardness of Reuters-21578 subsets
Source
Journal of the American Society for Information Science and Technology. 56(2005) no.6, S.584-596
Year
2005
Abstract
The existence, public availability, and widespread acceptance of a standard benchmark for a given information retrieval (IR) task are beneficial to research an this task, because they allow different researchers to experimentally compare their own systems by comparing the results they have obtained an this benchmark. The Reuters-21578 test collection, together with its earlier variants, has been such a standard benchmark for the text categorization (TC) task throughout the last 10 years. However, the benefits that this has brought about have somehow been limited by the fact that different researchers have "carved" different subsets out of this collection and tested their systems an one of these subsets only; systems that have been tested an different Reuters-21578 subsets are thus not readily comparable. In this article, we present a systematic, comparative experimental study of the three subsets of Reuters-21578 that have been most popular among TC researchers. The results we obtain allow us to determine the relative hardness of these subsets, thus establishing an indirect means for comparing TC systems that have, or will be, tested an these different subsets.
Theme
Retrievalstudien

Similar documents (author)

  1. Sebastiani, F.: On the role of logic in information retrieval (1998) 5.94
    5.937289 = sum of:
      5.937289 = weight(author_txt:sebastiani in 1140) [ClassicSimilarity], result of:
        5.937289 = fieldWeight in 1140, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.499662 = idf(docFreq=8, maxDocs=44218)
          0.625 = fieldNorm(doc=1140)
    
  2. Sebastiani, F.: Machine learning in automated text categorization (2002) 5.94
    5.937289 = sum of:
      5.937289 = weight(author_txt:sebastiani in 3389) [ClassicSimilarity], result of:
        5.937289 = fieldWeight in 3389, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.499662 = idf(docFreq=8, maxDocs=44218)
          0.625 = fieldNorm(doc=3389)
    
  3. Sebastiani, F.: ¬A tutorial an automated text categorisation (1999) 5.94
    5.937289 = sum of:
      5.937289 = weight(author_txt:sebastiani in 3390) [ClassicSimilarity], result of:
        5.937289 = fieldWeight in 3390, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.499662 = idf(docFreq=8, maxDocs=44218)
          0.625 = fieldNorm(doc=3390)
    
  4. Sebastiani, F.: Classification of text, automatic (2006) 5.94
    5.937289 = sum of:
      5.937289 = weight(author_txt:sebastiani in 5003) [ClassicSimilarity], result of:
        5.937289 = fieldWeight in 5003, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.499662 = idf(docFreq=8, maxDocs=44218)
          0.625 = fieldNorm(doc=5003)
    
  5. Giorgetti, D.; Sebastiani, F.: Automating survey coding by multiclass text categorization techniques (2003) 4.75
    4.749831 = sum of:
      4.749831 = weight(author_txt:sebastiani in 5172) [ClassicSimilarity], result of:
        4.749831 = fieldWeight in 5172, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.499662 = idf(docFreq=8, maxDocs=44218)
          0.5 = fieldNorm(doc=5172)
    

Similar documents (content)

  1. Fagni, T.; Sebastiani, F.: Selecting negative examples for hierarchical text classification: An experimental comparison (2010) 0.24
    0.24197517 = sum of:
      0.24197517 = product of:
        0.6049379 = sum of:
          0.008896571 = weight(abstract_txt:they in 4101) [ClassicSimilarity], result of:
            0.008896571 = score(doc=4101,freq=1.0), product of:
              0.03793806 = queryWeight, product of:
                1.0015866 = boost
                3.7520406 = idf(docFreq=2820, maxDocs=44218)
                0.010095296 = queryNorm
              0.23450254 = fieldWeight in 4101, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.7520406 = idf(docFreq=2820, maxDocs=44218)
                0.0625 = fieldNorm(doc=4101)
          0.017769933 = weight(abstract_txt:standard in 4101) [ClassicSimilarity], result of:
            0.017769933 = score(doc=4101,freq=1.0), product of:
              0.06017053 = queryWeight, product of:
                1.261371 = boost
                4.725219 = idf(docFreq=1065, maxDocs=44218)
                0.010095296 = queryNorm
              0.29532617 = fieldWeight in 4101, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.725219 = idf(docFreq=1065, maxDocs=44218)
                0.0625 = fieldNorm(doc=4101)
          0.008186894 = weight(abstract_txt:these in 4101) [ClassicSimilarity], result of:
            0.008186894 = score(doc=4101,freq=1.0), product of:
              0.041086882 = queryWeight, product of:
                1.2765803 = boost
                3.1881294 = idf(docFreq=4957, maxDocs=44218)
                0.010095296 = queryNorm
              0.19925809 = fieldWeight in 4101, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1881294 = idf(docFreq=4957, maxDocs=44218)
                0.0625 = fieldNorm(doc=4101)
          0.005601657 = weight(abstract_txt:that in 4101) [ClassicSimilarity], result of:
            0.005601657 = score(doc=4101,freq=1.0), product of:
              0.037825473 = queryWeight, product of:
                1.5812958 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.010095296 = queryNorm
              0.1480922 = fieldWeight in 4101, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.0625 = fieldNorm(doc=4101)
          0.011832362 = weight(abstract_txt:this in 4101) [ClassicSimilarity], result of:
            0.011832362 = score(doc=4101,freq=4.0), product of:
              0.039228432 = queryWeight, product of:
                1.6103542 = boost
                2.4130175 = idf(docFreq=10762, maxDocs=44218)
                0.010095296 = queryNorm
              0.3016272 = fieldWeight in 4101, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.4130175 = idf(docFreq=10762, maxDocs=44218)
                0.0625 = fieldNorm(doc=4101)
          0.02803253 = weight(abstract_txt:researchers in 4101) [ClassicSimilarity], result of:
            0.02803253 = score(doc=4101,freq=1.0), product of:
              0.09333946 = queryWeight, product of:
                1.9241068 = boost
                4.805261 = idf(docFreq=983, maxDocs=44218)
                0.010095296 = queryNorm
              0.30032882 = fieldWeight in 4101, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.805261 = idf(docFreq=983, maxDocs=44218)
                0.0625 = fieldNorm(doc=4101)
          0.027622612 = weight(abstract_txt:been in 4101) [ClassicSimilarity], result of:
            0.027622612 = score(doc=4101,freq=3.0), product of:
              0.07053523 = queryWeight, product of:
                1.9313855 = boost
                3.617579 = idf(docFreq=3226, maxDocs=44218)
                0.010095296 = queryNorm
              0.3916144 = fieldWeight in 4101, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.617579 = idf(docFreq=3226, maxDocs=44218)
                0.0625 = fieldNorm(doc=4101)
          0.023516815 = weight(abstract_txt:have in 4101) [ClassicSimilarity], result of:
            0.023516815 = score(doc=4101,freq=2.0), product of:
              0.083025105 = queryWeight, product of:
                2.5663524 = boost
                3.2046018 = idf(docFreq=4876, maxDocs=44218)
                0.010095296 = queryNorm
              0.28324944 = fieldWeight in 4101, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.2046018 = idf(docFreq=4876, maxDocs=44218)
                0.0625 = fieldNorm(doc=4101)
          0.14611511 = weight(abstract_txt:reuters in 4101) [ClassicSimilarity], result of:
            0.14611511 = score(doc=4101,freq=1.0), product of:
              0.30883992 = queryWeight, product of:
                4.0414076 = boost
                7.5697527 = idf(docFreq=61, maxDocs=44218)
                0.010095296 = queryNorm
              0.47310954 = fieldWeight in 4101, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5697527 = idf(docFreq=61, maxDocs=44218)
                0.0625 = fieldNorm(doc=4101)
          0.32736346 = weight(abstract_txt:21578 in 4101) [ClassicSimilarity], result of:
            0.32736346 = score(doc=4101,freq=1.0), product of:
              0.5287984 = queryWeight, product of:
                5.288238 = boost
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.010095296 = queryNorm
              0.6190705 = fieldWeight in 4101, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.0625 = fieldNorm(doc=4101)
        0.4 = coord(10/25)
    
  2. Sun, A.; Lim, E.-P.; Ng, W.-K.: Performance measurement framework for hierarchical text classification (2003) 0.15
    0.14905302 = sum of:
      0.14905302 = product of:
        0.53233224 = sum of:
          0.016918382 = weight(abstract_txt:collection in 1808) [ClassicSimilarity], result of:
            0.016918382 = score(doc=1808,freq=1.0), product of:
              0.05823256 = queryWeight, product of:
                1.2408917 = boost
                4.648501 = idf(docFreq=1150, maxDocs=44218)
                0.010095296 = queryNorm
              0.2905313 = fieldWeight in 1808, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.648501 = idf(docFreq=1150, maxDocs=44218)
                0.0625 = fieldNorm(doc=1808)
          0.008186894 = weight(abstract_txt:these in 1808) [ClassicSimilarity], result of:
            0.008186894 = score(doc=1808,freq=1.0), product of:
              0.041086882 = queryWeight, product of:
                1.2765803 = boost
                3.1881294 = idf(docFreq=4957, maxDocs=44218)
                0.010095296 = queryNorm
              0.19925809 = fieldWeight in 1808, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1881294 = idf(docFreq=4957, maxDocs=44218)
                0.0625 = fieldNorm(doc=1808)
          0.011203314 = weight(abstract_txt:that in 1808) [ClassicSimilarity], result of:
            0.011203314 = score(doc=1808,freq=4.0), product of:
              0.037825473 = queryWeight, product of:
                1.5812958 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.010095296 = queryNorm
              0.2961844 = fieldWeight in 1808, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.0625 = fieldNorm(doc=1808)
          0.005916181 = weight(abstract_txt:this in 1808) [ClassicSimilarity], result of:
            0.005916181 = score(doc=1808,freq=1.0), product of:
              0.039228432 = queryWeight, product of:
                1.6103542 = boost
                2.4130175 = idf(docFreq=10762, maxDocs=44218)
                0.010095296 = queryNorm
              0.1508136 = fieldWeight in 1808, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4130175 = idf(docFreq=10762, maxDocs=44218)
                0.0625 = fieldNorm(doc=1808)
          0.0166289 = weight(abstract_txt:have in 1808) [ClassicSimilarity], result of:
            0.0166289 = score(doc=1808,freq=1.0), product of:
              0.083025105 = queryWeight, product of:
                2.5663524 = boost
                3.2046018 = idf(docFreq=4876, maxDocs=44218)
                0.010095296 = queryNorm
              0.20028761 = fieldWeight in 1808, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.2046018 = idf(docFreq=4876, maxDocs=44218)
                0.0625 = fieldNorm(doc=1808)
          0.14611511 = weight(abstract_txt:reuters in 1808) [ClassicSimilarity], result of:
            0.14611511 = score(doc=1808,freq=1.0), product of:
              0.30883992 = queryWeight, product of:
                4.0414076 = boost
                7.5697527 = idf(docFreq=61, maxDocs=44218)
                0.010095296 = queryNorm
              0.47310954 = fieldWeight in 1808, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5697527 = idf(docFreq=61, maxDocs=44218)
                0.0625 = fieldNorm(doc=1808)
          0.32736346 = weight(abstract_txt:21578 in 1808) [ClassicSimilarity], result of:
            0.32736346 = score(doc=1808,freq=1.0), product of:
              0.5287984 = queryWeight, product of:
                5.288238 = boost
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.010095296 = queryNorm
              0.6190705 = fieldWeight in 1808, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.0625 = fieldNorm(doc=1808)
        0.28 = coord(7/25)
    
  3. Egghe, L.; Rousseau, R.: ¬A theoretical study of recall and precision using a topological approach to information retrieval (1998) 0.13
    0.13128957 = sum of:
      0.13128957 = product of:
        0.8205598 = sum of:
          0.035539865 = weight(abstract_txt:standard in 3267) [ClassicSimilarity], result of:
            0.035539865 = score(doc=3267,freq=1.0), product of:
              0.06017053 = queryWeight, product of:
                1.261371 = boost
                4.725219 = idf(docFreq=1065, maxDocs=44218)
                0.010095296 = queryNorm
              0.59065235 = fieldWeight in 3267, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.725219 = idf(docFreq=1065, maxDocs=44218)
                0.125 = fieldNorm(doc=3267)
          0.05351686 = weight(abstract_txt:systems in 3267) [ClassicSimilarity], result of:
            0.05351686 = score(doc=3267,freq=4.0), product of:
              0.062741816 = queryWeight, product of:
                1.8215642 = boost
                3.4118783 = idf(docFreq=3963, maxDocs=44218)
                0.010095296 = queryNorm
              0.8529696 = fieldWeight in 3267, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.4118783 = idf(docFreq=3963, maxDocs=44218)
                0.125 = fieldNorm(doc=3267)
          0.041475374 = weight(abstract_txt:different in 3267) [ClassicSimilarity], result of:
            0.041475374 = score(doc=3267,freq=1.0), product of:
              0.090520486 = queryWeight, product of:
                2.4462137 = boost
                3.6655018 = idf(docFreq=3075, maxDocs=44218)
                0.010095296 = queryNorm
              0.45818773 = fieldWeight in 3267, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6655018 = idf(docFreq=3075, maxDocs=44218)
                0.125 = fieldNorm(doc=3267)
          0.6900277 = weight(abstract_txt:subsets in 3267) [ClassicSimilarity], result of:
            0.6900277 = score(doc=3267,freq=1.0), product of:
              0.65994394 = queryWeight, product of:
                7.815171 = boost
                8.364683 = idf(docFreq=27, maxDocs=44218)
                0.010095296 = queryNorm
              1.0455854 = fieldWeight in 3267, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.364683 = idf(docFreq=27, maxDocs=44218)
                0.125 = fieldNorm(doc=3267)
        0.16 = coord(4/25)
    
  4. Hung, C.-M.; Chien, L.-F.: Web-based text classification in the absence of manually labeled training documents (2007) 0.12
    0.12405332 = sum of:
      0.12405332 = product of:
        0.6202666 = sum of:
          0.011120713 = weight(abstract_txt:they in 87) [ClassicSimilarity], result of:
            0.011120713 = score(doc=87,freq=1.0), product of:
              0.03793806 = queryWeight, product of:
                1.0015866 = boost
                3.7520406 = idf(docFreq=2820, maxDocs=44218)
                0.010095296 = queryNorm
              0.29312816 = fieldWeight in 87, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.7520406 = idf(docFreq=2820, maxDocs=44218)
                0.078125 = fieldNorm(doc=87)
          0.009902424 = weight(abstract_txt:that in 87) [ClassicSimilarity], result of:
            0.009902424 = score(doc=87,freq=2.0), product of:
              0.037825473 = queryWeight, product of:
                1.5812958 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.010095296 = queryNorm
              0.26179248 = fieldWeight in 87, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.078125 = fieldNorm(doc=87)
          0.007395226 = weight(abstract_txt:this in 87) [ClassicSimilarity], result of:
            0.007395226 = score(doc=87,freq=1.0), product of:
              0.039228432 = queryWeight, product of:
                1.6103542 = boost
                2.4130175 = idf(docFreq=10762, maxDocs=44218)
                0.010095296 = queryNorm
              0.18851699 = fieldWeight in 87, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4130175 = idf(docFreq=10762, maxDocs=44218)
                0.078125 = fieldNorm(doc=87)
          0.18264389 = weight(abstract_txt:reuters in 87) [ClassicSimilarity], result of:
            0.18264389 = score(doc=87,freq=1.0), product of:
              0.30883992 = queryWeight, product of:
                4.0414076 = boost
                7.5697527 = idf(docFreq=61, maxDocs=44218)
                0.010095296 = queryNorm
              0.5913869 = fieldWeight in 87, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5697527 = idf(docFreq=61, maxDocs=44218)
                0.078125 = fieldNorm(doc=87)
          0.40920436 = weight(abstract_txt:21578 in 87) [ClassicSimilarity], result of:
            0.40920436 = score(doc=87,freq=1.0), product of:
              0.5287984 = queryWeight, product of:
                5.288238 = boost
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.010095296 = queryNorm
              0.7738381 = fieldWeight in 87, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.078125 = fieldNorm(doc=87)
        0.2 = coord(5/25)
    
  5. Whyte, G.; Bytheway, A.; Edwards, C.: Understanding user perceptions of information systems success (1997) 0.12
    0.116682105 = sum of:
      0.116682105 = product of:
        0.5834105 = sum of:
          0.013344856 = weight(abstract_txt:they in 1367) [ClassicSimilarity], result of:
            0.013344856 = score(doc=1367,freq=1.0), product of:
              0.03793806 = queryWeight, product of:
                1.0015866 = boost
                3.7520406 = idf(docFreq=2820, maxDocs=44218)
                0.010095296 = queryNorm
              0.3517538 = fieldWeight in 1367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.7520406 = idf(docFreq=2820, maxDocs=44218)
                0.09375 = fieldNorm(doc=1367)
          0.01228034 = weight(abstract_txt:these in 1367) [ClassicSimilarity], result of:
            0.01228034 = score(doc=1367,freq=1.0), product of:
              0.041086882 = queryWeight, product of:
                1.2765803 = boost
                3.1881294 = idf(docFreq=4957, maxDocs=44218)
                0.010095296 = queryNorm
              0.29888713 = fieldWeight in 1367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1881294 = idf(docFreq=4957, maxDocs=44218)
                0.09375 = fieldNorm(doc=1367)
          0.01188291 = weight(abstract_txt:that in 1367) [ClassicSimilarity], result of:
            0.01188291 = score(doc=1367,freq=2.0), product of:
              0.037825473 = queryWeight, product of:
                1.5812958 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.010095296 = queryNorm
              0.314151 = fieldWeight in 1367, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.09375 = fieldNorm(doc=1367)
          0.028381603 = weight(abstract_txt:systems in 1367) [ClassicSimilarity], result of:
            0.028381603 = score(doc=1367,freq=2.0), product of:
              0.062741816 = queryWeight, product of:
                1.8215642 = boost
                3.4118783 = idf(docFreq=3963, maxDocs=44218)
                0.010095296 = queryNorm
              0.45235544 = fieldWeight in 1367, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4118783 = idf(docFreq=3963, maxDocs=44218)
                0.09375 = fieldNorm(doc=1367)
          0.5175208 = weight(abstract_txt:subsets in 1367) [ClassicSimilarity], result of:
            0.5175208 = score(doc=1367,freq=1.0), product of:
              0.65994394 = queryWeight, product of:
                7.815171 = boost
                8.364683 = idf(docFreq=27, maxDocs=44218)
                0.010095296 = queryNorm
              0.78418905 = fieldWeight in 1367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.364683 = idf(docFreq=27, maxDocs=44218)
                0.09375 = fieldNorm(doc=1367)
        0.2 = coord(5/25)