Document (#28458)

Author
Debole, F.
Sebastiani, F.
Title
¬An analysis of the relative hardness of Reuters-21578 subsets
Source
Journal of the American Society for Information Science and Technology. 56(2005) no.6, S.584-596
Year
2005
Abstract
The existence, public availability, and widespread acceptance of a standard benchmark for a given information retrieval (IR) task are beneficial to research an this task, because they allow different researchers to experimentally compare their own systems by comparing the results they have obtained an this benchmark. The Reuters-21578 test collection, together with its earlier variants, has been such a standard benchmark for the text categorization (TC) task throughout the last 10 years. However, the benefits that this has brought about have somehow been limited by the fact that different researchers have "carved" different subsets out of this collection and tested their systems an one of these subsets only; systems that have been tested an different Reuters-21578 subsets are thus not readily comparable. In this article, we present a systematic, comparative experimental study of the three subsets of Reuters-21578 that have been most popular among TC researchers. The results we obtain allow us to determine the relative hardness of these subsets, thus establishing an indirect means for comparing TC systems that have, or will be, tested an these different subsets.
Theme
Retrievalstudien

Similar documents (author)

  1. Sebastiani, F.: On the role of logic in information retrieval (1998) 6.00
    5.9971275 = sum of:
      5.9971275 = weight(author_txt:sebastiani in 3141) [ClassicSimilarity], result of:
        5.9971275 = fieldWeight in 3141, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.595404 = idf(docFreq=7, maxDocs=43254)
          0.625 = fieldNorm(doc=3141)
    
  2. Sebastiani, F.: Machine learning in automated text categorization (2002) 6.00
    5.9971275 = sum of:
      5.9971275 = weight(author_txt:sebastiani in 5390) [ClassicSimilarity], result of:
        5.9971275 = fieldWeight in 5390, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.595404 = idf(docFreq=7, maxDocs=43254)
          0.625 = fieldNorm(doc=5390)
    
  3. Sebastiani, F.: ¬A tutorial an automated text categorisation (1999) 6.00
    5.9971275 = sum of:
      5.9971275 = weight(author_txt:sebastiani in 5391) [ClassicSimilarity], result of:
        5.9971275 = fieldWeight in 5391, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.595404 = idf(docFreq=7, maxDocs=43254)
          0.625 = fieldNorm(doc=5391)
    
  4. Sebastiani, F.: Classification of text, automatic (2006) 6.00
    5.9971275 = sum of:
      5.9971275 = weight(author_txt:sebastiani in 4) [ClassicSimilarity], result of:
        5.9971275 = fieldWeight in 4, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.595404 = idf(docFreq=7, maxDocs=43254)
          0.625 = fieldNorm(doc=4)
    
  5. Giorgetti, D.; Sebastiani, F.: Automating survey coding by multiclass text categorization techniques (2003) 4.80
    4.797702 = sum of:
      4.797702 = weight(author_txt:sebastiani in 173) [ClassicSimilarity], result of:
        4.797702 = fieldWeight in 173, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.595404 = idf(docFreq=7, maxDocs=43254)
          0.5 = fieldNorm(doc=173)
    

Similar documents (content)

  1. Fagni, T.; Sebastiani, F.: Selecting negative examples for hierarchical text classification: An experimental comparison (2010) 0.24
    0.24225132 = sum of:
      0.24225132 = product of:
        0.6056283 = sum of:
          0.009061378 = weight(abstract_txt:they in 566) [ClassicSimilarity], result of:
            0.009061378 = score(doc=566,freq=1.0), product of:
              0.038451575 = queryWeight, product of:
                1.005408 = boost
                3.7705102 = idf(docFreq=2708, maxDocs=43254)
                0.010143123 = queryNorm
              0.23565689 = fieldWeight in 566, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.7705102 = idf(docFreq=2708, maxDocs=43254)
                0.0625 = fieldNorm(doc=566)
          0.017810164 = weight(abstract_txt:standard in 566) [ClassicSimilarity], result of:
            0.017810164 = score(doc=566,freq=1.0), product of:
              0.060334157 = queryWeight, product of:
                1.2594093 = boost
                4.723073 = idf(docFreq=1044, maxDocs=43254)
                0.010143123 = queryNorm
              0.29519206 = fieldWeight in 566, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.723073 = idf(docFreq=1044, maxDocs=43254)
                0.0625 = fieldNorm(doc=566)
          0.008370374 = weight(abstract_txt:these in 566) [ClassicSimilarity], result of:
            0.008370374 = score(doc=566,freq=1.0), product of:
              0.041748893 = queryWeight, product of:
                1.2830789 = boost
                3.2078931 = idf(docFreq=4754, maxDocs=43254)
                0.010143123 = queryNorm
              0.20049332 = fieldWeight in 566, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.2078931 = idf(docFreq=4754, maxDocs=43254)
                0.0625 = fieldNorm(doc=566)
          0.0057362285 = weight(abstract_txt:that in 566) [ClassicSimilarity], result of:
            0.0057362285 = score(doc=566,freq=1.0), product of:
              0.03847532 = queryWeight, product of:
                1.5901804 = boost
                2.3854163 = idf(docFreq=10822, maxDocs=43254)
                0.010143123 = queryNorm
              0.14908852 = fieldWeight in 566, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3854163 = idf(docFreq=10822, maxDocs=43254)
                0.0625 = fieldNorm(doc=566)
          0.012149645 = weight(abstract_txt:this in 566) [ClassicSimilarity], result of:
            0.012149645 = score(doc=566,freq=4.0), product of:
              0.039974865 = queryWeight, product of:
                1.6208723 = boost
                2.4314568 = idf(docFreq=10335, maxDocs=43254)
                0.010143123 = queryNorm
              0.3039321 = fieldWeight in 566, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.4314568 = idf(docFreq=10335, maxDocs=43254)
                0.0625 = fieldNorm(doc=566)
          0.028647425 = weight(abstract_txt:researchers in 566) [ClassicSimilarity], result of:
            0.028647425 = score(doc=566,freq=1.0), product of:
              0.09481392 = queryWeight, product of:
                1.9336014 = boost
                4.8342986 = idf(docFreq=934, maxDocs=43254)
                0.010143123 = queryNorm
              0.30214366 = fieldWeight in 566, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.8342986 = idf(docFreq=934, maxDocs=43254)
                0.0625 = fieldNorm(doc=566)
          0.02805913 = weight(abstract_txt:been in 566) [ClassicSimilarity], result of:
            0.02805913 = score(doc=566,freq=3.0), product of:
              0.07136255 = queryWeight, product of:
                1.937026 = boost
                3.6321454 = idf(docFreq=3110, maxDocs=43254)
                0.010143123 = queryNorm
              0.39319128 = fieldWeight in 566, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.6321454 = idf(docFreq=3110, maxDocs=43254)
                0.0625 = fieldNorm(doc=566)
          0.02406223 = weight(abstract_txt:have in 566) [ClassicSimilarity], result of:
            0.02406223 = score(doc=566,freq=2.0), product of:
              0.0844058 = queryWeight, product of:
                2.5800734 = boost
                3.2252884 = idf(docFreq=4672, maxDocs=43254)
                0.010143123 = queryNorm
              0.2850779 = fieldWeight in 566, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.2252884 = idf(docFreq=4672, maxDocs=43254)
                0.0625 = fieldNorm(doc=566)
          0.14536826 = weight(abstract_txt:reuters in 566) [ClassicSimilarity], result of:
            0.14536826 = score(doc=566,freq=1.0), product of:
              0.30815864 = queryWeight, product of:
                4.0252 = boost
                7.5477104 = idf(docFreq=61, maxDocs=43254)
                0.010143123 = queryNorm
              0.4717319 = fieldWeight in 566, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5477104 = idf(docFreq=61, maxDocs=43254)
                0.0625 = fieldNorm(doc=566)
          0.32636347 = weight(abstract_txt:21578 in 566) [ClassicSimilarity], result of:
            0.32636347 = score(doc=566,freq=1.0), product of:
              0.5283589 = queryWeight, product of:
                5.270657 = boost
                9.883085 = idf(docFreq=5, maxDocs=43254)
                0.010143123 = queryNorm
              0.6176928 = fieldWeight in 566, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.883085 = idf(docFreq=5, maxDocs=43254)
                0.0625 = fieldNorm(doc=566)
        0.4 = coord(10/25)
    
  2. Sun, A.; Lim, E.-P.; Ng, W.-K.: Performance measurement framework for hierarchical text classification (2003) 0.15
    0.1488764 = sum of:
      0.1488764 = product of:
        0.53170145 = sum of:
          0.01703751 = weight(abstract_txt:collection in 3809) [ClassicSimilarity], result of:
            0.01703751 = score(doc=3809,freq=1.0), product of:
              0.058576316 = queryWeight, product of:
                1.2409272 = boost
                4.653761 = idf(docFreq=1119, maxDocs=43254)
                0.010143123 = queryNorm
              0.29086006 = fieldWeight in 3809, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.653761 = idf(docFreq=1119, maxDocs=43254)
                0.0625 = fieldNorm(doc=3809)
          0.008370374 = weight(abstract_txt:these in 3809) [ClassicSimilarity], result of:
            0.008370374 = score(doc=3809,freq=1.0), product of:
              0.041748893 = queryWeight, product of:
                1.2830789 = boost
                3.2078931 = idf(docFreq=4754, maxDocs=43254)
                0.010143123 = queryNorm
              0.20049332 = fieldWeight in 3809, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.2078931 = idf(docFreq=4754, maxDocs=43254)
                0.0625 = fieldNorm(doc=3809)
          0.011472457 = weight(abstract_txt:that in 3809) [ClassicSimilarity], result of:
            0.011472457 = score(doc=3809,freq=4.0), product of:
              0.03847532 = queryWeight, product of:
                1.5901804 = boost
                2.3854163 = idf(docFreq=10822, maxDocs=43254)
                0.010143123 = queryNorm
              0.29817703 = fieldWeight in 3809, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.3854163 = idf(docFreq=10822, maxDocs=43254)
                0.0625 = fieldNorm(doc=3809)
          0.0060748225 = weight(abstract_txt:this in 3809) [ClassicSimilarity], result of:
            0.0060748225 = score(doc=3809,freq=1.0), product of:
              0.039974865 = queryWeight, product of:
                1.6208723 = boost
                2.4314568 = idf(docFreq=10335, maxDocs=43254)
                0.010143123 = queryNorm
              0.15196605 = fieldWeight in 3809, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4314568 = idf(docFreq=10335, maxDocs=43254)
                0.0625 = fieldNorm(doc=3809)
          0.017014565 = weight(abstract_txt:have in 3809) [ClassicSimilarity], result of:
            0.017014565 = score(doc=3809,freq=1.0), product of:
              0.0844058 = queryWeight, product of:
                2.5800734 = boost
                3.2252884 = idf(docFreq=4672, maxDocs=43254)
                0.010143123 = queryNorm
              0.20158052 = fieldWeight in 3809, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.2252884 = idf(docFreq=4672, maxDocs=43254)
                0.0625 = fieldNorm(doc=3809)
          0.14536826 = weight(abstract_txt:reuters in 3809) [ClassicSimilarity], result of:
            0.14536826 = score(doc=3809,freq=1.0), product of:
              0.30815864 = queryWeight, product of:
                4.0252 = boost
                7.5477104 = idf(docFreq=61, maxDocs=43254)
                0.010143123 = queryNorm
              0.4717319 = fieldWeight in 3809, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5477104 = idf(docFreq=61, maxDocs=43254)
                0.0625 = fieldNorm(doc=3809)
          0.32636347 = weight(abstract_txt:21578 in 3809) [ClassicSimilarity], result of:
            0.32636347 = score(doc=3809,freq=1.0), product of:
              0.5283589 = queryWeight, product of:
                5.270657 = boost
                9.883085 = idf(docFreq=5, maxDocs=43254)
                0.010143123 = queryNorm
              0.6176928 = fieldWeight in 3809, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.883085 = idf(docFreq=5, maxDocs=43254)
                0.0625 = fieldNorm(doc=3809)
        0.28 = coord(7/25)
    
  3. Egghe, L.; Rousseau, R.: ¬A theoretical study of recall and precision using a topological approach to information retrieval (1998) 0.13
    0.1309893 = sum of:
      0.1309893 = product of:
        0.81868315 = sum of:
          0.035620328 = weight(abstract_txt:standard in 5268) [ClassicSimilarity], result of:
            0.035620328 = score(doc=5268,freq=1.0), product of:
              0.060334157 = queryWeight, product of:
                1.2594093 = boost
                4.723073 = idf(docFreq=1044, maxDocs=43254)
                0.010143123 = queryNorm
              0.5903841 = fieldWeight in 5268, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.723073 = idf(docFreq=1044, maxDocs=43254)
                0.125 = fieldNorm(doc=5268)
          0.05379141 = weight(abstract_txt:systems in 5268) [ClassicSimilarity], result of:
            0.05379141 = score(doc=5268,freq=4.0), product of:
              0.063032314 = queryWeight, product of:
                1.8204632 = boost
                3.4135768 = idf(docFreq=3870, maxDocs=43254)
                0.010143123 = queryNorm
              0.8533942 = fieldWeight in 5268, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.4135768 = idf(docFreq=3870, maxDocs=43254)
                0.125 = fieldNorm(doc=5268)
          0.04219875 = weight(abstract_txt:different in 5268) [ClassicSimilarity], result of:
            0.04219875 = score(doc=5268,freq=1.0), product of:
              0.09168065 = queryWeight, product of:
                2.4546757 = boost
                3.6822383 = idf(docFreq=2958, maxDocs=43254)
                0.010143123 = queryNorm
              0.4602798 = fieldWeight in 5268, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6822383 = idf(docFreq=2958, maxDocs=43254)
                0.125 = fieldNorm(doc=5268)
          0.68707263 = weight(abstract_txt:subsets in 5268) [ClassicSimilarity], result of:
            0.68707263 = score(doc=5268,freq=1.0), product of:
              0.6588538 = queryWeight, product of:
                7.7859898 = boost
                8.342641 = idf(docFreq=27, maxDocs=43254)
                0.010143123 = queryNorm
              1.0428301 = fieldWeight in 5268, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.342641 = idf(docFreq=27, maxDocs=43254)
                0.125 = fieldNorm(doc=5268)
        0.16 = coord(4/25)
    
  4. Hung, C.-M.; Chien, L.-F.: Web-based text classification in the absence of manually labeled training documents (2007) 0.12
    0.12374505 = sum of:
      0.12374505 = product of:
        0.61872524 = sum of:
          0.011326723 = weight(abstract_txt:they in 2088) [ClassicSimilarity], result of:
            0.011326723 = score(doc=2088,freq=1.0), product of:
              0.038451575 = queryWeight, product of:
                1.005408 = boost
                3.7705102 = idf(docFreq=2708, maxDocs=43254)
                0.010143123 = queryNorm
              0.2945711 = fieldWeight in 2088, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.7705102 = idf(docFreq=2708, maxDocs=43254)
                0.078125 = fieldNorm(doc=2088)
          0.010140315 = weight(abstract_txt:that in 2088) [ClassicSimilarity], result of:
            0.010140315 = score(doc=2088,freq=2.0), product of:
              0.03847532 = queryWeight, product of:
                1.5901804 = boost
                2.3854163 = idf(docFreq=10822, maxDocs=43254)
                0.010143123 = queryNorm
              0.26355374 = fieldWeight in 2088, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3854163 = idf(docFreq=10822, maxDocs=43254)
                0.078125 = fieldNorm(doc=2088)
          0.007593528 = weight(abstract_txt:this in 2088) [ClassicSimilarity], result of:
            0.007593528 = score(doc=2088,freq=1.0), product of:
              0.039974865 = queryWeight, product of:
                1.6208723 = boost
                2.4314568 = idf(docFreq=10335, maxDocs=43254)
                0.010143123 = queryNorm
              0.18995756 = fieldWeight in 2088, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4314568 = idf(docFreq=10335, maxDocs=43254)
                0.078125 = fieldNorm(doc=2088)
          0.18171032 = weight(abstract_txt:reuters in 2088) [ClassicSimilarity], result of:
            0.18171032 = score(doc=2088,freq=1.0), product of:
              0.30815864 = queryWeight, product of:
                4.0252 = boost
                7.5477104 = idf(docFreq=61, maxDocs=43254)
                0.010143123 = queryNorm
              0.5896649 = fieldWeight in 2088, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5477104 = idf(docFreq=61, maxDocs=43254)
                0.078125 = fieldNorm(doc=2088)
          0.40795437 = weight(abstract_txt:21578 in 2088) [ClassicSimilarity], result of:
            0.40795437 = score(doc=2088,freq=1.0), product of:
              0.5283589 = queryWeight, product of:
                5.270657 = boost
                9.883085 = idf(docFreq=5, maxDocs=43254)
                0.010143123 = queryNorm
              0.77211607 = fieldWeight in 2088, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.883085 = idf(docFreq=5, maxDocs=43254)
                0.078125 = fieldNorm(doc=2088)
        0.2 = coord(5/25)
    
  5. Thelwall, M.; Harries, G.: ¬The connection between the research of a university and counts of links to its Web pages : an investigation based upon a classification of the relationships of pages to the research of the host university (2003) 0.12
    0.11645575 = sum of:
      0.11645575 = product of:
        0.5822787 = sum of:
          0.012168379 = weight(abstract_txt:that in 3677) [ClassicSimilarity], result of:
            0.012168379 = score(doc=3677,freq=2.0), product of:
              0.03847532 = queryWeight, product of:
                1.5901804 = boost
                2.3854163 = idf(docFreq=10822, maxDocs=43254)
                0.010143123 = queryNorm
              0.3162645 = fieldWeight in 3677, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3854163 = idf(docFreq=10822, maxDocs=43254)
                0.09375 = fieldNorm(doc=3677)
          0.009112233 = weight(abstract_txt:this in 3677) [ClassicSimilarity], result of:
            0.009112233 = score(doc=3677,freq=1.0), product of:
              0.039974865 = queryWeight, product of:
                1.6208723 = boost
                2.4314568 = idf(docFreq=10335, maxDocs=43254)
                0.010143123 = queryNorm
              0.22794908 = fieldWeight in 3677, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4314568 = idf(docFreq=10335, maxDocs=43254)
                0.09375 = fieldNorm(doc=3677)
          0.020171778 = weight(abstract_txt:systems in 3677) [ClassicSimilarity], result of:
            0.020171778 = score(doc=3677,freq=1.0), product of:
              0.063032314 = queryWeight, product of:
                1.8204632 = boost
                3.4135768 = idf(docFreq=3870, maxDocs=43254)
                0.010143123 = queryNorm
              0.32002282 = fieldWeight in 3677, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4135768 = idf(docFreq=3870, maxDocs=43254)
                0.09375 = fieldNorm(doc=3677)
          0.025521848 = weight(abstract_txt:have in 3677) [ClassicSimilarity], result of:
            0.025521848 = score(doc=3677,freq=1.0), product of:
              0.0844058 = queryWeight, product of:
                2.5800734 = boost
                3.2252884 = idf(docFreq=4672, maxDocs=43254)
                0.010143123 = queryNorm
              0.3023708 = fieldWeight in 3677, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.2252884 = idf(docFreq=4672, maxDocs=43254)
                0.09375 = fieldNorm(doc=3677)
          0.5153045 = weight(abstract_txt:subsets in 3677) [ClassicSimilarity], result of:
            0.5153045 = score(doc=3677,freq=1.0), product of:
              0.6588538 = queryWeight, product of:
                7.7859898 = boost
                8.342641 = idf(docFreq=27, maxDocs=43254)
                0.010143123 = queryNorm
              0.7821226 = fieldWeight in 3677, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.342641 = idf(docFreq=27, maxDocs=43254)
                0.09375 = fieldNorm(doc=3677)
        0.2 = coord(5/25)