Document (#28455)

Author
Debole, F.
Sebastiani, F.
Title
¬An analysis of the relative hardness of Reuters-21578 subsets
Source
Journal of the American Society for Information Science and Technology. 56(2005) no.6, S.584-596
Year
2005
Abstract
The existence, public availability, and widespread acceptance of a standard benchmark for a given information retrieval (IR) task are beneficial to research an this task, because they allow different researchers to experimentally compare their own systems by comparing the results they have obtained an this benchmark. The Reuters-21578 test collection, together with its earlier variants, has been such a standard benchmark for the text categorization (TC) task throughout the last 10 years. However, the benefits that this has brought about have somehow been limited by the fact that different researchers have "carved" different subsets out of this collection and tested their systems an one of these subsets only; systems that have been tested an different Reuters-21578 subsets are thus not readily comparable. In this article, we present a systematic, comparative experimental study of the three subsets of Reuters-21578 that have been most popular among TC researchers. The results we obtain allow us to determine the relative hardness of these subsets, thus establishing an indirect means for comparing TC systems that have, or will be, tested an these different subsets.
Theme
Retrievalstudien

Similar documents (author)

  1. Sebastiani, F.: On the role of logic in information retrieval (1998) 6.00
    6.0014763 = sum of:
      6.0014763 = weight(author_txt:sebastiani in 2138) [ClassicSimilarity], result of:
        6.0014763 = fieldWeight in 2138, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.602362 = idf(docFreq=7, maxDocs=43556)
          0.625 = fieldNorm(doc=2138)
    
  2. Sebastiani, F.: Machine learning in automated text categorization (2002) 6.00
    6.0014763 = sum of:
      6.0014763 = weight(author_txt:sebastiani in 4387) [ClassicSimilarity], result of:
        6.0014763 = fieldWeight in 4387, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.602362 = idf(docFreq=7, maxDocs=43556)
          0.625 = fieldNorm(doc=4387)
    
  3. Sebastiani, F.: ¬A tutorial an automated text categorisation (1999) 6.00
    6.0014763 = sum of:
      6.0014763 = weight(author_txt:sebastiani in 4388) [ClassicSimilarity], result of:
        6.0014763 = fieldWeight in 4388, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.602362 = idf(docFreq=7, maxDocs=43556)
          0.625 = fieldNorm(doc=4388)
    
  4. Sebastiani, F.: Classification of text, automatic (2006) 6.00
    6.0014763 = sum of:
      6.0014763 = weight(author_txt:sebastiani in 1) [ClassicSimilarity], result of:
        6.0014763 = fieldWeight in 1, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.602362 = idf(docFreq=7, maxDocs=43556)
          0.625 = fieldNorm(doc=1)
    
  5. Giorgetti, D.; Sebastiani, F.: Automating survey coding by multiclass text categorization techniques (2003) 4.80
    4.801181 = sum of:
      4.801181 = weight(author_txt:sebastiani in 170) [ClassicSimilarity], result of:
        4.801181 = fieldWeight in 170, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.602362 = idf(docFreq=7, maxDocs=43556)
          0.5 = fieldNorm(doc=170)
    

Similar documents (content)

  1. Fagni, T.; Sebastiani, F.: Selecting negative examples for hierarchical text classification: An experimental comparison (2010) 0.24
    0.24222241 = sum of:
      0.24222241 = product of:
        0.605556 = sum of:
          0.009023394 = weight(abstract_txt:they in 1099) [ClassicSimilarity], result of:
            0.009023394 = score(doc=1099,freq=1.0), product of:
              0.038327906 = queryWeight, product of:
                1.0034931 = boost
                3.7668197 = idf(docFreq=2737, maxDocs=43556)
                0.010139718 = queryNorm
              0.23542623 = fieldWeight in 1099, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.7668197 = idf(docFreq=2737, maxDocs=43556)
                0.0625 = fieldNorm(doc=1099)
          0.017812375 = weight(abstract_txt:standard in 1099) [ClassicSimilarity], result of:
            0.017812375 = score(doc=1099,freq=1.0), product of:
              0.060313754 = queryWeight, product of:
                1.258824 = boost
                4.7252574 = idf(docFreq=1049, maxDocs=43556)
                0.010139718 = queryNorm
              0.2953286 = fieldWeight in 1099, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7252574 = idf(docFreq=1049, maxDocs=43556)
                0.0625 = fieldNorm(doc=1099)
          0.008314631 = weight(abstract_txt:these in 1099) [ClassicSimilarity], result of:
            0.008314631 = score(doc=1099,freq=1.0), product of:
              0.041545838 = queryWeight, product of:
                1.2795764 = boost
                3.2021039 = idf(docFreq=4815, maxDocs=43556)
                0.010139718 = queryNorm
              0.20013149 = fieldWeight in 1099, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.2021039 = idf(docFreq=4815, maxDocs=43556)
                0.0625 = fieldNorm(doc=1099)
          0.005702349 = weight(abstract_txt:that in 1099) [ClassicSimilarity], result of:
            0.005702349 = score(doc=1099,freq=1.0), product of:
              0.038307544 = queryWeight, product of:
                1.5862404 = boost
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.010139718 = queryNorm
              0.14885707 = fieldWeight in 1099, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.0625 = fieldNorm(doc=1099)
          0.012074352 = weight(abstract_txt:this in 1099) [ClassicSimilarity], result of:
            0.012074352 = score(doc=1099,freq=4.0), product of:
              0.039792787 = queryWeight, product of:
                1.6166985 = boost
                2.4274454 = idf(docFreq=10449, maxDocs=43556)
                0.010139718 = queryNorm
              0.30343068 = fieldWeight in 1099, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.4274454 = idf(docFreq=10449, maxDocs=43556)
                0.0625 = fieldNorm(doc=1099)
          0.02852722 = weight(abstract_txt:researchers in 1099) [ClassicSimilarity], result of:
            0.02852722 = score(doc=1099,freq=1.0), product of:
              0.094508715 = queryWeight, product of:
                1.929916 = boost
                4.8295603 = idf(docFreq=945, maxDocs=43556)
                0.010139718 = queryNorm
              0.30184752 = fieldWeight in 1099, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.8295603 = idf(docFreq=945, maxDocs=43556)
                0.0625 = fieldNorm(doc=1099)
          0.027962679 = weight(abstract_txt:been in 1099) [ClassicSimilarity], result of:
            0.027962679 = score(doc=1099,freq=3.0), product of:
              0.07116895 = queryWeight, product of:
                1.9338248 = boost
                3.6295063 = idf(docFreq=3140, maxDocs=43556)
                0.010139718 = queryNorm
              0.3929056 = fieldWeight in 1099, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.6295063 = idf(docFreq=3140, maxDocs=43556)
                0.0625 = fieldNorm(doc=1099)
          0.023911854 = weight(abstract_txt:have in 1099) [ClassicSimilarity], result of:
            0.023911854 = score(doc=1099,freq=2.0), product of:
              0.0840184 = queryWeight, product of:
                2.5733845 = boost
                3.2199109 = idf(docFreq=4730, maxDocs=43556)
                0.010139718 = queryNorm
              0.28460258 = fieldWeight in 1099, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.2199109 = idf(docFreq=4730, maxDocs=43556)
                0.0625 = fieldNorm(doc=1099)
          0.1455867 = weight(abstract_txt:reuters in 1099) [ClassicSimilarity], result of:
            0.1455867 = score(doc=1099,freq=1.0), product of:
              0.30833745 = queryWeight, product of:
                4.025177 = boost
                7.5546684 = idf(docFreq=61, maxDocs=43556)
                0.010139718 = queryNorm
              0.47216678 = fieldWeight in 1099, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5546684 = idf(docFreq=61, maxDocs=43556)
                0.0625 = fieldNorm(doc=1099)
          0.3266405 = weight(abstract_txt:21578 in 1099) [ClassicSimilarity], result of:
            0.3266405 = score(doc=1099,freq=1.0), product of:
              0.5284353 = queryWeight, product of:
                5.2694798 = boost
                9.890043 = idf(docFreq=5, maxDocs=43556)
                0.010139718 = queryNorm
              0.6181277 = fieldWeight in 1099, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.890043 = idf(docFreq=5, maxDocs=43556)
                0.0625 = fieldNorm(doc=1099)
        0.4 = coord(10/25)
    
  2. Sun, A.; Lim, E.-P.; Ng, W.-K.: Performance measurement framework for hierarchical text classification (2003) 0.15
    0.14892016 = sum of:
      0.14892016 = product of:
        0.5318577 = sum of:
          0.016965792 = weight(abstract_txt:collection in 2806) [ClassicSimilarity], result of:
            0.016965792 = score(doc=2806,freq=1.0), product of:
              0.05838723 = queryWeight, product of:
                1.2385564 = boost
                4.6491785 = idf(docFreq=1132, maxDocs=43556)
                0.010139718 = queryNorm
              0.29057366 = fieldWeight in 2806, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6491785 = idf(docFreq=1132, maxDocs=43556)
                0.0625 = fieldNorm(doc=2806)
          0.008314631 = weight(abstract_txt:these in 2806) [ClassicSimilarity], result of:
            0.008314631 = score(doc=2806,freq=1.0), product of:
              0.041545838 = queryWeight, product of:
                1.2795764 = boost
                3.2021039 = idf(docFreq=4815, maxDocs=43556)
                0.010139718 = queryNorm
              0.20013149 = fieldWeight in 2806, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.2021039 = idf(docFreq=4815, maxDocs=43556)
                0.0625 = fieldNorm(doc=2806)
          0.011404698 = weight(abstract_txt:that in 2806) [ClassicSimilarity], result of:
            0.011404698 = score(doc=2806,freq=4.0), product of:
              0.038307544 = queryWeight, product of:
                1.5862404 = boost
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.010139718 = queryNorm
              0.29771414 = fieldWeight in 2806, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.0625 = fieldNorm(doc=2806)
          0.006037176 = weight(abstract_txt:this in 2806) [ClassicSimilarity], result of:
            0.006037176 = score(doc=2806,freq=1.0), product of:
              0.039792787 = queryWeight, product of:
                1.6166985 = boost
                2.4274454 = idf(docFreq=10449, maxDocs=43556)
                0.010139718 = queryNorm
              0.15171534 = fieldWeight in 2806, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4274454 = idf(docFreq=10449, maxDocs=43556)
                0.0625 = fieldNorm(doc=2806)
          0.016908236 = weight(abstract_txt:have in 2806) [ClassicSimilarity], result of:
            0.016908236 = score(doc=2806,freq=1.0), product of:
              0.0840184 = queryWeight, product of:
                2.5733845 = boost
                3.2199109 = idf(docFreq=4730, maxDocs=43556)
                0.010139718 = queryNorm
              0.20124443 = fieldWeight in 2806, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.2199109 = idf(docFreq=4730, maxDocs=43556)
                0.0625 = fieldNorm(doc=2806)
          0.1455867 = weight(abstract_txt:reuters in 2806) [ClassicSimilarity], result of:
            0.1455867 = score(doc=2806,freq=1.0), product of:
              0.30833745 = queryWeight, product of:
                4.025177 = boost
                7.5546684 = idf(docFreq=61, maxDocs=43556)
                0.010139718 = queryNorm
              0.47216678 = fieldWeight in 2806, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5546684 = idf(docFreq=61, maxDocs=43556)
                0.0625 = fieldNorm(doc=2806)
          0.3266405 = weight(abstract_txt:21578 in 2806) [ClassicSimilarity], result of:
            0.3266405 = score(doc=2806,freq=1.0), product of:
              0.5284353 = queryWeight, product of:
                5.2694798 = boost
                9.890043 = idf(docFreq=5, maxDocs=43556)
                0.010139718 = queryNorm
              0.6181277 = fieldWeight in 2806, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.890043 = idf(docFreq=5, maxDocs=43556)
                0.0625 = fieldNorm(doc=2806)
        0.28 = coord(7/25)
    
  3. Egghe, L.; Rousseau, R.: ¬A theoretical study of recall and precision using a topological approach to information retrieval (1998) 0.13
    0.13109419 = sum of:
      0.13109419 = product of:
        0.8193387 = sum of:
          0.03562475 = weight(abstract_txt:standard in 4265) [ClassicSimilarity], result of:
            0.03562475 = score(doc=4265,freq=1.0), product of:
              0.060313754 = queryWeight, product of:
                1.258824 = boost
                4.7252574 = idf(docFreq=1049, maxDocs=43556)
                0.010139718 = queryNorm
              0.5906572 = fieldWeight in 4265, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7252574 = idf(docFreq=1049, maxDocs=43556)
                0.125 = fieldNorm(doc=4265)
          0.053820904 = weight(abstract_txt:systems in 4265) [ClassicSimilarity], result of:
            0.053820904 = score(doc=4265,freq=4.0), product of:
              0.06302881 = queryWeight, product of:
                1.8198744 = boost
                3.4156382 = idf(docFreq=3889, maxDocs=43556)
                0.010139718 = queryNorm
              0.85390955 = fieldWeight in 4265, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.4156382 = idf(docFreq=3889, maxDocs=43556)
                0.125 = fieldNorm(doc=4265)
          0.041969422 = weight(abstract_txt:different in 4265) [ClassicSimilarity], result of:
            0.041969422 = score(doc=4265,freq=1.0), product of:
              0.09130975 = queryWeight, product of:
                2.448981 = boost
                3.6771033 = idf(docFreq=2994, maxDocs=43556)
                0.010139718 = queryNorm
              0.4596379 = fieldWeight in 4265, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6771033 = idf(docFreq=2994, maxDocs=43556)
                0.125 = fieldNorm(doc=4265)
          0.6879236 = weight(abstract_txt:subsets in 4265) [ClassicSimilarity], result of:
            0.6879236 = score(doc=4265,freq=1.0), product of:
              0.6591202 = queryWeight, product of:
                7.7852607 = boost
                8.349598 = idf(docFreq=27, maxDocs=43556)
                0.010139718 = queryNorm
              1.0436997 = fieldWeight in 4265, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.349598 = idf(docFreq=27, maxDocs=43556)
                0.125 = fieldNorm(doc=4265)
        0.16 = coord(4/25)
    
  4. Hung, C.-M.; Chien, L.-F.: Web-based text classification in the absence of manually labeled training documents (2007) 0.12
    0.12383803 = sum of:
      0.12383803 = product of:
        0.61919016 = sum of:
          0.011279243 = weight(abstract_txt:they in 2085) [ClassicSimilarity], result of:
            0.011279243 = score(doc=2085,freq=1.0), product of:
              0.038327906 = queryWeight, product of:
                1.0034931 = boost
                3.7668197 = idf(docFreq=2737, maxDocs=43556)
                0.010139718 = queryNorm
              0.2942828 = fieldWeight in 2085, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.7668197 = idf(docFreq=2737, maxDocs=43556)
                0.078125 = fieldNorm(doc=2085)
          0.010080424 = weight(abstract_txt:that in 2085) [ClassicSimilarity], result of:
            0.010080424 = score(doc=2085,freq=2.0), product of:
              0.038307544 = queryWeight, product of:
                1.5862404 = boost
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.010139718 = queryNorm
              0.2631446 = fieldWeight in 2085, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.078125 = fieldNorm(doc=2085)
          0.00754647 = weight(abstract_txt:this in 2085) [ClassicSimilarity], result of:
            0.00754647 = score(doc=2085,freq=1.0), product of:
              0.039792787 = queryWeight, product of:
                1.6166985 = boost
                2.4274454 = idf(docFreq=10449, maxDocs=43556)
                0.010139718 = queryNorm
              0.18964417 = fieldWeight in 2085, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4274454 = idf(docFreq=10449, maxDocs=43556)
                0.078125 = fieldNorm(doc=2085)
          0.18198338 = weight(abstract_txt:reuters in 2085) [ClassicSimilarity], result of:
            0.18198338 = score(doc=2085,freq=1.0), product of:
              0.30833745 = queryWeight, product of:
                4.025177 = boost
                7.5546684 = idf(docFreq=61, maxDocs=43556)
                0.010139718 = queryNorm
              0.5902085 = fieldWeight in 2085, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5546684 = idf(docFreq=61, maxDocs=43556)
                0.078125 = fieldNorm(doc=2085)
          0.40830064 = weight(abstract_txt:21578 in 2085) [ClassicSimilarity], result of:
            0.40830064 = score(doc=2085,freq=1.0), product of:
              0.5284353 = queryWeight, product of:
                5.2694798 = boost
                9.890043 = idf(docFreq=5, maxDocs=43556)
                0.010139718 = queryNorm
              0.77265966 = fieldWeight in 2085, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.890043 = idf(docFreq=5, maxDocs=43556)
                0.078125 = fieldNorm(doc=2085)
        0.2 = coord(5/25)
    
  5. Thelwall, M.; Harries, G.: ¬The connection between the research of a university and counts of links to its Web pages : an investigation based upon a classification of the relationships of pages to the research of the host university (2003) 0.12
    0.116528034 = sum of:
      0.116528034 = product of:
        0.5826402 = sum of:
          0.012096509 = weight(abstract_txt:that in 2674) [ClassicSimilarity], result of:
            0.012096509 = score(doc=2674,freq=2.0), product of:
              0.038307544 = queryWeight, product of:
                1.5862404 = boost
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.010139718 = queryNorm
              0.31577355 = fieldWeight in 2674, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.09375 = fieldNorm(doc=2674)
          0.009055764 = weight(abstract_txt:this in 2674) [ClassicSimilarity], result of:
            0.009055764 = score(doc=2674,freq=1.0), product of:
              0.039792787 = queryWeight, product of:
                1.6166985 = boost
                2.4274454 = idf(docFreq=10449, maxDocs=43556)
                0.010139718 = queryNorm
              0.227573 = fieldWeight in 2674, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4274454 = idf(docFreq=10449, maxDocs=43556)
                0.09375 = fieldNorm(doc=2674)
          0.02018284 = weight(abstract_txt:systems in 2674) [ClassicSimilarity], result of:
            0.02018284 = score(doc=2674,freq=1.0), product of:
              0.06302881 = queryWeight, product of:
                1.8198744 = boost
                3.4156382 = idf(docFreq=3889, maxDocs=43556)
                0.010139718 = queryNorm
              0.3202161 = fieldWeight in 2674, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4156382 = idf(docFreq=3889, maxDocs=43556)
                0.09375 = fieldNorm(doc=2674)
          0.025362354 = weight(abstract_txt:have in 2674) [ClassicSimilarity], result of:
            0.025362354 = score(doc=2674,freq=1.0), product of:
              0.0840184 = queryWeight, product of:
                2.5733845 = boost
                3.2199109 = idf(docFreq=4730, maxDocs=43556)
                0.010139718 = queryNorm
              0.30186665 = fieldWeight in 2674, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.2199109 = idf(docFreq=4730, maxDocs=43556)
                0.09375 = fieldNorm(doc=2674)
          0.5159427 = weight(abstract_txt:subsets in 2674) [ClassicSimilarity], result of:
            0.5159427 = score(doc=2674,freq=1.0), product of:
              0.6591202 = queryWeight, product of:
                7.7852607 = boost
                8.349598 = idf(docFreq=27, maxDocs=43556)
                0.010139718 = queryNorm
              0.7827748 = fieldWeight in 2674, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.349598 = idf(docFreq=27, maxDocs=43556)
                0.09375 = fieldNorm(doc=2674)
        0.2 = coord(5/25)