Document (#33955)

Author
Ringltetter, C.
Stubbe, A.
Title
Practical aspects of automatic genre classification
Source
Bulletin of the American Society for Information Science and Technology. 34(2008) no.5, S.27-30
Year
2008
Abstract
In the field of automatic text processing the technical term genre refers to the partition of documents into classes of documents with similar function and form. Genre represents an independent dimension, ideally orthogonal to topic. Traditionally, most work in the area of text classification from a practical as well as from a theoretical perspective has focused on the problem of how to recognize thematic domains. However, given a user's information need, even prior to content, the genre of a document leads to a first coarse binary classification of the recall space into immediately rejected documents and those that require further processing. Depending on the information task at hand, each genre can represent a class of documents that should be filtered. For example, cooking recipes represent a kind of "noise" if someone needs to find articles about the economic outlook on fish breeding; a person might be interested only in prose about the Spanish Civil War, another only in military documents. In cases like these, a genre-triggered search can deliver significantly higher precision than a simple keyword search. If the documents are not tagged initially and the document base is too big for manual annotation, we need an automatic classification system.
Footnote
Available online at: http://www.asis.org/Bulletin/Jun-08/JunJul08_Ringlstetter_Stubbe.html.

Similar documents (content)

  1. Finn, A.; Kushmerick, N.: Learning to classify documents according to genre (2006) 0.29
    0.29396093 = sum of:
      0.29396093 = product of:
        1.0498605 = sum of:
          0.023317793 = weight(abstract_txt:text in 6010) [ClassicSimilarity], result of:
            0.023317793 = score(doc=6010,freq=1.0), product of:
              0.073807515 = queryWeight, product of:
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.018251719 = queryNorm
              0.3159271 = fieldWeight in 6010, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.078125 = fieldNorm(doc=6010)
          0.025761269 = weight(abstract_txt:need in 6010) [ClassicSimilarity], result of:
            0.025761269 = score(doc=6010,freq=1.0), product of:
              0.07887762 = queryWeight, product of:
                1.0337764 = boost
                4.180454 = idf(docFreq=1837, maxDocs=44218)
                0.018251719 = queryNorm
              0.32659796 = fieldWeight in 6010, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.180454 = idf(docFreq=1837, maxDocs=44218)
                0.078125 = fieldNorm(doc=6010)
          0.027890785 = weight(abstract_txt:document in 6010) [ClassicSimilarity], result of:
            0.027890785 = score(doc=6010,freq=1.0), product of:
              0.083166696 = queryWeight, product of:
                1.0615108 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.018251719 = queryNorm
              0.33536002 = fieldWeight in 6010, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.078125 = fieldNorm(doc=6010)
          0.10490828 = weight(abstract_txt:automatic in 6010) [ClassicSimilarity], result of:
            0.10490828 = score(doc=6010,freq=2.0), product of:
              0.182755 = queryWeight, product of:
                1.927214 = boost
                5.1955976 = idf(docFreq=665, maxDocs=44218)
                0.018251719 = queryNorm
              0.57403785 = fieldWeight in 6010, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.1955976 = idf(docFreq=665, maxDocs=44218)
                0.078125 = fieldNorm(doc=6010)
          0.06345095 = weight(abstract_txt:classification in 6010) [ClassicSimilarity], result of:
            0.06345095 = score(doc=6010,freq=2.0), product of:
              0.14385812 = queryWeight, product of:
                1.9743851 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.018251719 = queryNorm
              0.44106615 = fieldWeight in 6010, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.078125 = fieldNorm(doc=6010)
          0.18138334 = weight(abstract_txt:documents in 6010) [ClassicSimilarity], result of:
            0.18138334 = score(doc=6010,freq=6.0), product of:
              0.22998378 = queryWeight, product of:
                3.057447 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.018251719 = queryNorm
              0.7886788 = fieldWeight in 6010, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.078125 = fieldNorm(doc=6010)
          0.6231481 = weight(abstract_txt:genre in 6010) [ClassicSimilarity], result of:
            0.6231481 = score(doc=6010,freq=4.0), product of:
              0.5994094 = queryWeight, product of:
                4.9359655 = boost
                6.653462 = idf(docFreq=154, maxDocs=44218)
                0.018251719 = queryNorm
              1.0396035 = fieldWeight in 6010, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.653462 = idf(docFreq=154, maxDocs=44218)
                0.078125 = fieldNorm(doc=6010)
        0.28 = coord(7/25)
    
  2. Lim, C.S.; Lee, K.J.; Kim, G.C.: Multiple sets of features for automatic genre classification of web documents (2005) 0.15
    0.15164015 = sum of:
      0.15164015 = product of:
        0.7582007 = sum of:
          0.038646605 = weight(abstract_txt:document in 1048) [ClassicSimilarity], result of:
            0.038646605 = score(doc=1048,freq=3.0), product of:
              0.083166696 = queryWeight, product of:
                1.0615108 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.018251719 = queryNorm
              0.46468848 = fieldWeight in 1048, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0625 = fieldNorm(doc=1048)
          0.05934509 = weight(abstract_txt:automatic in 1048) [ClassicSimilarity], result of:
            0.05934509 = score(doc=1048,freq=1.0), product of:
              0.182755 = queryWeight, product of:
                1.927214 = boost
                5.1955976 = idf(docFreq=665, maxDocs=44218)
                0.018251719 = queryNorm
              0.32472485 = fieldWeight in 1048, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1955976 = idf(docFreq=665, maxDocs=44218)
                0.0625 = fieldNorm(doc=1048)
          0.050760757 = weight(abstract_txt:classification in 1048) [ClassicSimilarity], result of:
            0.050760757 = score(doc=1048,freq=2.0), product of:
              0.14385812 = queryWeight, product of:
                1.9743851 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.018251719 = queryNorm
              0.3528529 = fieldWeight in 1048, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.0625 = fieldNorm(doc=1048)
          0.17771864 = weight(abstract_txt:documents in 1048) [ClassicSimilarity], result of:
            0.17771864 = score(doc=1048,freq=9.0), product of:
              0.22998378 = queryWeight, product of:
                3.057447 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.018251719 = queryNorm
              0.77274424 = fieldWeight in 1048, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0625 = fieldNorm(doc=1048)
          0.4317296 = weight(abstract_txt:genre in 1048) [ClassicSimilarity], result of:
            0.4317296 = score(doc=1048,freq=3.0), product of:
              0.5994094 = queryWeight, product of:
                4.9359655 = boost
                6.653462 = idf(docFreq=154, maxDocs=44218)
                0.018251719 = queryNorm
              0.72025836 = fieldWeight in 1048, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.653462 = idf(docFreq=154, maxDocs=44218)
                0.0625 = fieldNorm(doc=1048)
        0.2 = coord(5/25)
    
  3. Santini, M.: Zero, single, or multi? : genre of web pages through the users' perspective (2008) 0.13
    0.13089342 = sum of:
      0.13089342 = product of:
        0.8180839 = sum of:
          0.020609016 = weight(abstract_txt:need in 2059) [ClassicSimilarity], result of:
            0.020609016 = score(doc=2059,freq=1.0), product of:
              0.07887762 = queryWeight, product of:
                1.0337764 = boost
                4.180454 = idf(docFreq=1837, maxDocs=44218)
                0.018251719 = queryNorm
              0.26127836 = fieldWeight in 2059, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.180454 = idf(docFreq=1837, maxDocs=44218)
                0.0625 = fieldNorm(doc=2059)
          0.030294325 = weight(abstract_txt:only in 2059) [ClassicSimilarity], result of:
            0.030294325 = score(doc=2059,freq=2.0), product of:
              0.08093689 = queryWeight, product of:
                1.0471839 = boost
                4.234672 = idf(docFreq=1740, maxDocs=44218)
                0.018251719 = queryNorm
              0.37429565 = fieldWeight in 2059, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.234672 = idf(docFreq=1740, maxDocs=44218)
                0.0625 = fieldNorm(doc=2059)
          0.06216898 = weight(abstract_txt:classification in 2059) [ClassicSimilarity], result of:
            0.06216898 = score(doc=2059,freq=3.0), product of:
              0.14385812 = queryWeight, product of:
                1.9743851 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.018251719 = queryNorm
              0.4321548 = fieldWeight in 2059, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.0625 = fieldNorm(doc=2059)
          0.70501155 = weight(abstract_txt:genre in 2059) [ClassicSimilarity], result of:
            0.70501155 = score(doc=2059,freq=8.0), product of:
              0.5994094 = queryWeight, product of:
                4.9359655 = boost
                6.653462 = idf(docFreq=154, maxDocs=44218)
                0.018251719 = queryNorm
              1.176177 = fieldWeight in 2059, product of:
                2.828427 = tf(freq=8.0), with freq of:
                  8.0 = termFreq=8.0
                6.653462 = idf(docFreq=154, maxDocs=44218)
                0.0625 = fieldNorm(doc=2059)
        0.16 = coord(4/25)
    
  4. Morato, J.; Llorens, J.; Genova, G.; Moreiro, J.A.: Experiments in discourse analysis impact on information classification and retrieval algorithms (2003) 0.11
    0.11031421 = sum of:
      0.11031421 = product of:
        0.45964256 = sum of:
          0.026381072 = weight(abstract_txt:text in 1083) [ClassicSimilarity], result of:
            0.026381072 = score(doc=1083,freq=2.0), product of:
              0.073807515 = queryWeight, product of:
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.018251719 = queryNorm
              0.3574307 = fieldWeight in 1083, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=1083)
          0.021421323 = weight(abstract_txt:only in 1083) [ClassicSimilarity], result of:
            0.021421323 = score(doc=1083,freq=1.0), product of:
              0.08093689 = queryWeight, product of:
                1.0471839 = boost
                4.234672 = idf(docFreq=1740, maxDocs=44218)
                0.018251719 = queryNorm
              0.264667 = fieldWeight in 1083, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.234672 = idf(docFreq=1740, maxDocs=44218)
                0.0625 = fieldNorm(doc=1083)
          0.031554822 = weight(abstract_txt:document in 1083) [ClassicSimilarity], result of:
            0.031554822 = score(doc=1083,freq=2.0), product of:
              0.083166696 = queryWeight, product of:
                1.0615108 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.018251719 = queryNorm
              0.37941656 = fieldWeight in 1083, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0625 = fieldNorm(doc=1083)
          0.07178655 = weight(abstract_txt:classification in 1083) [ClassicSimilarity], result of:
            0.07178655 = score(doc=1083,freq=4.0), product of:
              0.14385812 = queryWeight, product of:
                1.9743851 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.018251719 = queryNorm
              0.4990094 = fieldWeight in 1083, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.0625 = fieldNorm(doc=1083)
          0.059239548 = weight(abstract_txt:documents in 1083) [ClassicSimilarity], result of:
            0.059239548 = score(doc=1083,freq=1.0), product of:
              0.22998378 = queryWeight, product of:
                3.057447 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.018251719 = queryNorm
              0.2575814 = fieldWeight in 1083, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0625 = fieldNorm(doc=1083)
          0.24925923 = weight(abstract_txt:genre in 1083) [ClassicSimilarity], result of:
            0.24925923 = score(doc=1083,freq=1.0), product of:
              0.5994094 = queryWeight, product of:
                4.9359655 = boost
                6.653462 = idf(docFreq=154, maxDocs=44218)
                0.018251719 = queryNorm
              0.41584137 = fieldWeight in 1083, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.653462 = idf(docFreq=154, maxDocs=44218)
                0.0625 = fieldNorm(doc=1083)
        0.24 = coord(6/25)
    
  5. Altinel, B.; Ganiz, M.C.: Semantic text classification : a survey of past and recent advances (2018) 0.11
    0.10625727 = sum of:
      0.10625727 = product of:
        0.37949023 = sum of:
          0.058851447 = weight(abstract_txt:text in 5051) [ClassicSimilarity], result of:
            0.058851447 = score(doc=5051,freq=13.0), product of:
              0.073807515 = queryWeight, product of:
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.018251719 = queryNorm
              0.7973639 = fieldWeight in 5051, product of:
                3.6055512 = tf(freq=13.0), with freq of:
                  13.0 = termFreq=13.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0546875 = fieldNorm(doc=5051)
          0.018743658 = weight(abstract_txt:only in 5051) [ClassicSimilarity], result of:
            0.018743658 = score(doc=5051,freq=1.0), product of:
              0.08093689 = queryWeight, product of:
                1.0471839 = boost
                4.234672 = idf(docFreq=1740, maxDocs=44218)
                0.018251719 = queryNorm
              0.23158363 = fieldWeight in 5051, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.234672 = idf(docFreq=1740, maxDocs=44218)
                0.0546875 = fieldNorm(doc=5051)
          0.03381578 = weight(abstract_txt:document in 5051) [ClassicSimilarity], result of:
            0.03381578 = score(doc=5051,freq=3.0), product of:
              0.083166696 = queryWeight, product of:
                1.0615108 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.018251719 = queryNorm
              0.4066024 = fieldWeight in 5051, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0546875 = fieldNorm(doc=5051)
          0.029609025 = weight(abstract_txt:processing in 5051) [ClassicSimilarity], result of:
            0.029609025 = score(doc=5051,freq=1.0), product of:
              0.10978078 = queryWeight, product of:
                1.2195872 = boost
                4.931848 = idf(docFreq=866, maxDocs=44218)
                0.018251719 = queryNorm
              0.26971045 = fieldWeight in 5051, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.931848 = idf(docFreq=866, maxDocs=44218)
                0.0546875 = fieldNorm(doc=5051)
          0.05192695 = weight(abstract_txt:automatic in 5051) [ClassicSimilarity], result of:
            0.05192695 = score(doc=5051,freq=1.0), product of:
              0.182755 = queryWeight, product of:
                1.927214 = boost
                5.1955976 = idf(docFreq=665, maxDocs=44218)
                0.018251719 = queryNorm
              0.28413424 = fieldWeight in 5051, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1955976 = idf(docFreq=665, maxDocs=44218)
                0.0546875 = fieldNorm(doc=5051)
          0.11323817 = weight(abstract_txt:classification in 5051) [ClassicSimilarity], result of:
            0.11323817 = score(doc=5051,freq=13.0), product of:
              0.14385812 = queryWeight, product of:
                1.9743851 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.018251719 = queryNorm
              0.78715175 = fieldWeight in 5051, product of:
                3.6055512 = tf(freq=13.0), with freq of:
                  13.0 = termFreq=13.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.0546875 = fieldNorm(doc=5051)
          0.0733052 = weight(abstract_txt:documents in 5051) [ClassicSimilarity], result of:
            0.0733052 = score(doc=5051,freq=2.0), product of:
              0.22998378 = queryWeight, product of:
                3.057447 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.018251719 = queryNorm
              0.31874073 = fieldWeight in 5051, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0546875 = fieldNorm(doc=5051)
        0.28 = coord(7/25)