Document (#30117)

Author
Duwairi, R.M.
Title
Machine learning for Arabic text categorization
Source
Journal of the American Society for Information Science and Technology. 57(2006) no.8, S.1005-1010
Year
2006
Abstract
In this article we propose a distance-based classifier for categorizing Arabic text. Each category is represented as a vector of words in an m-dimensional space, and documents are classified on the basis of their closeness to feature vectors of categories. The classifier, in its learning phase, scans the set of training documents to extract features of categories that capture inherent category-specific properties; in its testing phase the classifier uses previously determined category-specific features to categorize unclassified documents. Stemming was used to reduce the dimensionality of feature vectors of documents. The accuracy of the classifier was tested by carrying out several categorization tasks on an in-house collected Arabic corpus. The results show that the proposed classifier is very accurate and robust.
Theme
Computerlinguistik
Automatisches Klassifizieren

Similar documents (content)

  1. Mengle, S.S.R.; Goharian, N.: Ambiguity measure feature-selection algorithm (2009) 0.35
    0.35184675 = sum of:
      0.35184675 = product of:
        1.0995212 = sum of:
          0.045091596 = weight(abstract_txt:text in 4805) [ClassicSimilarity], result of:
            0.045091596 = score(doc=4805,freq=6.0), product of:
              0.072724134 = queryWeight, product of:
                1.2224864 = boost
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.014688355 = queryNorm
              0.6200362 = fieldWeight in 4805, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.0625 = fieldNorm(doc=4805)
          0.085912265 = weight(abstract_txt:dimensionality in 4805) [ClassicSimilarity], result of:
            0.085912265 = score(doc=4805,freq=1.0), product of:
              0.16119765 = queryWeight, product of:
                1.2869719 = boost
                8.527396 = idf(docFreq=22, maxDocs=42740)
                0.014688355 = queryNorm
              0.53296226 = fieldWeight in 4805, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.527396 = idf(docFreq=22, maxDocs=42740)
                0.0625 = fieldNorm(doc=4805)
          0.022493633 = weight(abstract_txt:specific in 4805) [ClassicSimilarity], result of:
            0.022493633 = score(doc=4805,freq=1.0), product of:
              0.08311989 = queryWeight, product of:
                1.3069447 = boost
                4.3298674 = idf(docFreq=1529, maxDocs=42740)
                0.014688355 = queryNorm
              0.2706167 = fieldWeight in 4805, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.3298674 = idf(docFreq=1529, maxDocs=42740)
                0.0625 = fieldNorm(doc=4805)
          0.045569394 = weight(abstract_txt:features in 4805) [ClassicSimilarity], result of:
            0.045569394 = score(doc=4805,freq=3.0), product of:
              0.092272796 = queryWeight, product of:
                1.3770242 = boost
                4.5620384 = idf(docFreq=1212, maxDocs=42740)
                0.014688355 = queryNorm
              0.49385515 = fieldWeight in 4805, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.5620384 = idf(docFreq=1212, maxDocs=42740)
                0.0625 = fieldNorm(doc=4805)
          0.15410952 = weight(abstract_txt:feature in 4805) [ClassicSimilarity], result of:
            0.15410952 = score(doc=4805,freq=7.0), product of:
              0.15674207 = queryWeight, product of:
                1.7947234 = boost
                5.945863 = idf(docFreq=303, maxDocs=42740)
                0.014688355 = queryNorm
              0.9832046 = fieldWeight in 4805, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                5.945863 = idf(docFreq=303, maxDocs=42740)
                0.0625 = fieldNorm(doc=4805)
          0.05462777 = weight(abstract_txt:documents in 4805) [ClassicSimilarity], result of:
            0.05462777 = score(doc=4805,freq=2.0), product of:
              0.15017843 = queryWeight, product of:
                2.4844115 = boost
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.014688355 = queryNorm
              0.36375242 = fieldWeight in 4805, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.0625 = fieldNorm(doc=4805)
          0.102866285 = weight(abstract_txt:category in 4805) [ClassicSimilarity], result of:
            0.102866285 = score(doc=4805,freq=1.0), product of:
              0.26214668 = queryWeight, product of:
                2.8426445 = boost
                6.2783957 = idf(docFreq=217, maxDocs=42740)
                0.014688355 = queryNorm
              0.39239973 = fieldWeight in 4805, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.2783957 = idf(docFreq=217, maxDocs=42740)
                0.0625 = fieldNorm(doc=4805)
          0.5888506 = weight(abstract_txt:classifier in 4805) [ClassicSimilarity], result of:
            0.5888506 = score(doc=4805,freq=5.0), product of:
              0.5816459 = queryWeight, product of:
                5.4664335 = boost
                7.24405 = idf(docFreq=82, maxDocs=42740)
                0.014688355 = queryNorm
              1.0123868 = fieldWeight in 4805, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.24405 = idf(docFreq=82, maxDocs=42740)
                0.0625 = fieldNorm(doc=4805)
        0.32 = coord(8/25)
    
  2. Duwairi, R.; Al-Refai, M.N.; Khasawneh, N.: Feature reduction techniques for Arabic text categorization (2009) 0.34
    0.3408731 = sum of:
      0.3408731 = product of:
        1.0652285 = sum of:
          0.11427385 = weight(abstract_txt:stemming in 170) [ClassicSimilarity], result of:
            0.11427385 = score(doc=170,freq=4.0), product of:
              0.12281927 = queryWeight, product of:
                1.1233704 = boost
                7.4433827 = idf(docFreq=67, maxDocs=42740)
                0.014688355 = queryNorm
              0.93042284 = fieldWeight in 170, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.4433827 = idf(docFreq=67, maxDocs=42740)
                0.0625 = fieldNorm(doc=170)
          0.018408567 = weight(abstract_txt:text in 170) [ClassicSimilarity], result of:
            0.018408567 = score(doc=170,freq=1.0), product of:
              0.072724134 = queryWeight, product of:
                1.2224864 = boost
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.014688355 = queryNorm
              0.2531287 = fieldWeight in 170, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.0625 = fieldNorm(doc=170)
          0.03895929 = weight(abstract_txt:categories in 170) [ClassicSimilarity], result of:
            0.03895929 = score(doc=170,freq=1.0), product of:
              0.11987794 = queryWeight, product of:
                1.5695472 = boost
                5.199861 = idf(docFreq=640, maxDocs=42740)
                0.014688355 = queryNorm
              0.32499132 = fieldWeight in 170, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.199861 = idf(docFreq=640, maxDocs=42740)
                0.0625 = fieldNorm(doc=170)
          0.058247928 = weight(abstract_txt:feature in 170) [ClassicSimilarity], result of:
            0.058247928 = score(doc=170,freq=1.0), product of:
              0.15674207 = queryWeight, product of:
                1.7947234 = boost
                5.945863 = idf(docFreq=303, maxDocs=42740)
                0.014688355 = queryNorm
              0.37161642 = fieldWeight in 170, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.945863 = idf(docFreq=303, maxDocs=42740)
                0.0625 = fieldNorm(doc=170)
          0.32368457 = weight(abstract_txt:vectors in 170) [ClassicSimilarity], result of:
            0.32368457 = score(doc=170,freq=6.0), product of:
              0.27062184 = queryWeight, product of:
                2.3582299 = boost
                7.8127427 = idf(docFreq=46, maxDocs=42740)
                0.014688355 = queryNorm
              1.1960771 = fieldWeight in 170, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                7.8127427 = idf(docFreq=46, maxDocs=42740)
                0.0625 = fieldNorm(doc=170)
          0.06690508 = weight(abstract_txt:documents in 170) [ClassicSimilarity], result of:
            0.06690508 = score(doc=170,freq=3.0), product of:
              0.15017843 = queryWeight, product of:
                2.4844115 = boost
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.014688355 = queryNorm
              0.44550392 = fieldWeight in 170, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.0625 = fieldNorm(doc=170)
          0.18140715 = weight(abstract_txt:arabic in 170) [ClassicSimilarity], result of:
            0.18140715 = score(doc=170,freq=1.0), product of:
              0.38264725 = queryWeight, product of:
                3.4343903 = boost
                7.585353 = idf(docFreq=58, maxDocs=42740)
                0.014688355 = queryNorm
              0.47408456 = fieldWeight in 170, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.585353 = idf(docFreq=58, maxDocs=42740)
                0.0625 = fieldNorm(doc=170)
          0.263342 = weight(abstract_txt:classifier in 170) [ClassicSimilarity], result of:
            0.263342 = score(doc=170,freq=1.0), product of:
              0.5816459 = queryWeight, product of:
                5.4664335 = boost
                7.24405 = idf(docFreq=82, maxDocs=42740)
                0.014688355 = queryNorm
              0.45275313 = fieldWeight in 170, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.24405 = idf(docFreq=82, maxDocs=42740)
                0.0625 = fieldNorm(doc=170)
        0.32 = coord(8/25)
    
  3. Sebastiani, F.: Machine learning in automated text categorization (2002) 0.25
    0.24694787 = sum of:
      0.24694787 = product of:
        1.0289495 = sum of:
          0.023010708 = weight(abstract_txt:text in 4390) [ClassicSimilarity], result of:
            0.023010708 = score(doc=4390,freq=1.0), product of:
              0.072724134 = queryWeight, product of:
                1.2224864 = boost
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.014688355 = queryNorm
              0.3164109 = fieldWeight in 4390, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.078125 = fieldNorm(doc=4390)
          0.066659085 = weight(abstract_txt:learning in 4390) [ClassicSimilarity], result of:
            0.066659085 = score(doc=4390,freq=3.0), product of:
              0.10246866 = queryWeight, product of:
                1.4511098 = boost
                4.807482 = idf(docFreq=948, maxDocs=42740)
                0.014688355 = queryNorm
              0.6505314 = fieldWeight in 4390, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.807482 = idf(docFreq=948, maxDocs=42740)
                0.078125 = fieldNorm(doc=4390)
          0.06887095 = weight(abstract_txt:categories in 4390) [ClassicSimilarity], result of:
            0.06887095 = score(doc=4390,freq=2.0), product of:
              0.11987794 = queryWeight, product of:
                1.5695472 = boost
                5.199861 = idf(docFreq=640, maxDocs=42740)
                0.014688355 = queryNorm
              0.5745089 = fieldWeight in 4390, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.199861 = idf(docFreq=640, maxDocs=42740)
                0.078125 = fieldNorm(doc=4390)
          0.14376909 = weight(abstract_txt:categorization in 4390) [ClassicSimilarity], result of:
            0.14376909 = score(doc=4390,freq=2.0), product of:
              0.19580582 = queryWeight, product of:
                2.005938 = boost
                6.645611 = idf(docFreq=150, maxDocs=42740)
                0.014688355 = queryNorm
              0.73424315 = fieldWeight in 4390, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.645611 = idf(docFreq=150, maxDocs=42740)
                0.078125 = fieldNorm(doc=4390)
          0.068284705 = weight(abstract_txt:documents in 4390) [ClassicSimilarity], result of:
            0.068284705 = score(doc=4390,freq=2.0), product of:
              0.15017843 = queryWeight, product of:
                2.4844115 = boost
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.014688355 = queryNorm
              0.45469052 = fieldWeight in 4390, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.078125 = fieldNorm(doc=4390)
          0.658355 = weight(abstract_txt:classifier in 4390) [ClassicSimilarity], result of:
            0.658355 = score(doc=4390,freq=4.0), product of:
              0.5816459 = queryWeight, product of:
                5.4664335 = boost
                7.24405 = idf(docFreq=82, maxDocs=42740)
                0.014688355 = queryNorm
              1.1318828 = fieldWeight in 4390, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.24405 = idf(docFreq=82, maxDocs=42740)
                0.078125 = fieldNorm(doc=4390)
        0.24 = coord(6/25)
    
  4. Maghsoodi, N.; Homayounpour, M.M.: Improving Farsi multiclass text classification using a thesaurus and two-stage feature selection (2011) 0.23
    0.22931273 = sum of:
      0.22931273 = product of:
        0.7166023 = sum of:
          0.018408567 = weight(abstract_txt:text in 1776) [ClassicSimilarity], result of:
            0.018408567 = score(doc=1776,freq=1.0), product of:
              0.072724134 = queryWeight, product of:
                1.2224864 = boost
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.014688355 = queryNorm
              0.2531287 = fieldWeight in 1776, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.0625 = fieldNorm(doc=1776)
          0.045569394 = weight(abstract_txt:features in 1776) [ClassicSimilarity], result of:
            0.045569394 = score(doc=1776,freq=3.0), product of:
              0.092272796 = queryWeight, product of:
                1.3770242 = boost
                4.5620384 = idf(docFreq=1212, maxDocs=42740)
                0.014688355 = queryNorm
              0.49385515 = fieldWeight in 1776, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.5620384 = idf(docFreq=1212, maxDocs=42740)
                0.0625 = fieldNorm(doc=1776)
          0.055096757 = weight(abstract_txt:categories in 1776) [ClassicSimilarity], result of:
            0.055096757 = score(doc=1776,freq=2.0), product of:
              0.11987794 = queryWeight, product of:
                1.5695472 = boost
                5.199861 = idf(docFreq=640, maxDocs=42740)
                0.014688355 = queryNorm
              0.45960712 = fieldWeight in 1776, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.199861 = idf(docFreq=640, maxDocs=42740)
                0.0625 = fieldNorm(doc=1776)
          0.13024633 = weight(abstract_txt:feature in 1776) [ClassicSimilarity], result of:
            0.13024633 = score(doc=1776,freq=5.0), product of:
              0.15674207 = queryWeight, product of:
                1.7947234 = boost
                5.945863 = idf(docFreq=303, maxDocs=42740)
                0.014688355 = queryNorm
              0.8309596 = fieldWeight in 1776, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.945863 = idf(docFreq=303, maxDocs=42740)
                0.0625 = fieldNorm(doc=1776)
          0.06798345 = weight(abstract_txt:phase in 1776) [ClassicSimilarity], result of:
            0.06798345 = score(doc=1776,freq=1.0), product of:
              0.1737537 = queryWeight, product of:
                1.8896083 = boost
                6.2602134 = idf(docFreq=221, maxDocs=42740)
                0.014688355 = queryNorm
              0.39126334 = fieldWeight in 1776, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.2602134 = idf(docFreq=221, maxDocs=42740)
                0.0625 = fieldNorm(doc=1776)
          0.08132808 = weight(abstract_txt:categorization in 1776) [ClassicSimilarity], result of:
            0.08132808 = score(doc=1776,freq=1.0), product of:
              0.19580582 = queryWeight, product of:
                2.005938 = boost
                6.645611 = idf(docFreq=150, maxDocs=42740)
                0.014688355 = queryNorm
              0.41535068 = fieldWeight in 1776, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.645611 = idf(docFreq=150, maxDocs=42740)
                0.0625 = fieldNorm(doc=1776)
          0.05462777 = weight(abstract_txt:documents in 1776) [ClassicSimilarity], result of:
            0.05462777 = score(doc=1776,freq=2.0), product of:
              0.15017843 = queryWeight, product of:
                2.4844115 = boost
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.014688355 = queryNorm
              0.36375242 = fieldWeight in 1776, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.115389 = idf(docFreq=1895, maxDocs=42740)
                0.0625 = fieldNorm(doc=1776)
          0.263342 = weight(abstract_txt:classifier in 1776) [ClassicSimilarity], result of:
            0.263342 = score(doc=1776,freq=1.0), product of:
              0.5816459 = queryWeight, product of:
                5.4664335 = boost
                7.24405 = idf(docFreq=82, maxDocs=42740)
                0.014688355 = queryNorm
              0.45275313 = fieldWeight in 1776, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.24405 = idf(docFreq=82, maxDocs=42740)
                0.0625 = fieldNorm(doc=1776)
        0.32 = coord(8/25)
    
  5. Yang, Y.; Liu, X.: ¬A re-examination of text categorization methods (1999) 0.21
    0.2067862 = sum of:
      0.2067862 = product of:
        1.033931 = sum of:
          0.027612848 = weight(abstract_txt:text in 4387) [ClassicSimilarity], result of:
            0.027612848 = score(doc=4387,freq=1.0), product of:
              0.072724134 = queryWeight, product of:
                1.2224864 = boost
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.014688355 = queryNorm
              0.37969306 = fieldWeight in 4387, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.09375 = fieldNorm(doc=4387)
          0.058438934 = weight(abstract_txt:categories in 4387) [ClassicSimilarity], result of:
            0.058438934 = score(doc=4387,freq=1.0), product of:
              0.11987794 = queryWeight, product of:
                1.5695472 = boost
                5.199861 = idf(docFreq=640, maxDocs=42740)
                0.014688355 = queryNorm
              0.48748696 = fieldWeight in 4387, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.199861 = idf(docFreq=640, maxDocs=42740)
                0.09375 = fieldNorm(doc=4387)
          0.12199212 = weight(abstract_txt:categorization in 4387) [ClassicSimilarity], result of:
            0.12199212 = score(doc=4387,freq=1.0), product of:
              0.19580582 = queryWeight, product of:
                2.005938 = boost
                6.645611 = idf(docFreq=150, maxDocs=42740)
                0.014688355 = queryNorm
              0.623026 = fieldWeight in 4387, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.645611 = idf(docFreq=150, maxDocs=42740)
                0.09375 = fieldNorm(doc=4387)
          0.26725444 = weight(abstract_txt:category in 4387) [ClassicSimilarity], result of:
            0.26725444 = score(doc=4387,freq=3.0), product of:
              0.26214668 = queryWeight, product of:
                2.8426445 = boost
                6.2783957 = idf(docFreq=217, maxDocs=42740)
                0.014688355 = queryNorm
              1.0194844 = fieldWeight in 4387, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.2783957 = idf(docFreq=217, maxDocs=42740)
                0.09375 = fieldNorm(doc=4387)
          0.55863273 = weight(abstract_txt:classifier in 4387) [ClassicSimilarity], result of:
            0.55863273 = score(doc=4387,freq=2.0), product of:
              0.5816459 = queryWeight, product of:
                5.4664335 = boost
                7.24405 = idf(docFreq=82, maxDocs=42740)
                0.014688355 = queryNorm
              0.96043444 = fieldWeight in 4387, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.24405 = idf(docFreq=82, maxDocs=42740)
                0.09375 = fieldNorm(doc=4387)
        0.2 = coord(5/25)