Document (#30117)

Author
Duwairi, R.M.
Title
Machine learning for Arabic text categorization
Source
Journal of the American Society for Information Science and Technology. 57(2006) no.8, S.1005-1010
Year
2006
Abstract
In this article we propose a distance-based classifier for categorizing Arabic text. Each category is represented as a vector of words in an m-dimensional space, and documents are classified on the basis of their closeness to feature vectors of categories. The classifier, in its learning phase, scans the set of training documents to extract features of categories that capture inherent category-specific properties; in its testing phase the classifier uses previously determined category-specific features to categorize unclassified documents. Stemming was used to reduce the dimensionality of feature vectors of documents. The accuracy of the classifier was tested by carrying out several categorization tasks on an in-house collected Arabic corpus. The results show that the proposed classifier is very accurate and robust.
Theme
Computerlinguistik
Automatisches Klassifizieren

Similar documents (content)

  1. Mengle, S.S.R.; Goharian, N.: Ambiguity measure feature-selection algorithm (2009) 0.35
    0.35143188 = sum of:
      0.35143188 = product of:
        1.0982246 = sum of:
          0.045040637 = weight(abstract_txt:text in 3984) [ClassicSimilarity], result of:
            0.045040637 = score(doc=3984,freq=6.0), product of:
              0.07265812 = queryWeight, product of:
                1.2228372 = boost
                4.049158 = idf(docFreq=2018, maxDocs=42596)
                0.014674077 = queryNorm
              0.6198982 = fieldWeight in 3984, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.049158 = idf(docFreq=2018, maxDocs=42596)
                0.0625 = fieldNorm(doc=3984)
          0.08577057 = weight(abstract_txt:dimensionality in 3984) [ClassicSimilarity], result of:
            0.08577057 = score(doc=3984,freq=1.0), product of:
              0.16099551 = queryWeight, product of:
                1.2871182 = boost
                8.524021 = idf(docFreq=22, maxDocs=42596)
                0.014674077 = queryNorm
              0.5327513 = fieldWeight in 3984, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.524021 = idf(docFreq=22, maxDocs=42596)
                0.0625 = fieldNorm(doc=3984)
          0.02249186 = weight(abstract_txt:specific in 3984) [ClassicSimilarity], result of:
            0.02249186 = score(doc=3984,freq=1.0), product of:
              0.083102696 = queryWeight, product of:
                1.3077782 = boost
                4.330422 = idf(docFreq=1523, maxDocs=42596)
                0.014674077 = queryNorm
              0.27065137 = fieldWeight in 3984, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.330422 = idf(docFreq=1523, maxDocs=42596)
                0.0625 = fieldNorm(doc=3984)
          0.04549666 = weight(abstract_txt:features in 3984) [ClassicSimilarity], result of:
            0.04549666 = score(doc=3984,freq=3.0), product of:
              0.09216036 = queryWeight, product of:
                1.3772051 = boost
                4.5603137 = idf(docFreq=1210, maxDocs=42596)
                0.014674077 = queryNorm
              0.49366844 = fieldWeight in 3984, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.5603137 = idf(docFreq=1210, maxDocs=42596)
                0.0625 = fieldNorm(doc=3984)
          0.15403195 = weight(abstract_txt:feature in 3984) [ClassicSimilarity], result of:
            0.15403195 = score(doc=3984,freq=7.0), product of:
              0.15666528 = queryWeight, product of:
                1.7956138 = boost
                5.9457827 = idf(docFreq=302, maxDocs=42596)
                0.014674077 = queryNorm
              0.9831914 = fieldWeight in 3984, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                5.9457827 = idf(docFreq=302, maxDocs=42596)
                0.0625 = fieldNorm(doc=3984)
          0.054531205 = weight(abstract_txt:documents in 3984) [ClassicSimilarity], result of:
            0.054531205 = score(doc=3984,freq=2.0), product of:
              0.14997825 = queryWeight, product of:
                2.4845955 = boost
                4.1135974 = idf(docFreq=1892, maxDocs=42596)
                0.014674077 = queryNorm
              0.36359408 = fieldWeight in 3984, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1135974 = idf(docFreq=1892, maxDocs=42596)
                0.0625 = fieldNorm(doc=3984)
          0.10310596 = weight(abstract_txt:category in 3984) [ClassicSimilarity], result of:
            0.10310596 = score(doc=3984,freq=1.0), product of:
              0.26251322 = queryWeight, product of:
                2.8467398 = boost
                6.2842374 = idf(docFreq=215, maxDocs=42596)
                0.014674077 = queryNorm
              0.39276484 = fieldWeight in 3984, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.2842374 = idf(docFreq=215, maxDocs=42596)
                0.0625 = fieldNorm(doc=3984)
          0.58775574 = weight(abstract_txt:classifier in 3984) [ClassicSimilarity], result of:
            0.58775574 = score(doc=3984,freq=5.0), product of:
              0.58083504 = queryWeight, product of:
                5.4666715 = boost
                7.240675 = idf(docFreq=82, maxDocs=42596)
                0.014674077 = queryNorm
              1.0119151 = fieldWeight in 3984, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.240675 = idf(docFreq=82, maxDocs=42596)
                0.0625 = fieldNorm(doc=3984)
        0.32 = coord(8/25)
    
  2. Duwairi, R.; Al-Refai, M.N.; Khasawneh, N.: Feature reduction techniques for Arabic text categorization (2009) 0.34
    0.34054306 = sum of:
      0.34054306 = product of:
        1.0641971 = sum of:
          0.114748426 = weight(abstract_txt:stemming in 4349) [ClassicSimilarity], result of:
            0.114748426 = score(doc=4349,freq=4.0), product of:
              0.12314007 = queryWeight, product of:
                1.1256704 = boost
                7.454823 = idf(docFreq=66, maxDocs=42596)
                0.014674077 = queryNorm
              0.9318529 = fieldWeight in 4349, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.454823 = idf(docFreq=66, maxDocs=42596)
                0.0625 = fieldNorm(doc=4349)
          0.018387765 = weight(abstract_txt:text in 4349) [ClassicSimilarity], result of:
            0.018387765 = score(doc=4349,freq=1.0), product of:
              0.07265812 = queryWeight, product of:
                1.2228372 = boost
                4.049158 = idf(docFreq=2018, maxDocs=42596)
                0.014674077 = queryNorm
              0.25307238 = fieldWeight in 4349, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.049158 = idf(docFreq=2018, maxDocs=42596)
                0.0625 = fieldNorm(doc=4349)
          0.0390061 = weight(abstract_txt:categories in 4349) [ClassicSimilarity], result of:
            0.0390061 = score(doc=4349,freq=1.0), product of:
              0.11995543 = queryWeight, product of:
                1.5712183 = boost
                5.202746 = idf(docFreq=636, maxDocs=42596)
                0.014674077 = queryNorm
              0.32517162 = fieldWeight in 4349, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.202746 = idf(docFreq=636, maxDocs=42596)
                0.0625 = fieldNorm(doc=4349)
          0.058218606 = weight(abstract_txt:feature in 4349) [ClassicSimilarity], result of:
            0.058218606 = score(doc=4349,freq=1.0), product of:
              0.15666528 = queryWeight, product of:
                1.7956138 = boost
                5.9457827 = idf(docFreq=302, maxDocs=42596)
                0.014674077 = queryNorm
              0.37161142 = fieldWeight in 4349, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9457827 = idf(docFreq=302, maxDocs=42596)
                0.0625 = fieldNorm(doc=4349)
          0.32311568 = weight(abstract_txt:vectors in 4349) [ClassicSimilarity], result of:
            0.32311568 = score(doc=4349,freq=6.0), product of:
              0.27026293 = queryWeight, product of:
                2.3584127 = boost
                7.809368 = idf(docFreq=46, maxDocs=42596)
                0.014674077 = queryNorm
              1.1955605 = fieldWeight in 4349, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                7.809368 = idf(docFreq=46, maxDocs=42596)
                0.0625 = fieldNorm(doc=4349)
          0.06678681 = weight(abstract_txt:documents in 4349) [ClassicSimilarity], result of:
            0.06678681 = score(doc=4349,freq=3.0), product of:
              0.14997825 = queryWeight, product of:
                2.4845955 = boost
                4.1135974 = idf(docFreq=1892, maxDocs=42596)
                0.014674077 = queryNorm
              0.44530997 = fieldWeight in 4349, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.1135974 = idf(docFreq=1892, maxDocs=42596)
                0.0625 = fieldNorm(doc=4349)
          0.1810813 = weight(abstract_txt:arabic in 4349) [ClassicSimilarity], result of:
            0.1810813 = score(doc=4349,freq=1.0), product of:
              0.3821299 = queryWeight, product of:
                3.4346123 = boost
                7.5819783 = idf(docFreq=58, maxDocs=42596)
                0.014674077 = queryNorm
              0.47387365 = fieldWeight in 4349, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5819783 = idf(docFreq=58, maxDocs=42596)
                0.0625 = fieldNorm(doc=4349)
          0.26285237 = weight(abstract_txt:classifier in 4349) [ClassicSimilarity], result of:
            0.26285237 = score(doc=4349,freq=1.0), product of:
              0.58083504 = queryWeight, product of:
                5.4666715 = boost
                7.240675 = idf(docFreq=82, maxDocs=42596)
                0.014674077 = queryNorm
              0.4525422 = fieldWeight in 4349, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.240675 = idf(docFreq=82, maxDocs=42596)
                0.0625 = fieldNorm(doc=4349)
        0.32 = coord(8/25)
    
  3. Sebastiani, F.: Machine learning in automated text categorization (2002) 0.25
    0.24659248 = sum of:
      0.24659248 = product of:
        1.0274687 = sum of:
          0.022984704 = weight(abstract_txt:text in 4390) [ClassicSimilarity], result of:
            0.022984704 = score(doc=4390,freq=1.0), product of:
              0.07265812 = queryWeight, product of:
                1.2228372 = boost
                4.049158 = idf(docFreq=2018, maxDocs=42596)
                0.014674077 = queryNorm
              0.31634048 = fieldWeight in 4390, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.049158 = idf(docFreq=2018, maxDocs=42596)
                0.078125 = fieldNorm(doc=4390)
          0.06675169 = weight(abstract_txt:learning in 4390) [ClassicSimilarity], result of:
            0.06675169 = score(doc=4390,freq=3.0), product of:
              0.102547705 = queryWeight, product of:
                1.4527454 = boost
                4.810449 = idf(docFreq=942, maxDocs=42596)
                0.014674077 = queryNorm
              0.650933 = fieldWeight in 4390, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.810449 = idf(docFreq=942, maxDocs=42596)
                0.078125 = fieldNorm(doc=4390)
          0.0689537 = weight(abstract_txt:categories in 4390) [ClassicSimilarity], result of:
            0.0689537 = score(doc=4390,freq=2.0), product of:
              0.11995543 = queryWeight, product of:
                1.5712183 = boost
                5.202746 = idf(docFreq=636, maxDocs=42596)
                0.014674077 = queryNorm
              0.5748277 = fieldWeight in 4390, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.202746 = idf(docFreq=636, maxDocs=42596)
                0.078125 = fieldNorm(doc=4390)
          0.14348368 = weight(abstract_txt:categorization in 4390) [ClassicSimilarity], result of:
            0.14348368 = score(doc=4390,freq=2.0), product of:
              0.19551642 = queryWeight, product of:
                2.0059412 = boost
                6.6422358 = idf(docFreq=150, maxDocs=42596)
                0.014674077 = queryNorm
              0.73387027 = fieldWeight in 4390, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.6422358 = idf(docFreq=150, maxDocs=42596)
                0.078125 = fieldNorm(doc=4390)
          0.068164006 = weight(abstract_txt:documents in 4390) [ClassicSimilarity], result of:
            0.068164006 = score(doc=4390,freq=2.0), product of:
              0.14997825 = queryWeight, product of:
                2.4845955 = boost
                4.1135974 = idf(docFreq=1892, maxDocs=42596)
                0.014674077 = queryNorm
              0.4544926 = fieldWeight in 4390, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1135974 = idf(docFreq=1892, maxDocs=42596)
                0.078125 = fieldNorm(doc=4390)
          0.65713096 = weight(abstract_txt:classifier in 4390) [ClassicSimilarity], result of:
            0.65713096 = score(doc=4390,freq=4.0), product of:
              0.58083504 = queryWeight, product of:
                5.4666715 = boost
                7.240675 = idf(docFreq=82, maxDocs=42596)
                0.014674077 = queryNorm
              1.1313555 = fieldWeight in 4390, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.240675 = idf(docFreq=82, maxDocs=42596)
                0.078125 = fieldNorm(doc=4390)
        0.24 = coord(6/25)
    
  4. Maghsoodi, N.; Homayounpour, M.M.: Improving Farsi multiclass text classification using a thesaurus and two-stage feature selection (2011) 0.23
    0.22899856 = sum of:
      0.22899856 = product of:
        0.7156205 = sum of:
          0.018387765 = weight(abstract_txt:text in 776) [ClassicSimilarity], result of:
            0.018387765 = score(doc=776,freq=1.0), product of:
              0.07265812 = queryWeight, product of:
                1.2228372 = boost
                4.049158 = idf(docFreq=2018, maxDocs=42596)
                0.014674077 = queryNorm
              0.25307238 = fieldWeight in 776, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.049158 = idf(docFreq=2018, maxDocs=42596)
                0.0625 = fieldNorm(doc=776)
          0.04549666 = weight(abstract_txt:features in 776) [ClassicSimilarity], result of:
            0.04549666 = score(doc=776,freq=3.0), product of:
              0.09216036 = queryWeight, product of:
                1.3772051 = boost
                4.5603137 = idf(docFreq=1210, maxDocs=42596)
                0.014674077 = queryNorm
              0.49366844 = fieldWeight in 776, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.5603137 = idf(docFreq=1210, maxDocs=42596)
                0.0625 = fieldNorm(doc=776)
          0.055162955 = weight(abstract_txt:categories in 776) [ClassicSimilarity], result of:
            0.055162955 = score(doc=776,freq=2.0), product of:
              0.11995543 = queryWeight, product of:
                1.5712183 = boost
                5.202746 = idf(docFreq=636, maxDocs=42596)
                0.014674077 = queryNorm
              0.4598621 = fieldWeight in 776, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.202746 = idf(docFreq=636, maxDocs=42596)
                0.0625 = fieldNorm(doc=776)
          0.13018076 = weight(abstract_txt:feature in 776) [ClassicSimilarity], result of:
            0.13018076 = score(doc=776,freq=5.0), product of:
              0.15666528 = queryWeight, product of:
                1.7956138 = boost
                5.9457827 = idf(docFreq=302, maxDocs=42596)
                0.014674077 = queryNorm
              0.8309484 = fieldWeight in 776, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.9457827 = idf(docFreq=302, maxDocs=42596)
                0.0625 = fieldNorm(doc=776)
          0.06784213 = weight(abstract_txt:phase in 776) [ClassicSimilarity], result of:
            0.06784213 = score(doc=776,freq=1.0), product of:
              0.17348605 = queryWeight, product of:
                1.889552 = boost
                6.2568383 = idf(docFreq=221, maxDocs=42596)
                0.014674077 = queryNorm
              0.3910524 = fieldWeight in 776, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.2568383 = idf(docFreq=221, maxDocs=42596)
                0.0625 = fieldNorm(doc=776)
          0.08116663 = weight(abstract_txt:categorization in 776) [ClassicSimilarity], result of:
            0.08116663 = score(doc=776,freq=1.0), product of:
              0.19551642 = queryWeight, product of:
                2.0059412 = boost
                6.6422358 = idf(docFreq=150, maxDocs=42596)
                0.014674077 = queryNorm
              0.41513973 = fieldWeight in 776, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.6422358 = idf(docFreq=150, maxDocs=42596)
                0.0625 = fieldNorm(doc=776)
          0.054531205 = weight(abstract_txt:documents in 776) [ClassicSimilarity], result of:
            0.054531205 = score(doc=776,freq=2.0), product of:
              0.14997825 = queryWeight, product of:
                2.4845955 = boost
                4.1135974 = idf(docFreq=1892, maxDocs=42596)
                0.014674077 = queryNorm
              0.36359408 = fieldWeight in 776, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1135974 = idf(docFreq=1892, maxDocs=42596)
                0.0625 = fieldNorm(doc=776)
          0.26285237 = weight(abstract_txt:classifier in 776) [ClassicSimilarity], result of:
            0.26285237 = score(doc=776,freq=1.0), product of:
              0.58083504 = queryWeight, product of:
                5.4666715 = boost
                7.240675 = idf(docFreq=82, maxDocs=42596)
                0.014674077 = queryNorm
              0.4525422 = fieldWeight in 776, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.240675 = idf(docFreq=82, maxDocs=42596)
                0.0625 = fieldNorm(doc=776)
        0.32 = coord(8/25)
    
  5. Yang, Y.; Liu, X.: ¬A re-examination of text categorization methods (1999) 0.21
    0.2066624 = sum of:
      0.2066624 = product of:
        1.033312 = sum of:
          0.027581645 = weight(abstract_txt:text in 4387) [ClassicSimilarity], result of:
            0.027581645 = score(doc=4387,freq=1.0), product of:
              0.07265812 = queryWeight, product of:
                1.2228372 = boost
                4.049158 = idf(docFreq=2018, maxDocs=42596)
                0.014674077 = queryNorm
              0.37960857 = fieldWeight in 4387, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.049158 = idf(docFreq=2018, maxDocs=42596)
                0.09375 = fieldNorm(doc=4387)
          0.058509152 = weight(abstract_txt:categories in 4387) [ClassicSimilarity], result of:
            0.058509152 = score(doc=4387,freq=1.0), product of:
              0.11995543 = queryWeight, product of:
                1.5712183 = boost
                5.202746 = idf(docFreq=636, maxDocs=42596)
                0.014674077 = queryNorm
              0.48775744 = fieldWeight in 4387, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.202746 = idf(docFreq=636, maxDocs=42596)
                0.09375 = fieldNorm(doc=4387)
          0.12174996 = weight(abstract_txt:categorization in 4387) [ClassicSimilarity], result of:
            0.12174996 = score(doc=4387,freq=1.0), product of:
              0.19551642 = queryWeight, product of:
                2.0059412 = boost
                6.6422358 = idf(docFreq=150, maxDocs=42596)
                0.014674077 = queryNorm
              0.62270963 = fieldWeight in 4387, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.6422358 = idf(docFreq=150, maxDocs=42596)
                0.09375 = fieldNorm(doc=4387)
          0.26787713 = weight(abstract_txt:category in 4387) [ClassicSimilarity], result of:
            0.26787713 = score(doc=4387,freq=3.0), product of:
              0.26251322 = queryWeight, product of:
                2.8467398 = boost
                6.2842374 = idf(docFreq=215, maxDocs=42596)
                0.014674077 = queryNorm
              1.020433 = fieldWeight in 4387, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.2842374 = idf(docFreq=215, maxDocs=42596)
                0.09375 = fieldNorm(doc=4387)
          0.55759406 = weight(abstract_txt:classifier in 4387) [ClassicSimilarity], result of:
            0.55759406 = score(doc=4387,freq=2.0), product of:
              0.58083504 = queryWeight, product of:
                5.4666715 = boost
                7.240675 = idf(docFreq=82, maxDocs=42596)
                0.014674077 = queryNorm
              0.9599869 = fieldWeight in 4387, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.240675 = idf(docFreq=82, maxDocs=42596)
                0.09375 = fieldNorm(doc=4387)
        0.2 = coord(5/25)