Document (#35097)

Author
Kanaan, G.
Al-Shalabi, R.
Ghwanmeh, S.
Al-Ma'adeed, H.
Title
¬A comparison of text-classification techniques applied to Arabic text
Source
Journal of the American Society for Information Science and Technology. 60(2009) no.9, S.1836-1844
Year
2009
Abstract
Many algorithms have been implemented for the problem of text classification. Most of the work in this area was carried out for English text. Very little research has been carried out on Arabic text. The nature of Arabic text is different than that of English text, and preprocessing of Arabic text is more challenging. This paper presents an implementation of three automatic text-classification techniques for Arabic text. A corpus of 1445 Arabic text documents belonging to nine categories has been automatically classified using the kNN, Rocchio, and naïve Bayes algorithms. The research results reveal that Naïve Bayes was the best performer, followed by kNN and Rocchio.
Theme
Automatisches Klassifizieren
Object
Bayes-Algorithmus
Naive-Bayes-Algorithmus
Rocchio-Algorithmus
kNN-Algorithmus

Similar documents (content)

  1. Rushdi-Saleh, M.; Martín-Valdivia, M.T.; Ureña-López, L.A.; Perea-Ortega, J.M.: OCA: Opinion corpus for Arabic (2011) 0.46
    0.45994836 = sum of:
      0.45994836 = product of:
        1.1498709 = sum of:
          0.05846463 = weight(abstract_txt:corpus in 4360) [ClassicSimilarity], result of:
            0.05846463 = score(doc=4360,freq=3.0), product of:
              0.07084709 = queryWeight, product of:
                1.142239 = boost
                6.0984654 = idf(docFreq=269, maxDocs=44218)
                0.010170551 = queryNorm
              0.8252228 = fieldWeight in 4360, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.0984654 = idf(docFreq=269, maxDocs=44218)
                0.078125 = fieldNorm(doc=4360)
          0.013413207 = weight(abstract_txt:research in 4360) [ClassicSimilarity], result of:
            0.013413207 = score(doc=4360,freq=2.0), product of:
              0.03829323 = queryWeight, product of:
                1.1876049 = boost
                3.170338 = idf(docFreq=5046, maxDocs=44218)
                0.010170551 = queryNorm
              0.35027617 = fieldWeight in 4360, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.170338 = idf(docFreq=5046, maxDocs=44218)
                0.078125 = fieldNorm(doc=4360)
          0.04530536 = weight(abstract_txt:challenging in 4360) [ClassicSimilarity], result of:
            0.04530536 = score(doc=4360,freq=1.0), product of:
              0.08620517 = queryWeight, product of:
                1.259977 = boost
                6.727074 = idf(docFreq=143, maxDocs=44218)
                0.010170551 = queryNorm
              0.5255527 = fieldWeight in 4360, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.727074 = idf(docFreq=143, maxDocs=44218)
                0.078125 = fieldNorm(doc=4360)
          0.029892433 = weight(abstract_txt:been in 4360) [ClassicSimilarity], result of:
            0.029892433 = score(doc=4360,freq=2.0), product of:
              0.074789084 = queryWeight, product of:
                2.0327113 = boost
                3.617579 = idf(docFreq=3226, maxDocs=44218)
                0.010170551 = queryNorm
              0.3996898 = fieldWeight in 4360, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.617579 = idf(docFreq=3226, maxDocs=44218)
                0.078125 = fieldNorm(doc=4360)
          0.051557746 = weight(abstract_txt:english in 4360) [ClassicSimilarity], result of:
            0.051557746 = score(doc=4360,freq=1.0), product of:
              0.1183876 = queryWeight, product of:
                2.0881615 = boost
                5.574394 = idf(docFreq=455, maxDocs=44218)
                0.010170551 = queryNorm
              0.43549955 = fieldWeight in 4360, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.574394 = idf(docFreq=455, maxDocs=44218)
                0.078125 = fieldNorm(doc=4360)
          0.055352326 = weight(abstract_txt:algorithms in 4360) [ClassicSimilarity], result of:
            0.055352326 = score(doc=4360,freq=1.0), product of:
              0.124127366 = queryWeight, product of:
                2.1381824 = boost
                5.707926 = idf(docFreq=398, maxDocs=44218)
                0.010170551 = queryNorm
              0.4459317 = fieldWeight in 4360, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.707926 = idf(docFreq=398, maxDocs=44218)
                0.078125 = fieldNorm(doc=4360)
          0.05586885 = weight(abstract_txt:carried in 4360) [ClassicSimilarity], result of:
            0.05586885 = score(doc=4360,freq=1.0), product of:
              0.12489837 = queryWeight, product of:
                2.1448126 = boost
                5.7256255 = idf(docFreq=391, maxDocs=44218)
                0.010170551 = queryNorm
              0.4473145 = fieldWeight in 4360, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7256255 = idf(docFreq=391, maxDocs=44218)
                0.078125 = fieldNorm(doc=4360)
          0.18400994 = weight(abstract_txt:bayes in 4360) [ClassicSimilarity], result of:
            0.18400994 = score(doc=4360,freq=1.0), product of:
              0.27648473 = queryWeight, product of:
                3.191145 = boost
                8.518833 = idf(docFreq=23, maxDocs=44218)
                0.010170551 = queryNorm
              0.66553384 = fieldWeight in 4360, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.518833 = idf(docFreq=23, maxDocs=44218)
                0.078125 = fieldNorm(doc=4360)
          0.10825654 = weight(abstract_txt:text in 4360) [ClassicSimilarity], result of:
            0.10825654 = score(doc=4360,freq=1.0), product of:
              0.34266305 = queryWeight, product of:
                8.3315525 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.010170551 = queryNorm
              0.3159271 = fieldWeight in 4360, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.078125 = fieldNorm(doc=4360)
          0.5477498 = weight(abstract_txt:arabic in 4360) [ClassicSimilarity], result of:
            0.5477498 = score(doc=4360,freq=2.0), product of:
              0.65493095 = queryWeight, product of:
                8.506862 = boost
                7.5697527 = idf(docFreq=61, maxDocs=44218)
                0.010170551 = queryNorm
              0.8363474 = fieldWeight in 4360, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.5697527 = idf(docFreq=61, maxDocs=44218)
                0.078125 = fieldNorm(doc=4360)
        0.4 = coord(10/25)
    
  2. Atlam, E.-S.; Morita, K.; Fuketa, M.; Aoe, J.-i.: ¬A new approach for Arabic text classification using Arabic field-association terms (2011) 0.37
    0.36856914 = sum of:
      0.36856914 = product of:
        1.0238031 = sum of:
          0.028271852 = weight(abstract_txt:automatically in 4927) [ClassicSimilarity], result of:
            0.028271852 = score(doc=4927,freq=2.0), product of:
              0.057978433 = queryWeight, product of:
                1.0333066 = boost
                5.5168705 = idf(docFreq=482, maxDocs=44218)
                0.010170551 = queryNorm
              0.48762706 = fieldWeight in 4927, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.5168705 = idf(docFreq=482, maxDocs=44218)
                0.0625 = fieldNorm(doc=4927)
          0.02575726 = weight(abstract_txt:followed in 4927) [ClassicSimilarity], result of:
            0.02575726 = score(doc=4927,freq=1.0), product of:
              0.068649925 = queryWeight, product of:
                1.1243875 = boost
                6.003155 = idf(docFreq=296, maxDocs=44218)
                0.010170551 = queryNorm
              0.3751972 = fieldWeight in 4927, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.003155 = idf(docFreq=296, maxDocs=44218)
                0.0625 = fieldNorm(doc=4927)
          0.007587655 = weight(abstract_txt:research in 4927) [ClassicSimilarity], result of:
            0.007587655 = score(doc=4927,freq=1.0), product of:
              0.03829323 = queryWeight, product of:
                1.1876049 = boost
                3.170338 = idf(docFreq=5046, maxDocs=44218)
                0.010170551 = queryNorm
              0.19814612 = fieldWeight in 4927, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.170338 = idf(docFreq=5046, maxDocs=44218)
                0.0625 = fieldNorm(doc=4927)
          0.041246198 = weight(abstract_txt:english in 4927) [ClassicSimilarity], result of:
            0.041246198 = score(doc=4927,freq=1.0), product of:
              0.1183876 = queryWeight, product of:
                2.0881615 = boost
                5.574394 = idf(docFreq=455, maxDocs=44218)
                0.010170551 = queryNorm
              0.34839964 = fieldWeight in 4927, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.574394 = idf(docFreq=455, maxDocs=44218)
                0.0625 = fieldNorm(doc=4927)
          0.04469508 = weight(abstract_txt:carried in 4927) [ClassicSimilarity], result of:
            0.04469508 = score(doc=4927,freq=1.0), product of:
              0.12489837 = queryWeight, product of:
                2.1448126 = boost
                5.7256255 = idf(docFreq=391, maxDocs=44218)
                0.010170551 = queryNorm
              0.3578516 = fieldWeight in 4927, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7256255 = idf(docFreq=391, maxDocs=44218)
                0.0625 = fieldNorm(doc=4927)
          0.022723662 = weight(abstract_txt:classification in 4927) [ClassicSimilarity], result of:
            0.022723662 = score(doc=4927,freq=1.0), product of:
              0.091075085 = queryWeight, product of:
                2.24314 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.010170551 = queryNorm
              0.2495047 = fieldWeight in 4927, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.0625 = fieldNorm(doc=4927)
          0.14720796 = weight(abstract_txt:bayes in 4927) [ClassicSimilarity], result of:
            0.14720796 = score(doc=4927,freq=1.0), product of:
              0.27648473 = queryWeight, product of:
                3.191145 = boost
                8.518833 = idf(docFreq=23, maxDocs=44218)
                0.010170551 = queryNorm
              0.5324271 = fieldWeight in 4927, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.518833 = idf(docFreq=23, maxDocs=44218)
                0.0625 = fieldNorm(doc=4927)
          0.08660523 = weight(abstract_txt:text in 4927) [ClassicSimilarity], result of:
            0.08660523 = score(doc=4927,freq=1.0), product of:
              0.34266305 = queryWeight, product of:
                8.3315525 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.010170551 = queryNorm
              0.25274166 = fieldWeight in 4927, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=4927)
          0.6197082 = weight(abstract_txt:arabic in 4927) [ClassicSimilarity], result of:
            0.6197082 = score(doc=4927,freq=4.0), product of:
              0.65493095 = queryWeight, product of:
                8.506862 = boost
                7.5697527 = idf(docFreq=61, maxDocs=44218)
                0.010170551 = queryNorm
              0.9462191 = fieldWeight in 4927, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.5697527 = idf(docFreq=61, maxDocs=44218)
                0.0625 = fieldNorm(doc=4927)
        0.36 = coord(9/25)
    
  3. Kanan, T.; Fox, E.A.: Automated arabic text classification with P-Stemmer, machine learning, and a tailored news article taxonomy (2016) 0.23
    0.2301899 = sum of:
      0.2301899 = product of:
        0.9591246 = sum of:
          0.010730565 = weight(abstract_txt:research in 3151) [ClassicSimilarity], result of:
            0.010730565 = score(doc=3151,freq=2.0), product of:
              0.03829323 = queryWeight, product of:
                1.1876049 = boost
                3.170338 = idf(docFreq=5046, maxDocs=44218)
                0.010170551 = queryNorm
              0.28022093 = fieldWeight in 3151, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.170338 = idf(docFreq=5046, maxDocs=44218)
                0.0625 = fieldNorm(doc=3151)
          0.03130093 = weight(abstract_txt:techniques in 3151) [ClassicSimilarity], result of:
            0.03130093 = score(doc=3151,freq=2.0), product of:
              0.0781769 = queryWeight, product of:
                1.6968763 = boost
                4.5298495 = idf(docFreq=1295, maxDocs=44218)
                0.010170551 = queryNorm
              0.40038592 = fieldWeight in 3151, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.5298495 = idf(docFreq=1295, maxDocs=44218)
                0.0625 = fieldNorm(doc=3151)
          0.023913946 = weight(abstract_txt:been in 3151) [ClassicSimilarity], result of:
            0.023913946 = score(doc=3151,freq=2.0), product of:
              0.074789084 = queryWeight, product of:
                2.0327113 = boost
                3.617579 = idf(docFreq=3226, maxDocs=44218)
                0.010170551 = queryNorm
              0.31975183 = fieldWeight in 3151, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.617579 = idf(docFreq=3226, maxDocs=44218)
                0.0625 = fieldNorm(doc=3151)
          0.041246198 = weight(abstract_txt:english in 3151) [ClassicSimilarity], result of:
            0.041246198 = score(doc=3151,freq=1.0), product of:
              0.1183876 = queryWeight, product of:
                2.0881615 = boost
                5.574394 = idf(docFreq=455, maxDocs=44218)
                0.010170551 = queryNorm
              0.34839964 = fieldWeight in 3151, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.574394 = idf(docFreq=455, maxDocs=44218)
                0.0625 = fieldNorm(doc=3151)
          0.03213611 = weight(abstract_txt:classification in 3151) [ClassicSimilarity], result of:
            0.03213611 = score(doc=3151,freq=2.0), product of:
              0.091075085 = queryWeight, product of:
                2.24314 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.010170551 = queryNorm
              0.3528529 = fieldWeight in 3151, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.0625 = fieldNorm(doc=3151)
          0.81979686 = weight(abstract_txt:arabic in 3151) [ClassicSimilarity], result of:
            0.81979686 = score(doc=3151,freq=7.0), product of:
              0.65493095 = queryWeight, product of:
                8.506862 = boost
                7.5697527 = idf(docFreq=61, maxDocs=44218)
                0.010170551 = queryNorm
              1.2517302 = fieldWeight in 3151, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                7.5697527 = idf(docFreq=61, maxDocs=44218)
                0.0625 = fieldNorm(doc=3151)
        0.24 = coord(6/25)
    
  4. Hmeidi, I.I.; Al-Shalabi, R.F.; Al-Taani, A.T.; Najadat, H.; Al-Hazaimeh, S.A.: ¬A novel approach to the extraction of roots from Arabic words using bigrams (2010) 0.18
    0.17719947 = sum of:
      0.17719947 = product of:
        0.88599735 = sum of:
          0.027003657 = weight(abstract_txt:corpus in 3426) [ClassicSimilarity], result of:
            0.027003657 = score(doc=3426,freq=1.0), product of:
              0.07084709 = queryWeight, product of:
                1.142239 = boost
                6.0984654 = idf(docFreq=269, maxDocs=44218)
                0.010170551 = queryNorm
              0.3811541 = fieldWeight in 3426, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0984654 = idf(docFreq=269, maxDocs=44218)
                0.0625 = fieldNorm(doc=3426)
          0.016909713 = weight(abstract_txt:been in 3426) [ClassicSimilarity], result of:
            0.016909713 = score(doc=3426,freq=1.0), product of:
              0.074789084 = queryWeight, product of:
                2.0327113 = boost
                3.617579 = idf(docFreq=3226, maxDocs=44218)
                0.010170551 = queryNorm
              0.22609869 = fieldWeight in 3426, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.617579 = idf(docFreq=3226, maxDocs=44218)
                0.0625 = fieldNorm(doc=3426)
          0.06262401 = weight(abstract_txt:algorithms in 3426) [ClassicSimilarity], result of:
            0.06262401 = score(doc=3426,freq=2.0), product of:
              0.124127366 = queryWeight, product of:
                2.1381824 = boost
                5.707926 = idf(docFreq=398, maxDocs=44218)
                0.010170551 = queryNorm
              0.5045141 = fieldWeight in 3426, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.707926 = idf(docFreq=398, maxDocs=44218)
                0.0625 = fieldNorm(doc=3426)
          0.08660523 = weight(abstract_txt:text in 3426) [ClassicSimilarity], result of:
            0.08660523 = score(doc=3426,freq=1.0), product of:
              0.34266305 = queryWeight, product of:
                8.3315525 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.010170551 = queryNorm
              0.25274166 = fieldWeight in 3426, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=3426)
          0.69285476 = weight(abstract_txt:arabic in 3426) [ClassicSimilarity], result of:
            0.69285476 = score(doc=3426,freq=5.0), product of:
              0.65493095 = queryWeight, product of:
                8.506862 = boost
                7.5697527 = idf(docFreq=61, maxDocs=44218)
                0.010170551 = queryNorm
              1.0579051 = fieldWeight in 3426, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.5697527 = idf(docFreq=61, maxDocs=44218)
                0.0625 = fieldNorm(doc=3426)
        0.2 = coord(5/25)
    
  5. Peng, F.; Huang, X.: Machine learning for Asian language text classification (2007) 0.17
    0.17034517 = sum of:
      0.17034517 = product of:
        0.6083756 = sum of:
          0.022839053 = weight(abstract_txt:implemented in 831) [ClassicSimilarity], result of:
            0.022839053 = score(doc=831,freq=1.0), product of:
              0.06336153 = queryWeight, product of:
                1.0802115 = boost
                5.767298 = idf(docFreq=375, maxDocs=44218)
                0.010170551 = queryNorm
              0.36045614 = fieldWeight in 831, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.767298 = idf(docFreq=375, maxDocs=44218)
                0.0625 = fieldNorm(doc=831)
          0.007587655 = weight(abstract_txt:research in 831) [ClassicSimilarity], result of:
            0.007587655 = score(doc=831,freq=1.0), product of:
              0.03829323 = queryWeight, product of:
                1.1876049 = boost
                3.170338 = idf(docFreq=5046, maxDocs=44218)
                0.010170551 = queryNorm
              0.19814612 = fieldWeight in 831, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.170338 = idf(docFreq=5046, maxDocs=44218)
                0.0625 = fieldNorm(doc=831)
          0.03130093 = weight(abstract_txt:techniques in 831) [ClassicSimilarity], result of:
            0.03130093 = score(doc=831,freq=2.0), product of:
              0.0781769 = queryWeight, product of:
                1.6968763 = boost
                4.5298495 = idf(docFreq=1295, maxDocs=44218)
                0.010170551 = queryNorm
              0.40038592 = fieldWeight in 831, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.5298495 = idf(docFreq=1295, maxDocs=44218)
                0.0625 = fieldNorm(doc=831)
          0.06817099 = weight(abstract_txt:classification in 831) [ClassicSimilarity], result of:
            0.06817099 = score(doc=831,freq=9.0), product of:
              0.091075085 = queryWeight, product of:
                2.24314 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.010170551 = queryNorm
              0.7485141 = fieldWeight in 831, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.0625 = fieldNorm(doc=831)
          0.13761382 = weight(abstract_txt:naïve in 831) [ClassicSimilarity], result of:
            0.13761382 = score(doc=831,freq=1.0), product of:
              0.2643372 = queryWeight, product of:
                3.1202552 = boost
                8.329592 = idf(docFreq=28, maxDocs=44218)
                0.010170551 = queryNorm
              0.5205995 = fieldWeight in 831, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.329592 = idf(docFreq=28, maxDocs=44218)
                0.0625 = fieldNorm(doc=831)
          0.14720796 = weight(abstract_txt:bayes in 831) [ClassicSimilarity], result of:
            0.14720796 = score(doc=831,freq=1.0), product of:
              0.27648473 = queryWeight, product of:
                3.191145 = boost
                8.518833 = idf(docFreq=23, maxDocs=44218)
                0.010170551 = queryNorm
              0.5324271 = fieldWeight in 831, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.518833 = idf(docFreq=23, maxDocs=44218)
                0.0625 = fieldNorm(doc=831)
          0.1936552 = weight(abstract_txt:text in 831) [ClassicSimilarity], result of:
            0.1936552 = score(doc=831,freq=5.0), product of:
              0.34266305 = queryWeight, product of:
                8.3315525 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.010170551 = queryNorm
              0.5651476 = fieldWeight in 831, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=831)
        0.28 = coord(7/25)