Document (#35098)

Author
Kanaan, G.
Al-Shalabi, R.
Ghwanmeh, S.
Al-Ma'adeed, H.
Title
¬A comparison of text-classification techniques applied to Arabic text
Source
Journal of the American Society for Information Science and Technology. 60(2009) no.9, S.1836-1844
Year
2009
Abstract
Many algorithms have been implemented for the problem of text classification. Most of the work in this area was carried out for English text. Very little research has been carried out on Arabic text. The nature of Arabic text is different than that of English text, and preprocessing of Arabic text is more challenging. This paper presents an implementation of three automatic text-classification techniques for Arabic text. A corpus of 1445 Arabic text documents belonging to nine categories has been automatically classified using the kNN, Rocchio, and naïve Bayes algorithms. The research results reveal that Naïve Bayes was the best performer, followed by kNN and Rocchio.
Theme
Automatisches Klassifizieren
Object
Bayes-Algorithmus
Naive-Bayes-Algorithmus
Rocchio-Algorithmus
kNN-Algorithmus

Similar documents (content)

  1. Rushdi-Saleh, M.; Martín-Valdivia, M.T.; Ureña-López, L.A.; Perea-Ortega, J.M.: OCA: Opinion corpus for Arabic (2011) 0.46
    0.46298012 = sum of:
      0.46298012 = product of:
        1.1574503 = sum of:
          0.05931114 = weight(abstract_txt:corpus in 1361) [ClassicSimilarity], result of:
            0.05931114 = score(doc=1361,freq=3.0), product of:
              0.071462795 = queryWeight, product of:
                1.1387264 = boost
                6.1334615 = idf(docFreq=251, maxDocs=42740)
                0.010231869 = queryNorm
              0.8299583 = fieldWeight in 1361, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.1334615 = idf(docFreq=251, maxDocs=42740)
                0.078125 = fieldNorm(doc=1361)
          0.013935791 = weight(abstract_txt:research in 1361) [ClassicSimilarity], result of:
            0.013935791 = score(doc=1361,freq=2.0), product of:
              0.03924497 = queryWeight, product of:
                1.1934009 = boost
                3.2139761 = idf(docFreq=4669, maxDocs=42740)
                0.010231869 = queryNorm
              0.35509753 = fieldWeight in 1361, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.2139761 = idf(docFreq=4669, maxDocs=42740)
                0.078125 = fieldNorm(doc=1361)
          0.04672778 = weight(abstract_txt:challenging in 1361) [ClassicSimilarity], result of:
            0.04672778 = score(doc=1361,freq=1.0), product of:
              0.08791837 = queryWeight, product of:
                1.263046 = boost
                6.803078 = idf(docFreq=128, maxDocs=42740)
                0.010231869 = queryNorm
              0.5314905 = fieldWeight in 1361, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.803078 = idf(docFreq=128, maxDocs=42740)
                0.078125 = fieldNorm(doc=1361)
          0.030284801 = weight(abstract_txt:been in 1361) [ClassicSimilarity], result of:
            0.030284801 = score(doc=1361,freq=2.0), product of:
              0.07537196 = queryWeight, product of:
                2.025559 = boost
                3.6367204 = idf(docFreq=3059, maxDocs=42740)
                0.010231869 = queryNorm
              0.40180463 = fieldWeight in 1361, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.6367204 = idf(docFreq=3059, maxDocs=42740)
                0.078125 = fieldNorm(doc=1361)
          0.052494086 = weight(abstract_txt:english in 1361) [ClassicSimilarity], result of:
            0.052494086 = score(doc=1361,freq=1.0), product of:
              0.11970524 = queryWeight, product of:
                2.0842555 = boost
                5.6131573 = idf(docFreq=423, maxDocs=42740)
                0.010231869 = queryNorm
              0.4385279 = fieldWeight in 1361, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.6131573 = idf(docFreq=423, maxDocs=42740)
                0.078125 = fieldNorm(doc=1361)
          0.05593749 = weight(abstract_txt:carried in 1361) [ClassicSimilarity], result of:
            0.05593749 = score(doc=1361,freq=1.0), product of:
              0.124884404 = queryWeight, product of:
                2.1288667 = boost
                5.733301 = idf(docFreq=375, maxDocs=42740)
                0.010231869 = queryNorm
              0.44791415 = fieldWeight in 1361, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.733301 = idf(docFreq=375, maxDocs=42740)
                0.078125 = fieldNorm(doc=1361)
          0.056649633 = weight(abstract_txt:algorithms in 1361) [ClassicSimilarity], result of:
            0.056649633 = score(doc=1361,freq=1.0), product of:
              0.12594211 = queryWeight, product of:
                2.137863 = boost
                5.757529 = idf(docFreq=366, maxDocs=42740)
                0.010231869 = queryNorm
              0.44980693 = fieldWeight in 1361, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.757529 = idf(docFreq=366, maxDocs=42740)
                0.078125 = fieldNorm(doc=1361)
          0.18405095 = weight(abstract_txt:bayes in 1361) [ClassicSimilarity], result of:
            0.18405095 = score(doc=1361,freq=1.0), product of:
              0.27626863 = queryWeight, product of:
                3.166359 = boost
                8.527396 = idf(docFreq=22, maxDocs=42740)
                0.010231869 = queryNorm
              0.66620284 = fieldWeight in 1361, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.527396 = idf(docFreq=22, maxDocs=42740)
                0.078125 = fieldNorm(doc=1361)
          0.10845151 = weight(abstract_txt:text in 1361) [ClassicSimilarity], result of:
            0.10845151 = score(doc=1361,freq=1.0), product of:
              0.3427553 = queryWeight, product of:
                8.271186 = boost
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.010231869 = queryNorm
              0.3164109 = fieldWeight in 1361, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.078125 = fieldNorm(doc=1361)
          0.54960716 = weight(abstract_txt:arabic in 1361) [ClassicSimilarity], result of:
            0.54960716 = score(doc=1361,freq=2.0), product of:
              0.6558002 = queryWeight, product of:
                8.44969 = boost
                7.585353 = idf(docFreq=58, maxDocs=42740)
                0.010231869 = queryNorm
              0.83807105 = fieldWeight in 1361, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.585353 = idf(docFreq=58, maxDocs=42740)
                0.078125 = fieldNorm(doc=1361)
        0.4 = coord(10/25)
    
  2. Atlam, E.-S.; Morita, K.; Fuketa, M.; Aoe, J.-i.: ¬A new approach for Arabic text classification using Arabic field-association terms (2011) 0.37
    0.36991847 = sum of:
      0.36991847 = product of:
        1.0275513 = sum of:
          0.02812285 = weight(abstract_txt:automatically in 1928) [ClassicSimilarity], result of:
            0.02812285 = score(doc=1928,freq=2.0), product of:
              0.057720814 = queryWeight, product of:
                1.0234004 = boost
                5.5122876 = idf(docFreq=468, maxDocs=42740)
                0.010231869 = queryNorm
              0.487222 = fieldWeight in 1928, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.5122876 = idf(docFreq=468, maxDocs=42740)
                0.0625 = fieldNorm(doc=1928)
          0.026193675 = weight(abstract_txt:followed in 1928) [ClassicSimilarity], result of:
            0.026193675 = score(doc=1928,freq=1.0), product of:
              0.06935863 = queryWeight, product of:
                1.1218367 = boost
                6.0424895 = idf(docFreq=275, maxDocs=42740)
                0.010231869 = queryNorm
              0.3776556 = fieldWeight in 1928, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0424895 = idf(docFreq=275, maxDocs=42740)
                0.0625 = fieldNorm(doc=1928)
          0.007883275 = weight(abstract_txt:research in 1928) [ClassicSimilarity], result of:
            0.007883275 = score(doc=1928,freq=1.0), product of:
              0.03924497 = queryWeight, product of:
                1.1934009 = boost
                3.2139761 = idf(docFreq=4669, maxDocs=42740)
                0.010231869 = queryNorm
              0.20087351 = fieldWeight in 1928, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.2139761 = idf(docFreq=4669, maxDocs=42740)
                0.0625 = fieldNorm(doc=1928)
          0.041995272 = weight(abstract_txt:english in 1928) [ClassicSimilarity], result of:
            0.041995272 = score(doc=1928,freq=1.0), product of:
              0.11970524 = queryWeight, product of:
                2.0842555 = boost
                5.6131573 = idf(docFreq=423, maxDocs=42740)
                0.010231869 = queryNorm
              0.35082233 = fieldWeight in 1928, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.6131573 = idf(docFreq=423, maxDocs=42740)
                0.0625 = fieldNorm(doc=1928)
          0.044749994 = weight(abstract_txt:carried in 1928) [ClassicSimilarity], result of:
            0.044749994 = score(doc=1928,freq=1.0), product of:
              0.124884404 = queryWeight, product of:
                2.1288667 = boost
                5.733301 = idf(docFreq=375, maxDocs=42740)
                0.010231869 = queryNorm
              0.35833132 = fieldWeight in 1928, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.733301 = idf(docFreq=375, maxDocs=42740)
                0.0625 = fieldNorm(doc=1928)
          0.022794738 = weight(abstract_txt:classification in 1928) [ClassicSimilarity], result of:
            0.022794738 = score(doc=1928,freq=1.0), product of:
              0.09118003 = queryWeight, product of:
                2.22787 = boost
                3.9999528 = idf(docFreq=2127, maxDocs=42740)
                0.010231869 = queryNorm
              0.24999705 = fieldWeight in 1928, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.9999528 = idf(docFreq=2127, maxDocs=42740)
                0.0625 = fieldNorm(doc=1928)
          0.14724076 = weight(abstract_txt:bayes in 1928) [ClassicSimilarity], result of:
            0.14724076 = score(doc=1928,freq=1.0), product of:
              0.27626863 = queryWeight, product of:
                3.166359 = boost
                8.527396 = idf(docFreq=22, maxDocs=42740)
                0.010231869 = queryNorm
              0.53296226 = fieldWeight in 1928, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.527396 = idf(docFreq=22, maxDocs=42740)
                0.0625 = fieldNorm(doc=1928)
          0.08676121 = weight(abstract_txt:text in 1928) [ClassicSimilarity], result of:
            0.08676121 = score(doc=1928,freq=1.0), product of:
              0.3427553 = queryWeight, product of:
                8.271186 = boost
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.010231869 = queryNorm
              0.2531287 = fieldWeight in 1928, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.0625 = fieldNorm(doc=1928)
          0.62180954 = weight(abstract_txt:arabic in 1928) [ClassicSimilarity], result of:
            0.62180954 = score(doc=1928,freq=4.0), product of:
              0.6558002 = queryWeight, product of:
                8.44969 = boost
                7.585353 = idf(docFreq=58, maxDocs=42740)
                0.010231869 = queryNorm
              0.9481691 = fieldWeight in 1928, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.585353 = idf(docFreq=58, maxDocs=42740)
                0.0625 = fieldNorm(doc=1928)
        0.36 = coord(9/25)
    
  3. Kanan, T.; Fox, E.A.: Automated arabic text classification with P-Stemmer, machine learning, and a tailored news article taxonomy (2016) 0.23
    0.23119467 = sum of:
      0.23119467 = product of:
        0.9633112 = sum of:
          0.011148633 = weight(abstract_txt:research in 5152) [ClassicSimilarity], result of:
            0.011148633 = score(doc=5152,freq=2.0), product of:
              0.03924497 = queryWeight, product of:
                1.1934009 = boost
                3.2139761 = idf(docFreq=4669, maxDocs=42740)
                0.010231869 = queryNorm
              0.28407803 = fieldWeight in 5152, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.2139761 = idf(docFreq=4669, maxDocs=42740)
                0.0625 = fieldNorm(doc=5152)
          0.031126168 = weight(abstract_txt:techniques in 5152) [ClassicSimilarity], result of:
            0.031126168 = score(doc=5152,freq=2.0), product of:
              0.07781321 = queryWeight, product of:
                1.6804323 = boost
                4.525612 = idf(docFreq=1257, maxDocs=42740)
                0.010231869 = queryNorm
              0.40001136 = fieldWeight in 5152, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.525612 = idf(docFreq=1257, maxDocs=42740)
                0.0625 = fieldNorm(doc=5152)
          0.02422784 = weight(abstract_txt:been in 5152) [ClassicSimilarity], result of:
            0.02422784 = score(doc=5152,freq=2.0), product of:
              0.07537196 = queryWeight, product of:
                2.025559 = boost
                3.6367204 = idf(docFreq=3059, maxDocs=42740)
                0.010231869 = queryNorm
              0.3214437 = fieldWeight in 5152, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.6367204 = idf(docFreq=3059, maxDocs=42740)
                0.0625 = fieldNorm(doc=5152)
          0.041995272 = weight(abstract_txt:english in 5152) [ClassicSimilarity], result of:
            0.041995272 = score(doc=5152,freq=1.0), product of:
              0.11970524 = queryWeight, product of:
                2.0842555 = boost
                5.6131573 = idf(docFreq=423, maxDocs=42740)
                0.010231869 = queryNorm
              0.35082233 = fieldWeight in 5152, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.6131573 = idf(docFreq=423, maxDocs=42740)
                0.0625 = fieldNorm(doc=5152)
          0.03223663 = weight(abstract_txt:classification in 5152) [ClassicSimilarity], result of:
            0.03223663 = score(doc=5152,freq=2.0), product of:
              0.09118003 = queryWeight, product of:
                2.22787 = boost
                3.9999528 = idf(docFreq=2127, maxDocs=42740)
                0.010231869 = queryNorm
              0.3535492 = fieldWeight in 5152, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.9999528 = idf(docFreq=2127, maxDocs=42740)
                0.0625 = fieldNorm(doc=5152)
          0.82257664 = weight(abstract_txt:arabic in 5152) [ClassicSimilarity], result of:
            0.82257664 = score(doc=5152,freq=7.0), product of:
              0.6558002 = queryWeight, product of:
                8.44969 = boost
                7.585353 = idf(docFreq=58, maxDocs=42740)
                0.010231869 = queryNorm
              1.2543098 = fieldWeight in 5152, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                7.585353 = idf(docFreq=58, maxDocs=42740)
                0.0625 = fieldNorm(doc=5152)
        0.24 = coord(6/25)
    
  4. Hmeidi, I.I.; Al-Shalabi, R.F.; Al-Taani, A.T.; Najadat, H.; Al-Hazaimeh, S.A.: ¬A novel approach to the extraction of roots from Arabic words using bigrams (2010) 0.18
    0.1781167 = sum of:
      0.1781167 = product of:
        0.89058346 = sum of:
          0.027394643 = weight(abstract_txt:corpus in 427) [ClassicSimilarity], result of:
            0.027394643 = score(doc=427,freq=1.0), product of:
              0.071462795 = queryWeight, product of:
                1.1387264 = boost
                6.1334615 = idf(docFreq=251, maxDocs=42740)
                0.010231869 = queryNorm
              0.38334134 = fieldWeight in 427, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.1334615 = idf(docFreq=251, maxDocs=42740)
                0.0625 = fieldNorm(doc=427)
          0.017131671 = weight(abstract_txt:been in 427) [ClassicSimilarity], result of:
            0.017131671 = score(doc=427,freq=1.0), product of:
              0.07537196 = queryWeight, product of:
                2.025559 = boost
                3.6367204 = idf(docFreq=3059, maxDocs=42740)
                0.010231869 = queryNorm
              0.22729503 = fieldWeight in 427, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6367204 = idf(docFreq=3059, maxDocs=42740)
                0.0625 = fieldNorm(doc=427)
          0.06409174 = weight(abstract_txt:algorithms in 427) [ClassicSimilarity], result of:
            0.06409174 = score(doc=427,freq=2.0), product of:
              0.12594211 = queryWeight, product of:
                2.137863 = boost
                5.757529 = idf(docFreq=366, maxDocs=42740)
                0.010231869 = queryNorm
              0.50889844 = fieldWeight in 427, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.757529 = idf(docFreq=366, maxDocs=42740)
                0.0625 = fieldNorm(doc=427)
          0.08676121 = weight(abstract_txt:text in 427) [ClassicSimilarity], result of:
            0.08676121 = score(doc=427,freq=1.0), product of:
              0.3427553 = queryWeight, product of:
                8.271186 = boost
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.010231869 = queryNorm
              0.2531287 = fieldWeight in 427, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.0625 = fieldNorm(doc=427)
          0.6952042 = weight(abstract_txt:arabic in 427) [ClassicSimilarity], result of:
            0.6952042 = score(doc=427,freq=5.0), product of:
              0.6558002 = queryWeight, product of:
                8.44969 = boost
                7.585353 = idf(docFreq=58, maxDocs=42740)
                0.010231869 = queryNorm
              1.0600853 = fieldWeight in 427, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.585353 = idf(docFreq=58, maxDocs=42740)
                0.0625 = fieldNorm(doc=427)
        0.2 = coord(5/25)
    
  5. Peng, F.; Huang, X.: Machine learning for Asian language text classification (2007) 0.17
    0.17054208 = sum of:
      0.17054208 = product of:
        0.6090788 = sum of:
          0.023156144 = weight(abstract_txt:implemented in 2832) [ClassicSimilarity], result of:
            0.023156144 = score(doc=2832,freq=1.0), product of:
              0.06388718 = queryWeight, product of:
                1.076679 = boost
                5.799259 = idf(docFreq=351, maxDocs=42740)
                0.010231869 = queryNorm
              0.3624537 = fieldWeight in 2832, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.799259 = idf(docFreq=351, maxDocs=42740)
                0.0625 = fieldNorm(doc=2832)
          0.007883275 = weight(abstract_txt:research in 2832) [ClassicSimilarity], result of:
            0.007883275 = score(doc=2832,freq=1.0), product of:
              0.03924497 = queryWeight, product of:
                1.1934009 = boost
                3.2139761 = idf(docFreq=4669, maxDocs=42740)
                0.010231869 = queryNorm
              0.20087351 = fieldWeight in 2832, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.2139761 = idf(docFreq=4669, maxDocs=42740)
                0.0625 = fieldNorm(doc=2832)
          0.031126168 = weight(abstract_txt:techniques in 2832) [ClassicSimilarity], result of:
            0.031126168 = score(doc=2832,freq=2.0), product of:
              0.07781321 = queryWeight, product of:
                1.6804323 = boost
                4.525612 = idf(docFreq=1257, maxDocs=42740)
                0.010231869 = queryNorm
              0.40001136 = fieldWeight in 2832, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.525612 = idf(docFreq=1257, maxDocs=42740)
                0.0625 = fieldNorm(doc=2832)
          0.068384215 = weight(abstract_txt:classification in 2832) [ClassicSimilarity], result of:
            0.068384215 = score(doc=2832,freq=9.0), product of:
              0.09118003 = queryWeight, product of:
                2.22787 = boost
                3.9999528 = idf(docFreq=2127, maxDocs=42740)
                0.010231869 = queryNorm
              0.7499912 = fieldWeight in 2832, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                3.9999528 = idf(docFreq=2127, maxDocs=42740)
                0.0625 = fieldNorm(doc=2832)
          0.13728432 = weight(abstract_txt:naïve in 2832) [ClassicSimilarity], result of:
            0.13728432 = score(doc=2832,freq=1.0), product of:
              0.26366967 = queryWeight, product of:
                3.0933173 = boost
                8.330686 = idf(docFreq=27, maxDocs=42740)
                0.010231869 = queryNorm
              0.52066785 = fieldWeight in 2832, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.330686 = idf(docFreq=27, maxDocs=42740)
                0.0625 = fieldNorm(doc=2832)
          0.14724076 = weight(abstract_txt:bayes in 2832) [ClassicSimilarity], result of:
            0.14724076 = score(doc=2832,freq=1.0), product of:
              0.27626863 = queryWeight, product of:
                3.166359 = boost
                8.527396 = idf(docFreq=22, maxDocs=42740)
                0.010231869 = queryNorm
              0.53296226 = fieldWeight in 2832, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.527396 = idf(docFreq=22, maxDocs=42740)
                0.0625 = fieldNorm(doc=2832)
          0.19400394 = weight(abstract_txt:text in 2832) [ClassicSimilarity], result of:
            0.19400394 = score(doc=2832,freq=5.0), product of:
              0.3427553 = queryWeight, product of:
                8.271186 = boost
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.010231869 = queryNorm
              0.566013 = fieldWeight in 2832, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.0625 = fieldNorm(doc=2832)
        0.28 = coord(7/25)