Document (#43447)

Author
Ikae, C.
Savoy, J.
Title
Gender identification on Twitter
Source
Journal of the Association for Information Science and Technology. 73(2022) no.1, S.58-69
Year
2022
Abstract
To determine the author of a text's gender, various feature types have been suggested (e.g., function words, n-gram of letters, etc.) leading to a huge number of stylistic markers. To determine the target category, different machine learning models have been suggested (e.g., logistic regression, decision tree, k nearest-neighbors, support vector machine, naïve Bayes, neural networks, and random forest). In this study, our first objective is to know whether or not the same model always proposes the best effectiveness when considering similar corpora under the same conditions. Thus, based on 7 CLEF-PAN collections, this study analyzes the effectiveness of 10 different classifiers. Our second aim is to propose a 2-stage feature selection to reduce the feature size to a few hundred terms without any significant change in the performance level compared to approaches using all the attributes (increase of around 5% after applying the proposed feature selection). Based on our experiments, neural network or random forest tend, on average, to produce the highest effectiveness. Moreover, empirical evidence indicates that reducing the feature set size to around 300 without penalizing the effectiveness is possible. Finally, based on such reduced feature sizes, an analysis reveals some of the specific terms that clearly discriminate between the 2 genders.
Content
Vgl.: https://asistdl.onlinelibrary.wiley.com/doi/10.1002/asi.24541.
Theme
Informetrie
Field
Kommunikationswissenschaften
Object
Twitter

Similar documents (author)

  1. Savoy, J.: Stemming of French words based on grammatical categories (1993) 5.21
    5.2059946 = sum of:
      5.2059946 = weight(author_txt:savoy in 4650) [ClassicSimilarity], result of:
        5.2059946 = fieldWeight in 4650, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.329592 = idf(docFreq=28, maxDocs=44218)
          0.625 = fieldNorm(doc=4650)
    
  2. Savoy, J.: Effectiveness of information retrieval systems used in a hypertext environment (1993) 5.21
    5.2059946 = sum of:
      5.2059946 = weight(author_txt:savoy in 6511) [ClassicSimilarity], result of:
        5.2059946 = fieldWeight in 6511, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.329592 = idf(docFreq=28, maxDocs=44218)
          0.625 = fieldNorm(doc=6511)
    
  3. Savoy, J.: ¬A learning scheme for information retrieval in hypertext (1994) 5.21
    5.2059946 = sum of:
      5.2059946 = weight(author_txt:savoy in 7292) [ClassicSimilarity], result of:
        5.2059946 = fieldWeight in 7292, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.329592 = idf(docFreq=28, maxDocs=44218)
          0.625 = fieldNorm(doc=7292)
    
  4. Savoy, J.: Bayesian inference networks and spreading activation in hypertext systems (1992) 5.21
    5.2059946 = sum of:
      5.2059946 = weight(author_txt:savoy in 192) [ClassicSimilarity], result of:
        5.2059946 = fieldWeight in 192, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.329592 = idf(docFreq=28, maxDocs=44218)
          0.625 = fieldNorm(doc=192)
    
  5. Savoy, J.: Searching information in legal hypertext systems (1993/94) 5.21
    5.2059946 = sum of:
      5.2059946 = weight(author_txt:savoy in 757) [ClassicSimilarity], result of:
        5.2059946 = fieldWeight in 757, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.329592 = idf(docFreq=28, maxDocs=44218)
          0.625 = fieldNorm(doc=757)
    

Similar documents (content)

  1. Mengle, S.S.R.; Goharian, N.: Ambiguity measure feature-selection algorithm (2009) 0.23
    0.22908705 = sum of:
      0.22908705 = product of:
        0.9545294 = sum of:
          0.11656691 = weight(abstract_txt:naïve in 2804) [ClassicSimilarity], result of:
            0.11656691 = score(doc=2804,freq=2.0), product of:
              0.15832758 = queryWeight, product of:
                1.0230979 = boost
                8.329592 = idf(docFreq=28, maxDocs=44218)
                0.018578714 = queryNorm
              0.73623884 = fieldWeight in 2804, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.329592 = idf(docFreq=28, maxDocs=44218)
                0.0625 = fieldNorm(doc=2804)
          0.12469368 = weight(abstract_txt:bayes in 2804) [ClassicSimilarity], result of:
            0.12469368 = score(doc=2804,freq=2.0), product of:
              0.16560343 = queryWeight, product of:
                1.0463418 = boost
                8.518833 = idf(docFreq=23, maxDocs=44218)
                0.018578714 = queryNorm
              0.75296557 = fieldWeight in 2804, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.518833 = idf(docFreq=23, maxDocs=44218)
                0.0625 = fieldNorm(doc=2804)
          0.041952346 = weight(abstract_txt:machine in 2804) [ClassicSimilarity], result of:
            0.041952346 = score(doc=2804,freq=1.0), product of:
              0.1271639 = queryWeight, product of:
                1.296689 = boost
                5.2785225 = idf(docFreq=612, maxDocs=44218)
                0.018578714 = queryNorm
              0.32990766 = fieldWeight in 2804, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2785225 = idf(docFreq=612, maxDocs=44218)
                0.0625 = fieldNorm(doc=2804)
          0.099208064 = weight(abstract_txt:selection in 2804) [ClassicSimilarity], result of:
            0.099208064 = score(doc=2804,freq=5.0), product of:
              0.13199809 = queryWeight, product of:
                1.3211062 = boost
                5.377919 = idf(docFreq=554, maxDocs=44218)
                0.018578714 = queryNorm
              0.7515871 = fieldWeight in 2804, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.377919 = idf(docFreq=554, maxDocs=44218)
                0.0625 = fieldNorm(doc=2804)
          0.10692033 = weight(abstract_txt:effectiveness in 2804) [ClassicSimilarity], result of:
            0.10692033 = score(doc=2804,freq=2.0), product of:
              0.23726475 = queryWeight, product of:
                2.5048718 = boost
                5.098378 = idf(docFreq=733, maxDocs=44218)
                0.018578714 = queryNorm
              0.45063722 = fieldWeight in 2804, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.098378 = idf(docFreq=733, maxDocs=44218)
                0.0625 = fieldNorm(doc=2804)
          0.4651881 = weight(abstract_txt:feature in 2804) [ClassicSimilarity], result of:
            0.4651881 = score(doc=2804,freq=7.0), product of:
              0.4767454 = queryWeight, product of:
                4.3486834 = boost
                5.9008293 = idf(docFreq=328, maxDocs=44218)
                0.018578714 = queryNorm
              0.9757579 = fieldWeight in 2804, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                5.9008293 = idf(docFreq=328, maxDocs=44218)
                0.0625 = fieldNorm(doc=2804)
        0.24 = coord(6/25)
    
  2. Wang, P.; Li, X.: Assessing the quality of information on Wikipedia : a deep-learning approach (2020) 0.13
    0.1288495 = sum of:
      0.1288495 = product of:
        0.64424753 = sum of:
          0.04100644 = weight(abstract_txt:determine in 5505) [ClassicSimilarity], result of:
            0.04100644 = score(doc=5505,freq=1.0), product of:
              0.12524518 = queryWeight, product of:
                1.2868693 = boost
                5.2385488 = idf(docFreq=637, maxDocs=44218)
                0.018578714 = queryNorm
              0.3274093 = fieldWeight in 5505, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2385488 = idf(docFreq=637, maxDocs=44218)
                0.0625 = fieldNorm(doc=5505)
          0.041952346 = weight(abstract_txt:machine in 5505) [ClassicSimilarity], result of:
            0.041952346 = score(doc=5505,freq=1.0), product of:
              0.1271639 = queryWeight, product of:
                1.296689 = boost
                5.2785225 = idf(docFreq=612, maxDocs=44218)
                0.018578714 = queryNorm
              0.32990766 = fieldWeight in 5505, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2785225 = idf(docFreq=612, maxDocs=44218)
                0.0625 = fieldNorm(doc=5505)
          0.13403547 = weight(abstract_txt:neural in 5505) [ClassicSimilarity], result of:
            0.13403547 = score(doc=5505,freq=2.0), product of:
              0.21894222 = queryWeight, product of:
                1.701448 = boost
                6.926203 = idf(docFreq=117, maxDocs=44218)
                0.018578714 = queryNorm
              0.6121956 = fieldWeight in 5505, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.926203 = idf(docFreq=117, maxDocs=44218)
                0.0625 = fieldNorm(doc=5505)
          0.07560409 = weight(abstract_txt:effectiveness in 5505) [ClassicSimilarity], result of:
            0.07560409 = score(doc=5505,freq=1.0), product of:
              0.23726475 = queryWeight, product of:
                2.5048718 = boost
                5.098378 = idf(docFreq=733, maxDocs=44218)
                0.018578714 = queryNorm
              0.31864864 = fieldWeight in 5505, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.098378 = idf(docFreq=733, maxDocs=44218)
                0.0625 = fieldNorm(doc=5505)
          0.35164917 = weight(abstract_txt:feature in 5505) [ClassicSimilarity], result of:
            0.35164917 = score(doc=5505,freq=4.0), product of:
              0.4767454 = queryWeight, product of:
                4.3486834 = boost
                5.9008293 = idf(docFreq=328, maxDocs=44218)
                0.018578714 = queryNorm
              0.73760366 = fieldWeight in 5505, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.9008293 = idf(docFreq=328, maxDocs=44218)
                0.0625 = fieldNorm(doc=5505)
        0.2 = coord(5/25)
    
  3. Ma, Z.; Sun, A.; Cong, G.: On predicting the popularity of newly emerging hashtags in Twitter (2013) 0.12
    0.12018781 = sum of:
      0.12018781 = product of:
        0.50078255 = sum of:
          0.07696774 = weight(abstract_txt:nearest in 967) [ClassicSimilarity], result of:
            0.07696774 = score(doc=967,freq=1.0), product of:
              0.15125933 = queryWeight, product of:
                8.14154 = idf(docFreq=34, maxDocs=44218)
                0.018578714 = queryNorm
              0.5088462 = fieldWeight in 967, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.14154 = idf(docFreq=34, maxDocs=44218)
                0.0625 = fieldNorm(doc=967)
          0.11248274 = weight(abstract_txt:logistic in 967) [ClassicSimilarity], result of:
            0.11248274 = score(doc=967,freq=2.0), product of:
              0.1546074 = queryWeight, product of:
                1.0110067 = boost
                8.231152 = idf(docFreq=31, maxDocs=44218)
                0.018578714 = queryNorm
              0.7275379 = fieldWeight in 967, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.231152 = idf(docFreq=31, maxDocs=44218)
                0.0625 = fieldNorm(doc=967)
          0.08242526 = weight(abstract_txt:naïve in 967) [ClassicSimilarity], result of:
            0.08242526 = score(doc=967,freq=1.0), product of:
              0.15832758 = queryWeight, product of:
                1.0230979 = boost
                8.329592 = idf(docFreq=28, maxDocs=44218)
                0.018578714 = queryNorm
              0.5205995 = fieldWeight in 967, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.329592 = idf(docFreq=28, maxDocs=44218)
                0.0625 = fieldNorm(doc=967)
          0.08817175 = weight(abstract_txt:bayes in 967) [ClassicSimilarity], result of:
            0.08817175 = score(doc=967,freq=1.0), product of:
              0.16560343 = queryWeight, product of:
                1.0463418 = boost
                8.518833 = idf(docFreq=23, maxDocs=44218)
                0.018578714 = queryNorm
              0.5324271 = fieldWeight in 967, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.518833 = idf(docFreq=23, maxDocs=44218)
                0.0625 = fieldNorm(doc=967)
          0.013862316 = weight(abstract_txt:based in 967) [ClassicSimilarity], result of:
            0.013862316 = score(doc=967,freq=1.0), product of:
              0.06957405 = queryWeight, product of:
                1.1746898 = boost
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.018578714 = queryNorm
              0.19924548 = fieldWeight in 967, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.0625 = fieldNorm(doc=967)
          0.12687275 = weight(abstract_txt:neighbors in 967) [ClassicSimilarity], result of:
            0.12687275 = score(doc=967,freq=1.0), product of:
              0.21107101 = queryWeight, product of:
                1.181281 = boost
                9.617446 = idf(docFreq=7, maxDocs=44218)
                0.018578714 = queryNorm
              0.6010904 = fieldWeight in 967, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.617446 = idf(docFreq=7, maxDocs=44218)
                0.0625 = fieldNorm(doc=967)
        0.24 = coord(6/25)
    
  4. Goller, C.; Löning, J.; Will, T.; Wolff, W.: Automatic document classification : a thourough evaluation of various methods (2000) 0.12
    0.11549755 = sum of:
      0.11549755 = product of:
        0.57748777 = sum of:
          0.017327894 = weight(abstract_txt:based in 5480) [ClassicSimilarity], result of:
            0.017327894 = score(doc=5480,freq=1.0), product of:
              0.06957405 = queryWeight, product of:
                1.1746898 = boost
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.018578714 = queryNorm
              0.24905685 = fieldWeight in 5480, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.078125 = fieldNorm(doc=5480)
          0.05244043 = weight(abstract_txt:machine in 5480) [ClassicSimilarity], result of:
            0.05244043 = score(doc=5480,freq=1.0), product of:
              0.1271639 = queryWeight, product of:
                1.296689 = boost
                5.2785225 = idf(docFreq=612, maxDocs=44218)
                0.018578714 = queryNorm
              0.41238457 = fieldWeight in 5480, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2785225 = idf(docFreq=612, maxDocs=44218)
                0.078125 = fieldNorm(doc=5480)
          0.078430854 = weight(abstract_txt:selection in 5480) [ClassicSimilarity], result of:
            0.078430854 = score(doc=5480,freq=2.0), product of:
              0.13199809 = queryWeight, product of:
                1.3211062 = boost
                5.377919 = idf(docFreq=554, maxDocs=44218)
                0.018578714 = queryNorm
              0.5941817 = fieldWeight in 5480, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.377919 = idf(docFreq=554, maxDocs=44218)
                0.078125 = fieldNorm(doc=5480)
          0.118471734 = weight(abstract_txt:neural in 5480) [ClassicSimilarity], result of:
            0.118471734 = score(doc=5480,freq=1.0), product of:
              0.21894222 = queryWeight, product of:
                1.701448 = boost
                6.926203 = idf(docFreq=117, maxDocs=44218)
                0.018578714 = queryNorm
              0.54110956 = fieldWeight in 5480, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.926203 = idf(docFreq=117, maxDocs=44218)
                0.078125 = fieldNorm(doc=5480)
          0.31081685 = weight(abstract_txt:feature in 5480) [ClassicSimilarity], result of:
            0.31081685 = score(doc=5480,freq=2.0), product of:
              0.4767454 = queryWeight, product of:
                4.3486834 = boost
                5.9008293 = idf(docFreq=328, maxDocs=44218)
                0.018578714 = queryNorm
              0.65195566 = fieldWeight in 5480, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.9008293 = idf(docFreq=328, maxDocs=44218)
                0.078125 = fieldNorm(doc=5480)
        0.2 = coord(5/25)
    
  5. Cui, C.; Ma, J.; Lian, T.; Chen, Z.; Wang, S.: Improving image annotation via ranking-oriented neighbor search and learning-based keyword propagation (2015) 0.11
    0.11099934 = sum of:
      0.11099934 = product of:
        0.55499667 = sum of:
          0.15393548 = weight(abstract_txt:nearest in 1609) [ClassicSimilarity], result of:
            0.15393548 = score(doc=1609,freq=4.0), product of:
              0.15125933 = queryWeight, product of:
                8.14154 = idf(docFreq=34, maxDocs=44218)
                0.018578714 = queryNorm
              1.0176924 = fieldWeight in 1609, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                8.14154 = idf(docFreq=34, maxDocs=44218)
                0.0625 = fieldNorm(doc=1609)
          0.03099708 = weight(abstract_txt:based in 1609) [ClassicSimilarity], result of:
            0.03099708 = score(doc=1609,freq=5.0), product of:
              0.06957405 = queryWeight, product of:
                1.1746898 = boost
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.018578714 = queryNorm
              0.44552645 = fieldWeight in 1609, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.0625 = fieldNorm(doc=1609)
          0.2537455 = weight(abstract_txt:neighbors in 1609) [ClassicSimilarity], result of:
            0.2537455 = score(doc=1609,freq=4.0), product of:
              0.21107101 = queryWeight, product of:
                1.181281 = boost
                9.617446 = idf(docFreq=7, maxDocs=44218)
                0.018578714 = queryNorm
              1.2021807 = fieldWeight in 1609, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                9.617446 = idf(docFreq=7, maxDocs=44218)
                0.0625 = fieldNorm(doc=1609)
          0.04071451 = weight(abstract_txt:without in 1609) [ClassicSimilarity], result of:
            0.04071451 = score(doc=1609,freq=1.0), product of:
              0.12465006 = queryWeight, product of:
                1.2838082 = boost
                5.2260876 = idf(docFreq=645, maxDocs=44218)
                0.018578714 = queryNorm
              0.32663047 = fieldWeight in 1609, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2260876 = idf(docFreq=645, maxDocs=44218)
                0.0625 = fieldNorm(doc=1609)
          0.07560409 = weight(abstract_txt:effectiveness in 1609) [ClassicSimilarity], result of:
            0.07560409 = score(doc=1609,freq=1.0), product of:
              0.23726475 = queryWeight, product of:
                2.5048718 = boost
                5.098378 = idf(docFreq=733, maxDocs=44218)
                0.018578714 = queryNorm
              0.31864864 = fieldWeight in 1609, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.098378 = idf(docFreq=733, maxDocs=44218)
                0.0625 = fieldNorm(doc=1609)
        0.2 = coord(5/25)