Search (4 results, page 1 of 1)

Savoy, J.: Estimating the probability of an authorship attribution (2016) 0.03
```
0.032671984 = sum of:
  0.0149865085 = product of:
    0.059946034 = sum of:
      0.059946034 = weight(_text_:authors in 2937) [ClassicSimilarity], result of:
        0.059946034 = score(doc=2937,freq=2.0), product of:
          0.23803101 = queryWeight, product of:
            4.558814 = idf(docFreq=1258, maxDocs=44218)
            0.052213363 = queryNorm
          0.25184128 = fieldWeight in 2937, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            4.558814 = idf(docFreq=1258, maxDocs=44218)
            0.0390625 = fieldNorm(doc=2937)
    0.25 = coord(1/4)
  0.017685475 = product of:
    0.03537095 = sum of:
      0.03537095 = weight(_text_:22 in 2937) [ClassicSimilarity], result of:
        0.03537095 = score(doc=2937,freq=2.0), product of:
          0.1828423 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.052213363 = queryNorm
          0.19345059 = fieldWeight in 2937, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0390625 = fieldNorm(doc=2937)
    0.5 = coord(1/2)
```
Abstract

In authorship attribution, various distance-based metrics have been proposed to determine the most probable author of a disputed text. In this paradigm, a distance is computed between each author profile and the query text. These values are then employed only to rank the possible authors. In this article, we analyze their distribution and show that we can model it as a mixture of 2 Beta distributions. Based on this finding, we demonstrate how we can derive a more accurate probability that the closest author is, in fact, the real author. To evaluate this approach, we have chosen 4 authorship attribution methods (Burrows' Delta, Kullback-Leibler divergence, Labbé's intertextual distance, and the naïve Bayes). As the first test collection, we have downloaded 224 State of the Union addresses (from 1790 to 2014) delivered by 41 U.S. presidents. The second test collection is formed by the Federalist Papers. The evaluations indicate that the accuracy rate of some authorship decisions can be improved. The suggested method can signal that the proposed assignment should be interpreted as possible, without strong certainty. Being able to quantify the certainty associated with an authorship decision can be a useful component when important decisions must be taken.

Date

7. 5.2016 21:22:27

Savoy, J.; Ndarugendamwo, M.; Vrajitoru, D.: Report on the TREC-4 experiment : combining probabilistic and vector-space schemes (1996) 0.02

0.022054153 = product of:
  0.044108305 = sum of:
    0.044108305 = product of:
      0.08821661 = sum of:
        0.08821661 = weight(_text_:k in 7574) [ClassicSimilarity], result of:
          0.08821661 = score(doc=7574,freq=2.0), product of:
            0.18639012 = queryWeight, product of:
              3.569778 = idf(docFreq=3384, maxDocs=44218)
              0.052213363 = queryNorm
            0.47329018 = fieldWeight in 7574, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.569778 = idf(docFreq=3384, maxDocs=44218)
              0.09375 = fieldNorm(doc=7574)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Source: The Fourth Text Retrieval Conference (TREC-4). Ed.: K. Harman

Savoy, J.: Text representation strategies : an example with the State of the union addresses (2016) 0.01
```
0.00918923 = product of:
  0.01837846 = sum of:
    0.01837846 = product of:
      0.03675692 = sum of:
        0.03675692 = weight(_text_:k in 3042) [ClassicSimilarity], result of:
          0.03675692 = score(doc=3042,freq=2.0), product of:
            0.18639012 = queryWeight, product of:
              3.569778 = idf(docFreq=3384, maxDocs=44218)
              0.052213363 = queryNorm
            0.19720423 = fieldWeight in 3042, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.569778 = idf(docFreq=3384, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3042)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Based on State of the Union addresses from 1790 to 2014 (225 speeches delivered by 42 presidents), this paper describes and evaluates different text representation strategies. To determine the most important words of a given text, the term frequencies (tf) or the tf?idf weighting scheme can be applied. Recently, latent Dirichlet allocation (LDA) has been proposed to define the topics included in a corpus. As another strategy, this study proposes to apply a vocabulary specificity measure (Z?score) to determine the most significantly overused word-types or short sequences of them. Our experiments show that the simple term frequency measure is not able to discriminate between specific terms associated with a document or a set of texts. Using the tf idf or LDA approach, the selection requires some arbitrary decisions. Based on the term-specific measure (Z?score), the term selection has a clear theoretical basis. Moreover, the most significant sentences for each presidency can be determined. As another facet, we can visualize the dynamic evolution of usage of some terms associated with their specificity measures. Finally, this technique can be employed to define the most important lexical leaders introducing terms overused by the k following presidencies.
Ikae, C.; Savoy, J.: Gender identification on Twitter (2022) 0.01
```
0.00918923 = product of:
  0.01837846 = sum of:
    0.01837846 = product of:
      0.03675692 = sum of:
        0.03675692 = weight(_text_:k in 445) [ClassicSimilarity], result of:
          0.03675692 = score(doc=445,freq=2.0), product of:
            0.18639012 = queryWeight, product of:
              3.569778 = idf(docFreq=3384, maxDocs=44218)
              0.052213363 = queryNorm
            0.19720423 = fieldWeight in 445, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.569778 = idf(docFreq=3384, maxDocs=44218)
              0.0390625 = fieldNorm(doc=445)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

To determine the author of a text's gender, various feature types have been suggested (e.g., function words, n-gram of letters, etc.) leading to a huge number of stylistic markers. To determine the target category, different machine learning models have been suggested (e.g., logistic regression, decision tree, k nearest-neighbors, support vector machine, naïve Bayes, neural networks, and random forest). In this study, our first objective is to know whether or not the same model always proposes the best effectiveness when considering similar corpora under the same conditions. Thus, based on 7 CLEF-PAN collections, this study analyzes the effectiveness of 10 different classifiers. Our second aim is to propose a 2-stage feature selection to reduce the feature size to a few hundred terms without any significant change in the performance level compared to approaches using all the attributes (increase of around 5% after applying the proposed feature selection). Based on our experiments, neural network or random forest tend, on average, to produce the highest effectiveness. Moreover, empirical evidence indicates that reducing the feature set size to around 300 without penalizing the effectiveness is possible. Finally, based on such reduced feature sizes, an analysis reveals some of the specific terms that clearly discriminate between the 2 genders.

Search (4 results, page 1 of 1)

Authors

Years

Themes