Search (6 results, page 1 of 1)

Savoy, J.: Estimating the probability of an authorship attribution (2016) 0.01
```
0.014725267 = product of:
  0.0441758 = sum of:
    0.0441758 = sum of:
      0.014496832 = weight(_text_:of in 2937) [ClassicSimilarity], result of:
        0.014496832 = score(doc=2937,freq=12.0), product of:
          0.06850986 = queryWeight, product of:
            1.5637573 = idf(docFreq=25162, maxDocs=44218)
            0.043811057 = queryNorm
          0.21160212 = fieldWeight in 2937, product of:
            3.4641016 = tf(freq=12.0), with freq of:
              12.0 = termFreq=12.0
            1.5637573 = idf(docFreq=25162, maxDocs=44218)
            0.0390625 = fieldNorm(doc=2937)
      0.029678967 = weight(_text_:22 in 2937) [ClassicSimilarity], result of:
        0.029678967 = score(doc=2937,freq=2.0), product of:
          0.15341885 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.043811057 = queryNorm
          0.19345059 = fieldWeight in 2937, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0390625 = fieldNorm(doc=2937)
  0.33333334 = coord(1/3)
```
Abstract

In authorship attribution, various distance-based metrics have been proposed to determine the most probable author of a disputed text. In this paradigm, a distance is computed between each author profile and the query text. These values are then employed only to rank the possible authors. In this article, we analyze their distribution and show that we can model it as a mixture of 2 Beta distributions. Based on this finding, we demonstrate how we can derive a more accurate probability that the closest author is, in fact, the real author. To evaluate this approach, we have chosen 4 authorship attribution methods (Burrows' Delta, Kullback-Leibler divergence, Labbé's intertextual distance, and the naïve Bayes). As the first test collection, we have downloaded 224 State of the Union addresses (from 1790 to 2014) delivered by 41 U.S. presidents. The second test collection is formed by the Federalist Papers. The evaluations indicate that the accuracy rate of some authorship decisions can be improved. The suggested method can signal that the proposed assignment should be interpreted as possible, without strong certainty. Being able to quantify the certainty associated with an authorship decision can be a useful component when important decisions must be taken.

Date

7. 5.2016 21:22:27

Source

Journal of the Association for Information Science and Technology. 67(2016) no.6, S.1462-1472
Savoy, J.: Text clustering : an application with the 'State of the Union' addresses (2015) 0.00
```
0.0029591531 = product of:
  0.008877459 = sum of:
    0.008877459 = product of:
      0.017754918 = sum of:
        0.017754918 = weight(_text_:of in 2128) [ClassicSimilarity], result of:
          0.017754918 = score(doc=2128,freq=18.0), product of:
            0.06850986 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.043811057 = queryNorm
            0.25915858 = fieldWeight in 2128, product of:
              4.2426405 = tf(freq=18.0), with freq of:
                18.0 = termFreq=18.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2128)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)
```
Abstract

This paper describes a clustering and authorship attribution study over the State of the Union addresses from 1790 to 2014 (224 speeches delivered by 41 presidents). To define the style of each presidency, we have applied a principal component analysis (PCA) based on the part-of-speech (POS) frequencies. From Roosevelt (1934), each president tends to own a distinctive style whereas previous presidents tend usually to share some stylistic aspects with others. Applying an automatic classification based on the frequencies of all content-bearing word-types we show that chronology tends to play a central role in forming clusters, a factor that is more important than political affiliation. Using the 300 most frequent word-types, we generate another clustering representation based on the style of each president. This second view shares similarities with the first one, but usually with more numerous and smaller clusters. Finally, an authorship attribution approach for each speech can reach a success rate of around 95.7% under some constraints. When an incorrect assignment is detected, the proposed author often belongs to the same party and has lived during roughly the same time period as the presumed author. A deeper analysis of some incorrect assignments reveals interesting reasons justifying difficult attributions.

Source

Journal of the Association for Information Science and Technology. 66(2015) no.8, S.1645-1654
Dolamic, L.; Savoy, J.: Retrieval effectiveness of machine translated queries (2010) 0.00
```
0.0028993662 = product of:
  0.008698098 = sum of:
    0.008698098 = product of:
      0.017396197 = sum of:
        0.017396197 = weight(_text_:of in 4102) [ClassicSimilarity], result of:
          0.017396197 = score(doc=4102,freq=12.0), product of:
            0.06850986 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.043811057 = queryNorm
            0.25392252 = fieldWeight in 4102, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=4102)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)
```
Abstract

This article describes and evaluates various information retrieval models used to search document collections written in English through submitting queries written in various other languages, either members of the Indo-European family (English, French, German, and Spanish) or radically different language groups such as Chinese. This evaluation method involves searching a rather large number of topics (around 300) and using two commercial machine translation systems to translate across the language barriers. In this study, mean average precision is used to measure variances in retrieval effectiveness when a query language differs from the document language. Although performance differences are rather large for certain languages pairs, this does not mean that bilingual search methods are not commercially viable. Causes of the difficulties incurred when searching or during translation are analyzed and the results of concrete examples are explained.

Source

Journal of the American Society for Information Science and Technology. 61(2010) no.11, S.2266-2273
Savoy, J.: Text representation strategies : an example with the State of the union addresses (2016) 0.00
```
0.0027899165 = product of:
  0.008369749 = sum of:
    0.008369749 = product of:
      0.016739499 = sum of:
        0.016739499 = weight(_text_:of in 3042) [ClassicSimilarity], result of:
          0.016739499 = score(doc=3042,freq=16.0), product of:
            0.06850986 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.043811057 = queryNorm
            0.24433708 = fieldWeight in 3042, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3042)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)
```
Abstract

Based on State of the Union addresses from 1790 to 2014 (225 speeches delivered by 42 presidents), this paper describes and evaluates different text representation strategies. To determine the most important words of a given text, the term frequencies (tf) or the tf?idf weighting scheme can be applied. Recently, latent Dirichlet allocation (LDA) has been proposed to define the topics included in a corpus. As another strategy, this study proposes to apply a vocabulary specificity measure (Z?score) to determine the most significantly overused word-types or short sequences of them. Our experiments show that the simple term frequency measure is not able to discriminate between specific terms associated with a document or a set of texts. Using the tf idf or LDA approach, the selection requires some arbitrary decisions. Based on the term-specific measure (Z?score), the term selection has a clear theoretical basis. Moreover, the most significant sentences for each presidency can be determined. As another facet, we can visualize the dynamic evolution of usage of some terms associated with their specificity measures. Finally, this technique can be employed to define the most important lexical leaders introducing terms overused by the k following presidencies.

Source

Journal of the Association for Information Science and Technology. 67(2016) no.8, S.1858-1870
Kocher, M.; Savoy, J.: ¬A simple and efficient algorithm for authorship verification (2017) 0.00
```
0.0023673228 = product of:
  0.0071019684 = sum of:
    0.0071019684 = product of:
      0.014203937 = sum of:
        0.014203937 = weight(_text_:of in 3330) [ClassicSimilarity], result of:
          0.014203937 = score(doc=3330,freq=8.0), product of:
            0.06850986 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.043811057 = queryNorm
            0.20732689 = fieldWeight in 3330, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=3330)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)
```
Abstract

This paper describes and evaluates an unsupervised and effective authorship verification model called Spatium-L1. As features, we suggest using the 200 most frequent terms of the disputed text (isolated words and punctuation symbols). Applying a simple distance measure and a set of impostors, we can determine whether or not the disputed text was written by the proposed author. Moreover, based on a simple rule we can define when there is enough evidence to propose an answer or when the attribution scheme is unable to make a decision with a high degree of certainty. Evaluations based on 6 test collections (PAN CLEF 2014 evaluation campaign) indicate that Spatium-L1 usually appears in the top 3 best verification systems, and on an aggregate measure, presents the best performance. The suggested strategy can be adapted without any problem to different Indo-European languages (such as English, Dutch, Spanish, and Greek) or genres (essay, novel, review, and newspaper article).

Source

Journal of the Association for Information Science and Technology. 68(2017) no.1, S.259-269
Savoy, J.: Authorship of Pauline epistles revisited (2019) 0.00
```
0.001972769 = product of:
  0.0059183068 = sum of:
    0.0059183068 = product of:
      0.0118366135 = sum of:
        0.0118366135 = weight(_text_:of in 5386) [ClassicSimilarity], result of:
          0.0118366135 = score(doc=5386,freq=8.0), product of:
            0.06850986 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.043811057 = queryNorm
            0.17277241 = fieldWeight in 5386, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5386)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)
```
Abstract

The name Paul appears in 13 epistles, but is he the real author? According to different biblical scholars, the number of letters really attributed to Paul varies from 4 to 13, with a majority agreeing on seven. This article proposes to revisit this authorship attribution problem by considering two effective methods (Burrows' Delta, Labbé's intertextual distance). Based on these results, a hierarchical clustering is then applied showing that four clusters can be derived, namely: {Colossians-Ephesians}, {1 and 2 Thessalonians}, {Titus, 1 and 2 Timothy}, and {Romans, Galatians, 1 and 2 Corinthians}. Moreover, a verification method based on the impostors' strategy indicates clearly that the group {Colossians-Ephesians} is written by the same author who seems not to be Paul. The same conclusion can be found for the cluster {Titus, 1 and 2 Timothy}. The Letter to Philemon stays as a singleton, without any close stylistic relationship with the other epistles. Finally, a group of four letters {Romans, Galatians, 1 and 2 Corinthians} is certainly written by the same author (Paul), but the verification protocol also indicates that 2 Corinthians is related to 1 Thessalonians, rendering a clear and simple interpretation difficult.

Source

Journal of the Association for Information Science and Technology. 70(2019) no.10, S.1089-1097

Search (6 results, page 1 of 1)

Authors

Themes