Search (4 results, page 1 of 1)

  • × author_ss:"Savoy, J."
  1. Dolamic, L.; Savoy, J.: Retrieval effectiveness of machine translated queries (2010) 0.03
    0.030525716 = product of:
      0.061051432 = sum of:
        0.061051432 = product of:
          0.122102864 = sum of:
            0.122102864 = weight(_text_:300 in 4102) [ClassicSimilarity], result of:
              0.122102864 = score(doc=4102,freq=2.0), product of:
                0.3045538 = queryWeight, product of:
                  6.047913 = idf(docFreq=283, maxDocs=44218)
                  0.050356843 = queryNorm
                0.4009238 = fieldWeight in 4102, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  6.047913 = idf(docFreq=283, maxDocs=44218)
                  0.046875 = fieldNorm(doc=4102)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    This article describes and evaluates various information retrieval models used to search document collections written in English through submitting queries written in various other languages, either members of the Indo-European family (English, French, German, and Spanish) or radically different language groups such as Chinese. This evaluation method involves searching a rather large number of topics (around 300) and using two commercial machine translation systems to translate across the language barriers. In this study, mean average precision is used to measure variances in retrieval effectiveness when a query language differs from the document language. Although performance differences are rather large for certain languages pairs, this does not mean that bilingual search methods are not commercially viable. Causes of the difficulties incurred when searching or during translation are analyzed and the results of concrete examples are explained.
  2. Savoy, J.: Text clustering : an application with the 'State of the Union' addresses (2015) 0.03
    0.025438096 = product of:
      0.050876193 = sum of:
        0.050876193 = product of:
          0.101752385 = sum of:
            0.101752385 = weight(_text_:300 in 2128) [ClassicSimilarity], result of:
              0.101752385 = score(doc=2128,freq=2.0), product of:
                0.3045538 = queryWeight, product of:
                  6.047913 = idf(docFreq=283, maxDocs=44218)
                  0.050356843 = queryNorm
                0.33410314 = fieldWeight in 2128, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  6.047913 = idf(docFreq=283, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2128)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    This paper describes a clustering and authorship attribution study over the State of the Union addresses from 1790 to 2014 (224 speeches delivered by 41 presidents). To define the style of each presidency, we have applied a principal component analysis (PCA) based on the part-of-speech (POS) frequencies. From Roosevelt (1934), each president tends to own a distinctive style whereas previous presidents tend usually to share some stylistic aspects with others. Applying an automatic classification based on the frequencies of all content-bearing word-types we show that chronology tends to play a central role in forming clusters, a factor that is more important than political affiliation. Using the 300 most frequent word-types, we generate another clustering representation based on the style of each president. This second view shares similarities with the first one, but usually with more numerous and smaller clusters. Finally, an authorship attribution approach for each speech can reach a success rate of around 95.7% under some constraints. When an incorrect assignment is detected, the proposed author often belongs to the same party and has lived during roughly the same time period as the presumed author. A deeper analysis of some incorrect assignments reveals interesting reasons justifying difficult attributions.
  3. Ikae, C.; Savoy, J.: Gender identification on Twitter (2022) 0.03
    0.025438096 = product of:
      0.050876193 = sum of:
        0.050876193 = product of:
          0.101752385 = sum of:
            0.101752385 = weight(_text_:300 in 445) [ClassicSimilarity], result of:
              0.101752385 = score(doc=445,freq=2.0), product of:
                0.3045538 = queryWeight, product of:
                  6.047913 = idf(docFreq=283, maxDocs=44218)
                  0.050356843 = queryNorm
                0.33410314 = fieldWeight in 445, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  6.047913 = idf(docFreq=283, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=445)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    To determine the author of a text's gender, various feature types have been suggested (e.g., function words, n-gram of letters, etc.) leading to a huge number of stylistic markers. To determine the target category, different machine learning models have been suggested (e.g., logistic regression, decision tree, k nearest-neighbors, support vector machine, naïve Bayes, neural networks, and random forest). In this study, our first objective is to know whether or not the same model always proposes the best effectiveness when considering similar corpora under the same conditions. Thus, based on 7 CLEF-PAN collections, this study analyzes the effectiveness of 10 different classifiers. Our second aim is to propose a 2-stage feature selection to reduce the feature size to a few hundred terms without any significant change in the performance level compared to approaches using all the attributes (increase of around 5% after applying the proposed feature selection). Based on our experiments, neural network or random forest tend, on average, to produce the highest effectiveness. Moreover, empirical evidence indicates that reducing the feature set size to around 300 without penalizing the effectiveness is possible. Finally, based on such reduced feature sizes, an analysis reveals some of the specific terms that clearly discriminate between the 2 genders.
  4. Savoy, J.: Estimating the probability of an authorship attribution (2016) 0.01
    0.008528322 = product of:
      0.017056644 = sum of:
        0.017056644 = product of:
          0.034113288 = sum of:
            0.034113288 = weight(_text_:22 in 2937) [ClassicSimilarity], result of:
              0.034113288 = score(doc=2937,freq=2.0), product of:
                0.17634109 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.050356843 = queryNorm
                0.19345059 = fieldWeight in 2937, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2937)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Date
    7. 5.2016 21:22:27