Search (24 results, page 1 of 2)

  • × author_ss:"Savoy, J."
  1. Savoy, J.: Estimating the probability of an authorship attribution (2016) 0.01
    0.014725267 = product of:
      0.0441758 = sum of:
        0.0441758 = sum of:
          0.014496832 = weight(_text_:of in 2937) [ClassicSimilarity], result of:
            0.014496832 = score(doc=2937,freq=12.0), product of:
              0.06850986 = queryWeight, product of:
                1.5637573 = idf(docFreq=25162, maxDocs=44218)
                0.043811057 = queryNorm
              0.21160212 = fieldWeight in 2937, product of:
                3.4641016 = tf(freq=12.0), with freq of:
                  12.0 = termFreq=12.0
                1.5637573 = idf(docFreq=25162, maxDocs=44218)
                0.0390625 = fieldNorm(doc=2937)
          0.029678967 = weight(_text_:22 in 2937) [ClassicSimilarity], result of:
            0.029678967 = score(doc=2937,freq=2.0), product of:
              0.15341885 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.043811057 = queryNorm
              0.19345059 = fieldWeight in 2937, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.0390625 = fieldNorm(doc=2937)
      0.33333334 = coord(1/3)
    
    Abstract
    In authorship attribution, various distance-based metrics have been proposed to determine the most probable author of a disputed text. In this paradigm, a distance is computed between each author profile and the query text. These values are then employed only to rank the possible authors. In this article, we analyze their distribution and show that we can model it as a mixture of 2 Beta distributions. Based on this finding, we demonstrate how we can derive a more accurate probability that the closest author is, in fact, the real author. To evaluate this approach, we have chosen 4 authorship attribution methods (Burrows' Delta, Kullback-Leibler divergence, Labbé's intertextual distance, and the naïve Bayes). As the first test collection, we have downloaded 224 State of the Union addresses (from 1790 to 2014) delivered by 41 U.S. presidents. The second test collection is formed by the Federalist Papers. The evaluations indicate that the accuracy rate of some authorship decisions can be improved. The suggested method can signal that the proposed assignment should be interpreted as possible, without strong certainty. Being able to quantify the certainty associated with an authorship decision can be a useful component when important decisions must be taken.
    Date
    7. 5.2016 21:22:27
    Source
    Journal of the Association for Information Science and Technology. 67(2016) no.6, S.1462-1472
  2. Savoy, J.: Stemming of French words based on grammatical categories (1993) 0.00
    0.004463867 = product of:
      0.0133916 = sum of:
        0.0133916 = product of:
          0.0267832 = sum of:
            0.0267832 = weight(_text_:of in 4650) [ClassicSimilarity], result of:
              0.0267832 = score(doc=4650,freq=4.0), product of:
                0.06850986 = queryWeight, product of:
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.043811057 = queryNorm
                0.39093933 = fieldWeight in 4650, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.125 = fieldNorm(doc=4650)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Source
    Journal of the American Society for Information Science. 44(1993) no.1, S.1-9
  3. Savoy, J.: Effectiveness of information retrieval systems used in a hypertext environment (1993) 0.00
    0.0031564306 = product of:
      0.009469291 = sum of:
        0.009469291 = product of:
          0.018938582 = sum of:
            0.018938582 = weight(_text_:of in 6511) [ClassicSimilarity], result of:
              0.018938582 = score(doc=6511,freq=8.0), product of:
                0.06850986 = queryWeight, product of:
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.043811057 = queryNorm
                0.27643585 = fieldWeight in 6511, product of:
                  2.828427 = tf(freq=8.0), with freq of:
                    8.0 = termFreq=8.0
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.0625 = fieldNorm(doc=6511)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    In most hypertext systems, information retrieval techniques emphasize browsing or navigational methods which are not thorough enough to find all relevant material, especially when the number of nodes and/or links becomes very large. Reviews the main query-based search techniques currently used in hypertext environments. Explains the experimental methodology. Concentrates on the retrieval effectiveness of these retrieval strategies. Considers ways of improving search effectiveness
  4. Savoy, J.: Text clustering : an application with the 'State of the Union' addresses (2015) 0.00
    0.0029591531 = product of:
      0.008877459 = sum of:
        0.008877459 = product of:
          0.017754918 = sum of:
            0.017754918 = weight(_text_:of in 2128) [ClassicSimilarity], result of:
              0.017754918 = score(doc=2128,freq=18.0), product of:
                0.06850986 = queryWeight, product of:
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.043811057 = queryNorm
                0.25915858 = fieldWeight in 2128, product of:
                  4.2426405 = tf(freq=18.0), with freq of:
                    18.0 = termFreq=18.0
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2128)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    This paper describes a clustering and authorship attribution study over the State of the Union addresses from 1790 to 2014 (224 speeches delivered by 41 presidents). To define the style of each presidency, we have applied a principal component analysis (PCA) based on the part-of-speech (POS) frequencies. From Roosevelt (1934), each president tends to own a distinctive style whereas previous presidents tend usually to share some stylistic aspects with others. Applying an automatic classification based on the frequencies of all content-bearing word-types we show that chronology tends to play a central role in forming clusters, a factor that is more important than political affiliation. Using the 300 most frequent word-types, we generate another clustering representation based on the style of each president. This second view shares similarities with the first one, but usually with more numerous and smaller clusters. Finally, an authorship attribution approach for each speech can reach a success rate of around 95.7% under some constraints. When an incorrect assignment is detected, the proposed author often belongs to the same party and has lived during roughly the same time period as the presumed author. A deeper analysis of some incorrect assignments reveals interesting reasons justifying difficult attributions.
    Source
    Journal of the Association for Information Science and Technology. 66(2015) no.8, S.1645-1654
  5. Dolamic, L.; Savoy, J.: Retrieval effectiveness of machine translated queries (2010) 0.00
    0.0028993662 = product of:
      0.008698098 = sum of:
        0.008698098 = product of:
          0.017396197 = sum of:
            0.017396197 = weight(_text_:of in 4102) [ClassicSimilarity], result of:
              0.017396197 = score(doc=4102,freq=12.0), product of:
                0.06850986 = queryWeight, product of:
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.043811057 = queryNorm
                0.25392252 = fieldWeight in 4102, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.046875 = fieldNorm(doc=4102)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    This article describes and evaluates various information retrieval models used to search document collections written in English through submitting queries written in various other languages, either members of the Indo-European family (English, French, German, and Spanish) or radically different language groups such as Chinese. This evaluation method involves searching a rather large number of topics (around 300) and using two commercial machine translation systems to translate across the language barriers. In this study, mean average precision is used to measure variances in retrieval effectiveness when a query language differs from the document language. Although performance differences are rather large for certain languages pairs, this does not mean that bilingual search methods are not commercially viable. Causes of the difficulties incurred when searching or during translation are analyzed and the results of concrete examples are explained.
    Source
    Journal of the American Society for Information Science and Technology. 61(2010) no.11, S.2266-2273
  6. Picard, J.; Savoy, J.: Enhancing retrieval with hyperlinks : a general model based on propositional argumentation systems (2003) 0.00
    0.0027899165 = product of:
      0.008369749 = sum of:
        0.008369749 = product of:
          0.016739499 = sum of:
            0.016739499 = weight(_text_:of in 1427) [ClassicSimilarity], result of:
              0.016739499 = score(doc=1427,freq=16.0), product of:
                0.06850986 = queryWeight, product of:
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.043811057 = queryNorm
                0.24433708 = fieldWeight in 1427, product of:
                  4.0 = tf(freq=16.0), with freq of:
                    16.0 = termFreq=16.0
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1427)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    Fast, effective, and adaptable techniques are needed to automatically organize and retrieve information an the ever-increasing World Wide Web. In that respect, different strategies have been suggested to take hypertext links into account. For example, hyperlinks have been used to (1) enhance document representation, (2) improve document ranking by propagating document score, (3) provide an indicator of popularity, and (4) find hubs and authorities for a given topic. Although the TREC experiments have not demonstrated the usefulness of hyperlinks for retrieval, the hypertext structure is nevertheless an essential aspect of the Web, and as such, should not be ignored. The development of abstract models of the IR task was a key factor to the improvement of search engines. However, at this time conceptual tools for modeling the hypertext retrieval task are lacking, making it difficult to compare, improve, and reason an the existing techniques. This article proposes a general model for using hyperlinks based an Probabilistic Argumentation Systems, in which each of the above-mentioned techniques can be stated. This model will allow to discover some inconsistencies in the mentioned techniques, and to take a higher level and systematic approach for using hyperlinks for retrieval.
    Source
    Journal of the American Society for Information Science and technology. 54(2003) no.4, S.347-355
  7. Savoy, J.: Text representation strategies : an example with the State of the union addresses (2016) 0.00
    0.0027899165 = product of:
      0.008369749 = sum of:
        0.008369749 = product of:
          0.016739499 = sum of:
            0.016739499 = weight(_text_:of in 3042) [ClassicSimilarity], result of:
              0.016739499 = score(doc=3042,freq=16.0), product of:
                0.06850986 = queryWeight, product of:
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.043811057 = queryNorm
                0.24433708 = fieldWeight in 3042, product of:
                  4.0 = tf(freq=16.0), with freq of:
                    16.0 = termFreq=16.0
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=3042)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    Based on State of the Union addresses from 1790 to 2014 (225 speeches delivered by 42 presidents), this paper describes and evaluates different text representation strategies. To determine the most important words of a given text, the term frequencies (tf) or the tf?idf weighting scheme can be applied. Recently, latent Dirichlet allocation (LDA) has been proposed to define the topics included in a corpus. As another strategy, this study proposes to apply a vocabulary specificity measure (Z?score) to determine the most significantly overused word-types or short sequences of them. Our experiments show that the simple term frequency measure is not able to discriminate between specific terms associated with a document or a set of texts. Using the tf idf or LDA approach, the selection requires some arbitrary decisions. Based on the term-specific measure (Z?score), the term selection has a clear theoretical basis. Moreover, the most significant sentences for each presidency can be determined. As another facet, we can visualize the dynamic evolution of usage of some terms associated with their specificity measures. Finally, this technique can be employed to define the most important lexical leaders introducing terms overused by the k following presidencies.
    Source
    Journal of the Association for Information Science and Technology. 67(2016) no.8, S.1858-1870
  8. Savoy, J.: ¬A new probabilistic scheme for information retrieval in hypertext (1995) 0.00
    0.0027618767 = product of:
      0.00828563 = sum of:
        0.00828563 = product of:
          0.01657126 = sum of:
            0.01657126 = weight(_text_:of in 7254) [ClassicSimilarity], result of:
              0.01657126 = score(doc=7254,freq=8.0), product of:
                0.06850986 = queryWeight, product of:
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.043811057 = queryNorm
                0.24188137 = fieldWeight in 7254, product of:
                  2.828427 = tf(freq=8.0), with freq of:
                    8.0 = termFreq=8.0
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=7254)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    The aim of probabilistic models is to define a retrieval strategy within which documents can be optimally ranked according to their relevance probability with respect to a given request. Presents a study which suggests representing documents not only by index term vendors, as proposed by previous probabilistic models but also by considering relevance hypertext links. To enhance retrieval effectiveness, the learning retrieval scheme should modify the weight assigned to each indexing terms, the importance attached to each search term, and the relationships between documents. Evaluation of the proposed retrieval scheme with a hypertext based on the CACM test collection which includes 3.204 documents and the CISI corpus (1,460 documents), yields interesting results on the retrieval effectiveness of this approach
    Source
    New review of hypermedia and multimedia. 1995, no.1, S.107-134
  9. Savoy, J.: Searching information in legal hypertext systems (1993/94) 0.00
    0.0027335489 = product of:
      0.008200646 = sum of:
        0.008200646 = product of:
          0.016401293 = sum of:
            0.016401293 = weight(_text_:of in 757) [ClassicSimilarity], result of:
              0.016401293 = score(doc=757,freq=6.0), product of:
                0.06850986 = queryWeight, product of:
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.043811057 = queryNorm
                0.23940048 = fieldWeight in 757, product of:
                  2.4494898 = tf(freq=6.0), with freq of:
                    6.0 = termFreq=6.0
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.0625 = fieldNorm(doc=757)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    Hypertext may represent a new paradigm capable of exploring legal sources within which links are established according to pertinent relationships found between statute texts and case law. However, to discover relvant information in such a network, a browsing mechanism is not enough when faced with a large column of texts. Describes a new retrieval model where documents are represented according to both their content and relationship with other sources of information
  10. Savoy, J.: Ranking schemes in hybrid Boolean systems : a new approach (1997) 0.00
    0.0026467475 = product of:
      0.007940242 = sum of:
        0.007940242 = product of:
          0.015880484 = sum of:
            0.015880484 = weight(_text_:of in 393) [ClassicSimilarity], result of:
              0.015880484 = score(doc=393,freq=10.0), product of:
                0.06850986 = queryWeight, product of:
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.043811057 = queryNorm
                0.23179851 = fieldWeight in 393, product of:
                  3.1622777 = tf(freq=10.0), with freq of:
                    10.0 = termFreq=10.0
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.046875 = fieldNorm(doc=393)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    In most commercial online systems, the retrieval system is based on the Boolean model and its inverted file organization. Since the investment in these systems is so great and changing them could be economically unfeasible, this article suggests a new ranking scheme especially adapted for hypertext environments in order to produce more effective retrieval results and yet maintain the effectiveness of the investment made to date in the Boolean model. To select the retrieved documents, the suggested ranking strategy uses multiple sources of document content evidence. The proposed scheme integrates both the information provided by the index and query terms, and the inherent relationships between documents such as bibliographic references or hypertext links. We will demonstrate that our scheme represents an integration of both subject and citation indexing, and results in a significant imporvement over classical ranking schemes uses in hybrid Boolean systems, while preserving its efficiency. Moreover, through knowing the nearest neighbor and the hypertext links which constitute additional sources of evidence, our strategy will take them into account in order to further improve retrieval effectiveness and to provide 'good' starting points for browsing in a hypertext or hypermedia environement
    Source
    Journal of the American Society for Information Science. 48(1997) no.3, S.235-253
  11. Ikae, C.; Savoy, J.: Gender identification on Twitter (2022) 0.00
    0.002609728 = product of:
      0.007829184 = sum of:
        0.007829184 = product of:
          0.015658367 = sum of:
            0.015658367 = weight(_text_:of in 445) [ClassicSimilarity], result of:
              0.015658367 = score(doc=445,freq=14.0), product of:
                0.06850986 = queryWeight, product of:
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.043811057 = queryNorm
                0.22855641 = fieldWeight in 445, product of:
                  3.7416575 = tf(freq=14.0), with freq of:
                    14.0 = termFreq=14.0
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=445)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    To determine the author of a text's gender, various feature types have been suggested (e.g., function words, n-gram of letters, etc.) leading to a huge number of stylistic markers. To determine the target category, different machine learning models have been suggested (e.g., logistic regression, decision tree, k nearest-neighbors, support vector machine, naïve Bayes, neural networks, and random forest). In this study, our first objective is to know whether or not the same model always proposes the best effectiveness when considering similar corpora under the same conditions. Thus, based on 7 CLEF-PAN collections, this study analyzes the effectiveness of 10 different classifiers. Our second aim is to propose a 2-stage feature selection to reduce the feature size to a few hundred terms without any significant change in the performance level compared to approaches using all the attributes (increase of around 5% after applying the proposed feature selection). Based on our experiments, neural network or random forest tend, on average, to produce the highest effectiveness. Moreover, empirical evidence indicates that reducing the feature set size to around 300 without penalizing the effectiveness is possible. Finally, based on such reduced feature sizes, an analysis reveals some of the specific terms that clearly discriminate between the 2 genders.
    Source
    Journal of the Association for Information Science and Technology. 73(2022) no.1, S.58-69
  12. Savoy, J.: Bibliographic database access using free-text and controlled vocabulary : an evaluation (2005) 0.00
    0.0023918552 = product of:
      0.0071755657 = sum of:
        0.0071755657 = product of:
          0.014351131 = sum of:
            0.014351131 = weight(_text_:of in 1053) [ClassicSimilarity], result of:
              0.014351131 = score(doc=1053,freq=6.0), product of:
                0.06850986 = queryWeight, product of:
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.043811057 = queryNorm
                0.20947541 = fieldWeight in 1053, product of:
                  2.4494898 = tf(freq=6.0), with freq of:
                    6.0 = termFreq=6.0
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=1053)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    This paper evaluates and compares the retrieval effectiveness of various search models, based on either automatic text-word indexing or on manually assigned controlled descriptors. Retrieval is from a relatively large collection of bibliographic material written in French. Moreover, for this French collection we evaluate improvements that result from combining automatic and manual indexing. First, when considering various contexts, this study reveals that the combined indexing strategy always obtains the best retrieval performance. Second, when users wish to conduct exhaustive searches with minimal effort, we demonstrate that manually assigned terms are essential. Third, the evaluations presented in this paper study reveal the comparative retrieval performances that result from manual and automatic indexing in a variety of circumstances.
  13. Savoy, J.; Ndarugendamwo, M.; Vrajitoru, D.: Report on the TREC-4 experiment : combining probabilistic and vector-space schemes (1996) 0.00
    0.0023673228 = product of:
      0.0071019684 = sum of:
        0.0071019684 = product of:
          0.014203937 = sum of:
            0.014203937 = weight(_text_:of in 7574) [ClassicSimilarity], result of:
              0.014203937 = score(doc=7574,freq=2.0), product of:
                0.06850986 = queryWeight, product of:
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.043811057 = queryNorm
                0.20732689 = fieldWeight in 7574, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.09375 = fieldNorm(doc=7574)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Imprint
    Gaithersburgh, MD : National Institute of Standards and Technology
  14. Savoy, J.; Calvé, A. le; Vrajitoru, D.: Report on the TREC5 experiment : data fusion and collection fusion (1997) 0.00
    0.0023673228 = product of:
      0.0071019684 = sum of:
        0.0071019684 = product of:
          0.014203937 = sum of:
            0.014203937 = weight(_text_:of in 3108) [ClassicSimilarity], result of:
              0.014203937 = score(doc=3108,freq=2.0), product of:
                0.06850986 = queryWeight, product of:
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.043811057 = queryNorm
                0.20732689 = fieldWeight in 3108, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.09375 = fieldNorm(doc=3108)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Imprint
    Gaithersburgh, MD : National Institute of Standards and Technology
  15. Savoy, J.: ¬A stemming procedure and stopword list for general French Corpora (1999) 0.00
    0.0023673228 = product of:
      0.0071019684 = sum of:
        0.0071019684 = product of:
          0.014203937 = sum of:
            0.014203937 = weight(_text_:of in 4314) [ClassicSimilarity], result of:
              0.014203937 = score(doc=4314,freq=2.0), product of:
                0.06850986 = queryWeight, product of:
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.043811057 = queryNorm
                0.20732689 = fieldWeight in 4314, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.09375 = fieldNorm(doc=4314)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Source
    Journal of the American Society for Information Science. 50(1999) no.10, S.944-954
  16. Kocher, M.; Savoy, J.: ¬A simple and efficient algorithm for authorship verification (2017) 0.00
    0.0023673228 = product of:
      0.0071019684 = sum of:
        0.0071019684 = product of:
          0.014203937 = sum of:
            0.014203937 = weight(_text_:of in 3330) [ClassicSimilarity], result of:
              0.014203937 = score(doc=3330,freq=8.0), product of:
                0.06850986 = queryWeight, product of:
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.043811057 = queryNorm
                0.20732689 = fieldWeight in 3330, product of:
                  2.828427 = tf(freq=8.0), with freq of:
                    8.0 = termFreq=8.0
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.046875 = fieldNorm(doc=3330)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    This paper describes and evaluates an unsupervised and effective authorship verification model called Spatium-L1. As features, we suggest using the 200 most frequent terms of the disputed text (isolated words and punctuation symbols). Applying a simple distance measure and a set of impostors, we can determine whether or not the disputed text was written by the proposed author. Moreover, based on a simple rule we can define when there is enough evidence to propose an answer or when the attribution scheme is unable to make a decision with a high degree of certainty. Evaluations based on 6 test collections (PAN CLEF 2014 evaluation campaign) indicate that Spatium-L1 usually appears in the top 3 best verification systems, and on an aggregate measure, presents the best performance. The suggested strategy can be adapted without any problem to different Indo-European languages (such as English, Dutch, Spanish, and Greek) or genres (essay, novel, review, and newspaper article).
    Source
    Journal of the Association for Information Science and Technology. 68(2017) no.1, S.259-269
  17. Savoy, J.; Desbois, D.: Information retrieval in hypertext systems (1991) 0.00
    0.0022319334 = product of:
      0.0066958 = sum of:
        0.0066958 = product of:
          0.0133916 = sum of:
            0.0133916 = weight(_text_:of in 4452) [ClassicSimilarity], result of:
              0.0133916 = score(doc=4452,freq=4.0), product of:
                0.06850986 = queryWeight, product of:
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.043811057 = queryNorm
                0.19546966 = fieldWeight in 4452, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.0625 = fieldNorm(doc=4452)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    The emphasis in most hypertext systems is on the navigational methods, rather than on the global document retrieval mechanisms. When a search mechanism is provided, it is often restricted to simple string matching or to the Boolean model (as an alternate method). proposes a retrieval mechanism using Bayesian inference networks. The main contribution of this approach is the automatic construction of this network using the expected mutual information measure to build the inference tree, and using Jaccard's formula to define fixed conditional probability relationships
  18. Savoy, J.: ¬An extended vector-processing scheme for searching information in hypertext systems (1996) 0.00
    0.0022056228 = product of:
      0.006616868 = sum of:
        0.006616868 = product of:
          0.013233736 = sum of:
            0.013233736 = weight(_text_:of in 4036) [ClassicSimilarity], result of:
              0.013233736 = score(doc=4036,freq=10.0), product of:
                0.06850986 = queryWeight, product of:
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.043811057 = queryNorm
                0.19316542 = fieldWeight in 4036, product of:
                  3.1622777 = tf(freq=10.0), with freq of:
                    10.0 = termFreq=10.0
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=4036)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    When searching information in a hypertext is limited to navigation, it is not an easy task, especially when the number of nodes and/or links becomes very large. A query based access mechanism must therefore be provided to complement the navigational tools inherent in hypertext systems. Most mechanisms currently proposed are based on conventional information retrieval models which consider documents as indepent entities, and ignore hypertext links. To promote the use of other information retrieval mechnaisms adapted to hypertext systems, responds to the following questions; how can we integrate information given by hypertext links into an information retrieval scheme; are these hypertext links (and link semantics) clues to the enhancement of retrieval effectiveness; if so, how can we use them. 2 solutions are: using a default weight function based on link tape or assigning the same strength to all link types; or using a specific weight for each particular link, i.e. the level of association or a similarity measure. Proposes an extended vector processing scheme which extracts additional information from hypertext links to enhance retrieval effectiveness. A hypertext based on 2 medium size collections, the CACM and the CISI collection has been built. The hypergraph is composed of explicit links (bibliographic references), computed links based on bibliographic information, or on hypertext links established according to document representatives (nearest neighbour)
  19. Abdou, S.; Savoy, J.: Searching in Medline : query expansion and manual indexing evaluation (2008) 0.00
    0.0020501618 = product of:
      0.006150485 = sum of:
        0.006150485 = product of:
          0.01230097 = sum of:
            0.01230097 = weight(_text_:of in 2062) [ClassicSimilarity], result of:
              0.01230097 = score(doc=2062,freq=6.0), product of:
                0.06850986 = queryWeight, product of:
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.043811057 = queryNorm
                0.17955035 = fieldWeight in 2062, product of:
                  2.4494898 = tf(freq=6.0), with freq of:
                    6.0 = termFreq=6.0
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2062)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    Based on a relatively large subset representing one third of the Medline collection, this paper evaluates ten different IR models, including recent developments in both probabilistic and language models. We show that the best performing IR models is a probabilistic model developed within the Divergence from Randomness framework [Amati, G., & van Rijsbergen, C.J. (2002) Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM-Transactions on Information Systems 20(4), 357-389], which result in 170% enhancements in mean average precision when compared to the classical tf idf vector-space model. This paper also reports on our impact evaluations on the retrieval effectiveness of manually assigned descriptors (MeSH or Medical Subject Headings), showing that by including these terms retrieval performance can improve from 2.4% to 13.5%, depending on the underling IR model. Finally, we design a new general blind-query expansion approach showing improved retrieval performances compared to those obtained using the Rocchio approach.
  20. Savoy, J.: Authorship of Pauline epistles revisited (2019) 0.00
    0.001972769 = product of:
      0.0059183068 = sum of:
        0.0059183068 = product of:
          0.0118366135 = sum of:
            0.0118366135 = weight(_text_:of in 5386) [ClassicSimilarity], result of:
              0.0118366135 = score(doc=5386,freq=8.0), product of:
                0.06850986 = queryWeight, product of:
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.043811057 = queryNorm
                0.17277241 = fieldWeight in 5386, product of:
                  2.828427 = tf(freq=8.0), with freq of:
                    8.0 = termFreq=8.0
                  1.5637573 = idf(docFreq=25162, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=5386)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    The name Paul appears in 13 epistles, but is he the real author? According to different biblical scholars, the number of letters really attributed to Paul varies from 4 to 13, with a majority agreeing on seven. This article proposes to revisit this authorship attribution problem by considering two effective methods (Burrows' Delta, Labbé's intertextual distance). Based on these results, a hierarchical clustering is then applied showing that four clusters can be derived, namely: {Colossians-Ephesians}, {1 and 2 Thessalonians}, {Titus, 1 and 2 Timothy}, and {Romans, Galatians, 1 and 2 Corinthians}. Moreover, a verification method based on the impostors' strategy indicates clearly that the group {Colossians-Ephesians} is written by the same author who seems not to be Paul. The same conclusion can be found for the cluster {Titus, 1 and 2 Timothy}. The Letter to Philemon stays as a singleton, without any close stylistic relationship with the other epistles. Finally, a group of four letters {Romans, Galatians, 1 and 2 Corinthians} is certainly written by the same author (Paul), but the verification protocol also indicates that 2 Corinthians is related to 1 Thessalonians, rendering a clear and simple interpretation difficult.
    Source
    Journal of the Association for Information Science and Technology. 70(2019) no.10, S.1089-1097