Search (3 results, page 1 of 1)

  • × author_ss:"Savoy, J."
  • × language_ss:"e"
  1. Dolamic, L.; Savoy, J.: Indexing and searching strategies for the Russian language (2009) 0.04
    0.040342852 = product of:
      0.080685705 = sum of:
        0.080685705 = product of:
          0.16137141 = sum of:
            0.16137141 = weight(_text_:light in 3301) [ClassicSimilarity], result of:
              0.16137141 = score(doc=3301,freq=6.0), product of:
                0.2920221 = queryWeight, product of:
                  5.7753086 = idf(docFreq=372, maxDocs=44218)
                  0.050563898 = queryNorm
                0.55259997 = fieldWeight in 3301, product of:
                  2.4494898 = tf(freq=6.0), with freq of:
                    6.0 = termFreq=6.0
                  5.7753086 = idf(docFreq=372, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=3301)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    This paper describes and evaluates various stemming and indexing strategies for the Russian language. We design and evaluate two stemming approaches, a light and a more aggressive one, and compare these stemmers to the Snowball stemmer, to no stemming, and also to a language-independent approach (n-gram). To evaluate the suggested stemming strategies we apply various probabilistic information retrieval (IR) models, including the Okapi, the Divergence from Randomness (DFR), a statistical language model (LM), as well as two vector-space approaches, namely, the classical tf idf scheme and the dtu-dtn model. We find that the vector-space dtu-dtn and the DFR models tend to result in better retrieval effectiveness than the Okapi, LM, or tf idf models, while only the latter two IR approaches result in statistically significant performance differences. Ignoring stemming generally reduces the MAP by more than 50%, and these differences are always significant. When applying an n-gram approach, performance differences are usually lower than an approach involving stemming. Finally, our light stemmer tends to perform best, although performance differences between the light, aggressive, and Snowball stemmers are not statistically significant.
  2. Savoy, J.: Searching strategies for the Hungarian language (2008) 0.04
    0.039527763 = product of:
      0.079055525 = sum of:
        0.079055525 = product of:
          0.15811105 = sum of:
            0.15811105 = weight(_text_:light in 2037) [ClassicSimilarity], result of:
              0.15811105 = score(doc=2037,freq=4.0), product of:
                0.2920221 = queryWeight, product of:
                  5.7753086 = idf(docFreq=372, maxDocs=44218)
                  0.050563898 = queryNorm
                0.5414352 = fieldWeight in 2037, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  5.7753086 = idf(docFreq=372, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2037)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    This paper reports on the underlying IR problems encountered when dealing with the complex morphology and compound constructions found in the Hungarian language. It describes evaluations carried out on two general stemming strategies for this language, and also demonstrates that a light stemming approach could be quite effective. Based on searches done on the CLEF test collection, we find that a more aggressive suffix-stripping approach may produce better MAP. When compared to an IR scheme without stemming or one based on only a light stemmer, we find the differences to be statistically significant. When compared with probabilistic, vector-space and language models, we find that the Okapi model results in the best retrieval effectiveness. The resulting MAP is found to be about 35% better than the classical tf idf approach, particularly for very short requests. Finally, we demonstrate that applying an automatic decompounding procedure for both queries and documents significantly improves IR performance (+10%), compared to word-based indexing strategies.
  3. Savoy, J.: Estimating the probability of an authorship attribution (2016) 0.01
    0.008563388 = product of:
      0.017126776 = sum of:
        0.017126776 = product of:
          0.034253553 = sum of:
            0.034253553 = weight(_text_:22 in 2937) [ClassicSimilarity], result of:
              0.034253553 = score(doc=2937,freq=2.0), product of:
                0.17706616 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.050563898 = queryNorm
                0.19345059 = fieldWeight in 2937, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2937)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Date
    7. 5.2016 21:22:27