Search (286 results, page 1 of 15)

Baayen, R.H.; Lieber, H.: Word frequency distributions and lexical semantics (1997) 0.20

0.20030972 = product of:
  0.40061945 = sum of:
    0.35609803 = weight(_text_:frequency in 3117) [ClassicSimilarity], result of:
      0.35609803 = score(doc=3117,freq=4.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        1.288163 = fieldWeight in 3117, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.109375 = fieldNorm(doc=3117)
    0.04452143 = product of:
      0.08904286 = sum of:
        0.08904286 = weight(_text_:22 in 3117) [ClassicSimilarity], result of:
          0.08904286 = score(doc=3117,freq=2.0), product of:
            0.16438834 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.04694356 = queryNorm
            0.5416616 = fieldWeight in 3117, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.109375 = fieldNorm(doc=3117)
      0.5 = coord(1/2)
  0.5 = coord(2/4)

Abstract: Relation between meaning, lexical productivity and frequency of use
Date: 28. 2.1999 10:48:22

Doko, A.; Stula, , M.; Seric, L.: Improved sentence retrieval using local context and sentence length (2013) 0.17

0.17057168 = product of:
  0.22742891 = sum of:
    0.0070626684 = product of:
      0.028250674 = sum of:
        0.028250674 = weight(_text_:based in 2705) [ClassicSimilarity], result of:
          0.028250674 = score(doc=2705,freq=2.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.19973516 = fieldWeight in 2705, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.046875 = fieldNorm(doc=2705)
      0.25 = coord(1/4)
    0.06775281 = weight(_text_:term in 2705) [ClassicSimilarity], result of:
      0.06775281 = score(doc=2705,freq=2.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.309317 = fieldWeight in 2705, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.046875 = fieldNorm(doc=2705)
    0.15261345 = weight(_text_:frequency in 2705) [ClassicSimilarity], result of:
      0.15261345 = score(doc=2705,freq=4.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.55206984 = fieldWeight in 2705, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.046875 = fieldNorm(doc=2705)
  0.75 = coord(3/4)

Abstract: In this paper we propose improved variants of the sentence retrieval method TF-ISF (a TF-IDF or Term Frequency-Inverse Document Frequency variant for sentence retrieval). The improvement is achieved by using context consisting of neighboring sentences and at the same time promoting the retrieval of longer sentences. We thoroughly compare new modified TF-ISF methods to the TF-ISF baseline, to an earlier attempt to include context into TF-ISF named tfmix and to a language modeling based method that uses context and promoting retrieval of long sentences named 3MMPDS. Experimental results show that the TF-ISF method can be improved using local context. Results also show that the TF-ISF method can be improved by promoting the retrieval of longer sentences. Finally we show that the best results are achieved when combining both modifications. All new methods (TF-ISF variants) also show statistically significant better results than the other tested methods.

Ko, Y.: ¬A new term-weighting scheme for text classification using the odds of positive and negative class probabilities (2015) 0.17
```
0.16593526 = product of:
  0.33187053 = sum of:
    0.17925708 = weight(_text_:term in 2339) [ClassicSimilarity], result of:
      0.17925708 = score(doc=2339,freq=14.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.8183758 = fieldWeight in 2339, product of:
          3.7416575 = tf(freq=14.0), with freq of:
            14.0 = termFreq=14.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.046875 = fieldNorm(doc=2339)
    0.15261345 = weight(_text_:frequency in 2339) [ClassicSimilarity], result of:
      0.15261345 = score(doc=2339,freq=4.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.55206984 = fieldWeight in 2339, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.046875 = fieldNorm(doc=2339)
  0.5 = coord(2/4)
```
Abstract

Text classification (TC) is a core technique for text mining and information retrieval. It has been applied to many applications in many different research and industrial areas. Term-weighting schemes assign an appropriate weight to each term to obtain a high TC performance. Although term weighting is one of the important modules for TC and TC has different peculiarities from those in information retrieval, many term-weighting schemes used in information retrieval, such as term frequency-inverse document frequency (tf-idf), have been used in TC in the same manner. The peculiarity of TC that differs most from information retrieval is the existence of class information. This article proposes a new term-weighting scheme that uses class information using positive and negative class distributions. As a result, the proposed scheme, log tf-TRR, consistently performs better than do other schemes using class information as well as traditional schemes such as tf-idf.

Savoy, J.: Text representation strategies : an example with the State of the union addresses (2016) 0.16

0.15837984 = product of:
  0.21117312 = sum of:
    0.008323434 = product of:
      0.033293735 = sum of:
        0.033293735 = weight(_text_:based in 3042) [ClassicSimilarity], result of:
          0.033293735 = score(doc=3042,freq=4.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.23539014 = fieldWeight in 3042, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3042)
      0.25 = coord(1/4)
    0.11292135 = weight(_text_:term in 3042) [ClassicSimilarity], result of:
      0.11292135 = score(doc=3042,freq=8.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.5155283 = fieldWeight in 3042, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3042)
    0.08992833 = weight(_text_:frequency in 3042) [ClassicSimilarity], result of:
      0.08992833 = score(doc=3042,freq=2.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.32531026 = fieldWeight in 3042, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3042)
  0.75 = coord(3/4)

Abstract: Based on State of the Union addresses from 1790 to 2014 (225 speeches delivered by 42 presidents), this paper describes and evaluates different text representation strategies. To determine the most important words of a given text, the term frequencies (tf) or the tf?idf weighting scheme can be applied. Recently, latent Dirichlet allocation (LDA) has been proposed to define the topics included in a corpus. As another strategy, this study proposes to apply a vocabulary specificity measure (Z?score) to determine the most significantly overused word-types or short sequences of them. Our experiments show that the simple term frequency measure is not able to discriminate between specific terms associated with a document or a set of texts. Using the tf idf or LDA approach, the selection requires some arbitrary decisions. Based on the term-specific measure (Z?score), the term selection has a clear theoretical basis. Moreover, the most significant sentences for each presidency can be determined. As another facet, we can visualize the dynamic evolution of usage of some terms associated with their specificity measures. Finally, this technique can be employed to define the most important lexical leaders introducing terms overused by the k following presidencies.

Arsenault, C.: Aggregation consistency and frequency of Chinese words and characters (2006) 0.14
```
0.14046668 = product of:
  0.28093335 = sum of:
    0.07984746 = weight(_text_:term in 609) [ClassicSimilarity], result of:
      0.07984746 = score(doc=609,freq=4.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.3645336 = fieldWeight in 609, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0390625 = fieldNorm(doc=609)
    0.20108588 = weight(_text_:frequency in 609) [ClassicSimilarity], result of:
      0.20108588 = score(doc=609,freq=10.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.7274159 = fieldWeight in 609, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.0390625 = fieldNorm(doc=609)
  0.5 = coord(2/4)
```
Abstract

Purpose - Aims to measure syllable aggregation consistency of Romanized Chinese data in the title fields of bibliographic records. Also aims to verify if the term frequency distributions satisfy conventional bibliometric laws. Design/methodology/approach - Uses Cooper's interindexer formula to evaluate aggregation consistency within and between two sets of Chinese bibliographic data. Compares the term frequency distributions of polysyllabic words and monosyllabic characters (for vernacular and Romanized data) with the Lotka and the generalised Zipf theoretical distributions. The fits are tested with the Kolmogorov-Smirnov test. Findings - Finds high internal aggregation consistency within each data set but some aggregation discrepancy between sets. Shows that word (polysyllabic) distributions satisfy Lotka's law but that character (monosyllabic) distributions do not abide by the law. Research limitations/implications - The findings are limited to only two sets of bibliographic data (for aggregation consistency analysis) and to one set of data for the frequency distribution analysis. Only two bibliometric distributions are tested. Internal consistency within each database remains fairly high. Therefore the main argument against syllable aggregation does not appear to hold true. The analysis revealed that Chinese words and characters behave differently in terms of frequency distribution but that there is no noticeable difference between vernacular and Romanized data. The distribution of Romanized characters exhibits the worst case in terms of fit to either Lotka's or Zipf's laws, which indicates that Romanized data in aggregated form appear to be a preferable option. Originality/value - Provides empirical data on consistency and distribution of Romanized Chinese titles in bibliographic records.
Kim, W.; Wilbur, W.J.: Corpus-based statistical screening for content-bearing terms (2001) 0.14
```
0.13997644 = product of:
  0.18663526 = sum of:
    0.0066587473 = product of:
      0.02663499 = sum of:
        0.02663499 = weight(_text_:based in 5188) [ClassicSimilarity], result of:
          0.02663499 = score(doc=5188,freq=4.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.18831211 = fieldWeight in 5188, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.03125 = fieldNorm(doc=5188)
      0.25 = coord(1/4)
    0.07823421 = weight(_text_:term in 5188) [ClassicSimilarity], result of:
      0.07823421 = score(doc=5188,freq=6.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.35716853 = fieldWeight in 5188, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.03125 = fieldNorm(doc=5188)
    0.10174229 = weight(_text_:frequency in 5188) [ClassicSimilarity], result of:
      0.10174229 = score(doc=5188,freq=4.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.36804655 = fieldWeight in 5188, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.03125 = fieldNorm(doc=5188)
  0.75 = coord(3/4)
```
Abstract

Kim and Wilber present three techniques for the algorithmic identification in text of content bearing terms and phrases intended for human use as entry points or hyperlinks. Using a set of 1,075 terms from MEDLINE evaluated on a zero to four, stop word to definite content word scale, they evaluate the ranked lists of their three methods based on their placement of content words in the top ranks. Data consist of the natural language elements of 304,057 MEDLINE records from 1996, and 173,252 Wall Street Journal records from the TIPSTER collection. Phrases are extracted by breaking at punctuation marks and stop words, normalized by lower casing, replacement of nonalphanumerics with spaces, and the reduction of multiple spaces. In the ``strength of context'' approach each document is a vector of binary values for each word or word pair. The words or word pairs are removed from all documents, and the Robertson, Spark Jones relevance weight for each term computed, negative weights replaced with zero, those below a randomness threshold ignored, and the remainder summed for each document, to yield a score for the document and finally to assign to the term the average document score for documents in which it occurred. The average of these word scores is assigned to the original phrase. The ``frequency clumping'' approach defines a random phrase as one whose distribution among documents is Poisson in character. A pvalue, the probability that a phrase frequency of occurrence would be equal to, or less than, Poisson expectations is computed, and a score assigned which is the negative log of that value. In the ``database comparison'' approach if a phrase occurring in a document allows prediction that the document is in MEDLINE rather that in the Wall Street Journal, it is considered to be content bearing for MEDLINE. The score is computed by dividing the number of occurrences of the term in MEDLINE by occurrences in the Journal, and taking the product of all these values. The one hundred top and bottom ranked phrases that occurred in at least 500 documents were collected for each method. The union set had 476 phrases. A second selection was made of two word phrases occurring each in only three documents with a union of 599 phrases. A judge then ranked the two sets of terms as to subject specificity on a 0 to 4 scale. Precision was the average subject specificity of the first r ranks and recall the fraction of the subject specific phrases in the first r ranks and eleven point average precision was used as a summary measure. The three methods all move content bearing terms forward in the lists as does the use of the sum of the logs of the three methods.
Kauchak, D.; Leroy, G.; Hogue, A.: Measuring text difficulty using parse-tree frequency (2017) 0.13
```
0.12985206 = product of:
  0.2597041 = sum of:
    0.07984746 = weight(_text_:term in 3786) [ClassicSimilarity], result of:
      0.07984746 = score(doc=3786,freq=4.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.3645336 = fieldWeight in 3786, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3786)
    0.17985666 = weight(_text_:frequency in 3786) [ClassicSimilarity], result of:
      0.17985666 = score(doc=3786,freq=8.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.6506205 = fieldWeight in 3786, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3786)
  0.5 = coord(2/4)
```
Abstract

Text simplification often relies on dated, unproven readability formulas. As an alternative and motivated by the success of term familiarity, we test a complementary measure: grammar familiarity. Grammar familiarity is measured as the frequency of the 3rd level sentence parse tree and is useful for evaluating individual sentences. We created a database of 140K unique 3rd level parse structures by parsing and binning all 5.4M sentences in English Wikipedia. We then calculated the grammar frequencies across the corpus and created 11 frequency bins. We evaluate the measure with a user study and corpus analysis. For the user study, we selected 20 sentences randomly from each bin, controlling for sentence length and term frequency, and recruited 30 readers per sentence (N = 6,600) on Amazon Mechanical Turk. We measured actual difficulty (comprehension) using a Cloze test, perceived difficulty using a 5-point Likert scale, and time taken. Sentences with more frequent grammatical structures, even with very different surface presentations, were easier to understand, perceived as easier, and took less time to read. Outcomes from readability formulas correlated with perceived but not with actual difficulty. Our corpus analysis shows how the metric can be used to understand grammar regularity in a broad range of corpora.

Tseng, Y.-H.: Automatic thesaurus generation for Chinese documents (2002) 0.11

0.11420593 = product of:
  0.15227456 = sum of:
    0.005885557 = product of:
      0.023542227 = sum of:
        0.023542227 = weight(_text_:based in 5226) [ClassicSimilarity], result of:
          0.023542227 = score(doc=5226,freq=2.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.16644597 = fieldWeight in 5226, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5226)
      0.25 = coord(1/4)
    0.056460675 = weight(_text_:term in 5226) [ClassicSimilarity], result of:
      0.056460675 = score(doc=5226,freq=2.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.25776416 = fieldWeight in 5226, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5226)
    0.08992833 = weight(_text_:frequency in 5226) [ClassicSimilarity], result of:
      0.08992833 = score(doc=5226,freq=2.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.32531026 = fieldWeight in 5226, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5226)
  0.75 = coord(3/4)

Abstract: Tseng constructs a word co-occurrence based thesaurus by means of the automatic analysis of Chinese text. Words are identified by a longest dictionary match supplemented by a key word extraction algorithm that merges back nearby tokens and accepts shorter strings of characters if they occur more often than the longest string. Single character auxiliary words are a major source of error but this can be greatly reduced with the use of a 70-character 2680 word stop list. Extracted terms with their associate document weights are sorted by decreasing frequency and the top of this list is associated using a Dice coefficient modified to account for longer documents on the weights of term pairs. Co-occurrence is not in the document as a whole but in paragraph or sentence size sections in order to reduce computation time. A window of 29 characters or 11 words was found to be sufficient. A thesaurus was produced from 25,230 Chinese news articles and judges asked to review the top 50 terms associated with each of 30 single word query terms. They determined 69% to be relevant.

Lee, G.E.; Sun, A.: Understanding the stability of medical concept embeddings (2021) 0.11
```
0.11308204 = product of:
  0.22616407 = sum of:
    0.005885557 = product of:
      0.023542227 = sum of:
        0.023542227 = weight(_text_:based in 159) [ClassicSimilarity], result of:
          0.023542227 = score(doc=159,freq=2.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.16644597 = fieldWeight in 159, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=159)
      0.25 = coord(1/4)
    0.22027852 = weight(_text_:frequency in 159) [ClassicSimilarity], result of:
      0.22027852 = score(doc=159,freq=12.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.7968441 = fieldWeight in 159, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.0390625 = fieldNorm(doc=159)
  0.5 = coord(2/4)
```
Abstract

Frequency is one of the major factors for training quality word embeddings. Several studies have recently discussed the stability of word embeddings in general domain and suggested factors influencing the stability. In this work, we conduct a detailed analysis on the stability of concept embeddings in medical domain, particularly in relations with concept frequency. The analysis reveals the surprising high stability of low-frequency concepts: low-frequency (<100) concepts have the same high stability as high-frequency (>1,000) concepts. To develop a deeper understanding of this finding, we propose a new factor, the noisiness of context words, which influences the stability of medical concept embeddings regardless of high or low frequency. We evaluate the proposed factor by showing the linear correlation with the stability of medical concept embeddings. The correlations are clear and consistent with various groups of medical concepts. Based on the linear relations, we make suggestions on ways to adjust the noisiness of context words for the improvement of stability. Finally, we demonstrate that the linear relation of the proposed factor extends to the word embedding stability in general domain.

Ruge, G.; Schwarz, C.: Term association and computational linguistics (1991) 0.10

0.10367832 = product of:
  0.20735665 = sum of:
    0.011771114 = product of:
      0.047084454 = sum of:
        0.047084454 = weight(_text_:based in 2310) [ClassicSimilarity], result of:
          0.047084454 = score(doc=2310,freq=2.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.33289194 = fieldWeight in 2310, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.078125 = fieldNorm(doc=2310)
      0.25 = coord(1/4)
    0.19558553 = weight(_text_:term in 2310) [ClassicSimilarity], result of:
      0.19558553 = score(doc=2310,freq=6.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.8929213 = fieldWeight in 2310, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.078125 = fieldNorm(doc=2310)
  0.5 = coord(2/4)

Abstract: Most systems for term associations are statistically based. In general they exploit term co-occurrences. A critical overview about statistical approaches in this field is given. A new approach on the basis of a linguistic analysis for large amounts of textual data is outlined

Abu-Salem, H.; Al-Omari, M.; Evens, M.W.: Stemming methodologies over individual query words for an Arabic information retrieval system (1999) 0.10
```
0.10351266 = product of:
  0.20702532 = sum of:
    0.07984746 = weight(_text_:term in 3672) [ClassicSimilarity], result of:
      0.07984746 = score(doc=3672,freq=4.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.3645336 = fieldWeight in 3672, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3672)
    0.12717786 = weight(_text_:frequency in 3672) [ClassicSimilarity], result of:
      0.12717786 = score(doc=3672,freq=4.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.46005818 = fieldWeight in 3672, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3672)
  0.5 = coord(2/4)
```
Abstract

Stemming is one of the most important factors that affect the performance of information retrieval systems. This article investigates how to improve the performance of an Arabic information retrieval system by imposing the retrieval method over individual words of a query depending on the importance of the WORD, the STEM, or the ROOT of the query terms in the database. This method, called Mxed Stemming, computes term importance using a weighting scheme that use the Term Frequency (TF) and the Inverse Document Frequency (IDF), called TFxIDF. An extended version of the Arabic IRS system is designed, implemented, and evaluated to reduce the number of irrelevant documents retrieved. The results of the experiment suggest that the proposed method outperforms the Word index method using the TFxIDF weighting scheme. It also outperforms the Stem index method using the Binary weighting scheme but does not outperform the Stem index method using the TFxIDF weighting scheme, and again it outperforms the Root index method using the Binary weighting scheme but does not outperform the Root index method using the TFxIDF weighting scheme
Warner, J.: Linguistics and information theory : analytic advantages (2007) 0.09
```
0.087833405 = product of:
  0.17566681 = sum of:
    0.06775281 = weight(_text_:term in 77) [ClassicSimilarity], result of:
      0.06775281 = score(doc=77,freq=2.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.309317 = fieldWeight in 77, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.046875 = fieldNorm(doc=77)
    0.107914 = weight(_text_:frequency in 77) [ClassicSimilarity], result of:
      0.107914 = score(doc=77,freq=2.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.39037234 = fieldWeight in 77, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.046875 = fieldNorm(doc=77)
  0.5 = coord(2/4)
```
Abstract

The analytic advantages of central concepts from linguistics and information theory, and the analogies demonstrated between them, for understanding patterns of retrieval from full-text indexes to documents are developed. The interaction between the syntagm and the paradigm in computational operations on written language in indexing, searching, and retrieval is used to account for transformations of the signified or meaning between documents and their representation and between queries and documents retrieved. Characteristics of the message, and messages for selection for written language, are brought to explain the relative frequency of occurrence of words and multiple word sequences in documents. The examples given in the companion article are revisited and a fuller example introduced. The signified of the sequence stood for, the term classically used in the definitions of the sign, as something standing for something else, can itself change rapidly according to its syntagm. A greater than ordinary discourse understanding of patterns in retrieval is obtained.

Nakagawa, H.; Mori, T.: Automatic term recognition based an statistics of compound nouns and their components (2003) 0.09

0.08728473 = product of:
  0.17456946 = sum of:
    0.01647956 = product of:
      0.06591824 = sum of:
        0.06591824 = weight(_text_:based in 4123) [ClassicSimilarity], result of:
          0.06591824 = score(doc=4123,freq=2.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.46604872 = fieldWeight in 4123, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.109375 = fieldNorm(doc=4123)
      0.25 = coord(1/4)
    0.15808989 = weight(_text_:term in 4123) [ClassicSimilarity], result of:
      0.15808989 = score(doc=4123,freq=2.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.72173965 = fieldWeight in 4123, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.109375 = fieldNorm(doc=4123)
  0.5 = coord(2/4)

Ruge, G.: Experiments on linguistically-based term associations (1992) 0.08

0.08294266 = product of:
  0.16588531 = sum of:
    0.009416891 = product of:
      0.037667565 = sum of:
        0.037667565 = weight(_text_:based in 1810) [ClassicSimilarity], result of:
          0.037667565 = score(doc=1810,freq=2.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.26631355 = fieldWeight in 1810, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0625 = fieldNorm(doc=1810)
      0.25 = coord(1/4)
    0.15646842 = weight(_text_:term in 1810) [ClassicSimilarity], result of:
      0.15646842 = score(doc=1810,freq=6.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.71433705 = fieldWeight in 1810, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0625 = fieldNorm(doc=1810)
  0.5 = coord(2/4)

Abstract: Describes the hyperterm system REALIST (REtrieval Aids by LInguistic and STatistics) and describes its semantic component. The semantic component of REALIST generates semantic term relations such synonyms. It takes as input a free text data base and generates as output term pairs that are semantically related with respect to their meanings in the data base. In the 1st step an automatic syntactic analysis provides linguistical knowledge about the terms of the data base. In the 2nd step this knowledge is compared by statistical similarity computation. Various experiments with different similarity measures are described

Hammwöhner, R.: TransRouter revisited : Decision support in the routing of translation projects (2000) 0.08

0.08171405 = product of:
  0.1634281 = sum of:
    0.00823978 = product of:
      0.03295912 = sum of:
        0.03295912 = weight(_text_:based in 5483) [ClassicSimilarity], result of:
          0.03295912 = score(doc=5483,freq=2.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.23302436 = fieldWeight in 5483, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0546875 = fieldNorm(doc=5483)
      0.25 = coord(1/4)
    0.15518832 = sum of:
      0.11066689 = weight(_text_:assessment in 5483) [ClassicSimilarity], result of:
        0.11066689 = score(doc=5483,freq=2.0), product of:
          0.25917634 = queryWeight, product of:
            5.52102 = idf(docFreq=480, maxDocs=44218)
            0.04694356 = queryNorm
          0.4269946 = fieldWeight in 5483, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            5.52102 = idf(docFreq=480, maxDocs=44218)
            0.0546875 = fieldNorm(doc=5483)
      0.04452143 = weight(_text_:22 in 5483) [ClassicSimilarity], result of:
        0.04452143 = score(doc=5483,freq=2.0), product of:
          0.16438834 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.04694356 = queryNorm
          0.2708308 = fieldWeight in 5483, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0546875 = fieldNorm(doc=5483)
  0.5 = coord(2/4)

Abstract: This paper gives an outline of the final results of the TransRouter project. In the scope of this project a decision support system for translation managers has been developed, which will support the selection of appropriate routes for translation projects. In this paper emphasis is put on the decision model, which is based on a stepwise refined assessment of translation routes. The workflow of using this system is considered as well
Date: 10.12.2000 18:22:35

Yang, Y.; Lu, Q.; Zhao, T.: ¬A delimiter-based general approach for Chinese term extraction (2009) 0.08
```
0.07978749 = product of:
  0.15957499 = sum of:
    0.010194084 = product of:
      0.040776335 = sum of:
        0.040776335 = weight(_text_:based in 3315) [ClassicSimilarity], result of:
          0.040776335 = score(doc=3315,freq=6.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.28829288 = fieldWeight in 3315, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3315)
      0.25 = coord(1/4)
    0.1493809 = weight(_text_:term in 3315) [ClassicSimilarity], result of:
      0.1493809 = score(doc=3315,freq=14.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.6819799 = fieldWeight in 3315, product of:
          3.7416575 = tf(freq=14.0), with freq of:
            14.0 = termFreq=14.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3315)
  0.5 = coord(2/4)
```
Abstract

This article addresses a two-step approach for term extraction. In the first step on term candidate extraction, a new delimiter-based approach is proposed to identify features of the delimiters of term candidates rather than those of the term candidates themselves. This delimiter-based method is much more stable and domain independent than the previous approaches. In the second step on term verification, an algorithm using link analysis is applied to calculate the relevance between term candidates and the sentences from which the terms are extracted. All information is obtained from the working domain corpus without the need for prior domain knowledge. The approach is not targeted at any specific domain and there is no need for extensive training when applying it to new domains. In other words, the method is not domain dependent and it is especially useful for resource-limited domains. Evaluations of Chinese text in two different domains show quite significant improvements over existing techniques and also verify its efficiency and its relatively domain-independent nature. The proposed method is also very effective for extracting new terms so that it can serve as an efficient tool for updating domain knowledge, especially for expanding lexicons.

Magennis, M.: Expert rule-based query expansion (1995) 0.08

0.07766729 = product of:
  0.15533458 = sum of:
    0.018424708 = product of:
      0.07369883 = sum of:
        0.07369883 = weight(_text_:based in 5181) [ClassicSimilarity], result of:
          0.07369883 = score(doc=5181,freq=10.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.5210583 = fieldWeight in 5181, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0546875 = fieldNorm(doc=5181)
      0.25 = coord(1/4)
    0.13690987 = weight(_text_:term in 5181) [ClassicSimilarity], result of:
      0.13690987 = score(doc=5181,freq=6.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.62504494 = fieldWeight in 5181, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0546875 = fieldNorm(doc=5181)
  0.5 = coord(2/4)

Abstract: Examines how, for term based free text retrieval, Interactive Query Expansion (IQE) provides better retrieval performance tahn Automatic Query Expansion (AQE) but the performance of IQE depends on the strategy employed by the user to select expansion terms. The aim is to build an expert query expansion system using term selection rules based on expert users' strategies. It is expected that such a system will achieve better performance for novice or inexperienced users that either AQE or IQE. The procedure is to discover expert IQE users' term selection strategies through observation and interrogation, to construct a rule based query expansion (RQE) system based on these and to compare the resulting retrieval performance with that of comparable AQE and IQE systems

Ruge, G.: Sprache und Computer : Wortbedeutung und Termassoziation. Methoden zur automatischen semantischen Klassifikation (1995) 0.08

0.07659837 = product of:
  0.15319674 = sum of:
    0.12775593 = weight(_text_:term in 1534) [ClassicSimilarity], result of:
      0.12775593 = score(doc=1534,freq=4.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.58325374 = fieldWeight in 1534, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0625 = fieldNorm(doc=1534)
    0.025440816 = product of:
      0.05088163 = sum of:
        0.05088163 = weight(_text_:22 in 1534) [ClassicSimilarity], result of:
          0.05088163 = score(doc=1534,freq=2.0), product of:
            0.16438834 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.04694356 = queryNorm
            0.30952093 = fieldWeight in 1534, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0625 = fieldNorm(doc=1534)
      0.5 = coord(1/2)
  0.5 = coord(2/4)

Content: Enthält folgende Kapitel: (1) Motivation; (2) Language philosophical foundations; (3) Structural comparison of extensions; (4) Earlier approaches towards term association; (5) Experiments; (6) Spreading-activation networks or memory models; (7) Perspective. Appendices: Heads and modifiers of 'car'. Glossary. Index. Language and computer. Word semantics and term association. Methods towards an automatic semantic classification
Footnote: Rez. in: Knowledge organization 22(1995) no.3/4, S.182-184 (M.T. Rolland)

Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.08

0.07544754 = product of:
  0.15089507 = sum of:
    0.13181446 = product of:
      0.26362893 = sum of:
        0.22367644 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
          0.22367644 = score(doc=562,freq=2.0), product of:
            0.39798802 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.04694356 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
        0.039952483 = weight(_text_:based in 562) [ClassicSimilarity], result of:
          0.039952483 = score(doc=562,freq=4.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.28246817 = fieldWeight in 562, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.5 = coord(2/4)
    0.019080611 = product of:
      0.038161222 = sum of:
        0.038161222 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
          0.038161222 = score(doc=562,freq=2.0), product of:
            0.16438834 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.04694356 = queryNorm
            0.23214069 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.5 = coord(1/2)
  0.5 = coord(2/4)

Abstract: Document representations for text classification are typically based on the classical Bag-Of-Words paradigm. This approach comes with deficiencies that motivate the integration of features on a higher semantic level than single words. In this paper we propose an enhancement of the classical document representation through concepts extracted from background knowledge. Boosting is used for actual classification. Experimental evaluations on two well known text corpora support our approach through consistent improvement of the results.
Content: Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
Date: 8. 1.2013 10:22:32

Kishida, K.: Term disambiguation techniques based on target document collection for cross-language information retrieval : an empirical comparison of performance between techniques (2007) 0.07
```
0.07033326 = product of:
  0.14066651 = sum of:
    0.014416611 = product of:
      0.057666443 = sum of:
        0.057666443 = weight(_text_:based in 897) [ClassicSimilarity], result of:
          0.057666443 = score(doc=897,freq=12.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.4077077 = fieldWeight in 897, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=897)
      0.25 = coord(1/4)
    0.12624991 = weight(_text_:term in 897) [ClassicSimilarity], result of:
      0.12624991 = score(doc=897,freq=10.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.5763782 = fieldWeight in 897, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0390625 = fieldNorm(doc=897)
  0.5 = coord(2/4)
```
Abstract

Dictionary-based query translation for cross-language information retrieval often yields various translation candidates having different meanings for a source term in the query. This paper examines methods for solving the ambiguity of translations based on only the target document collections. First, we discuss two kinds of disambiguation technique: (1) one is a method using term co-occurrence statistics in the collection, and (2) a technique based on pseudo-relevance feedback. Next, these techniques are empirically compared using the CLEF 2003 test collection for German to Italian bilingual searches, which are executed by using English language as a pivot. The experiments showed that a variation of term co-occurrence based techniques, in which the best sequence algorithm for selecting translations is used with the Cosine coefficient, is dominant, and that the PRF method shows comparable high search performance, although statistical tests did not sufficiently support these conclusions. Furthermore, we repeat the same experiments for the case of French to Italian (pivot) and English to Italian (non-pivot) searches on the same CLEF 2003 test collection in order to verity our findings. Again, similar results were observed except that the Dice coefficient outperforms slightly the Cosine coefficient in the case of disambiguation based on term co-occurrence for English to Italian searches.

Search (286 results, page 1 of 15)

Authors

Years

Languages

Types

Themes

Subjects

Classifications