Search (2 results, page 1 of 1)

  • × author_ss:"Tseng, Y.-H."
  1. Tseng, Y.-H.: Automatic thesaurus generation for Chinese documents (2002) 0.02
    0.024234196 = product of:
      0.12117098 = sum of:
        0.12117098 = weight(_text_:thesaurus in 5226) [ClassicSimilarity], result of:
          0.12117098 = score(doc=5226,freq=8.0), product of:
            0.23732872 = queryWeight, product of:
              4.6210785 = idf(docFreq=1182, maxDocs=44218)
              0.051357865 = queryNorm
            0.5105618 = fieldWeight in 5226, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              4.6210785 = idf(docFreq=1182, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5226)
      0.2 = coord(1/5)
    
    Abstract
    Tseng constructs a word co-occurrence based thesaurus by means of the automatic analysis of Chinese text. Words are identified by a longest dictionary match supplemented by a key word extraction algorithm that merges back nearby tokens and accepts shorter strings of characters if they occur more often than the longest string. Single character auxiliary words are a major source of error but this can be greatly reduced with the use of a 70-character 2680 word stop list. Extracted terms with their associate document weights are sorted by decreasing frequency and the top of this list is associated using a Dice coefficient modified to account for longer documents on the weights of term pairs. Co-occurrence is not in the document as a whole but in paragraph or sentence size sections in order to reduce computation time. A window of 29 characters or 11 words was found to be sufficient. A thesaurus was produced from 25,230 Chinese news articles and judges asked to review the top 50 terms associated with each of 30 single word query terms. They determined 69% to be relevant.
    Theme
    Konzeption und Anwendung des Prinzips Thesaurus
  2. Tseng, Y.-H.: Keyword extraction techniques and relevance feedback (1997) 0.02
    0.01696394 = product of:
      0.0848197 = sum of:
        0.0848197 = weight(_text_:thesaurus in 1830) [ClassicSimilarity], result of:
          0.0848197 = score(doc=1830,freq=2.0), product of:
            0.23732872 = queryWeight, product of:
              4.6210785 = idf(docFreq=1182, maxDocs=44218)
              0.051357865 = queryNorm
            0.3573933 = fieldWeight in 1830, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.6210785 = idf(docFreq=1182, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1830)
      0.2 = coord(1/5)
    
    Abstract
    Automatic keyword extraction is an important and fundamental technology in an advanced information retrieval systems. Briefly compares several major keyword extraction methods, lists their advantages and disadvantages, and reports recent research progress in Taiwan. Also describes the application of a keyword extraction algorithm in an information retrieval system for relevance feedback. Preliminary analysis shows that the error rate of extracting relevant keywords is 18%, and that the precision rate is over 50%. The main disadvantage of this approach is that the extraction results depend on the retrieval results, which in turn depend on the data held by the database. Apart from collecting more data, this problem can be alleviated by the application of a thesaurus constructed by the same keyword extraction algorithm