Search (1 results, page 1 of 1)

Tseng, Y.-H.: Automatic thesaurus generation for Chinese documents (2002) 0.01

0.008743494 = product of:
  0.017486988 = sum of:
    0.017486988 = product of:
      0.02623048 = sum of:
        0.010627648 = weight(_text_:a in 5226) [ClassicSimilarity], result of:
          0.010627648 = score(doc=5226,freq=20.0), product of:
            0.052761257 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.045758117 = queryNorm
            0.20142901 = fieldWeight in 5226, product of:
              4.472136 = tf(freq=20.0), with freq of:
                20.0 = termFreq=20.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5226)
        0.015602832 = weight(_text_:h in 5226) [ClassicSimilarity], result of:
          0.015602832 = score(doc=5226,freq=2.0), product of:
            0.113683715 = queryWeight, product of:
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.045758117 = queryNorm
            0.13724773 = fieldWeight in 5226, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5226)
      0.6666667 = coord(2/3)
  0.5 = coord(1/2)

Abstract: Tseng constructs a word co-occurrence based thesaurus by means of the automatic analysis of Chinese text. Words are identified by a longest dictionary match supplemented by a key word extraction algorithm that merges back nearby tokens and accepts shorter strings of characters if they occur more often than the longest string. Single character auxiliary words are a major source of error but this can be greatly reduced with the use of a 70-character 2680 word stop list. Extracted terms with their associate document weights are sorted by decreasing frequency and the top of this list is associated using a Dice coefficient modified to account for longer documents on the weights of term pairs. Co-occurrence is not in the document as a whole but in paragraph or sentence size sections in order to reduce computation time. A window of 29 characters or 11 words was found to be sufficient. A thesaurus was produced from 25,230 Chinese news articles and judges asked to review the top 50 terms associated with each of 30 single word query terms. They determined 69% to be relevant.
Type: a