Schneider, J.W.; Borlund, P.: ¬A bibliometric-based semiautomatic approach to identification of candidate thesaurus terms : parsing and filtering of noun phrases from citation contexts (2005)
0.00
0.0039484566 = product of:
0.0302715 = sum of:
0.00895379 = weight(_text_:und in 156) [ClassicSimilarity], result of:
0.00895379 = score(doc=156,freq=2.0), product of:
0.052235067 = queryWeight, product of:
2.216367 = idf(docFreq=13101, maxDocs=44218)
0.023567878 = queryNorm
0.17141339 = fieldWeight in 156, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
2.216367 = idf(docFreq=13101, maxDocs=44218)
0.0546875 = fieldNorm(doc=156)
0.010141784 = product of:
0.020283569 = sum of:
0.020283569 = weight(_text_:international in 156) [ClassicSimilarity], result of:
0.020283569 = score(doc=156,freq=2.0), product of:
0.078619614 = queryWeight, product of:
3.33588 = idf(docFreq=4276, maxDocs=44218)
0.023567878 = queryNorm
0.2579963 = fieldWeight in 156, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
3.33588 = idf(docFreq=4276, maxDocs=44218)
0.0546875 = fieldNorm(doc=156)
0.5 = coord(1/2)
0.011175927 = product of:
0.022351854 = sum of:
0.022351854 = weight(_text_:22 in 156) [ClassicSimilarity], result of:
0.022351854 = score(doc=156,freq=2.0), product of:
0.08253069 = queryWeight, product of:
3.5018296 = idf(docFreq=3622, maxDocs=44218)
0.023567878 = queryNorm
0.2708308 = fieldWeight in 156, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
3.5018296 = idf(docFreq=3622, maxDocs=44218)
0.0546875 = fieldNorm(doc=156)
0.5 = coord(1/2)
0.13043478 = coord(3/23)
- Date
- 8. 3.2007 19:55:22
- Source
- Context: nature, impact and role. 5th International Conference an Conceptions of Library and Information Sciences, CoLIS 2005 Glasgow, UK, June 2005. Ed. by F. Crestani u. I. Ruthven
- Theme
- Konzeption und Anwendung des Prinzips Thesaurus
Tseng, Y.-H.: Automatic thesaurus generation for Chinese documents (2002)
0.00
0.0012565941 = product of:
0.014450832 = sum of:
0.0063955644 = weight(_text_:und in 5226) [ClassicSimilarity], result of:
0.0063955644 = score(doc=5226,freq=2.0), product of:
0.052235067 = queryWeight, product of:
2.216367 = idf(docFreq=13101, maxDocs=44218)
0.023567878 = queryNorm
0.12243814 = fieldWeight in 5226, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
2.216367 = idf(docFreq=13101, maxDocs=44218)
0.0390625 = fieldNorm(doc=5226)
0.008055268 = product of:
0.016110536 = sum of:
0.016110536 = weight(_text_:29 in 5226) [ClassicSimilarity], result of:
0.016110536 = score(doc=5226,freq=2.0), product of:
0.08290443 = queryWeight, product of:
3.5176873 = idf(docFreq=3565, maxDocs=44218)
0.023567878 = queryNorm
0.19432661 = fieldWeight in 5226, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
3.5176873 = idf(docFreq=3565, maxDocs=44218)
0.0390625 = fieldNorm(doc=5226)
0.5 = coord(1/2)
0.08695652 = coord(2/23)
- Abstract
- Tseng constructs a word co-occurrence based thesaurus by means of the automatic analysis of Chinese text. Words are identified by a longest dictionary match supplemented by a key word extraction algorithm that merges back nearby tokens and accepts shorter strings of characters if they occur more often than the longest string. Single character auxiliary words are a major source of error but this can be greatly reduced with the use of a 70-character 2680 word stop list. Extracted terms with their associate document weights are sorted by decreasing frequency and the top of this list is associated using a Dice coefficient modified to account for longer documents on the weights of term pairs. Co-occurrence is not in the document as a whole but in paragraph or sentence size sections in order to reduce computation time. A window of 29 characters or 11 words was found to be sufficient. A thesaurus was produced from 25,230 Chinese news articles and judges asked to review the top 50 terms associated with each of 30 single word query terms. They determined 69% to be relevant.
- Theme
- Konzeption und Anwendung des Prinzips Thesaurus