Search (6 results, page 1 of 1)

Tseng, Y.-H.: Keyword extraction techniques and relevance feedback (1997) 0.01

0.0066157626 = product of:
  0.046310335 = sum of:
    0.009988253 = weight(_text_:information in 1830) [ClassicSimilarity], result of:
      0.009988253 = score(doc=1830,freq=4.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.1920054 = fieldWeight in 1830, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1830)
    0.036322083 = weight(_text_:retrieval in 1830) [ClassicSimilarity], result of:
      0.036322083 = score(doc=1830,freq=6.0), product of:
        0.08963835 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.029633347 = queryNorm
        0.40520695 = fieldWeight in 1830, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1830)
  0.14285715 = coord(2/14)

Abstract: Automatic keyword extraction is an important and fundamental technology in an advanced information retrieval systems. Briefly compares several major keyword extraction methods, lists their advantages and disadvantages, and reports recent research progress in Taiwan. Also describes the application of a keyword extraction algorithm in an information retrieval system for relevance feedback. Preliminary analysis shows that the error rate of extracting relevant keywords is 18%, and that the precision rate is over 50%. The main disadvantage of this approach is that the extraction results depend on the retrieval results, which in turn depend on the data held by the database. Apart from collecting more data, this problem can be alleviated by the application of a thesaurus constructed by the same keyword extraction algorithm

Tseng, Y.-H.: Automatic cataloguing and searching for retrospective data by use of OCR text (2001) 0.01
```
0.0059455284 = product of:
  0.041618697 = sum of:
    0.0104854815 = weight(_text_:information in 5421) [ClassicSimilarity], result of:
      0.0104854815 = score(doc=5421,freq=6.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.20156369 = fieldWeight in 5421, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=5421)
    0.031133216 = weight(_text_:retrieval in 5421) [ClassicSimilarity], result of:
      0.031133216 = score(doc=5421,freq=6.0), product of:
        0.08963835 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.029633347 = queryNorm
        0.34732026 = fieldWeight in 5421, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.046875 = fieldNorm(doc=5421)
  0.14285715 = coord(2/14)
```
Abstract

This article describes our efforts in supporting information retrieval from OCR degraded text. In particular, we report our approach to an automatic cataloging and searching contest for books in multiple languages. In this contest, 500 books in English, German, French, and Italian published during the 1770s to 1970s are scanned into images and OCRed to digital text. The goal is to use only automatic ways to extract information for sophisticated searching. We adopted the vector space retrieval model, an n-gram indexing method, and a special weighting scheme to tackle this problem. Although the performance by this approach is slightly inferior to the best approach, which is mainly based on regular expression match, one advantage of our approach is that it is less language dependent and less layout sensitive, thus is readily applicable to other languages and document collections. Problems of OCR text retrieval for some Asian languages are also discussed in this article, and solutions are suggested

Source

Journal of the American Society for Information Science and technology. 52(2001) no.5, S.378-390
Lee, L.-H.; Juan, Y.-C.; Tseng, W.-L.; Chen, H.-H.; Tseng, Y.-H.: Mining browsing behaviors for objectionable content filtering (2015) 0.01
```
0.0050347717 = product of:
  0.0352434 = sum of:
    0.03019857 = weight(_text_:web in 1818) [ClassicSimilarity], result of:
      0.03019857 = score(doc=1818,freq=6.0), product of:
        0.09670874 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029633347 = queryNorm
        0.3122631 = fieldWeight in 1818, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1818)
    0.0050448296 = weight(_text_:information in 1818) [ClassicSimilarity], result of:
      0.0050448296 = score(doc=1818,freq=2.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.09697737 = fieldWeight in 1818, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1818)
  0.14285715 = coord(2/14)
```
Abstract

This article explores users' browsing intents to predict the category of a user's next access during web surfing and applies the results to filter objectionable content, such as pornography, gambling, violence, and drugs. Users' access trails in terms of category sequences in click-through data are employed to mine users' web browsing behaviors. Contextual relationships of URL categories are learned by the hidden Markov model. The top-level domains (TLDs) extracted from URLs themselves and the corresponding categories are caught by the TLD model. Given a URL to be predicted, its TLD and current context are empirically combined in an aggregation model. In addition to the uses of the current context, the predictions of the URL accessed previously in different contexts by various users are also considered by majority rule to improve the aggregation model. Large-scale experiments show that the advanced aggregation approach achieves promising performance while maintaining an acceptably low false positive rate. Different strategies are introduced to integrate the model with the blacklist it generates for filtering objectionable web pages without analyzing their content. In practice, this is complementary to the existing content analysis from users' behavioral perspectives.

Source

Journal of the Association for Information Science and Technology. 66(2015) no.5, S.930-942
Tseng, Y.-H.: Solving vocabulary problems with interactive query expansion (1998) 0.00
```
0.004274482 = product of:
  0.029921371 = sum of:
    0.008737902 = weight(_text_:information in 5159) [ClassicSimilarity], result of:
      0.008737902 = score(doc=5159,freq=6.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.16796975 = fieldWeight in 5159, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5159)
    0.021183468 = weight(_text_:retrieval in 5159) [ClassicSimilarity], result of:
      0.021183468 = score(doc=5159,freq=4.0), product of:
        0.08963835 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.029633347 = queryNorm
        0.23632148 = fieldWeight in 5159, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5159)
  0.14285715 = coord(2/14)
```
Abstract

One of the major causes of search failures in information retrieval systems is vocabulary mismatch. Presents a solution to the vocabulary problem through 2 strategies known as term suggestion (TS) and term relevance feedback (TRF). In TS, collection specific terms are extracted from the text collection. These terms and their frequencies constitute the keyword database for suggesting terms in response to users' queries. One effect of this term suggestion is that it functions as a dynamic directory if the query is a general term that contains broad meaning. In term relevance feedback, terms extracted from the top ranked documents retrieved from the previous query are shown to users for relevance feedback. In the experiment, interactive TS provides very high precision rates while achieving similar recall rates as n-gram matching. Local TRF achieves improvement in both precision and recall rate in a full text news database and degrades slightly in recall rate in bibliographic databases due to the very limited source of information for feedback. In terms of Rijsbergen's combined measure of recall and precision, both TS and TRF achieve better performance than n-gram matching, which implies that the greater improvement in precision rate compensates the slight degradation in recall rate for TS and TRF

Source

Journal of library and information science. 24(1998) no.1, S.1-18

Theme

Semantisches Umfeld in Indexierung u. Retrieval
Tseng, Y.-H.; Lin, C.-J.; Lin, Y.-I.: Text mining techniques for patent analysis (2007) 0.00
```
5.0960475E-4 = product of:
  0.0071344664 = sum of:
    0.0071344664 = weight(_text_:information in 935) [ClassicSimilarity], result of:
      0.0071344664 = score(doc=935,freq=4.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.13714671 = fieldWeight in 935, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=935)
  0.071428575 = coord(1/14)
```
Abstract

Patent documents contain important research results. However, they are lengthy and rich in technical terminology such that it takes a lot of human efforts for analyses. Automatic tools for assisting patent engineers or decision makers in patent analysis are in great demand. This paper describes a series of text mining techniques that conforms to the analytical process used by patent analysts. These techniques include text segmentation, summary extraction, feature selection, term association, cluster generation, topic identification, and information mapping. The issues of efficiency and effectiveness are considered in the design of these techniques. Some important features of the proposed methodology include a rigorous approach to verify the usefulness of segment extracts as the document surrogates, a corpus- and dictionary-free algorithm for keyphrase extraction, an efficient co-word analysis method that can be applied to large volume of patents, and an automatic procedure to create generic cluster titles for ease of result interpretation. Evaluation of these techniques was conducted. The results confirm that the machine-generated summaries do preserve more important content words than some other sections for classification. To demonstrate the feasibility, the proposed methodology was applied to a real-world patent set for domain analysis and mapping, which shows that our approach is more effective than existing classification systems. The attempt in this paper to automate the whole process not only helps create final patent maps for topic analyses, but also facilitates or improves other patent analysis tasks such as patent classification, organization, knowledge sharing, and prior art searches.

Source

Information processing and management. 43(2007) no.5, S.1216-1247

Tseng, Y.-H.: Automatic thesaurus generation for Chinese documents (2002) 0.00

3.6034497E-4 = product of:
  0.0050448296 = sum of:
    0.0050448296 = weight(_text_:information in 5226) [ClassicSimilarity], result of:
      0.0050448296 = score(doc=5226,freq=2.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.09697737 = fieldWeight in 5226, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5226)
  0.071428575 = coord(1/14)

Source: Journal of the American Society for Information Science and technology. 53(2002) no.13, S.1130-1138

Search (6 results, page 1 of 1)

Authors

Years

Languages

Themes