Search (5 results, page 1 of 1)

Ko, Y.; Park, J.; Seo, J.: Improving text categorization using the importance of sentences (2004) 0.00
```
0.0028703054 = product of:
  0.005740611 = sum of:
    0.005740611 = product of:
      0.011481222 = sum of:
        0.011481222 = weight(_text_:a in 2557) [ClassicSimilarity], result of:
          0.011481222 = score(doc=2557,freq=16.0), product of:
            0.053105544 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046056706 = queryNorm
            0.2161963 = fieldWeight in 2557, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046875 = fieldNorm(doc=2557)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Automatic text categorization is a problem of assigning text documents to pre-defined categories. In order to classify text documents, we must extract useful features. In previous researches, a text document is commonly represented by the term frequency and the inverted document frequency of each feature. Since there is a difference between important sentences and unimportant sentences in a document, the features from more important sentences should be considered more than other features. In this paper, we measure the importance of sentences using text summarization techniques. Then we represent a document as a vector of features with different weights according to the importance of each sentence. To verify our new method, we conduct experiments using two language newsgroup data sets: one written by English and the other written by Korean. Four kinds of classifiers are used in our experiments: Naive Bayes, Rocchio, k-NN, and SVM. We observe that our new method makes a significant improvement in all these classifiers and both data sets.

Type

a
Ko, Y.; Seo, J.: Text classification from unlabeled documents with bootstrapping and feature projection techniques (2009) 0.00
```
0.0028703054 = product of:
  0.005740611 = sum of:
    0.005740611 = product of:
      0.011481222 = sum of:
        0.011481222 = weight(_text_:a in 2452) [ClassicSimilarity], result of:
          0.011481222 = score(doc=2452,freq=16.0), product of:
            0.053105544 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046056706 = queryNorm
            0.2161963 = fieldWeight in 2452, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046875 = fieldNorm(doc=2452)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Many machine learning algorithms have been applied to text classification tasks. In the machine learning paradigm, a general inductive process automatically builds a text classifier by learning, generally known as supervised learning. However, the supervised learning approaches have some problems. The most notable problem is that they require a large number of labeled training documents for accurate learning. While unlabeled documents are easily collected and plentiful, labeled documents are difficultly generated because a labeling task must be done by human developers. In this paper, we propose a new text classification method based on unsupervised or semi-supervised learning. The proposed method launches text classification tasks with only unlabeled documents and the title word of each category for learning, and then it automatically learns text classifier by using bootstrapping and feature projection techniques. The results of experiments showed that the proposed method achieved reasonably useful performance compared to a supervised method. If the proposed method is used in a text classification task, building text classification systems will become significantly faster and less expensive.

Type

a
Kim, S.; Ko, Y.; Oard, D.W.: Combining lexical and statistical translation evidence for cross-language information retrieval (2015) 0.00
```
0.0028703054 = product of:
  0.005740611 = sum of:
    0.005740611 = product of:
      0.011481222 = sum of:
        0.011481222 = weight(_text_:a in 1606) [ClassicSimilarity], result of:
          0.011481222 = score(doc=1606,freq=16.0), product of:
            0.053105544 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046056706 = queryNorm
            0.2161963 = fieldWeight in 1606, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046875 = fieldNorm(doc=1606)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

This article explores how best to use lexical and statistical translation evidence together for cross-language information retrieval (CLIR). Lexical translation evidence is assembled from Wikipedia and from a large machine-readable dictionary, statistical translation evidence is drawn from parallel corpora, and evidence from co-occurrence in the document language provides a basis for limiting the adverse effect of translation ambiguity. Coverage statistics for NII Testbeds and Community for Information Access Research (NTCIR) queries confirm that these resources have complementary strengths. Experiments with translation evidence from a small parallel corpus indicate that even rather rough estimates of translation probabilities can yield further improvements over a strong technique for translation weighting based on using Jensen-Shannon divergence as a term-association measure. Finally, a novel approach to posttranslation query expansion using a random walk over the Wikipedia concept link graph is shown to yield further improvements over alternative techniques for posttranslation query expansion. Evaluation results on the NTCIR-5 English-Korean test collection show statistically significant improvements over strong baselines.

Type

a
Ko, Y.: ¬A new term-weighting scheme for text classification using the odds of positive and negative class probabilities (2015) 0.00
```
0.0024857575 = product of:
  0.004971515 = sum of:
    0.004971515 = product of:
      0.00994303 = sum of:
        0.00994303 = weight(_text_:a in 2339) [ClassicSimilarity], result of:
          0.00994303 = score(doc=2339,freq=12.0), product of:
            0.053105544 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046056706 = queryNorm
            0.18723148 = fieldWeight in 2339, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046875 = fieldNorm(doc=2339)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Text classification (TC) is a core technique for text mining and information retrieval. It has been applied to many applications in many different research and industrial areas. Term-weighting schemes assign an appropriate weight to each term to obtain a high TC performance. Although term weighting is one of the important modules for TC and TC has different peculiarities from those in information retrieval, many term-weighting schemes used in information retrieval, such as term frequency-inverse document frequency (tf-idf), have been used in TC in the same manner. The peculiarity of TC that differs most from information retrieval is the existence of class information. This article proposes a new term-weighting scheme that uses class information using positive and negative class distributions. As a result, the proposed scheme, log tf-TRR, consistently performs better than do other schemes using class information as well as traditional schemes such as tf-idf.

Type

a
Bae, K.; Ko, Y.: Improving question retrieval in community question answering service using dependency relations and question classification (2019) 0.00
```
0.0016913437 = product of:
  0.0033826875 = sum of:
    0.0033826875 = product of:
      0.006765375 = sum of:
        0.006765375 = weight(_text_:a in 5412) [ClassicSimilarity], result of:
          0.006765375 = score(doc=5412,freq=8.0), product of:
            0.053105544 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046056706 = queryNorm
            0.12739488 = fieldWeight in 5412, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5412)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

To build an effective community question answering (cQA) service, determining ways to obtain questions similar to an input query question is a significant research issue. The major challenges for question retrieval in cQA are related to solving the lexical gap problem and estimating the relevance between questions. In this study, we first solve the lexical gap problem using a translation-based language model (TRLM). Thereafter, we determine features and methods that are competent for estimating the relevance between two questions. For this purpose, we explore ways to use the results of a dependency parser and question classification for category information. Head-dependent pairs are first extracted as bigram features, called dependency bigrams, from the analysis results of the dependency parser. The probability of each category is estimated using the softmax approach based on the scores of the classification results. Subsequently, we propose two retrieval models-the dependency-based model (DM) and category-based model (CM)-and they are applied to the previous model, TRLM. The experimental results demonstrate that the proposed methods significantly improve the performance of question retrieval in cQA services.

Type

a

Search (5 results, page 1 of 1)

Authors

Years

Themes