Search (5 results, page 1 of 1)

  • × author_ss:"Sun, A."
  1. Qu, B.; Cong, G.; Li, C.; Sun, A.; Chen, H.: ¬An evaluation of classification models for question topic categorization (2012) 0.01
    0.008005621 = product of:
      0.04803372 = sum of:
        0.04803372 = weight(_text_:problem in 237) [ClassicSimilarity], result of:
          0.04803372 = score(doc=237,freq=2.0), product of:
            0.20485485 = queryWeight, product of:
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.04826377 = queryNorm
            0.23447686 = fieldWeight in 237, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.0390625 = fieldNorm(doc=237)
      0.16666667 = coord(1/6)
    
    Abstract
    We study the problem of question topic classification using a very large real-world Community Question Answering (CQA) dataset from Yahoo! Answers. The dataset comprises 3.9 million questions and these questions are organized into more than 1,000 categories in a hierarchy. To the best knowledge, this is the first systematic evaluation of the performance of different classification methods on question topic classification as well as short texts. Specifically, we empirically evaluate the following in classifying questions into CQA categories: (a) the usefulness of n-gram features and bag-of-word features; (b) the performance of three standard classification algorithms (naive Bayes, maximum entropy, and support vector machines); (c) the performance of the state-of-the-art hierarchical classification algorithms; (d) the effect of training data size on performance; and (e) the effectiveness of the different components of CQA data, including subject, content, asker, and the best answer. The experimental results show what aspects are important for question topic classification in terms of both effectiveness and efficiency. We believe that the experimental findings from this study will be useful in real-world classification problems.
  2. Li, C.; Sun, A.; Datta, A.: TSDW: Two-stage word sense disambiguation using Wikipedia (2013) 0.01
    0.008005621 = product of:
      0.04803372 = sum of:
        0.04803372 = weight(_text_:problem in 956) [ClassicSimilarity], result of:
          0.04803372 = score(doc=956,freq=2.0), product of:
            0.20485485 = queryWeight, product of:
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.04826377 = queryNorm
            0.23447686 = fieldWeight in 956, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.0390625 = fieldNorm(doc=956)
      0.16666667 = coord(1/6)
    
    Abstract
    The semantic knowledge of Wikipedia has proved to be useful for many tasks, for example, named entity disambiguation. Among these applications, the task of identifying the word sense based on Wikipedia is a crucial component because the output of this component is often used in subsequent tasks. In this article, we present a two-stage framework (called TSDW) for word sense disambiguation using knowledge latent in Wikipedia. The disambiguation of a given phrase is applied through a two-stage disambiguation process: (a) The first-stage disambiguation explores the contextual semantic information, where the noisy information is pruned for better effectiveness and efficiency; and (b) the second-stage disambiguation explores the disambiguated phrases of high confidence from the first stage to achieve better redisambiguation decisions for the phrases that are difficult to disambiguate in the first stage. Moreover, existing studies have addressed the disambiguation problem for English text only. Considering the popular usage of Wikipedia in different languages, we study the performance of TSDW and the existing state-of-the-art approaches over both English and Traditional Chinese articles. The experimental results show that TSDW generalizes well to different semantic relatedness measures and text in different languages. More important, TSDW significantly outperforms the state-of-the-art approaches with both better effectiveness and efficiency.
  3. Ma, Z.; Sun, A.; Cong, G.: On predicting the popularity of newly emerging hashtags in Twitter (2013) 0.01
    0.008005621 = product of:
      0.04803372 = sum of:
        0.04803372 = weight(_text_:problem in 967) [ClassicSimilarity], result of:
          0.04803372 = score(doc=967,freq=2.0), product of:
            0.20485485 = queryWeight, product of:
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.04826377 = queryNorm
            0.23447686 = fieldWeight in 967, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.0390625 = fieldNorm(doc=967)
      0.16666667 = coord(1/6)
    
    Abstract
    Because of Twitter's popularity and the viral nature of information dissemination on Twitter, predicting which Twitter topics will become popular in the near future becomes a task of considerable economic importance. Many Twitter topics are annotated by hashtags. In this article, we propose methods to predict the popularity of new hashtags on Twitter by formulating the problem as a classification task. We use five standard classification models (i.e., Naïve bayes, k-nearest neighbors, decision trees, support vector machines, and logistic regression) for prediction. The main challenge is the identification of effective features for describing new hashtags. We extract 7 content features from a hashtag string and the collection of tweets containing the hashtag and 11 contextual features from the social graph formed by users who have adopted the hashtag. We conducted experiments on a Twitter data set consisting of 31 million tweets from 2 million Singapore-based users. The experimental results show that the standard classifiers using the extracted features significantly outperform the baseline methods that do not use these features. Among the five classifiers, the logistic regression model performs the best in terms of the Micro-F1 measure. We also observe that contextual features are more effective than content features.
  4. Yu, M.; Sun, A.: Dataset versus reality : understanding model performance from the perspective of information need (2023) 0.01
    0.008005621 = product of:
      0.04803372 = sum of:
        0.04803372 = weight(_text_:problem in 1073) [ClassicSimilarity], result of:
          0.04803372 = score(doc=1073,freq=2.0), product of:
            0.20485485 = queryWeight, product of:
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.04826377 = queryNorm
            0.23447686 = fieldWeight in 1073, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1073)
      0.16666667 = coord(1/6)
    
    Abstract
    Deep learning technologies have brought us many models that outperform human beings on a few benchmarks. An interesting question is: can these models well solve real-world problems with similar settings (e.g., identical input/output) to the benchmark datasets? We argue that a model is trained to answer the same information need in a similar context (e.g., the information available), for which the training dataset is created. The trained model may be used to solve real-world problems for a similar information need in a similar context. However, information need is independent of the format of dataset input/output. Although some datasets may share high structural similarities, they may represent different research tasks aiming for answering different information needs. Examples are question-answer pairs for the question answering (QA) task, and image-caption pairs for the image captioning (IC) task. In this paper, we use the QA task and IC task as two case studies and compare their widely used benchmark datasets. From the perspective of information need in the context of information retrieval, we show the differences in the dataset creation processes and the differences in morphosyntactic properties between datasets. The differences in these datasets can be attributed to the different information needs and contexts of the specific research tasks. We encourage all researchers to consider the information need perspective of a research task when selecting the appropriate datasets to train a model. Likewise, while creating a dataset, researchers may also incorporate the information need perspective as a factor to determine the degree to which the dataset accurately reflects the real-world problem or the research task they intend to tackle.
  5. Sun, A.; Lim, E.-P.: Web unit-based mining of homepage relationships (2006) 0.01
    0.0054492294 = product of:
      0.032695375 = sum of:
        0.032695375 = weight(_text_:22 in 5274) [ClassicSimilarity], result of:
          0.032695375 = score(doc=5274,freq=2.0), product of:
            0.1690115 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.04826377 = queryNorm
            0.19345059 = fieldWeight in 5274, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5274)
      0.16666667 = coord(1/6)
    
    Date
    22. 7.2006 16:18:25