Search (4 results, page 1 of 1)

  • × author_ss:"Chen, H."
  • × language_ss:"e"
  • × year_i:[2010 TO 2020}
  1. Qu, B.; Cong, G.; Li, C.; Sun, A.; Chen, H.: ¬An evaluation of classification models for question topic categorization (2012) 0.02
    0.020654965 = product of:
      0.04130993 = sum of:
        0.04130993 = product of:
          0.08261986 = sum of:
            0.08261986 = weight(_text_:classification in 237) [ClassicSimilarity], result of:
              0.08261986 = score(doc=237,freq=16.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.49761042 = fieldWeight in 237, product of:
                  4.0 = tf(freq=16.0), with freq of:
                    16.0 = termFreq=16.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=237)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    We study the problem of question topic classification using a very large real-world Community Question Answering (CQA) dataset from Yahoo! Answers. The dataset comprises 3.9 million questions and these questions are organized into more than 1,000 categories in a hierarchy. To the best knowledge, this is the first systematic evaluation of the performance of different classification methods on question topic classification as well as short texts. Specifically, we empirically evaluate the following in classifying questions into CQA categories: (a) the usefulness of n-gram features and bag-of-word features; (b) the performance of three standard classification algorithms (naive Bayes, maximum entropy, and support vector machines); (c) the performance of the state-of-the-art hierarchical classification algorithms; (d) the effect of training data size on performance; and (e) the effectiveness of the different components of CQA data, including subject, content, asker, and the best answer. The experimental results show what aspects are important for question topic classification in terms of both effectiveness and efficiency. We believe that the experimental findings from this study will be useful in real-world classification problems.
  2. Huang, C.; Fu, T.; Chen, H.: Text-based video content classification for online video-sharing sites (2010) 0.01
    0.014605265 = product of:
      0.02921053 = sum of:
        0.02921053 = product of:
          0.05842106 = sum of:
            0.05842106 = weight(_text_:classification in 3452) [ClassicSimilarity], result of:
              0.05842106 = score(doc=3452,freq=8.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.35186368 = fieldWeight in 3452, product of:
                  2.828427 = tf(freq=8.0), with freq of:
                    8.0 = termFreq=8.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=3452)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    With the emergence of Web 2.0, sharing personal content, communicating ideas, and interacting with other online users in Web 2.0 communities have become daily routines for online users. User-generated data from Web 2.0 sites provide rich personal information (e.g., personal preferences and interests) and can be utilized to obtain insight about cyber communities and their social networks. Many studies have focused on leveraging user-generated information to analyze blogs and forums, but few studies have applied this approach to video-sharing Web sites. In this study, we propose a text-based framework for video content classification of online-video sharing Web sites. Different types of user-generated data (e.g., titles, descriptions, and comments) were used as proxies for online videos, and three types of text features (lexical, syntactic, and content-specific features) were extracted. Three feature-based classification techniques (C4.5, Naïve Bayes, and Support Vector Machine) were used to classify videos. To evaluate the proposed framework, user-generated data from candidate videos, which were identified by searching user-given keywords on YouTube, were first collected. Then, a subset of the collected data was randomly selected and manually tagged by users as our experiment data. The experimental results showed that the proposed approach was able to classify online videos based on users' interests with accuracy rates up to 87.2%, and all three types of text features contributed to discriminating videos. Support Vector Machine outperformed C4.5 and Naïve Bayes techniques in our experiments. In addition, our case study further demonstrated that accurate video-classification results are very useful for identifying implicit cyber communities on video-sharing Web sites.
  3. Ku, Y.; Chiu, C.; Zhang, Y.; Chen, H.; Su, H.: Text mining self-disclosing health information for public health service (2014) 0.01
    0.00876316 = product of:
      0.01752632 = sum of:
        0.01752632 = product of:
          0.03505264 = sum of:
            0.03505264 = weight(_text_:classification in 1262) [ClassicSimilarity], result of:
              0.03505264 = score(doc=1262,freq=2.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.21111822 = fieldWeight in 1262, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.046875 = fieldNorm(doc=1262)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Understanding specific patterns or knowledge of self-disclosing health information could support public health surveillance and healthcare. This study aimed to develop an analytical framework to identify self-disclosing health information with unusual messages on web forums by leveraging advanced text-mining techniques. To demonstrate the performance of the proposed analytical framework, we conducted an experimental study on 2 major human immunodeficiency virus (HIV)/acquired immune deficiency syndrome (AIDS) forums in Taiwan. The experimental results show that the classification accuracy increased significantly (up to 83.83%) when using features selected by the information gain technique. The results also show the importance of adopting domain-specific features in analyzing unusual messages on web forums. This study has practical implications for the prevention and support of HIV/AIDS healthcare. For example, public health agencies can re-allocate resources and deliver services to people who need help via social media sites. In addition, individuals can also join a social media site to get better suggestions and support from each other.
  4. Yang, M.; Kiang, M.; Chen, H.; Li, Y.: Artificial immune system for illicit content identification in social media (2012) 0.01
    0.0073026326 = product of:
      0.014605265 = sum of:
        0.014605265 = product of:
          0.02921053 = sum of:
            0.02921053 = weight(_text_:classification in 4980) [ClassicSimilarity], result of:
              0.02921053 = score(doc=4980,freq=2.0), product of:
                0.16603322 = queryWeight, product of:
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.05213454 = queryNorm
                0.17593184 = fieldWeight in 4980, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.1847067 = idf(docFreq=4974, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=4980)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Social media is frequently used as a platform for the exchange of information and opinions as well as propaganda dissemination. But online content can be misused for the distribution of illicit information, such as violent postings in web forums. Illicit content is highly distributed in social media, while non-illicit content is unspecific and topically diverse. It is costly and time consuming to label a large amount of illicit content (positive examples) and non-illicit content (negative examples) to train classification systems. Nevertheless, it is relatively easy to obtain large volumes of unlabeled content in social media. In this article, an artificial immune system-based technique is presented to address the difficulties in the illicit content identification in social media. Inspired by the positive selection principle in the immune system, we designed a novel labeling heuristic based on partially supervised learning to extract high-quality positive and negative examples from unlabeled datasets. The empirical evaluation results from two large hate group web forums suggest that our proposed approach generally outperforms the benchmark techniques and exhibits more stable performance.