Diese Datenbank enthält über 40.000 Dokumente zu Themen aus den Bereichen Formalerschließung – Inhaltserschließung – Information Retrieval.
© 2015 W. Gödert, TH Köln, Institut für Informationswissenschaft / Powered by litecat, BIS Oldenburg (Stand: 21. Januar 2019)
1Zheng, X. ; Sun, A.: Collecting event-related tweets from twitter stream.
In: Journal of the Association for Information Science and Technology. 70(2019) no.2, S.176-186.
Abstract: Twitter provides a channel of collecting and publishing instant information on major events like natural disasters. However, information flow on Twitter is of great volume. For a specific event, messages collected from the Twitter Stream based on either location constraint or predefined keywords would contain a lot of noise. In this article, we propose a method to achieve both high-precision and high-recall in collecting event-related tweets. Our method involves an automatic keyword generation component, and an event-related tweet identification component. For keyword generation, we consider three properties of candidate keywords, namely relevance, coverage, and evolvement. The keyword updating mechanism enables our method to track the main topics of tweets along event development. To minimize annotation effort in identifying event-related tweets, we adopt active learning and incorporate multiple-instance learning which assigns labels to bags instead of instances (that is, individual tweets). Through experiments on two real-world events, we demonstrate the superiority of our method against state-of-the-art alternatives.
Inhalt: Vgl.: https://onlinelibrary.wiley.com/doi/10.1002/asi.24096.
2Li, J. ; Sun, A. ; Xing, Z.: To do or not to do : distill crowdsourced negative caveats to augment api documentation.
In: Journal of the Association for Information Science and Technology. 69(2018) no.12, S.1460-1475.
Abstract: Negative caveats of application programming interfaces (APIs) are about "how not to use an API," which are often absent from the official API documentation. When these caveats are overlooked, programming errors may emerge from misusing APIs, leading to heavy discussions on Q&A websites like Stack Overflow. If the overlooked caveats could be mined from these discussions, they would be beneficial for programmers to avoid misuse of APIs. However, it is challenging because the discussions are informal, redundant, and diverse. For this, for example, we propose Disca, a novel approach for automatically Distilling desirable API negative caveats from unstructured Q&A discussions. Through sentence selection and prominent term clustering, Disca ensures that distilled caveats are context-independent, prominent, semantically diverse, and nonredundant. Quantitative evaluation in our experiments shows that the proposed Disca significantly outperforms four text-summarization techniques. We also show that the distilled API negative caveats could greatly augment API documentation through qualitative analysis.
3Sedhai, S. ; Sun, A.: ¬An analysis of 14 Million tweets on hashtag-oriented spamming*.
In: Journal of the Association for Information Science and Technology. 68(2017) no.7, S.1638-1651.
Abstract: Over the years, Twitter has become a popular platform for information dissemination and information gathering. However, the popularity of Twitter has attracted not only legitimate users but also spammers who exploit social graphs, popular keywords, and hashtags for malicious purposes. In this paper, we present a detailed analysis of the HSpam14 dataset, which contains 14 million tweets with spam and ham (i.e., nonspam) labels, to understand spamming activities on Twitter. The primary focus of this paper is to analyze various aspects of spam on Twitter based on hashtags, tweet contents, and user profiles, which are useful for both tweet-level and user-level spam detection. First, we compare the usage of hashtags in spam and ham tweets based on frequency, position, orthography, and co-occurrence. Second, for content-based analysis, we analyze the variations in word usage, metadata, and near-duplicate tweets. Third, for user-based analysis, we investigate user profile information. In our study, we validate that spammers use popular hashtags to promote their tweets. We also observe differences in the usage of words in spam and ham tweets. Spam tweets are more likely to be emphasized using exclamation points and capitalized words. Furthermore, we observe that spammers use multiple accounts to post near-duplicate tweets to promote their services and products. Unlike spammers, legitimate users are likely to provide more information such as their locations and personal descriptions in their profiles. In summary, this study presents a comprehensive analysis of hashtags, tweet contents, and user profiles in Twitter spamming.
Inhalt: Vgl.: http://onlinelibrary.wiley.com/doi/10.1002/asi.23836/full.
4Li, C. ; Sun, A.: Extracting fine-grained location with temporal awareness in tweets : a two-stage approach.
In: Journal of the Association for Information Science and Technology. 68(2017) no.7, S.1652-1670.
Abstract: Twitter has attracted billions of users for life logging and sharing activities and opinions. In their tweets, users often reveal their location information and short-term visiting histories or plans. Capturing user's short-term activities could benefit many applications for providing the right context at the right time and location. In this paper we are interested in extracting locations mentioned in tweets at fine-grained granularity, with temporal awareness. Specifically, we recognize the points-of-interest (POIs) mentioned in a tweet and predict whether the user has visited, is currently at, or will soon visit the mentioned POIs. A POI can be a restaurant, a shopping mall, a bookstore, or any other fine-grained location. Our proposed framework, named TS-Petar (Two-Stage POI Extractor with Temporal Awareness), consists of two main components: a POI inventory and a two-stage time-aware POI tagger. The POI inventory is built by exploiting the crowd wisdom of the Foursquare community. It contains both POIs' formal names and their informal abbreviations, commonly observed in Foursquare check-ins. The time-aware POI tagger, based on the Conditional Random Field (CRF) model, is devised to disambiguate the POI mentions and to resolve their associated temporal awareness accordingly. Three sets of contextual features (linguistic, temporal, and inventory features) and two labeling schema features (OP and BILOU schemas) are explored for the time-aware POI extraction task. Our empirical study shows that the subtask of POI disambiguation and the subtask of temporal awareness resolution call for different feature settings for best performance. We have also evaluated the proposed TS-Petar against several strong baseline methods. The experimental results demonstrate that the two-stage approach achieves the best accuracy and outperforms all baseline methods in terms of both effectiveness and efficiency.
Inhalt: Vgl.: http://onlinelibrary.wiley.com/doi/10.1002/asi.23816/full.
5Li, C. ; Sun, A. ; Datta, A.: TSDW: Two-stage word sense disambiguation using Wikipedia.
In: Journal of the American Society for Information Science and Technology. 64(2013) no.6, S.1203-1223.
Abstract: The semantic knowledge of Wikipedia has proved to be useful for many tasks, for example, named entity disambiguation. Among these applications, the task of identifying the word sense based on Wikipedia is a crucial component because the output of this component is often used in subsequent tasks. In this article, we present a two-stage framework (called TSDW) for word sense disambiguation using knowledge latent in Wikipedia. The disambiguation of a given phrase is applied through a two-stage disambiguation process: (a) The first-stage disambiguation explores the contextual semantic information, where the noisy information is pruned for better effectiveness and efficiency; and (b) the second-stage disambiguation explores the disambiguated phrases of high confidence from the first stage to achieve better redisambiguation decisions for the phrases that are difficult to disambiguate in the first stage. Moreover, existing studies have addressed the disambiguation problem for English text only. Considering the popular usage of Wikipedia in different languages, we study the performance of TSDW and the existing state-of-the-art approaches over both English and Traditional Chinese articles. The experimental results show that TSDW generalizes well to different semantic relatedness measures and text in different languages. More important, TSDW significantly outperforms the state-of-the-art approaches with both better effectiveness and efficiency.
6Ma, Z. ; Sun, A. ; Cong, G.: On predicting the popularity of newly emerging hashtags in Twitter.
In: Journal of the American Society for Information Science and Technology. 64(2013) no.7, S.1399-1410.
Abstract: Because of Twitter's popularity and the viral nature of information dissemination on Twitter, predicting which Twitter topics will become popular in the near future becomes a task of considerable economic importance. Many Twitter topics are annotated by hashtags. In this article, we propose methods to predict the popularity of new hashtags on Twitter by formulating the problem as a classification task. We use five standard classification models (i.e., Naïve bayes, k-nearest neighbors, decision trees, support vector machines, and logistic regression) for prediction. The main challenge is the identification of effective features for describing new hashtags. We extract 7 content features from a hashtag string and the collection of tweets containing the hashtag and 11 contextual features from the social graph formed by users who have adopted the hashtag. We conducted experiments on a Twitter data set consisting of 31 million tweets from 2 million Singapore-based users. The experimental results show that the standard classifiers using the extracted features significantly outperform the baseline methods that do not use these features. Among the five classifiers, the logistic regression model performs the best in terms of the Micro-F1 measure. We also observe that contextual features are more effective than content features.
Themenfeld: Automatisches Klassifizieren ; Data Mining
7Qu, B. ; Cong, G. ; Li, C. ; Sun, A. ; Chen, H.: ¬An evaluation of classification models for question topic categorization.
In: Journal of the American Society for Information Science and Technology. 63(2012) no.5, S.889-903.
Abstract: We study the problem of question topic classification using a very large real-world Community Question Answering (CQA) dataset from Yahoo! Answers. The dataset comprises 3.9 million questions and these questions are organized into more than 1,000 categories in a hierarchy. To the best knowledge, this is the first systematic evaluation of the performance of different classification methods on question topic classification as well as short texts. Specifically, we empirically evaluate the following in classifying questions into CQA categories: (a) the usefulness of n-gram features and bag-of-word features; (b) the performance of three standard classification algorithms (naive Bayes, maximum entropy, and support vector machines); (c) the performance of the state-of-the-art hierarchical classification algorithms; (d) the effect of training data size on performance; and (e) the effectiveness of the different components of CQA data, including subject, content, asker, and the best answer. The experimental results show what aspects are important for question topic classification in terms of both effectiveness and efficiency. We believe that the experimental findings from this study will be useful in real-world classification problems.
Themenfeld: Automatisches Klassifizieren
8Li, H. ; Bhowmick, S.S. ; Sun, A.: AffRank: affinity-driven ranking of products in online social rating networks.
In: Journal of the American Society for Information Science and Technology. 62(2011) no.7, S.1345-1359.
Abstract: Large online social rating networks (e.g., Epinions, Blippr) have recently come into being containing information related to various types of products. Typically, each product in these networks is associated with a group of members who have provided ratings and comments on it. These people form a product community. A potential member can join a product community by giving a new rating to the product. We refer to this phenomenon of a product community's ability to "attract" new members as product affinity. The knowledge of a ranked list of products based on product affinity is of much importance for implementing policies, marketing research, online advertisement, and other applications. In this article, we identify and analyze an array of features that exert effect on product affinity and propose a novel model, called AffRank, that utilizes these features to predict the future rank of products according to their affinities. Evaluated on two real-world datasets, we demonstrate the effectiveness and superior prediction quality of AffRank compared with baseline methods. Our experiments show that features such as affinity rank history, affinity evolution distance, and average rating are the most important factors affecting future rank of products. At the same time, interestingly, traditional community features (e.g., community size, member connectivity, and social context) have negligible influence on product affinities.
9Sun, A. ; Bhowmick, S.S. ; Nguyen, K.T.N. ; Bai, G.: Tag-based social image retrieval : an empirical evaluation.
In: Journal of the American Society for Information Science and Technology. 62(2011) no.12, S.2364-2381.
Abstract: Tags associated with social images are valuable information source for superior image search and retrieval experiences. Although various heuristics are valuable to boost tag-based search for images, there is a lack of general framework to study the impact of these heuristics. Specifically, the task of ranking images matching a given tag query based on their associated tags in descending order of relevance has not been well studied. In this article, we take the first step to propose a generic, flexible, and extensible framework for this task and exploit it for a systematic and comprehensive empirical evaluation of various methods for ranking images. To this end, we identified five orthogonal dimensions to quantify the matching score between a tagged image and a tag query. These five dimensions are: (i) tag relatedness to measure the degree of effectiveness of a tag describing the tagged image; (ii) tag discrimination to quantify the degree of discrimination of a tag with respect to the entire tagged image collection; (iii) tag length normalization analogous to document length normalization in web search; (iv) tag-query matching model for the matching score computation between an image tag and a query tag; and (v) query model for tag query rewriting. For each dimension, we identify a few implementations and evaluate their impact on NUS-WIDE dataset, the largest human-annotated dataset consisting of more than 269K tagged images from Flickr. We evaluated 81 single-tag queries and 443 multi-tag queries over 288 search methods and systematically compare their performances using standard metrics including Precision at top-K, Mean Average Precision (MAP), Recall, and Normalized Discounted Cumulative Gain (NDCG).
Themenfeld: Social tagging
Behandelte Form: Bilder
10Sun, A. ; Lim, E.-P.: Web unit-based mining of homepage relationships.
In: Journal of the American Society for Information Science and Technology. 57(2006) no.3, S.394-407.
Abstract: Homepages usually describe important semantic information about conceptual or physical entities; hence, they are the main targets for searching and browsing. To facilitate semantic-based information retrieval (IR) at a Web site, homepages can be identified and classified under some predefined concepts and these concepts are then used in query or browsing criteria, e.g., finding professor homepages containing information retrieval. In some Web sites, relationships may also exist among homepages. These relationship instances (also known as homepage relationships) enrich our knowledge about these Web sites and allow more expressive semantic-based IR. In this article, we investigate the features to be used in mining homepage relationships. We systematically develop different classes of inter-homepage features, namely, navigation, relative-location, and common-item features. We also propose deriving for each homepage a set of support pages to obtain richer and more complete content about the entity described by the homepage. The homepage together with its support pages are known to be a Web unit. By extracting inter-homepage features from Web units, our experiments on the WebKB dataset show that better homepage relationship mining accuracies can be achieved.
11Sun, A. ; Lim, E.-P. ; Ng, W.-K.: Performance measurement framework for hierarchical text classification.
In: Journal of the American Society for Information Science and technology. 54(2003) no.11, S.1014-1028.
Abstract: Hierarchical text classification or simply hierarchical classification refers to assigning a document to one or more suitable categories from a hierarchical category space. In our literature survey, we have found that the existing hierarchical classification experiments used a variety of measures to evaluate performance. These performance measures often assume independence between categories and do not consider documents misclassified into categories that are similar or not far from the correct categories in the category tree. In this paper, we therefore propose new performance measures for hierarchicai classification. The proposed performance measures consist of category similarity measures and distance-based measures that consider the contributions of misclassified documents. Our experiments an hierarchical classification methods based an SVM classifiers and binary Naive Bayes classifiers showed that SVM classifiers perform better than Naive Bayes classifiers an Reuters-21578 collection according to the extended measures. A new classifier-centric measure called blocking measure is also defined to examine the performance of subtree classifiers in a top-down levelbased hierarchical classificatIon method.
Themenfeld: Automatisches Klassifizieren