Diese Datenbank enthält über 40.000 Dokumente zu Themen aus den Bereichen Formalerschließung – Inhaltserschließung – Information Retrieval.
© 2015 W. Gödert, TH Köln, Institut für Informationswissenschaft / Powered by litecat, BIS Oldenburg (Stand: 04. Juni 2021)
1Ferreira, R.S. ; Graça Pimentel, M. de ; Cristo, M.: ¬A wikification prediction model based on the combination of latent, dyadic, and monadic features.
In: Journal of the Association for Information Science and Technology. 69(2018) no.3, S.380-394.
Abstract: Considering repositories of web documents that are semantically linked and created in a collaborative fashion, as in the case of Wikipedia, a key problem faced by content providers is the placement of links in the articles. These links must support user navigation and provide a deeper semantic interpretation of the content. Current wikification methods exploit machine learning techniques to capture characteristics of the concepts and its associations. In previous work, we proposed a preliminary prediction model combining traditional predictors with a latent component which captures the concept graph topology by means of matrix factorization. In this work, we provide a detailed description of our method and a deeper comparison with a state-of-the-art wikification method using a sample of Wikipedia and report a gain up to 13% in F1 score. We also provide a comprehensive analysis of the model performance showing the importance of the latent predictor component and the attributes derived from the associations between the concepts. Moreover, we include an analysis that allows us to conclude that the model is resilient to ambiguity without including a disambiguation phase. We finally report the positive impact of selecting training samples from specific content quality classes.
Inhalt: Vgl.: http://onlinelibrary.wiley.com/doi/10.1002/asi.23922/full.
Themenfeld: Hypertext ; Semantisches Umfeld in Indexierung u. Retrieval
2Dalip, D.H. ; Gonçalves, M.A. ; Cristo, M. ; Calado, P.: ¬A general multiview framework for assessing the quality of collaboratively created content on web 2.0.
In: Journal of the Association for Information Science and Technology. 68(2017) no.2, S.286-308.
Abstract: User-generated content is one of the most interesting phenomena of current published media, as users are now able not only to consume, but also to produce content in a much faster and easier manner. However, such freedom also carries concerns about content quality. In this work, we propose an automatic framework to assess the quality of collaboratively generated content. Quality is addressed as a multidimensional concept, modeled as a combination of independent assessments, each regarding different quality dimensions. Accordingly, we adopt a machine-learning (ML)-based multiview approach to assess content quality. We perform a thorough analysis of our framework on two different domains: Questions and Answer Forums and Collaborative Encyclopedias. This allowed us to better understand when and how the proposed multiview approach is able to provide accurate quality assessments. Our main contributions are: (a) a general ML multiview framework that takes advantage of different views of quality indicators; (b) the improvement (up to 30%) in quality assessment over the best state-of-the-art baseline methods; (c) a thorough feature and view analysis regarding impact, informativeness, and correlation, based on two distinct domains.
Inhalt: Vgl.: http://onlinelibrary.wiley.com/doi/10.1002/asi.23650/full.
Objekt: Web 2.0
3Souza, J. ; Carvalho, A. ; Cristo, M. ; Moura, E. ; Calado, P. ; Chirita, P.-A. ; Nejdl, W.: Using site-level connections to estimate link confidence.
In: Journal of the American Society for Information Science and Technology. 63(2012) no.11, S.2294-2312.
Abstract: Search engines are essential tools for web users today. They rely on a large number of features to compute the rank of search results for each given query. The estimated reputation of pages is among the effective features available for search engine designers, probably being adopted by most current commercial search engines. Page reputation is estimated by analyzing the linkage relationships between pages. This information is used by link analysis algorithms as a query-independent feature, to be taken into account when computing the rank of the results. Unfortunately, several types of links found on the web may damage the estimated page reputation and thus cause a negative effect on the quality of search results. This work studies alternatives to reduce the negative impact of such noisy links. More specifically, the authors propose and evaluate new methods that deal with noisy links, considering scenarios where the reputation of pages is computed using the PageRank algorithm. They show, through experiments with real web content, that their methods achieve significant improvements when compared to previous solutions proposed in the literature.
4Silva, A.J.C. ; Gonçalves, M.A. ; Laender, A.H.F. ; Modesto, M.A.B. ; Cristo, M. ; Ziviani, N.: Finding what is missing from a digital library : a case study in the computer science field.
In: Information processing and management. 45(2009) no.3, S.380-391.
Abstract: This article proposes a process to retrieve the URL of a document for which metadata records exist in a digital library catalog but a pointer to the full text of the document is not available. The process uses results from queries submitted to Web search engines for finding the URL of the corresponding full text or any related material. We present a comprehensive study of this process in different situations by investigating different query strategies applied to three general purpose search engines (Google, Yahoo!, MSN) and two specialized ones (Scholar and CiteSeer), considering five user scenarios. Specifically, we have conducted experiments with metadata records taken from the Brazilian Digital Library of Computing (BDBComp) and The DBLP Computer Science Bibliography (DBLP). We found that Scholar was the most effective search engine for this task in all considered scenarios and that simple strategies for combining and re-ranking results from Scholar and Google significantly improve the retrieval quality. Moreover, we study the influence of the number of query results on the effectiveness of finding missing information as well as the coverage of the proposed scenarios.
Themenfeld: Information Gateway
5Calado, P. ; Cristo, M. ; Gonçalves, M.A. ; Moura, E.S. de ; Ribeiro-Neto, B. ; Ziviani, N.: Link-based similarity measures for the classification of Web documents.
In: Journal of the American Society for Information Science and Technology. 57(2006) no.2, S.208-221.
Abstract: Traditional text-based document classifiers tend to perform poorly an the Web. Text in Web documents is usually noisy and often does not contain enough information to determine their topic. However, the Web provides a different source that can be useful to document classification: its hyperlink structure. In this work, the authors evaluate how the link structure of the Web can be used to determine a measure of similarity appropriate for document classification. They experiment with five different similarity measures and determine their adequacy for predicting the topic of a Web page. Tests performed an a Web directory Show that link information alone allows classifying documents with an average precision of 86%. Further, when combined with a traditional textbased classifier, precision increases to values of up to 90%, representing gains that range from 63 to 132% over the use of text-based classification alone. Because the measures proposed in this article are straightforward to compute, they provide a practical and effective solution for Web classification and related information retrieval tasks. Further, the authors provide an important set of guidelines an how link structure can be used effectively to classify Web documents.
Themenfeld: Automatisches Klassifizieren
6Couto, T. ; Cristo, M. ; Gonçalves, M.A. ; Calado, P. ; Ziviani, N. ; Moura, E. ; Ribeiro-Neto, B.: ¬A comparative study of citations and links in document classification.
In: International Conference on Digital Libraries: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, Chapel Hill, NC, USA. New York : ACM, 2006. S.75-84.
Abstract: It is well known that links are an important source of information when dealing with Web collections. However, the question remains on whether the same techniques that are used on the Web can be applied to collections of documents containing citations between scientific papers. In this work we present a comparative study of digital library citations and Web links, in the context of automatic text classification. We show that there are in fact differences between citations and links in this context. For the comparison, we run a series of experiments using a digital library of computer science papers and a Web directory. In our reference collections, measures based on co-citation tend to perform better for pages in the Web directory, with gains up to 37% over text based classifiers, while measures based on bibliographic coupling perform better in a digital library. We also propose a simple and effective way of combining a traditional text based classifier with a citation-link based classifier. This combination is based on the notion of classifier reliability and presented gains of up to 14% in micro-averaged F1 in the Web collection. However, no significant gain was obtained in the digital library. Finally, a user study was performed to further investigate the causes for these results. We discovered that misclassifications by the citation-link based classifiers are in fact difficult cases, hard to classify even for humans.