Search (15 results, page 1 of 1)

Couto, T.; Cristo, M.; Gonçalves, M.A.; Calado, P.; Ziviani, N.; Moura, E.; Ribeiro-Neto, B.: ¬A comparative study of citations and links in document classification (2006) 0.02
```
0.02017508 = product of:
  0.07061278 = sum of:
    0.021635616 = weight(_text_:with in 2531) [ClassicSimilarity], result of:
      0.021635616 = score(doc=2531,freq=6.0), product of:
        0.09383348 = queryWeight, product of:
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.038938753 = queryNorm
        0.2305746 = fieldWeight in 2531, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2531)
    0.048977163 = product of:
      0.097954325 = sum of:
        0.097954325 = weight(_text_:humans in 2531) [ClassicSimilarity], result of:
          0.097954325 = score(doc=2531,freq=2.0), product of:
            0.26276368 = queryWeight, product of:
              6.7481275 = idf(docFreq=140, maxDocs=44218)
              0.038938753 = queryNorm
            0.37278488 = fieldWeight in 2531, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              6.7481275 = idf(docFreq=140, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2531)
      0.5 = coord(1/2)
  0.2857143 = coord(2/7)
```
Abstract

It is well known that links are an important source of information when dealing with Web collections. However, the question remains on whether the same techniques that are used on the Web can be applied to collections of documents containing citations between scientific papers. In this work we present a comparative study of digital library citations and Web links, in the context of automatic text classification. We show that there are in fact differences between citations and links in this context. For the comparison, we run a series of experiments using a digital library of computer science papers and a Web directory. In our reference collections, measures based on co-citation tend to perform better for pages in the Web directory, with gains up to 37% over text based classifiers, while measures based on bibliographic coupling perform better in a digital library. We also propose a simple and effective way of combining a traditional text based classifier with a citation-link based classifier. This combination is based on the notion of classifier reliability and presented gains of up to 14% in micro-averaged F1 in the Web collection. However, no significant gain was obtained in the digital library. Finally, a user study was performed to further investigate the causes for these results. We discovered that misclassifications by the citation-link based classifiers are in fact difficult cases, hard to classify even for humans.
Pereira, D.A.; Ribeiro-Neto, B.; Ziviani, N.; Laender, A.H.F.; Gonçalves, M.A.: ¬A generic Web-based entity resolution framework (2011) 0.00
```
0.003568951 = product of:
  0.024982655 = sum of:
    0.024982655 = weight(_text_:with in 4450) [ClassicSimilarity], result of:
      0.024982655 = score(doc=4450,freq=8.0), product of:
        0.09383348 = queryWeight, product of:
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.038938753 = queryNorm
        0.2662446 = fieldWeight in 4450, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.0390625 = fieldNorm(doc=4450)
  0.14285715 = coord(1/7)
```
Abstract

Web data repositories usually contain references to thousands of real-world entities from multiple sources. It is not uncommon that multiple entities share the same label (polysemes) and that distinct label variations are associated with the same entity (synonyms), which frequently leads to ambiguous interpretations. Further, spelling variants, acronyms, abbreviated forms, and misspellings compound to worsen the problem. Solving this problem requires identifying which labels correspond to the same real-world entity, a process known as entity resolution. One approach to solve the entity resolution problem is to associate an authority identifier and a list of variant forms with each entity-a data structure known as an authority file. In this work, we propose a generic framework for implementing a method for generating authority files. Our method uses information from the Web to improve the quality of the authority file and, because of that, is referred to as WER-Web-based Entity Resolution. Our contribution here is threefold: (a) we discuss how to implement the WER framework, which is flexible and easy to adapt to new domains; (b) we run extended experimentation with our WER framework to show that it outperforms selected baselines; and (c) we compare the results of a specialized solution for author name resolution with those produced by the generic WER framework, and show that the WER results remain competitive.
Martins, E.F.; Belém, F.M.; Almeida, J.M.; Gonçalves, M.A.: On cold start for associative tag recommendation (2016) 0.00
```
0.003568951 = product of:
  0.024982655 = sum of:
    0.024982655 = weight(_text_:with in 2494) [ClassicSimilarity], result of:
      0.024982655 = score(doc=2494,freq=8.0), product of:
        0.09383348 = queryWeight, product of:
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.038938753 = queryNorm
        0.2662446 = fieldWeight in 2494, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2494)
  0.14285715 = coord(1/7)
```
Abstract

Tag recommendation strategies that exploit term co-occurrence patterns with tags previously assigned to the target object have consistently produced state-of-the-art results. However, such techniques work only for objects with previously assigned tags. Here we focus on tag recommendation for objects with no tags, a variation of the well-known \textit{cold start} problem. We start by evaluating state-of-the-art co-occurrence based methods in cold start. Our results show that the effectiveness of these methods suffers in this situation. Moreover, we show that employing various automatic filtering strategies to generate an initial tag set that enables the use of co-occurrence patterns produces only marginal improvements. We then propose a new approach that exploits both positive and negative user feedback to iteratively select input tags along with a genetic programming strategy to learn the recommendation function. Our experimental results indicate that extending the methods to include user relevance feedback leads to gains in precision of up to 58% over the best baseline in cold start scenarios and gains of up to 43% over the best baseline in objects that contain some initial tags (i.e., no cold start). We also show that our best relevance-feedback-driven strategy performs well even in scenarios that lack user cooperation (i.e., users may refuse to provide feedback) and user reliability (i.e., users may provide the wrong feedback).
Calado, P.; Cristo, M.; Gonçalves, M.A.; Moura, E.S. de; Ribeiro-Neto, B.; Ziviani, N.: Link-based similarity measures for the classification of Web documents (2006) 0.00
```
0.0030908023 = product of:
  0.021635616 = sum of:
    0.021635616 = weight(_text_:with in 4921) [ClassicSimilarity], result of:
      0.021635616 = score(doc=4921,freq=6.0), product of:
        0.09383348 = queryWeight, product of:
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.038938753 = queryNorm
        0.2305746 = fieldWeight in 4921, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.0390625 = fieldNorm(doc=4921)
  0.14285715 = coord(1/7)
```
Abstract

Traditional text-based document classifiers tend to perform poorly an the Web. Text in Web documents is usually noisy and often does not contain enough information to determine their topic. However, the Web provides a different source that can be useful to document classification: its hyperlink structure. In this work, the authors evaluate how the link structure of the Web can be used to determine a measure of similarity appropriate for document classification. They experiment with five different similarity measures and determine their adequacy for predicting the topic of a Web page. Tests performed an a Web directory Show that link information alone allows classifying documents with an average precision of 86%. Further, when combined with a traditional textbased classifier, precision increases to values of up to 90%, representing gains that range from 63 to 132% over the use of text-based classification alone. Because the measures proposed in this article are straightforward to compute, they provide a practical and effective solution for Web classification and related information retrieval tasks. Further, the authors provide an important set of guidelines an how link structure can be used effectively to classify Web documents.
Cota, R.G.; Ferreira, A.A.; Nascimento, C.; Gonçalves, M.A.; Laender, A.H.F.: ¬An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations (2010) 0.00
```
0.0030908023 = product of:
  0.021635616 = sum of:
    0.021635616 = weight(_text_:with in 3986) [ClassicSimilarity], result of:
      0.021635616 = score(doc=3986,freq=6.0), product of:
        0.09383348 = queryWeight, product of:
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.038938753 = queryNorm
        0.2305746 = fieldWeight in 3986, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3986)
  0.14285715 = coord(1/7)
```
Abstract

Name ambiguity in the context of bibliographic citations is a difficult problem which, despite the many efforts from the research community, still has a lot of room for improvement. In this article, we present a heuristic-based hierarchical clustering method to deal with this problem. The method successively fuses clusters of citations of similar author names based on several heuristics and similarity measures on the components of the citations (e.g., coauthor names, work title, and publication venue title). During the disambiguation task, the information about fused clusters is aggregated providing more information for the next round of fusion. In order to demonstrate the effectiveness of our method, we ran a series of experiments in two different collections extracted from real-world digital libraries and compared it, under two metrics, with four representative methods described in the literature. We present comparisons of results using each considered attribute separately (i.e., coauthor names, work title, and publication venue title) with the author name attribute and using all attributes together. These results show that our unsupervised method, when using all attributes, performs competitively against all other methods, under both metrics, loosing only in one case against a supervised method, whose result was very close to ours. Moreover, such results are achieved without the burden of any training and without using any privileged information such as knowing a priori the correct number of clusters.
Silva, R.M.; Gonçalves, M.A.; Veloso, A.: ¬A Two-stage active learning method for learning to rank (2014) 0.00
```
0.0030908023 = product of:
  0.021635616 = sum of:
    0.021635616 = weight(_text_:with in 1184) [ClassicSimilarity], result of:
      0.021635616 = score(doc=1184,freq=6.0), product of:
        0.09383348 = queryWeight, product of:
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.038938753 = queryNorm
        0.2305746 = fieldWeight in 1184, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1184)
  0.14285715 = coord(1/7)
```
Abstract

Learning to rank (L2R) algorithms use a labeled training set to generate a ranking model that can later be used to rank new query results. These training sets are costly and laborious to produce, requiring human annotators to assess the relevance or order of the documents in relation to a query. Active learning algorithms are able to reduce the labeling effort by selectively sampling an unlabeled set and choosing data instances that maximize a learning function's effectiveness. In this article, we propose a novel two-stage active learning method for L2R that combines and exploits interesting properties of its constituent parts, thus being effective and practical. In the first stage, an association rule active sampling algorithm is used to select a very small but effective initial training set. In the second stage, a query-by-committee strategy trained with the first-stage set is used to iteratively select more examples until a preset labeling budget is met or a target effectiveness is achieved. We test our method with various LETOR benchmarking data sets and compare it with several baselines to show that it achieves good results using only a small portion of the original training sets.
Ferreira, A.A.; Veloso, A.; Gonçalves, M.A.; Laender, A.H.F.: Self-training author name disambiguation for information scarce scenarios (2014) 0.00
```
0.0030908023 = product of:
  0.021635616 = sum of:
    0.021635616 = weight(_text_:with in 1292) [ClassicSimilarity], result of:
      0.021635616 = score(doc=1292,freq=6.0), product of:
        0.09383348 = queryWeight, product of:
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.038938753 = queryNorm
        0.2305746 = fieldWeight in 1292, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1292)
  0.14285715 = coord(1/7)
```
Abstract

We present a novel 3-step self-training method for author name disambiguation-SAND (self-training associative name disambiguator)-which requires no manual labeling, no parameterization (in real-world scenarios) and is particularly suitable for the common situation in which only the most basic information about a citation record is available (i.e., author names, and work and venue titles). During the first step, real-world heuristics on coauthors are able to produce highly pure (although fragmented) clusters. The most representative of these clusters are then selected to serve as training data for the third supervised author assignment step. The third step exploits a state-of-the-art transductive disambiguation method capable of detecting unseen authors not included in any training example and incorporating reliable predictions to the training data. Experiments conducted with standard public collections, using the minimum set of attributes present in a citation, demonstrate that our proposed method outperforms all representative unsupervised author grouping disambiguation methods and is very competitive with fully supervised author assignment methods. Thus, different from other bootstrapping methods that explore privileged, hard to obtain information such as self-citations and personal information, our proposed method produces topnotch performance with no (manual) training data or parameterization and in the presence of scarce information.
Moura, E.S. de; Fernandes, D.; Ribeiro-Neto, B.; Silva, A.S. da; Gonçalves, M.A.: Using structural information to improve search in Web collections (2010) 0.00
```
0.0030283553 = product of:
  0.021198487 = sum of:
    0.021198487 = weight(_text_:with in 4119) [ClassicSimilarity], result of:
      0.021198487 = score(doc=4119,freq=4.0), product of:
        0.09383348 = queryWeight, product of:
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.038938753 = queryNorm
        0.22591603 = fieldWeight in 4119, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.046875 = fieldNorm(doc=4119)
  0.14285715 = coord(1/7)
```
Abstract

In this work, we investigate the problem of using the block structure of Web pages to improve ranking results. Starting with basic intuitions provided by the concepts of term frequency (TF) and inverse document frequency (IDF), we propose nine block-weight functions to distinguish the impact of term occurrences inside page blocks, instead of inside whole pages. These are then used to compute a modified BM25 ranking function. Using four distinct Web collections, we ran extensive experiments to compare our block-weight ranking formulas with two other baselines: (a) a BM25 ranking applied to full pages, and (b) a BM25 ranking that takes into account best blocks. Our methods suggest that our block-weighting ranking method is superior to all baselines across all collections we used and that average gain in precision figures from 5 to 20% are generated.
Gonçalves, M.A.; Moreira, B.L.; Fox, E.A.; Watson, L.T.: "What is a good digital library?" : a quality model for digital libraries (2007) 0.00
```
0.0025236295 = product of:
  0.017665405 = sum of:
    0.017665405 = weight(_text_:with in 937) [ClassicSimilarity], result of:
      0.017665405 = score(doc=937,freq=4.0), product of:
        0.09383348 = queryWeight, product of:
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.038938753 = queryNorm
        0.18826336 = fieldWeight in 937, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.0390625 = fieldNorm(doc=937)
  0.14285715 = coord(1/7)
```
Abstract

In this article, we elaborate on the meaning of quality in digital libraries (DLs) by proposing a model that is deeply grounded in a formal framework for digital libraries: 5S (Streams, Structures, Spaces, Scenarios, and Societies). For each major DL concept in the framework we formally define a number of dimensions of quality and propose a set of numerical indicators for those quality dimensions. In particular, we consider key concepts of a minimal DL: catalog, collection, digital object, metadata specification, repository, and services. Regarding quality dimensions, we consider: accessibility, accuracy, completeness, composability, conformance, consistency, effectiveness, efficiency, extensibility, pertinence, preservability, relevance, reliability, reusability, significance, similarity, and timeliness. Regarding measurement, we consider characteristics like: response time (with regard to efficiency), cost of migration (with respect to preservability), and number of service failures (to assess reliability). For some key DL concepts, the (quality dimension, numerical indicator) pairs are illustrated through their application to a number of "real-world" digital libraries. We also discuss connections between the proposed dimensions of DL quality and an expanded version of a workshop's consensus view of the life cycle of information in digital libraries. Such connections can be used to determine when and where quality issues can be measured, assessed, and improved - as well as how possible quality problems can be prevented, detected, and eliminated.
Salles, T.; Rocha, L.; Gonçalves, M.A.; Almeida, J.M.; Mourão, F.; Meira Jr., W.; Viegas, F.: ¬A quantitative analysis of the temporal effects on automatic text classification (2016) 0.00
```
0.0025236295 = product of:
  0.017665405 = sum of:
    0.017665405 = weight(_text_:with in 3014) [ClassicSimilarity], result of:
      0.017665405 = score(doc=3014,freq=4.0), product of:
        0.09383348 = queryWeight, product of:
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.038938753 = queryNorm
        0.18826336 = fieldWeight in 3014, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3014)
  0.14285715 = coord(1/7)
```
Abstract

Automatic text classification (TC) continues to be a relevant research topic and several TC algorithms have been proposed. However, the majority of TC algorithms assume that the underlying data distribution does not change over time. In this work, we are concerned with the challenges imposed by the temporal dynamics observed in textual data sets. We provide evidence of the existence of temporal effects in three textual data sets, reflected by variations observed over time in the class distribution, in the pairwise class similarities, and in the relationships between terms and classes. We then quantify, using a series of full factorial design experiments, the impact of these effects on four well-known TC algorithms. We show that these temporal effects affect each analyzed data set differently and that they restrict the performance of each considered TC algorithm to different extents. The reported quantitative analyses, which are the original contributions of this article, provide valuable new insights to better understand the behavior of TC algorithms when faced with nonstatic (temporal) data distributions and highlight important requirements for the proposal of more accurate classification models.
Cavalcante Dourado, Í.; Galante, R.; Gonçalves, M.A.; Silva Torres, R. de: Bag of textual graphs (BoTG) : a general graph-based text representation model (2019) 0.00
```
0.0025236295 = product of:
  0.017665405 = sum of:
    0.017665405 = weight(_text_:with in 5291) [ClassicSimilarity], result of:
      0.017665405 = score(doc=5291,freq=4.0), product of:
        0.09383348 = queryWeight, product of:
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.038938753 = queryNorm
        0.18826336 = fieldWeight in 5291, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5291)
  0.14285715 = coord(1/7)
```
Abstract

Text representation models are the fundamental basis for information retrieval and text mining tasks. Although different text models have been proposed, they typically target specific task aspects in isolation, such as time efficiency, accuracy, or applicability for different scenarios. Here we present Bag of Textual Graphs (BoTG), a general text representation model that addresses these three requirements at the same time. The proposed textual representation is based on a graph-based scheme that encodes term proximity and term ordering, and represents text documents into an efficient vector space that addresses all these aspects as well as provides discriminative textual patterns. Extensive experiments are conducted in two experimental scenarios-classification and retrieval-considering multiple well-known text collections. We also compare our model against several methods from the literature. Experimental results demonstrate that our model is generic enough to handle different tasks and collections. It is also more efficient than the widely used state-of-the-art methods in textual classification and retrieval tasks, with a competitive effectiveness, sometimes with gains by large margins.

Dalip, D.H.; Gonçalves, M.A.; Cristo, M.; Calado, P.: ¬A general multiview framework for assessing the quality of collaboratively created content on web 2.0 (2017) 0.00

0.0018841656 = product of:
  0.013189158 = sum of:
    0.013189158 = product of:
      0.026378317 = sum of:
        0.026378317 = weight(_text_:22 in 3343) [ClassicSimilarity], result of:
          0.026378317 = score(doc=3343,freq=2.0), product of:
            0.13635688 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.038938753 = queryNorm
            0.19345059 = fieldWeight in 3343, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3343)
      0.5 = coord(1/2)
  0.14285715 = coord(1/7)

Date: 16.11.2017 13:04:22

Belém, F.M.; Almeida, J.M.; Gonçalves, M.A.: ¬A survey on tag recommendation methods : a review (2017) 0.00

0.0018841656 = product of:
  0.013189158 = sum of:
    0.013189158 = product of:
      0.026378317 = sum of:
        0.026378317 = weight(_text_:22 in 3524) [ClassicSimilarity], result of:
          0.026378317 = score(doc=3524,freq=2.0), product of:
            0.13635688 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.038938753 = queryNorm
            0.19345059 = fieldWeight in 3524, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3524)
      0.5 = coord(1/2)
  0.14285715 = coord(1/7)

Date: 16.11.2017 13:30:22

Cortez, E.; Silva, A.S. da; Gonçalves, M.A.; Mesquita, F.; Moura, E.S. de: ¬A flexible approach for extracting metadata from bibliographic citations (2009) 0.00
```
0.0017844755 = product of:
  0.012491328 = sum of:
    0.012491328 = weight(_text_:with in 2848) [ClassicSimilarity], result of:
      0.012491328 = score(doc=2848,freq=2.0), product of:
        0.09383348 = queryWeight, product of:
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.038938753 = queryNorm
        0.1331223 = fieldWeight in 2848, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2848)
  0.14285715 = coord(1/7)
```
Abstract

In this article we present FLUX-CiM, a novel method for extracting components (e.g., author names, article titles, venues, page numbers) from bibliographic citations. Our method does not rely on patterns encoding specific delimiters used in a particular citation style. This feature yields a high degree of automation and flexibility, and allows FLUX-CiM to extract from citations in any given format. Differently from previous methods that are based on models learned from user-driven training, our method relies on a knowledge base automatically constructed from an existing set of sample metadata records from a given field (e.g., computer science, health sciences, social sciences, etc.). These records are usually available on the Web or other public data repositories. To demonstrate the effectiveness and applicability of our proposed method, we present a series of experiments in which we apply it to extract bibliographic data from citations in articles of different fields. Results of these experiments exhibit precision and recall levels above 94% for all fields, and perfect extraction for the large majority of citations tested. In addition, in a comparison against a state-of-the-art information-extraction method, ours produced superior results without the training phase required by that method. Finally, we present a strategy for using bibliographic data resulting from the extraction process with FLUX-CiM to automatically update and expand the knowledge base of a given domain. We show that this strategy can be used to achieve good extraction results even if only a very small initial sample of bibliographic records is available for building the knowledge base.
Silva, A.J.C.; Gonçalves, M.A.; Laender, A.H.F.; Modesto, M.A.B.; Cristo, M.; Ziviani, N.: Finding what is missing from a digital library : a case study in the computer science field (2009) 0.00
```
0.0017844755 = product of:
  0.012491328 = sum of:
    0.012491328 = weight(_text_:with in 4219) [ClassicSimilarity], result of:
      0.012491328 = score(doc=4219,freq=2.0), product of:
        0.09383348 = queryWeight, product of:
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.038938753 = queryNorm
        0.1331223 = fieldWeight in 4219, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          2.409771 = idf(docFreq=10797, maxDocs=44218)
          0.0390625 = fieldNorm(doc=4219)
  0.14285715 = coord(1/7)
```
Abstract

This article proposes a process to retrieve the URL of a document for which metadata records exist in a digital library catalog but a pointer to the full text of the document is not available. The process uses results from queries submitted to Web search engines for finding the URL of the corresponding full text or any related material. We present a comprehensive study of this process in different situations by investigating different query strategies applied to three general purpose search engines (Google, Yahoo!, MSN) and two specialized ones (Scholar and CiteSeer), considering five user scenarios. Specifically, we have conducted experiments with metadata records taken from the Brazilian Digital Library of Computing (BDBComp) and The DBLP Computer Science Bibliography (DBLP). We found that Scholar was the most effective search engine for this task in all considered scenarios and that simple strategies for combining and re-ranking results from Scholar and Google significantly improve the retrieval quality. Moreover, we study the influence of the number of query results on the effectiveness of finding missing information as well as the coverage of the proposed scenarios.

Search (15 results, page 1 of 1)

Authors

Years

Themes