Search (6 results, page 1 of 1)

Moura, E.S. de; Fernandes, D.; Ribeiro-Neto, B.; Silva, A.S. da; Gonçalves, M.A.: Using structural information to improve search in Web collections (2010) 0.02

0.024373945 = product of:
  0.036560915 = sum of:
    0.00890397 = weight(_text_:a in 4119) [ClassicSimilarity], result of:
      0.00890397 = score(doc=4119,freq=10.0), product of:
        0.05209492 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.045180224 = queryNorm
        0.1709182 = fieldWeight in 4119, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.046875 = fieldNorm(doc=4119)
    0.027656946 = product of:
      0.055313893 = sum of:
        0.055313893 = weight(_text_:de in 4119) [ClassicSimilarity], result of:
          0.055313893 = score(doc=4119,freq=2.0), product of:
            0.19416152 = queryWeight, product of:
              4.297489 = idf(docFreq=1634, maxDocs=44218)
              0.045180224 = queryNorm
            0.28488597 = fieldWeight in 4119, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.297489 = idf(docFreq=1634, maxDocs=44218)
              0.046875 = fieldNorm(doc=4119)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Abstract: In this work, we investigate the problem of using the block structure of Web pages to improve ranking results. Starting with basic intuitions provided by the concepts of term frequency (TF) and inverse document frequency (IDF), we propose nine block-weight functions to distinguish the impact of term occurrences inside page blocks, instead of inside whole pages. These are then used to compute a modified BM25 ranking function. Using four distinct Web collections, we ran extensive experiments to compare our block-weight ranking formulas with two other baselines: (a) a BM25 ranking applied to full pages, and (b) a BM25 ranking that takes into account best blocks. Our methods suggest that our block-weighting ranking method is superior to all baselines across all collections we used and that average gain in precision figures from 5 to 20% are generated.
Type: a

Calado, P.; Cristo, M.; Gonçalves, M.A.; Moura, E.S. de; Ribeiro-Neto, B.; Ziviani, N.: Link-based similarity measures for the classification of Web documents (2006) 0.02
```
0.021217927 = product of:
  0.03182689 = sum of:
    0.008779433 = weight(_text_:a in 4921) [ClassicSimilarity], result of:
      0.008779433 = score(doc=4921,freq=14.0), product of:
        0.05209492 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.045180224 = queryNorm
        0.1685276 = fieldWeight in 4921, product of:
          3.7416575 = tf(freq=14.0), with freq of:
            14.0 = termFreq=14.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.0390625 = fieldNorm(doc=4921)
    0.023047457 = product of:
      0.046094913 = sum of:
        0.046094913 = weight(_text_:de in 4921) [ClassicSimilarity], result of:
          0.046094913 = score(doc=4921,freq=2.0), product of:
            0.19416152 = queryWeight, product of:
              4.297489 = idf(docFreq=1634, maxDocs=44218)
              0.045180224 = queryNorm
            0.23740499 = fieldWeight in 4921, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.297489 = idf(docFreq=1634, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4921)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

Traditional text-based document classifiers tend to perform poorly an the Web. Text in Web documents is usually noisy and often does not contain enough information to determine their topic. However, the Web provides a different source that can be useful to document classification: its hyperlink structure. In this work, the authors evaluate how the link structure of the Web can be used to determine a measure of similarity appropriate for document classification. They experiment with five different similarity measures and determine their adequacy for predicting the topic of a Web page. Tests performed an a Web directory Show that link information alone allows classifying documents with an average precision of 86%. Further, when combined with a traditional textbased classifier, precision increases to values of up to 90%, representing gains that range from 63 to 132% over the use of text-based classification alone. Because the measures proposed in this article are straightforward to compute, they provide a practical and effective solution for Web classification and related information retrieval tasks. Further, the authors provide an important set of guidelines an how link structure can be used effectively to classify Web documents.

Type

a
Ribeiro-Neto, B.; Laender, A.H.F.; Lima, L.R.S. de: ¬An experimental study in automatically categorizing medical documents (2001) 0.02
```
0.020783756 = product of:
  0.031175632 = sum of:
    0.008128175 = weight(_text_:a in 5702) [ClassicSimilarity], result of:
      0.008128175 = score(doc=5702,freq=12.0), product of:
        0.05209492 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.045180224 = queryNorm
        0.15602624 = fieldWeight in 5702, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5702)
    0.023047457 = product of:
      0.046094913 = sum of:
        0.046094913 = weight(_text_:de in 5702) [ClassicSimilarity], result of:
          0.046094913 = score(doc=5702,freq=2.0), product of:
            0.19416152 = queryWeight, product of:
              4.297489 = idf(docFreq=1634, maxDocs=44218)
              0.045180224 = queryNorm
            0.23740499 = fieldWeight in 5702, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.297489 = idf(docFreq=1634, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5702)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

In this article, we evaluate the retrieval performance of an algorithm that automatically categorizes medical documents. The categorization, which consists in assigning an International Code of Disease (ICD) to the medical document under examination, is based on wellknown information retrieval techniques. The algorithm, which we proposed, operates in a fully automatic mode and requires no supervision or training data. Using a database of 20,569 documents, we verify that the algorithm attains levels of average precision in the 70-80% range for category coding and in the 60-70% range for subcategory coding. We also carefully analyze the case of those documents whose categorization is not in accordance with the one provided by the human specialists. The vast majority of them represent cases that can only be fully categorized with the assistance of a human subject (because, for instance, they require specific knowledge of a given pathology). For a slim fraction of all documents (0.77% for category coding and 1.4% for subcategory coding), the algorithm makes assignments that are clearly incorrect. However, this fraction corresponds to only one-fourth of the mistakes made by the human specialists

Type

a

Silveira, M.; Ribeiro-Neto, B.: Concept-based ranking : a case study in the juridical domain (2004) 0.00

0.003754243 = product of:
  0.011262729 = sum of:
    0.011262729 = weight(_text_:a in 2339) [ClassicSimilarity], result of:
      0.011262729 = score(doc=2339,freq=4.0), product of:
        0.05209492 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.045180224 = queryNorm
        0.2161963 = fieldWeight in 2339, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.09375 = fieldNorm(doc=2339)
  0.33333334 = coord(1/3)

Type: a

Couto, T.; Cristo, M.; Gonçalves, M.A.; Calado, P.; Ziviani, N.; Moura, E.; Ribeiro-Neto, B.: ¬A comparative study of citations and links in document classification (2006) 0.00
```
0.0036685336 = product of:
  0.011005601 = sum of:
    0.011005601 = weight(_text_:a in 2531) [ClassicSimilarity], result of:
      0.011005601 = score(doc=2531,freq=22.0), product of:
        0.05209492 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.045180224 = queryNorm
        0.21126054 = fieldWeight in 2531, product of:
          4.690416 = tf(freq=22.0), with freq of:
            22.0 = termFreq=22.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2531)
  0.33333334 = coord(1/3)
```
Abstract

It is well known that links are an important source of information when dealing with Web collections. However, the question remains on whether the same techniques that are used on the Web can be applied to collections of documents containing citations between scientific papers. In this work we present a comparative study of digital library citations and Web links, in the context of automatic text classification. We show that there are in fact differences between citations and links in this context. For the comparison, we run a series of experiments using a digital library of computer science papers and a Web directory. In our reference collections, measures based on co-citation tend to perform better for pages in the Web directory, with gains up to 37% over text based classifiers, while measures based on bibliographic coupling perform better in a digital library. We also propose a simple and effective way of combining a traditional text based classifier with a citation-link based classifier. This combination is based on the notion of classifier reliability and presented gains of up to 14% in micro-averaged F1 in the Web collection. However, no significant gain was obtained in the digital library. Finally, a user study was performed to further investigate the causes for these results. We discovered that misclassifications by the citation-link based classifiers are in fact difficult cases, hard to classify even for humans.

Type

a
Pereira, D.A.; Ribeiro-Neto, B.; Ziviani, N.; Laender, A.H.F.; Gonçalves, M.A.: ¬A generic Web-based entity resolution framework (2011) 0.00
```
0.0033183135 = product of:
  0.0099549405 = sum of:
    0.0099549405 = weight(_text_:a in 4450) [ClassicSimilarity], result of:
      0.0099549405 = score(doc=4450,freq=18.0), product of:
        0.05209492 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.045180224 = queryNorm
        0.19109234 = fieldWeight in 4450, product of:
          4.2426405 = tf(freq=18.0), with freq of:
            18.0 = termFreq=18.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.0390625 = fieldNorm(doc=4450)
  0.33333334 = coord(1/3)
```
Abstract

Web data repositories usually contain references to thousands of real-world entities from multiple sources. It is not uncommon that multiple entities share the same label (polysemes) and that distinct label variations are associated with the same entity (synonyms), which frequently leads to ambiguous interpretations. Further, spelling variants, acronyms, abbreviated forms, and misspellings compound to worsen the problem. Solving this problem requires identifying which labels correspond to the same real-world entity, a process known as entity resolution. One approach to solve the entity resolution problem is to associate an authority identifier and a list of variant forms with each entity-a data structure known as an authority file. In this work, we propose a generic framework for implementing a method for generating authority files. Our method uses information from the Web to improve the quality of the authority file and, because of that, is referred to as WER-Web-based Entity Resolution. Our contribution here is threefold: (a) we discuss how to implement the WER framework, which is flexible and easy to adapt to new domains; (b) we run extended experimentation with our WER framework to show that it outperforms selected baselines; and (c) we compare the results of a specialized solution for author name resolution with those produced by the generic WER framework, and show that the WER results remain competitive.

Type

a

Search (6 results, page 1 of 1)

Authors

Years

Themes