Search (3 results, page 1 of 1)

  • × author_ss:"Gravano, L."
  • × year_i:[2010 TO 2020}
  1. Barrio, P.; Gravano, L.: Sampling strategies for information extraction over the deep web (2017) 0.01
    0.005428266 = product of:
      0.021713063 = sum of:
        0.021713063 = weight(_text_:information in 3412) [ClassicSimilarity], result of:
          0.021713063 = score(doc=3412,freq=20.0), product of:
            0.08850355 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.050415643 = queryNorm
            0.2453355 = fieldWeight in 3412, product of:
              4.472136 = tf(freq=20.0), with freq of:
                20.0 = termFreq=20.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.03125 = fieldNorm(doc=3412)
      0.25 = coord(1/4)
    
    Abstract
    Information extraction systems discover structured information in natural language text. Having information in structured form enables much richer querying and data mining than possible over the natural language text. However, information extraction is a computationally expensive task, and hence improving the efficiency of the extraction process over large text collections is of critical interest. In this paper, we focus on an especially valuable family of text collections, namely, the so-called deep-web text collections, whose contents are not crawlable and are only available via querying. Important steps for efficient information extraction over deep-web text collections (e.g., selecting the collections on which to focus the extraction effort, based on their contents; or learning which documents within these collections-and in which order-to process, based on their words and phrases) require having a representative document sample from each collection. These document samples have to be collected by querying the deep-web text collections, an expensive process that renders impractical the existing sampling approaches developed for other data scenarios. In this paper, we systematically study the space of query-based document sampling techniques for information extraction over the deep web. Specifically, we consider (i) alternative query execution schedules, which vary on how they account for the query effectiveness, and (ii) alternative document retrieval and processing schedules, which vary on how they distribute the extraction effort over documents. We report the results of the first large-scale experimental evaluation of sampling techniques for information extraction over the deep web. Our results show the merits and limitations of the alternative query execution and document retrieval and processing strategies, and provide a roadmap for addressing this critically important building block for efficient, scalable information extraction.
    Source
    Information processing and management. 53(2017) no.2, S.309-331
  2. Naaman, M.; Becker, H.; Gravano, L.: Hip and trendy : characterizing emerging trends on Twitter (2011) 0.00
    0.004797954 = product of:
      0.019191816 = sum of:
        0.019191816 = weight(_text_:information in 4448) [ClassicSimilarity], result of:
          0.019191816 = score(doc=4448,freq=10.0), product of:
            0.08850355 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.050415643 = queryNorm
            0.21684799 = fieldWeight in 4448, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4448)
      0.25 = coord(1/4)
    
    Abstract
    Twitter, Facebook, and other related systems that we call social awareness streams are rapidly changing the information and communication dynamics of our society. These systems, where hundreds of millions of users share short messages in real time, expose the aggregate interests and attention of global and local communities. In particular, emerging temporal trends in these systems, especially those related to a single geographic area, are a significant and revealing source of information for, and about, a local community. This study makes two essential contributions for interpreting emerging temporal trends in these information systems. First, based on a large dataset of Twitter messages from one geographic area, we develop a taxonomy of the trends present in the data. Second, we identify important dimensions according to which trends can be categorized, as well as the key distinguishing features of trends that can be derived from their associated messages. We quantitatively examine the computed features for different categories of trends, and establish that significant differences can be detected across categories. Our study advances the understanding of trends on Twitter and other social awareness streams, which will enable powerful applications and activities, including user-driven real-time information services for local communities.
    Source
    Journal of the American Society for Information Science and Technology. 62(2011) no.5, S.902-918
  3. McKeown, K.; Daume III, H.; Chaturvedi, S.; Paparrizos, J.; Thadani, K.; Barrio, P.; Biran, O.; Bothe, S.; Collins, M.; Fleischmann, K.R.; Gravano, L.; Jha, R.; King, B.; McInerney, K.; Moon, T.; Neelakantan, A.; O'Seaghdha, D.; Radev, D.; Templeton, C.; Teufel, S.: Predicting the impact of scientific concepts using full-text features (2016) 0.00
    0.0037164795 = product of:
      0.014865918 = sum of:
        0.014865918 = weight(_text_:information in 3153) [ClassicSimilarity], result of:
          0.014865918 = score(doc=3153,freq=6.0), product of:
            0.08850355 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.050415643 = queryNorm
            0.16796975 = fieldWeight in 3153, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3153)
      0.25 = coord(1/4)
    
    Abstract
    New scientific concepts, interpreted broadly, are continuously introduced in the literature, but relatively few concepts have a long-term impact on society. The identification of such concepts is a challenging prediction task that would help multiple parties-including researchers and the general public-focus their attention within the vast scientific literature. In this paper we present a system that predicts the future impact of a scientific concept, represented as a technical term, based on the information available from recently published research articles. We analyze the usefulness of rich features derived from the full text of the articles through a variety of approaches, including rhetorical sentence analysis, information extraction, and time-series analysis. The results from two large-scale experiments with 3.8 million full-text articles and 48 million metadata records support the conclusion that full-text features are significantly more useful for prediction than metadata-only features and that the most accurate predictions result from combining the metadata and full-text features. Surprisingly, these results hold even when the metadata features are available for a much larger number of documents than are available for the full-text features.
    Source
    Journal of the Association for Information Science and Technology. 67(2016) no.11, S.2684-2696