Document (#40413)

Author
Barrio, P.
Gravano, L.
Title
Sampling strategies for information extraction over the deep web
Source
Information processing and management. 53(2017) no.2, S.309-331
Year
2017
Abstract
Information extraction systems discover structured information in natural language text. Having information in structured form enables much richer querying and data mining than possible over the natural language text. However, information extraction is a computationally expensive task, and hence improving the efficiency of the extraction process over large text collections is of critical interest. In this paper, we focus on an especially valuable family of text collections, namely, the so-called deep-web text collections, whose contents are not crawlable and are only available via querying. Important steps for efficient information extraction over deep-web text collections (e.g., selecting the collections on which to focus the extraction effort, based on their contents; or learning which documents within these collections-and in which order-to process, based on their words and phrases) require having a representative document sample from each collection. These document samples have to be collected by querying the deep-web text collections, an expensive process that renders impractical the existing sampling approaches developed for other data scenarios. In this paper, we systematically study the space of query-based document sampling techniques for information extraction over the deep web. Specifically, we consider (i) alternative query execution schedules, which vary on how they account for the query effectiveness, and (ii) alternative document retrieval and processing schedules, which vary on how they distribute the extraction effort over documents. We report the results of the first large-scale experimental evaluation of sampling techniques for information extraction over the deep web. Our results show the merits and limitations of the alternative query execution and document retrieval and processing strategies, and provide a roadmap for addressing this critically important building block for efficient, scalable information extraction.
Content
Vgl.: http://www.sciencedirect.com/science/article/pii/S0306457316306318 [http://dx.doi.org/10.1016/j.ipm.2016.11.006].
Theme
Internet
Suchtaktik

Similar documents (content)

  1. Zhang, M.; Zhou, G.D.; Aw, A.: Exploring syntactic structured features over parse trees for relation extraction using kernel methods (2008) 0.21
    0.20670453 = sum of:
      0.20670453 = product of:
        0.8612689 = sum of:
          0.059096463 = weight(abstract_txt:structured in 2055) [ClassicSimilarity], result of:
            0.059096463 = score(doc=2055,freq=4.0), product of:
              0.086888395 = queryWeight, product of:
                1.0986506 = boost
                5.4411373 = idf(docFreq=520, maxDocs=44218)
                0.014534915 = queryNorm
              0.68014216 = fieldWeight in 2055, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.4411373 = idf(docFreq=520, maxDocs=44218)
                0.0625 = fieldNorm(doc=2055)
          0.02028575 = weight(abstract_txt:information in 2055) [ClassicSimilarity], result of:
            0.02028575 = score(doc=2055,freq=3.0), product of:
              0.07740433 = queryWeight, product of:
                2.199721 = boost
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.014534915 = queryNorm
              0.26207513 = fieldWeight in 2055, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.0625 = fieldNorm(doc=2055)
          0.042454228 = weight(abstract_txt:text in 2055) [ClassicSimilarity], result of:
            0.042454228 = score(doc=2055,freq=1.0), product of:
              0.16797478 = queryWeight, product of:
                2.8578193 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.014534915 = queryNorm
              0.25274166 = fieldWeight in 2055, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=2055)
          0.04909141 = weight(abstract_txt:over in 2055) [ClassicSimilarity], result of:
            0.04909141 = score(doc=2055,freq=1.0), product of:
              0.18505485 = queryWeight, product of:
                2.9995975 = boost
                4.244485 = idf(docFreq=1723, maxDocs=44218)
                0.014534915 = queryNorm
              0.2652803 = fieldWeight in 2055, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.244485 = idf(docFreq=1723, maxDocs=44218)
                0.0625 = fieldNorm(doc=2055)
          0.15711987 = weight(abstract_txt:deep in 2055) [ClassicSimilarity], result of:
            0.15711987 = score(doc=2055,freq=1.0), product of:
              0.38177013 = queryWeight, product of:
                3.9887822 = boost
                6.5848994 = idf(docFreq=165, maxDocs=44218)
                0.014534915 = queryNorm
              0.4115562 = fieldWeight in 2055, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5848994 = idf(docFreq=165, maxDocs=44218)
                0.0625 = fieldNorm(doc=2055)
          0.5332211 = weight(abstract_txt:extraction in 2055) [ClassicSimilarity], result of:
            0.5332211 = score(doc=2055,freq=6.0), product of:
              0.5625381 = queryWeight, product of:
                6.250859 = boost
                6.1915555 = idf(docFreq=245, maxDocs=44218)
                0.014534915 = queryNorm
              0.9478845 = fieldWeight in 2055, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                6.1915555 = idf(docFreq=245, maxDocs=44218)
                0.0625 = fieldNorm(doc=2055)
        0.24 = coord(6/25)
    
  2. Goh, A.; Hui, S.C.: TES: a text extraction system (1996) 0.20
    0.1970413 = sum of:
      0.1970413 = product of:
        0.82100546 = sum of:
          0.03200974 = weight(abstract_txt:process in 6599) [ClassicSimilarity], result of:
            0.03200974 = score(doc=6599,freq=1.0), product of:
              0.0722438 = queryWeight, product of:
                1.2269437 = boost
                4.0510116 = idf(docFreq=2091, maxDocs=44218)
                0.014534915 = queryNorm
              0.44307938 = fieldWeight in 6599, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0510116 = idf(docFreq=2091, maxDocs=44218)
                0.109375 = fieldNorm(doc=6599)
          0.019912343 = weight(abstract_txt:which in 6599) [ClassicSimilarity], result of:
            0.019912343 = score(doc=6599,freq=1.0), product of:
              0.06241807 = queryWeight, product of:
                1.4723257 = boost
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.014534915 = queryNorm
              0.31901568 = fieldWeight in 6599, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.109375 = fieldNorm(doc=6599)
          0.08976753 = weight(abstract_txt:document in 6599) [ClassicSimilarity], result of:
            0.08976753 = score(doc=6599,freq=2.0), product of:
              0.13519634 = queryWeight, product of:
                2.1668618 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.014534915 = queryNorm
              0.663979 = fieldWeight in 6599, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.109375 = fieldNorm(doc=6599)
          0.03550006 = weight(abstract_txt:information in 6599) [ClassicSimilarity], result of:
            0.03550006 = score(doc=6599,freq=3.0), product of:
              0.07740433 = queryWeight, product of:
                2.199721 = boost
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.014534915 = queryNorm
              0.45863146 = fieldWeight in 6599, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.109375 = fieldNorm(doc=6599)
          0.105068855 = weight(abstract_txt:text in 6599) [ClassicSimilarity], result of:
            0.105068855 = score(doc=6599,freq=2.0), product of:
              0.16797478 = queryWeight, product of:
                2.8578193 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.014534915 = queryNorm
              0.6255037 = fieldWeight in 6599, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.109375 = fieldNorm(doc=6599)
          0.5387469 = weight(abstract_txt:extraction in 6599) [ClassicSimilarity], result of:
            0.5387469 = score(doc=6599,freq=2.0), product of:
              0.5625381 = queryWeight, product of:
                6.250859 = boost
                6.1915555 = idf(docFreq=245, maxDocs=44218)
                0.014534915 = queryNorm
              0.9577074 = fieldWeight in 6599, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.1915555 = idf(docFreq=245, maxDocs=44218)
                0.109375 = fieldNorm(doc=6599)
        0.24 = coord(6/25)
    
  3. Rui, Y.; Ortega, M.; Huang, T.S.; Mehrotra, S.: Information retrieval beyond the text document (1999) 0.19
    0.19091944 = sum of:
      0.19091944 = product of:
        0.68185514 = sum of:
          0.04441484 = weight(abstract_txt:efficient in 846) [ClassicSimilarity], result of:
            0.04441484 = score(doc=846,freq=1.0), product of:
              0.09825459 = queryWeight, product of:
                1.168302 = boost
                5.7860904 = idf(docFreq=368, maxDocs=44218)
                0.014534915 = queryNorm
              0.45203832 = fieldWeight in 846, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7860904 = idf(docFreq=368, maxDocs=44218)
                0.078125 = fieldNorm(doc=846)
          0.12786576 = weight(abstract_txt:execution in 846) [ClassicSimilarity], result of:
            0.12786576 = score(doc=846,freq=1.0), product of:
              0.19883995 = queryWeight, product of:
                1.6619982 = boost
                8.231152 = idf(docFreq=31, maxDocs=44218)
                0.014534915 = queryNorm
              0.6430587 = fieldWeight in 846, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.231152 = idf(docFreq=31, maxDocs=44218)
                0.078125 = fieldNorm(doc=846)
          0.04926239 = weight(abstract_txt:query in 846) [ClassicSimilarity], result of:
            0.04926239 = score(doc=846,freq=1.0), product of:
              0.13264404 = queryWeight, product of:
                1.9197187 = boost
                4.7537646 = idf(docFreq=1035, maxDocs=44218)
                0.014534915 = queryNorm
              0.37138787 = fieldWeight in 846, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7537646 = idf(docFreq=1035, maxDocs=44218)
                0.078125 = fieldNorm(doc=846)
          0.020704055 = weight(abstract_txt:information in 846) [ClassicSimilarity], result of:
            0.020704055 = score(doc=846,freq=2.0), product of:
              0.07740433 = queryWeight, product of:
                2.199721 = boost
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.014534915 = queryNorm
              0.2674793 = fieldWeight in 846, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.078125 = fieldNorm(doc=846)
          0.10613557 = weight(abstract_txt:text in 846) [ClassicSimilarity], result of:
            0.10613557 = score(doc=846,freq=4.0), product of:
              0.16797478 = queryWeight, product of:
                2.8578193 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.014534915 = queryNorm
              0.6318542 = fieldWeight in 846, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.078125 = fieldNorm(doc=846)
          0.061364256 = weight(abstract_txt:over in 846) [ClassicSimilarity], result of:
            0.061364256 = score(doc=846,freq=1.0), product of:
              0.18505485 = queryWeight, product of:
                2.9995975 = boost
                4.244485 = idf(docFreq=1723, maxDocs=44218)
                0.014534915 = queryNorm
              0.33160037 = fieldWeight in 846, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.244485 = idf(docFreq=1723, maxDocs=44218)
                0.078125 = fieldNorm(doc=846)
          0.27210826 = weight(abstract_txt:extraction in 846) [ClassicSimilarity], result of:
            0.27210826 = score(doc=846,freq=1.0), product of:
              0.5625381 = queryWeight, product of:
                6.250859 = boost
                6.1915555 = idf(docFreq=245, maxDocs=44218)
                0.014534915 = queryNorm
              0.48371527 = fieldWeight in 846, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.1915555 = idf(docFreq=245, maxDocs=44218)
                0.078125 = fieldNorm(doc=846)
        0.28 = coord(7/25)
    
  4. Suakkaphong, N.; Zhang, Z.; Chen, H.: Disease named entity recognition using semisupervised learning and conditional random fields (2011) 0.19
    0.19004257 = sum of:
      0.19004257 = product of:
        0.67872345 = sum of:
          0.024665399 = weight(abstract_txt:strategies in 4367) [ClassicSimilarity], result of:
            0.024665399 = score(doc=4367,freq=1.0), product of:
              0.07703112 = queryWeight, product of:
                1.0344555 = boost
                5.123207 = idf(docFreq=715, maxDocs=44218)
                0.014534915 = queryNorm
              0.32020044 = fieldWeight in 4367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.123207 = idf(docFreq=715, maxDocs=44218)
                0.0625 = fieldNorm(doc=4367)
          0.029548232 = weight(abstract_txt:structured in 4367) [ClassicSimilarity], result of:
            0.029548232 = score(doc=4367,freq=1.0), product of:
              0.086888395 = queryWeight, product of:
                1.0986506 = boost
                5.4411373 = idf(docFreq=520, maxDocs=44218)
                0.014534915 = queryNorm
              0.34007108 = fieldWeight in 4367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.4411373 = idf(docFreq=520, maxDocs=44218)
                0.0625 = fieldNorm(doc=4367)
          0.023423966 = weight(abstract_txt:information in 4367) [ClassicSimilarity], result of:
            0.023423966 = score(doc=4367,freq=4.0), product of:
              0.07740433 = queryWeight, product of:
                2.199721 = boost
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.014534915 = queryNorm
              0.3026183 = fieldWeight in 4367, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.0625 = fieldNorm(doc=4367)
          0.060039345 = weight(abstract_txt:text in 4367) [ClassicSimilarity], result of:
            0.060039345 = score(doc=4367,freq=2.0), product of:
              0.16797478 = queryWeight, product of:
                2.8578193 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.014534915 = queryNorm
              0.3574307 = fieldWeight in 4367, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=4367)
          0.16679995 = weight(abstract_txt:sampling in 4367) [ClassicSimilarity], result of:
            0.16679995 = score(doc=4367,freq=1.0), product of:
              0.34706813 = queryWeight, product of:
                3.1052823 = boost
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.014534915 = queryNorm
              0.48059714 = fieldWeight in 4367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.0625 = fieldNorm(doc=4367)
          0.06639117 = weight(abstract_txt:collections in 4367) [ClassicSimilarity], result of:
            0.06639117 = score(doc=4367,freq=1.0), product of:
              0.22630998 = queryWeight, product of:
                3.317146 = boost
                4.693822 = idf(docFreq=1099, maxDocs=44218)
                0.014534915 = queryNorm
              0.29336387 = fieldWeight in 4367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.693822 = idf(docFreq=1099, maxDocs=44218)
                0.0625 = fieldNorm(doc=4367)
          0.30785537 = weight(abstract_txt:extraction in 4367) [ClassicSimilarity], result of:
            0.30785537 = score(doc=4367,freq=2.0), product of:
              0.5625381 = queryWeight, product of:
                6.250859 = boost
                6.1915555 = idf(docFreq=245, maxDocs=44218)
                0.014534915 = queryNorm
              0.54726136 = fieldWeight in 4367, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.1915555 = idf(docFreq=245, maxDocs=44218)
                0.0625 = fieldNorm(doc=4367)
        0.28 = coord(7/25)
    
  5. Bergamaschi, S.; Domnori, E.; Guerra, F.; Rota, S.; Lado, R.T.; Velegrakis, Y.: Understanding the semantics of keyword queries on relational data without accessing the instance (2012) 0.18
    0.17828362 = sum of:
      0.17828362 = product of:
        0.6367272 = sum of:
          0.029548232 = weight(abstract_txt:structured in 431) [ClassicSimilarity], result of:
            0.029548232 = score(doc=431,freq=1.0), product of:
              0.086888395 = queryWeight, product of:
                1.0986506 = boost
                5.4411373 = idf(docFreq=520, maxDocs=44218)
                0.014534915 = queryNorm
              0.34007108 = fieldWeight in 431, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.4411373 = idf(docFreq=520, maxDocs=44218)
                0.0625 = fieldNorm(doc=431)
          0.03553187 = weight(abstract_txt:efficient in 431) [ClassicSimilarity], result of:
            0.03553187 = score(doc=431,freq=1.0), product of:
              0.09825459 = queryWeight, product of:
                1.168302 = boost
                5.7860904 = idf(docFreq=368, maxDocs=44218)
                0.014534915 = queryNorm
              0.36163065 = fieldWeight in 431, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7860904 = idf(docFreq=368, maxDocs=44218)
                0.0625 = fieldNorm(doc=431)
          0.10229261 = weight(abstract_txt:execution in 431) [ClassicSimilarity], result of:
            0.10229261 = score(doc=431,freq=1.0), product of:
              0.19883995 = queryWeight, product of:
                1.6619982 = boost
                8.231152 = idf(docFreq=31, maxDocs=44218)
                0.014534915 = queryNorm
              0.514447 = fieldWeight in 431, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.231152 = idf(docFreq=31, maxDocs=44218)
                0.0625 = fieldNorm(doc=431)
          0.07881982 = weight(abstract_txt:query in 431) [ClassicSimilarity], result of:
            0.07881982 = score(doc=431,freq=4.0), product of:
              0.13264404 = queryWeight, product of:
                1.9197187 = boost
                4.7537646 = idf(docFreq=1035, maxDocs=44218)
                0.014534915 = queryNorm
              0.5942206 = fieldWeight in 431, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.7537646 = idf(docFreq=1035, maxDocs=44218)
                0.0625 = fieldNorm(doc=431)
          0.0949711 = weight(abstract_txt:querying in 431) [ClassicSimilarity], result of:
            0.0949711 = score(doc=431,freq=1.0), product of:
              0.21662016 = queryWeight, product of:
                2.1245835 = boost
                7.014756 = idf(docFreq=107, maxDocs=44218)
                0.014534915 = queryNorm
              0.43842226 = fieldWeight in 431, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.014756 = idf(docFreq=107, maxDocs=44218)
                0.0625 = fieldNorm(doc=431)
          0.023423966 = weight(abstract_txt:information in 431) [ClassicSimilarity], result of:
            0.023423966 = score(doc=431,freq=4.0), product of:
              0.07740433 = queryWeight, product of:
                2.199721 = boost
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.014534915 = queryNorm
              0.3026183 = fieldWeight in 431, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.0625 = fieldNorm(doc=431)
          0.2721396 = weight(abstract_txt:deep in 431) [ClassicSimilarity], result of:
            0.2721396 = score(doc=431,freq=3.0), product of:
              0.38177013 = queryWeight, product of:
                3.9887822 = boost
                6.5848994 = idf(docFreq=165, maxDocs=44218)
                0.014534915 = queryNorm
              0.71283627 = fieldWeight in 431, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.5848994 = idf(docFreq=165, maxDocs=44218)
                0.0625 = fieldNorm(doc=431)
        0.28 = coord(7/25)