Document (#38628)

Author
Alqaraleh, S.
Ramadan, O.
Salamah, M.
Title
Efficient watcher based web crawler design
Source
Aslib journal of information management. 67(2015) no.6, S.663-686
Year
2015
Abstract
Purpose The purpose of this paper is to design a watcher-based crawler (WBC) that has the ability of crawling static and dynamic web sites, and can download only the updated and newly added web pages. Design/methodology/approach In the proposed WBC crawler, a watcher file, which can be uploaded to the web sites servers, prepares a report that contains the addresses of the updated and the newly added web pages. In addition, the WBC is split into five units, where each unit is responsible for performing a specific crawling process. Findings Several experiments have been conducted and it has been observed that the proposed WBC increases the number of uniquely visited static and dynamic web sites as compared with the existing crawling techniques. In addition, the proposed watcher file not only allows the crawlers to visit the updated and newly web pages, but also solves the crawlers overlapping and communication problems. Originality/value The proposed WBC performs all crawling processes in the sense that it detects all updated and newly added pages automatically without any human explicit intervention or downloading the entire web sites.
Content
Vgl.: http://dx.doi.org/10.1108/AJIM-02-2015-0019.
Theme
Suchmaschinen

Similar documents (content)

  1. Thelwall, M.; Stuart, D.: Web crawling ethics revisited : cost, privacy, and denial of service (2006) 0.17
    0.172002 = sum of:
      0.172002 = product of:
        0.86001 = sum of:
          0.009746531 = weight(abstract_txt:that in 6098) [ClassicSimilarity], result of:
            0.009746531 = score(doc=6098,freq=2.0), product of:
              0.03722999 = queryWeight, product of:
                1.1514671 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.0136454925 = queryNorm
              0.26179248 = fieldWeight in 6098, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.078125 = fieldNorm(doc=6098)
          0.19249575 = weight(abstract_txt:crawlers in 6098) [ClassicSimilarity], result of:
            0.19249575 = score(doc=6098,freq=1.0), product of:
              0.27202383 = queryWeight, product of:
                2.2008657 = boost
                9.05783 = idf(docFreq=13, maxDocs=44218)
                0.0136454925 = queryNorm
              0.707643 = fieldWeight in 6098, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.05783 = idf(docFreq=13, maxDocs=44218)
                0.078125 = fieldNorm(doc=6098)
          0.08156547 = weight(abstract_txt:sites in 6098) [ClassicSimilarity], result of:
            0.08156547 = score(doc=6098,freq=1.0), product of:
              0.19334832 = queryWeight, product of:
                2.6240692 = boost
                5.399778 = idf(docFreq=542, maxDocs=44218)
                0.0136454925 = queryNorm
              0.42185766 = fieldWeight in 6098, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.399778 = idf(docFreq=542, maxDocs=44218)
                0.078125 = fieldNorm(doc=6098)
          0.26051247 = weight(abstract_txt:crawler in 6098) [ClassicSimilarity], result of:
            0.26051247 = score(doc=6098,freq=1.0), product of:
              0.38098592 = queryWeight, product of:
                3.1899962 = boost
                8.752448 = idf(docFreq=18, maxDocs=44218)
                0.0136454925 = queryNorm
              0.683785 = fieldWeight in 6098, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.752448 = idf(docFreq=18, maxDocs=44218)
                0.078125 = fieldNorm(doc=6098)
          0.31568983 = weight(abstract_txt:crawling in 6098) [ClassicSimilarity], result of:
            0.31568983 = score(doc=6098,freq=1.0), product of:
              0.47662473 = queryWeight, product of:
                4.1199636 = boost
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.0136454925 = queryNorm
              0.66234463 = fieldWeight in 6098, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.078125 = fieldNorm(doc=6098)
        0.2 = coord(5/25)
    
  2. Lee, L.-H.; Luh, C.-J.: Generation of pornographic blacklist and its incremental update using an inverse chi-square based method (2008) 0.13
    0.13096967 = sum of:
      0.13096967 = product of:
        0.545707 = sum of:
          0.0068918387 = weight(abstract_txt:that in 1340) [ClassicSimilarity], result of:
            0.0068918387 = score(doc=1340,freq=1.0), product of:
              0.03722999 = queryWeight, product of:
                1.1514671 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.0136454925 = queryNorm
              0.18511525 = fieldWeight in 1340, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.078125 = fieldNorm(doc=1340)
          0.08640568 = weight(abstract_txt:added in 1340) [ClassicSimilarity], result of:
            0.08640568 = score(doc=1340,freq=1.0), product of:
              0.18255125 = queryWeight, product of:
                2.2081478 = boost
                6.0585327 = idf(docFreq=280, maxDocs=44218)
                0.0136454925 = queryNorm
              0.47332287 = fieldWeight in 1340, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0585327 = idf(docFreq=280, maxDocs=44218)
                0.078125 = fieldNorm(doc=1340)
          0.07174678 = weight(abstract_txt:proposed in 1340) [ClassicSimilarity], result of:
            0.07174678 = score(doc=1340,freq=2.0), product of:
              0.14088382 = queryWeight, product of:
                2.239936 = boost
                4.6093135 = idf(docFreq=1196, maxDocs=44218)
                0.0136454925 = queryNorm
              0.509262 = fieldWeight in 1340, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.6093135 = idf(docFreq=1196, maxDocs=44218)
                0.078125 = fieldNorm(doc=1340)
          0.115350984 = weight(abstract_txt:sites in 1340) [ClassicSimilarity], result of:
            0.115350984 = score(doc=1340,freq=2.0), product of:
              0.19334832 = queryWeight, product of:
                2.6240692 = boost
                5.399778 = idf(docFreq=542, maxDocs=44218)
                0.0136454925 = queryNorm
              0.5965968 = fieldWeight in 1340, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.399778 = idf(docFreq=542, maxDocs=44218)
                0.078125 = fieldNorm(doc=1340)
          0.09125144 = weight(abstract_txt:pages in 1340) [ClassicSimilarity], result of:
            0.09125144 = score(doc=1340,freq=1.0), product of:
              0.20836718 = queryWeight, product of:
                2.7240794 = boost
                5.6055775 = idf(docFreq=441, maxDocs=44218)
                0.0136454925 = queryNorm
              0.43793574 = fieldWeight in 1340, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.6055775 = idf(docFreq=441, maxDocs=44218)
                0.078125 = fieldNorm(doc=1340)
          0.17406023 = weight(abstract_txt:newly in 1340) [ClassicSimilarity], result of:
            0.17406023 = score(doc=1340,freq=1.0), product of:
              0.32048118 = queryWeight, product of:
                3.3783634 = boost
                6.9519553 = idf(docFreq=114, maxDocs=44218)
                0.0136454925 = queryNorm
              0.5431215 = fieldWeight in 1340, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9519553 = idf(docFreq=114, maxDocs=44218)
                0.078125 = fieldNorm(doc=1340)
        0.24 = coord(6/25)
    
  3. Vidmar, D.; Anderson, C.: History of Internet search tools (2002) 0.12
    0.12491949 = sum of:
      0.12491949 = product of:
        0.3903734 = sum of:
          0.013769116 = weight(abstract_txt:only in 4258) [ClassicSimilarity], result of:
            0.013769116 = score(doc=4258,freq=1.0), product of:
              0.059456345 = queryWeight, product of:
                1.0289379 = boost
                4.234672 = idf(docFreq=1740, maxDocs=44218)
                0.0136454925 = queryNorm
              0.23158363 = fieldWeight in 4258, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.234672 = idf(docFreq=1740, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4258)
          0.011817042 = weight(abstract_txt:that in 4258) [ClassicSimilarity], result of:
            0.011817042 = score(doc=4258,freq=6.0), product of:
              0.03722999 = queryWeight, product of:
                1.1514671 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.0136454925 = queryNorm
              0.31740654 = fieldWeight in 4258, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4258)
          0.031482324 = weight(abstract_txt:file in 4258) [ClassicSimilarity], result of:
            0.031482324 = score(doc=4258,freq=1.0), product of:
              0.10319025 = queryWeight, product of:
                1.3555309 = boost
                5.57879 = idf(docFreq=453, maxDocs=44218)
                0.0136454925 = queryNorm
              0.3050901 = fieldWeight in 4258, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.57879 = idf(docFreq=453, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4258)
          0.034830954 = weight(abstract_txt:dynamic in 4258) [ClassicSimilarity], result of:
            0.034830954 = score(doc=4258,freq=1.0), product of:
              0.11038356 = queryWeight, product of:
                1.4019816 = boost
                5.7699614 = idf(docFreq=374, maxDocs=44218)
                0.0136454925 = queryNorm
              0.31554475 = fieldWeight in 4258, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7699614 = idf(docFreq=374, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4258)
          0.06550656 = weight(abstract_txt:static in 4258) [ClassicSimilarity], result of:
            0.06550656 = score(doc=4258,freq=1.0), product of:
              0.16818374 = queryWeight, product of:
                1.7305418 = boost
                7.122176 = idf(docFreq=96, maxDocs=44218)
                0.0136454925 = queryNorm
              0.389494 = fieldWeight in 4258, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.122176 = idf(docFreq=96, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4258)
          0.08553726 = weight(abstract_txt:added in 4258) [ClassicSimilarity], result of:
            0.08553726 = score(doc=4258,freq=2.0), product of:
              0.18255125 = queryWeight, product of:
                2.2081478 = boost
                6.0585327 = idf(docFreq=280, maxDocs=44218)
                0.0136454925 = queryNorm
              0.46856573 = fieldWeight in 4258, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.0585327 = idf(docFreq=280, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4258)
          0.05709583 = weight(abstract_txt:sites in 4258) [ClassicSimilarity], result of:
            0.05709583 = score(doc=4258,freq=1.0), product of:
              0.19334832 = queryWeight, product of:
                2.6240692 = boost
                5.399778 = idf(docFreq=542, maxDocs=44218)
                0.0136454925 = queryNorm
              0.29530036 = fieldWeight in 4258, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.399778 = idf(docFreq=542, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4258)
          0.09033431 = weight(abstract_txt:pages in 4258) [ClassicSimilarity], result of:
            0.09033431 = score(doc=4258,freq=2.0), product of:
              0.20836718 = queryWeight, product of:
                2.7240794 = boost
                5.6055775 = idf(docFreq=441, maxDocs=44218)
                0.0136454925 = queryNorm
              0.43353426 = fieldWeight in 4258, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.6055775 = idf(docFreq=441, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4258)
        0.32 = coord(8/25)
    
  4. Bidoki, A.M.Z.; Yazdani, N.: an intelligent ranking algorithm for web pages : DistanceRank (2008) 0.11
    0.10639746 = sum of:
      0.10639746 = product of:
        0.6649841 = sum of:
          0.009746531 = weight(abstract_txt:that in 2068) [ClassicSimilarity], result of:
            0.009746531 = score(doc=2068,freq=2.0), product of:
              0.03722999 = queryWeight, product of:
                1.1514671 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.0136454925 = queryNorm
              0.26179248 = fieldWeight in 2068, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.078125 = fieldNorm(doc=2068)
          0.05073263 = weight(abstract_txt:proposed in 2068) [ClassicSimilarity], result of:
            0.05073263 = score(doc=2068,freq=1.0), product of:
              0.14088382 = queryWeight, product of:
                2.239936 = boost
                4.6093135 = idf(docFreq=1196, maxDocs=44218)
                0.0136454925 = queryNorm
              0.36010262 = fieldWeight in 2068, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6093135 = idf(docFreq=1196, maxDocs=44218)
                0.078125 = fieldNorm(doc=2068)
          0.15805212 = weight(abstract_txt:pages in 2068) [ClassicSimilarity], result of:
            0.15805212 = score(doc=2068,freq=3.0), product of:
              0.20836718 = queryWeight, product of:
                2.7240794 = boost
                5.6055775 = idf(docFreq=441, maxDocs=44218)
                0.0136454925 = queryNorm
              0.7585269 = fieldWeight in 2068, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.6055775 = idf(docFreq=441, maxDocs=44218)
                0.078125 = fieldNorm(doc=2068)
          0.44645286 = weight(abstract_txt:crawling in 2068) [ClassicSimilarity], result of:
            0.44645286 = score(doc=2068,freq=2.0), product of:
              0.47662473 = queryWeight, product of:
                4.1199636 = boost
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.0136454925 = queryNorm
              0.93669677 = fieldWeight in 2068, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.078125 = fieldNorm(doc=2068)
        0.16 = coord(4/25)
    
  5. Choi, B.; Peng, X.: Dynamic and hierarchical classification of Web pages (2004) 0.11
    0.10570589 = sum of:
      0.10570589 = product of:
        0.4404412 = sum of:
          0.019670166 = weight(abstract_txt:only in 2555) [ClassicSimilarity], result of:
            0.019670166 = score(doc=2555,freq=1.0), product of:
              0.059456345 = queryWeight, product of:
                1.0289379 = boost
                4.234672 = idf(docFreq=1740, maxDocs=44218)
                0.0136454925 = queryNorm
              0.33083376 = fieldWeight in 2555, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.234672 = idf(docFreq=1740, maxDocs=44218)
                0.078125 = fieldNorm(doc=2555)
          0.009746531 = weight(abstract_txt:that in 2555) [ClassicSimilarity], result of:
            0.009746531 = score(doc=2555,freq=2.0), product of:
              0.03722999 = queryWeight, product of:
                1.1514671 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.0136454925 = queryNorm
              0.26179248 = fieldWeight in 2555, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.078125 = fieldNorm(doc=2555)
          0.07036916 = weight(abstract_txt:dynamic in 2555) [ClassicSimilarity], result of:
            0.07036916 = score(doc=2555,freq=2.0), product of:
              0.11038356 = queryWeight, product of:
                1.4019816 = boost
                5.7699614 = idf(docFreq=374, maxDocs=44218)
                0.0136454925 = queryNorm
              0.6374967 = fieldWeight in 2555, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.7699614 = idf(docFreq=374, maxDocs=44218)
                0.078125 = fieldNorm(doc=2555)
          0.08640568 = weight(abstract_txt:added in 2555) [ClassicSimilarity], result of:
            0.08640568 = score(doc=2555,freq=1.0), product of:
              0.18255125 = queryWeight, product of:
                2.2081478 = boost
                6.0585327 = idf(docFreq=280, maxDocs=44218)
                0.0136454925 = queryNorm
              0.47332287 = fieldWeight in 2555, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0585327 = idf(docFreq=280, maxDocs=44218)
                0.078125 = fieldNorm(doc=2555)
          0.07174678 = weight(abstract_txt:proposed in 2555) [ClassicSimilarity], result of:
            0.07174678 = score(doc=2555,freq=2.0), product of:
              0.14088382 = queryWeight, product of:
                2.239936 = boost
                4.6093135 = idf(docFreq=1196, maxDocs=44218)
                0.0136454925 = queryNorm
              0.509262 = fieldWeight in 2555, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.6093135 = idf(docFreq=1196, maxDocs=44218)
                0.078125 = fieldNorm(doc=2555)
          0.18250288 = weight(abstract_txt:pages in 2555) [ClassicSimilarity], result of:
            0.18250288 = score(doc=2555,freq=4.0), product of:
              0.20836718 = queryWeight, product of:
                2.7240794 = boost
                5.6055775 = idf(docFreq=441, maxDocs=44218)
                0.0136454925 = queryNorm
              0.8758715 = fieldWeight in 2555, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.6055775 = idf(docFreq=441, maxDocs=44218)
                0.078125 = fieldNorm(doc=2555)
        0.24 = coord(6/25)