Document (#28090)

Author
Cothey, V.
Title
Web-crawling reliability
Source
Journal of the American Society for Information Science and Technology. 55(2004) no.14, S.1228-1238
Year
2004
Abstract
In this article, I investigate the reliability, in the social science sense, of collecting informetric data about the World Wide Web by Web crawling. The investigation includes a critical examination of the practice of Web crawling and contrasts the results of content crawling with the results of link crawling. It is shown that Web crawling by search engines is intentionally biased and selective. I also report the results of a [arge-scale experimental simulation of Web crawling that illustrates the effects of different crawling policies an data collection. It is concluded that the reliability of Web crawling as a data collection technique is improved by fuller reporting of relevant crawling policies.
Footnote
Beitrag in einem Themenheft über Webometrics
Theme
Internet
Informetrie
Object
WWW

Similar documents (content)

  1. Bidoki, A.M.Z.; Yazdani, N.: an intelligent ranking algorithm for web pages : DistanceRank (2008) 0.15
    0.14568163 = sum of:
      0.14568163 = product of:
        0.91051024 = sum of:
          0.018841226 = weight(abstract_txt:link in 2068) [ClassicSimilarity], result of:
            0.018841226 = score(doc=2068,freq=1.0), product of:
              0.04225137 = queryWeight, product of:
                1.0215272 = boost
                5.707926 = idf(docFreq=398, maxDocs=44218)
                0.0072462372 = queryNorm
              0.4459317 = fieldWeight in 2068, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.707926 = idf(docFreq=398, maxDocs=44218)
                0.078125 = fieldNorm(doc=2068)
          0.005718299 = weight(abstract_txt:that in 2068) [ClassicSimilarity], result of:
            0.005718299 = score(doc=2068,freq=2.0), product of:
              0.02184287 = queryWeight, product of:
                1.2721696 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.0072462372 = queryNorm
              0.26179248 = fieldWeight in 2068, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.078125 = fieldNorm(doc=2068)
          0.012836295 = weight(abstract_txt:results in 2068) [ClassicSimilarity], result of:
            0.012836295 = score(doc=2068,freq=1.0), product of:
              0.04718112 = queryWeight, product of:
                1.8697101 = boost
                3.482422 = idf(docFreq=3693, maxDocs=44218)
                0.0072462372 = queryNorm
              0.27206424 = fieldWeight in 2068, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.482422 = idf(docFreq=3693, maxDocs=44218)
                0.078125 = fieldNorm(doc=2068)
          0.8731144 = weight(abstract_txt:crawling in 2068) [ClassicSimilarity], result of:
            0.8731144 = score(doc=2068,freq=2.0), product of:
              0.9321207 = queryWeight, product of:
                15.172795 = boost
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.0072462372 = queryNorm
              0.93669677 = fieldWeight in 2068, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.078125 = fieldNorm(doc=2068)
        0.16 = coord(4/25)
    
  2. Alqaraleh, S.; Ramadan, O.; Salamah, M.: Efficient watcher based web crawler design (2015) 0.12
    0.12123921 = sum of:
      0.12123921 = product of:
        1.0103267 = sum of:
          0.016040994 = weight(abstract_txt:sense in 1627) [ClassicSimilarity], result of:
            0.016040994 = score(doc=1627,freq=1.0), product of:
              0.04404151 = queryWeight, product of:
                1.0429431 = boost
                5.8275905 = idf(docFreq=353, maxDocs=44218)
                0.0072462372 = queryNorm
              0.3642244 = fieldWeight in 1627, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8275905 = idf(docFreq=353, maxDocs=44218)
                0.0625 = fieldNorm(doc=1627)
          0.0064695175 = weight(abstract_txt:that in 1627) [ClassicSimilarity], result of:
            0.0064695175 = score(doc=1627,freq=4.0), product of:
              0.02184287 = queryWeight, product of:
                1.2721696 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.0072462372 = queryNorm
              0.2961844 = fieldWeight in 1627, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.0625 = fieldNorm(doc=1627)
          0.9878162 = weight(abstract_txt:crawling in 1627) [ClassicSimilarity], result of:
            0.9878162 = score(doc=1627,freq=4.0), product of:
              0.9321207 = queryWeight, product of:
                15.172795 = boost
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.0072462372 = queryNorm
              1.0597514 = fieldWeight in 1627, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.0625 = fieldNorm(doc=1627)
        0.12 = coord(3/25)
    
  3. Fu, T.; Abbasi, A.; Chen, H.: ¬A focused crawler for Dark Web forums (2010) 0.12
    0.11817857 = sum of:
      0.11817857 = product of:
        0.73861605 = sum of:
          0.014782505 = weight(abstract_txt:improved in 3471) [ClassicSimilarity], result of:
            0.014782505 = score(doc=3471,freq=1.0), product of:
              0.041706786 = queryWeight, product of:
                1.0149225 = boost
                5.6710215 = idf(docFreq=413, maxDocs=44218)
                0.0072462372 = queryNorm
              0.35443884 = fieldWeight in 3471, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.6710215 = idf(docFreq=413, maxDocs=44218)
                0.0625 = fieldNorm(doc=3471)
          0.015072981 = weight(abstract_txt:link in 3471) [ClassicSimilarity], result of:
            0.015072981 = score(doc=3471,freq=1.0), product of:
              0.04225137 = queryWeight, product of:
                1.0215272 = boost
                5.707926 = idf(docFreq=398, maxDocs=44218)
                0.0072462372 = queryNorm
              0.35674536 = fieldWeight in 3471, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.707926 = idf(docFreq=398, maxDocs=44218)
                0.0625 = fieldNorm(doc=3471)
          0.010269036 = weight(abstract_txt:results in 3471) [ClassicSimilarity], result of:
            0.010269036 = score(doc=3471,freq=1.0), product of:
              0.04718112 = queryWeight, product of:
                1.8697101 = boost
                3.482422 = idf(docFreq=3693, maxDocs=44218)
                0.0072462372 = queryNorm
              0.21765138 = fieldWeight in 3471, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.482422 = idf(docFreq=3693, maxDocs=44218)
                0.0625 = fieldNorm(doc=3471)
          0.6984915 = weight(abstract_txt:crawling in 3471) [ClassicSimilarity], result of:
            0.6984915 = score(doc=3471,freq=2.0), product of:
              0.9321207 = queryWeight, product of:
                15.172795 = boost
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.0072462372 = queryNorm
              0.7493574 = fieldWeight in 3471, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.0625 = fieldNorm(doc=3471)
        0.16 = coord(4/25)
    
  4. Marcondes, C.H.; Costa, L.C da.: ¬A model to represent and process scientific knowledge in biomedical articles with semantic Web technologies (2016) 0.11
    0.11187931 = sum of:
      0.11187931 = product of:
        0.55939656 = sum of:
          0.0045746397 = weight(abstract_txt:that in 2829) [ClassicSimilarity], result of:
            0.0045746397 = score(doc=2829,freq=2.0), product of:
              0.02184287 = queryWeight, product of:
                1.2721696 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.0072462372 = queryNorm
              0.20943399 = fieldWeight in 2829, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.0625 = fieldNorm(doc=2829)
          0.030356651 = weight(abstract_txt:reporting in 2829) [ClassicSimilarity], result of:
            0.030356651 = score(doc=2829,freq=1.0), product of:
              0.06738201 = queryWeight, product of:
                1.290035 = boost
                7.208251 = idf(docFreq=88, maxDocs=44218)
                0.0072462372 = queryNorm
              0.4505157 = fieldWeight in 2829, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.208251 = idf(docFreq=88, maxDocs=44218)
                0.0625 = fieldNorm(doc=2829)
          0.012770691 = weight(abstract_txt:data in 2829) [ClassicSimilarity], result of:
            0.012770691 = score(doc=2829,freq=2.0), product of:
              0.043305997 = queryWeight, product of:
                1.7912829 = boost
                3.3363478 = idf(docFreq=4274, maxDocs=44218)
                0.0072462372 = queryNorm
              0.29489428 = fieldWeight in 2829, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.3363478 = idf(docFreq=4274, maxDocs=44218)
                0.0625 = fieldNorm(doc=2829)
          0.017786492 = weight(abstract_txt:results in 2829) [ClassicSimilarity], result of:
            0.017786492 = score(doc=2829,freq=3.0), product of:
              0.04718112 = queryWeight, product of:
                1.8697101 = boost
                3.482422 = idf(docFreq=3693, maxDocs=44218)
                0.0072462372 = queryNorm
              0.37698326 = fieldWeight in 2829, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.482422 = idf(docFreq=3693, maxDocs=44218)
                0.0625 = fieldNorm(doc=2829)
          0.4939081 = weight(abstract_txt:crawling in 2829) [ClassicSimilarity], result of:
            0.4939081 = score(doc=2829,freq=1.0), product of:
              0.9321207 = queryWeight, product of:
                15.172795 = boost
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.0072462372 = queryNorm
              0.5298757 = fieldWeight in 2829, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.0625 = fieldNorm(doc=2829)
        0.2 = coord(5/25)
    
  5. Menczer, F.: Lexical and semantic clustering by Web links (2004) 0.11
    0.11079978 = sum of:
      0.11079978 = product of:
        0.92333156 = sum of:
          0.04213026 = weight(abstract_txt:link in 3090) [ClassicSimilarity], result of:
            0.04213026 = score(doc=3090,freq=5.0), product of:
              0.04225137 = queryWeight, product of:
                1.0215272 = boost
                5.707926 = idf(docFreq=398, maxDocs=44218)
                0.0072462372 = queryNorm
              0.9971336 = fieldWeight in 3090, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.707926 = idf(docFreq=398, maxDocs=44218)
                0.078125 = fieldNorm(doc=3090)
          0.0080868965 = weight(abstract_txt:that in 3090) [ClassicSimilarity], result of:
            0.0080868965 = score(doc=3090,freq=4.0), product of:
              0.02184287 = queryWeight, product of:
                1.2721696 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.0072462372 = queryNorm
              0.3702305 = fieldWeight in 3090, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.078125 = fieldNorm(doc=3090)
          0.8731144 = weight(abstract_txt:crawling in 3090) [ClassicSimilarity], result of:
            0.8731144 = score(doc=3090,freq=2.0), product of:
              0.9321207 = queryWeight, product of:
                15.172795 = boost
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.0072462372 = queryNorm
              0.93669677 = fieldWeight in 3090, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.078125 = fieldNorm(doc=3090)
        0.12 = coord(3/25)