Document (#28089)

Author
Menczer, F.
Title
Lexical and semantic clustering by Web links
Source
Journal of the American Society for Information Science and Technology. 55(2004) no.14, S.1261-1269
Year
2004
Abstract
Recent Web-searching and -mining tools are combining text and link analysis to improve ranking and crawling algorithms. The central assumption behind such approaches is that there is a correiation between the graph structure of the Web and the text and meaning of pages. Here I formalize and empirically evaluate two general conjectures drawing connections from link information to lexical and semantic Web content. The link-content conjecture states that a page is similar to the pages that link to it, and the link-cluster conjecture that pages about the same topic are clustered together. These conjectures are offen simply assumed to hold, and Web search tools are built an such assumptions. The present quantitative confirmation sheds light an the connection between the success of the latest Web-mining techniques and the small world topology of the Web, with encouraging implications for the design of better crawling algorithms.
Footnote
Beitrag in einem Themenheft über Webometrics
Theme
Internet
Informetrie
Semantisches Umfeld in Indexierung u. Retrieval

Similar documents (author)

  1. Menczer, F.; Monge, A.E.: Scalable Web search by adaptive online agents : an InfoSpiders case study (1999) 4.95
    4.9450216 = sum of:
      4.9450216 = weight(author_txt:menczer in 4983) [ClassicSimilarity], result of:
        4.9450216 = fieldWeight in 4983, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.890043 = idf(docFreq=5, maxDocs=43556)
          0.5 = fieldNorm(doc=4983)
    
  2. Menczer, F.; Hills, T.: ¬Die digitale Manipulation (2021) 4.95
    4.9450216 = sum of:
      4.9450216 = weight(author_txt:menczer in 145) [ClassicSimilarity], result of:
        4.9450216 = fieldWeight in 145, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.890043 = idf(docFreq=5, maxDocs=43556)
          0.5 = fieldNorm(doc=145)
    
  3. Lam, W.; Yang, C.C.; Menczer, F.: Introduction to the special topic section on mining Web resources for enhancing information retrieval (2007) 3.71
    3.7087662 = sum of:
      3.7087662 = weight(author_txt:menczer in 2598) [ClassicSimilarity], result of:
        3.7087662 = fieldWeight in 2598, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.890043 = idf(docFreq=5, maxDocs=43556)
          0.375 = fieldNorm(doc=2598)
    
  4. Nikolov, D.; Lalmas, M.; Flammini, A.; Menczer, F.: Quantifying biases in online information exposure (2019) 3.09
    3.0906386 = sum of:
      3.0906386 = weight(author_txt:menczer in 1272) [ClassicSimilarity], result of:
        3.0906386 = fieldWeight in 1272, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.890043 = idf(docFreq=5, maxDocs=43556)
          0.3125 = fieldNorm(doc=1272)
    

Similar documents (content)

  1. Bidoki, A.M.Z.; Yazdani, N.: an intelligent ranking algorithm for web pages : DistanceRank (2008) 0.18
    0.18315187 = sum of:
      0.18315187 = product of:
        0.7631328 = sum of:
          0.020985952 = weight(abstract_txt:between in 4066) [ClassicSimilarity], result of:
            0.020985952 = score(doc=4066,freq=2.0), product of:
              0.054703627 = queryWeight, product of:
                1.0096924 = boost
                3.4722223 = idf(docFreq=3675, maxDocs=43556)
                0.015603409 = queryNorm
              0.38362998 = fieldWeight in 4066, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4722223 = idf(docFreq=3675, maxDocs=43556)
                0.078125 = fieldNorm(doc=4066)
          0.013545794 = weight(abstract_txt:that in 4066) [ClassicSimilarity], result of:
            0.013545794 = score(doc=4066,freq=2.0), product of:
              0.051476616 = queryWeight, product of:
                1.3851635 = boost
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.015603409 = queryNorm
              0.2631446 = fieldWeight in 4066, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.078125 = fieldNorm(doc=4066)
          0.09462813 = weight(abstract_txt:algorithms in 4066) [ClassicSimilarity], result of:
            0.09462813 = score(doc=4066,freq=2.0), product of:
              0.14930598 = queryWeight, product of:
                1.6680907 = boost
                5.736382 = idf(docFreq=381, maxDocs=43556)
                0.015603409 = queryNorm
              0.6337866 = fieldWeight in 4066, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.736382 = idf(docFreq=381, maxDocs=43556)
                0.078125 = fieldNorm(doc=4066)
          0.16110943 = weight(abstract_txt:pages in 4066) [ClassicSimilarity], result of:
            0.16110943 = score(doc=4066,freq=3.0), product of:
              0.21288463 = queryWeight, product of:
                2.439489 = boost
                5.5927577 = idf(docFreq=440, maxDocs=43556)
                0.015603409 = queryNorm
              0.7567922 = fieldWeight in 4066, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.5927577 = idf(docFreq=440, maxDocs=43556)
                0.078125 = fieldNorm(doc=4066)
          0.30827436 = weight(abstract_txt:crawling in 4066) [ClassicSimilarity], result of:
            0.30827436 = score(doc=4066,freq=2.0), product of:
              0.3281119 = queryWeight, product of:
                2.4728172 = boost
                8.503749 = idf(docFreq=23, maxDocs=43556)
                0.015603409 = queryNorm
              0.9395403 = fieldWeight in 4066, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.503749 = idf(docFreq=23, maxDocs=43556)
                0.078125 = fieldNorm(doc=4066)
          0.1645892 = weight(abstract_txt:link in 4066) [ClassicSimilarity], result of:
            0.1645892 = score(doc=4066,freq=1.0), product of:
              0.36925063 = queryWeight, product of:
                4.1477413 = boost
                5.705452 = idf(docFreq=393, maxDocs=43556)
                0.015603409 = queryNorm
              0.44573843 = fieldWeight in 4066, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.705452 = idf(docFreq=393, maxDocs=43556)
                0.078125 = fieldNorm(doc=4066)
        0.24 = coord(6/25)
    
  2. Cothey, V.: Web-crawling reliability (2004) 0.17
    0.16535965 = sum of:
      0.16535965 = product of:
        1.0334978 = sum of:
          0.031344224 = weight(abstract_txt:content in 4087) [ClassicSimilarity], result of:
            0.031344224 = score(doc=4087,freq=1.0), product of:
              0.07974887 = queryWeight, product of:
                1.2191112 = boost
                4.1923904 = idf(docFreq=1788, maxDocs=43556)
                0.015603409 = queryNorm
              0.3930366 = fieldWeight in 4087, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1923904 = idf(docFreq=1788, maxDocs=43556)
                0.09375 = fieldNorm(doc=4087)
          0.01990817 = weight(abstract_txt:that in 4087) [ClassicSimilarity], result of:
            0.01990817 = score(doc=4087,freq=3.0), product of:
              0.051476616 = queryWeight, product of:
                1.3851635 = boost
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.015603409 = queryNorm
              0.386742 = fieldWeight in 4087, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.09375 = fieldNorm(doc=4087)
          0.7847384 = weight(abstract_txt:crawling in 4087) [ClassicSimilarity], result of:
            0.7847384 = score(doc=4087,freq=9.0), product of:
              0.3281119 = queryWeight, product of:
                2.4728172 = boost
                8.503749 = idf(docFreq=23, maxDocs=43556)
                0.015603409 = queryNorm
              2.3916793 = fieldWeight in 4087, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                8.503749 = idf(docFreq=23, maxDocs=43556)
                0.09375 = fieldNorm(doc=4087)
          0.19750704 = weight(abstract_txt:link in 4087) [ClassicSimilarity], result of:
            0.19750704 = score(doc=4087,freq=1.0), product of:
              0.36925063 = queryWeight, product of:
                4.1477413 = boost
                5.705452 = idf(docFreq=393, maxDocs=43556)
                0.015603409 = queryNorm
              0.5348861 = fieldWeight in 4087, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.705452 = idf(docFreq=393, maxDocs=43556)
                0.09375 = fieldNorm(doc=4087)
        0.16 = coord(4/25)
    
  3. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment (1998) 0.16
    0.15876234 = sum of:
      0.15876234 = product of:
        0.5670083 = sum of:
          0.020408878 = weight(abstract_txt:such in 2003) [ClassicSimilarity], result of:
            0.020408878 = score(doc=2003,freq=2.0), product of:
              0.053696144 = queryWeight, product of:
                1.0003514 = boost
                3.4400995 = idf(docFreq=3795, maxDocs=43556)
                0.015603409 = queryNorm
              0.38008088 = fieldWeight in 2003, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4400995 = idf(docFreq=3795, maxDocs=43556)
                0.078125 = fieldNorm(doc=2003)
          0.014839308 = weight(abstract_txt:between in 2003) [ClassicSimilarity], result of:
            0.014839308 = score(doc=2003,freq=1.0), product of:
              0.054703627 = queryWeight, product of:
                1.0096924 = boost
                3.4722223 = idf(docFreq=3675, maxDocs=43556)
                0.015603409 = queryNorm
              0.27126735 = fieldWeight in 2003, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4722223 = idf(docFreq=3675, maxDocs=43556)
                0.078125 = fieldNorm(doc=2003)
          0.026120188 = weight(abstract_txt:content in 2003) [ClassicSimilarity], result of:
            0.026120188 = score(doc=2003,freq=1.0), product of:
              0.07974887 = queryWeight, product of:
                1.2191112 = boost
                4.1923904 = idf(docFreq=1788, maxDocs=43556)
                0.015603409 = queryNorm
              0.3275305 = fieldWeight in 2003, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1923904 = idf(docFreq=1788, maxDocs=43556)
                0.078125 = fieldNorm(doc=2003)
          0.03137044 = weight(abstract_txt:tools in 2003) [ClassicSimilarity], result of:
            0.03137044 = score(doc=2003,freq=1.0), product of:
              0.09010608 = queryWeight, product of:
                1.29586 = boost
                4.4563212 = idf(docFreq=1373, maxDocs=43556)
                0.015603409 = queryNorm
              0.3481501 = fieldWeight in 2003, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.4563212 = idf(docFreq=1373, maxDocs=43556)
                0.078125 = fieldNorm(doc=2003)
          0.013545794 = weight(abstract_txt:that in 2003) [ClassicSimilarity], result of:
            0.013545794 = score(doc=2003,freq=2.0), product of:
              0.051476616 = queryWeight, product of:
                1.3851635 = boost
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.015603409 = queryNorm
              0.2631446 = fieldWeight in 2003, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.078125 = fieldNorm(doc=2003)
          0.1315453 = weight(abstract_txt:pages in 2003) [ClassicSimilarity], result of:
            0.1315453 = score(doc=2003,freq=2.0), product of:
              0.21288463 = queryWeight, product of:
                2.439489 = boost
                5.5927577 = idf(docFreq=440, maxDocs=43556)
                0.015603409 = queryNorm
              0.61791825 = fieldWeight in 2003, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.5927577 = idf(docFreq=440, maxDocs=43556)
                0.078125 = fieldNorm(doc=2003)
          0.3291784 = weight(abstract_txt:link in 2003) [ClassicSimilarity], result of:
            0.3291784 = score(doc=2003,freq=4.0), product of:
              0.36925063 = queryWeight, product of:
                4.1477413 = boost
                5.705452 = idf(docFreq=393, maxDocs=43556)
                0.015603409 = queryNorm
              0.89147687 = fieldWeight in 2003, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.705452 = idf(docFreq=393, maxDocs=43556)
                0.078125 = fieldNorm(doc=2003)
        0.28 = coord(7/25)
    
  4. Naing, M.-M.; Lim, E.-P.; Chiang, R.H.L.: Extracting link chains of relationship instances from a Web site (2006) 0.15
    0.14821139 = sum of:
      0.14821139 = product of:
        0.61754745 = sum of:
          0.014431256 = weight(abstract_txt:such in 1109) [ClassicSimilarity], result of:
            0.014431256 = score(doc=1109,freq=1.0), product of:
              0.053696144 = queryWeight, product of:
                1.0003514 = boost
                3.4400995 = idf(docFreq=3795, maxDocs=43556)
                0.015603409 = queryNorm
              0.26875776 = fieldWeight in 1109, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4400995 = idf(docFreq=3795, maxDocs=43556)
                0.078125 = fieldNorm(doc=1109)
          0.014839308 = weight(abstract_txt:between in 1109) [ClassicSimilarity], result of:
            0.014839308 = score(doc=1109,freq=1.0), product of:
              0.054703627 = queryWeight, product of:
                1.0096924 = boost
                3.4722223 = idf(docFreq=3675, maxDocs=43556)
                0.015603409 = queryNorm
              0.27126735 = fieldWeight in 1109, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4722223 = idf(docFreq=3675, maxDocs=43556)
                0.078125 = fieldNorm(doc=1109)
          0.031950474 = weight(abstract_txt:semantic in 1109) [ClassicSimilarity], result of:
            0.031950474 = score(doc=1109,freq=1.0), product of:
              0.09121338 = queryWeight, product of:
                1.3037981 = boost
                4.483619 = idf(docFreq=1336, maxDocs=43556)
                0.015603409 = queryNorm
              0.35028276 = fieldWeight in 1109, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.483619 = idf(docFreq=1336, maxDocs=43556)
                0.078125 = fieldNorm(doc=1109)
          0.019156646 = weight(abstract_txt:that in 1109) [ClassicSimilarity], result of:
            0.019156646 = score(doc=1109,freq=4.0), product of:
              0.051476616 = queryWeight, product of:
                1.3851635 = boost
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.015603409 = queryNorm
              0.37214267 = fieldWeight in 1109, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.078125 = fieldNorm(doc=1109)
          0.20799139 = weight(abstract_txt:pages in 1109) [ClassicSimilarity], result of:
            0.20799139 = score(doc=1109,freq=5.0), product of:
              0.21288463 = queryWeight, product of:
                2.439489 = boost
                5.5927577 = idf(docFreq=440, maxDocs=43556)
                0.015603409 = queryNorm
              0.9770146 = fieldWeight in 1109, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.5927577 = idf(docFreq=440, maxDocs=43556)
                0.078125 = fieldNorm(doc=1109)
          0.3291784 = weight(abstract_txt:link in 1109) [ClassicSimilarity], result of:
            0.3291784 = score(doc=1109,freq=4.0), product of:
              0.36925063 = queryWeight, product of:
                4.1477413 = boost
                5.705452 = idf(docFreq=393, maxDocs=43556)
                0.015603409 = queryNorm
              0.89147687 = fieldWeight in 1109, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.705452 = idf(docFreq=393, maxDocs=43556)
                0.078125 = fieldNorm(doc=1109)
        0.24 = coord(6/25)
    
  5. Haas, S.W.; Grams, E.S.: Readers, authors, and page structure : a discussion of four questions arising from a content analysis of Web pages (2000) 0.13
    0.12654565 = sum of:
      0.12654565 = product of:
        0.5272736 = sum of:
          0.014839308 = weight(abstract_txt:between in 5385) [ClassicSimilarity], result of:
            0.014839308 = score(doc=5385,freq=1.0), product of:
              0.054703627 = queryWeight, product of:
                1.0096924 = boost
                3.4722223 = idf(docFreq=3675, maxDocs=43556)
                0.015603409 = queryNorm
              0.27126735 = fieldWeight in 5385, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4722223 = idf(docFreq=3675, maxDocs=43556)
                0.078125 = fieldNorm(doc=5385)
          0.023537686 = weight(abstract_txt:text in 5385) [ClassicSimilarity], result of:
            0.023537686 = score(doc=5385,freq=1.0), product of:
              0.0744017 = queryWeight, product of:
                1.1775314 = boost
                4.0494018 = idf(docFreq=2063, maxDocs=43556)
                0.015603409 = queryNorm
              0.31635952 = fieldWeight in 5385, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0494018 = idf(docFreq=2063, maxDocs=43556)
                0.078125 = fieldNorm(doc=5385)
          0.026120188 = weight(abstract_txt:content in 5385) [ClassicSimilarity], result of:
            0.026120188 = score(doc=5385,freq=1.0), product of:
              0.07974887 = queryWeight, product of:
                1.2191112 = boost
                4.1923904 = idf(docFreq=1788, maxDocs=43556)
                0.015603409 = queryNorm
              0.3275305 = fieldWeight in 5385, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1923904 = idf(docFreq=1788, maxDocs=43556)
                0.078125 = fieldNorm(doc=5385)
          0.01659014 = weight(abstract_txt:that in 5385) [ClassicSimilarity], result of:
            0.01659014 = score(doc=5385,freq=3.0), product of:
              0.051476616 = queryWeight, product of:
                1.3851635 = boost
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.015603409 = queryNorm
              0.322285 = fieldWeight in 5385, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.078125 = fieldNorm(doc=5385)
          0.16110943 = weight(abstract_txt:pages in 5385) [ClassicSimilarity], result of:
            0.16110943 = score(doc=5385,freq=3.0), product of:
              0.21288463 = queryWeight, product of:
                2.439489 = boost
                5.5927577 = idf(docFreq=440, maxDocs=43556)
                0.015603409 = queryNorm
              0.7567922 = fieldWeight in 5385, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.5927577 = idf(docFreq=440, maxDocs=43556)
                0.078125 = fieldNorm(doc=5385)
          0.28507686 = weight(abstract_txt:link in 5385) [ClassicSimilarity], result of:
            0.28507686 = score(doc=5385,freq=3.0), product of:
              0.36925063 = queryWeight, product of:
                4.1477413 = boost
                5.705452 = idf(docFreq=393, maxDocs=43556)
                0.015603409 = queryNorm
              0.7720416 = fieldWeight in 5385, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.705452 = idf(docFreq=393, maxDocs=43556)
                0.078125 = fieldNorm(doc=5385)
        0.24 = coord(6/25)