Document (#28465)

Author
Thelwall, M.
Title
Text characteristics of English language university Web sites
Source
Journal of the American Society for Information Science and Technology. 56(2005) no.6, S.609-619
Year
2005
Abstract
The nature of the contents of academic Web sites is of direct relevance to the new field of scientific Web intelligence, and for search engine and topic-specific crawler designers. We analyze word frequencies in national academic Webs using the Web sites of three Englishspeaking nations: Australia, New Zealand, and the United Kingdom. Strong regularities were found in page size and word frequency distributions, but with significant anomalies. At least 26% of pages contain no words. High frequency words include university names and acronyms, Internet terminology, and computing product names: not always words in common usage away from the Web. A minority of low frequency words are spelling mistakes, with other common types including nonwords, proper names, foreign language terms or computer science variable names. Based upon these findings, recommendations for data cleansing and filtering are made, particularly for clustering applications.

Similar documents (author)

  1. Thelwall, M.: Extracting macroscopic information from Web links (2001) 4.35
    4.345732 = sum of:
      4.345732 = weight(author_txt:thelwall in 852) [ClassicSimilarity], result of:
        4.345732 = fieldWeight in 852, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          6.9531717 = idf(docFreq=108, maxDocs=41962)
          0.625 = fieldNorm(doc=852)
    
  2. Thelwall, M.: Conceptualizing documentation on the Web : an evaluation of different heuristic-based models for counting links between university Web sites (2002) 4.35
    4.345732 = sum of:
      4.345732 = weight(author_txt:thelwall in 1979) [ClassicSimilarity], result of:
        4.345732 = fieldWeight in 1979, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          6.9531717 = idf(docFreq=108, maxDocs=41962)
          0.625 = fieldNorm(doc=1979)
    
  3. Thelwall, M.: Bibliometrics to webometrics (2009) 4.35
    4.345732 = sum of:
      4.345732 = weight(author_txt:thelwall in 240) [ClassicSimilarity], result of:
        4.345732 = fieldWeight in 240, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          6.9531717 = idf(docFreq=108, maxDocs=41962)
          0.625 = fieldNorm(doc=240)
    
  4. Thelwall, M.: ¬A layered approach for investigating the topological structure of communities in the Web (2003) 4.35
    4.345732 = sum of:
      4.345732 = weight(author_txt:thelwall in 451) [ClassicSimilarity], result of:
        4.345732 = fieldWeight in 451, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          6.9531717 = idf(docFreq=108, maxDocs=41962)
          0.625 = fieldNorm(doc=451)
    
  5. Thelwall, M.: Can Google's PageRank be used to find the most important academic Web pages? (2003) 4.35
    4.345732 = sum of:
      4.345732 = weight(author_txt:thelwall in 458) [ClassicSimilarity], result of:
        4.345732 = fieldWeight in 458, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          6.9531717 = idf(docFreq=108, maxDocs=41962)
          0.625 = fieldNorm(doc=458)
    

Similar documents (content)

  1. Price, L.; Thelwall, M.: ¬The clustering power of low frequency words in academic webs (2005) 0.27
    0.2732103 = sum of:
      0.2732103 = product of:
        1.1383762 = sum of:
          0.13950843 = weight(abstract_txt:zealand in 4562) [ClassicSimilarity], result of:
            0.13950843 = score(doc=4562,freq=1.0), product of:
              0.18263225 = queryWeight, product of:
                1.1508213 = boost
                8.148012 = idf(docFreq=32, maxDocs=41962)
                0.019476812 = queryNorm
              0.76387614 = fieldWeight in 4562, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.148012 = idf(docFreq=32, maxDocs=41962)
                0.09375 = fieldNorm(doc=4562)
          0.09506747 = weight(abstract_txt:academic in 4562) [ClassicSimilarity], result of:
            0.09506747 = score(doc=4562,freq=3.0), product of:
              0.12354772 = queryWeight, product of:
                1.3386022 = boost
                4.7387667 = idf(docFreq=997, maxDocs=41962)
                0.019476812 = queryNorm
              0.7694798 = fieldWeight in 4562, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.7387667 = idf(docFreq=997, maxDocs=41962)
                0.09375 = fieldNorm(doc=4562)
          0.08464592 = weight(abstract_txt:word in 4562) [ClassicSimilarity], result of:
            0.08464592 = score(doc=4562,freq=1.0), product of:
              0.16491413 = queryWeight, product of:
                1.5465469 = boost
                5.474909 = idf(docFreq=477, maxDocs=41962)
                0.019476812 = queryNorm
              0.5132727 = fieldWeight in 4562, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.474909 = idf(docFreq=477, maxDocs=41962)
                0.09375 = fieldNorm(doc=4562)
          0.21219999 = weight(abstract_txt:sites in 4562) [ClassicSimilarity], result of:
            0.21219999 = score(doc=4562,freq=3.0), product of:
              0.2415502 = queryWeight, product of:
                2.2923636 = boost
                5.410109 = idf(docFreq=509, maxDocs=41962)
                0.019476812 = queryNorm
              0.87849224 = fieldWeight in 4562, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.410109 = idf(docFreq=509, maxDocs=41962)
                0.09375 = fieldNorm(doc=4562)
          0.3300134 = weight(abstract_txt:frequency in 4562) [ClassicSimilarity], result of:
            0.3300134 = score(doc=4562,freq=4.0), product of:
              0.29459044 = queryWeight, product of:
                2.5315652 = boost
                5.974639 = idf(docFreq=289, maxDocs=41962)
                0.019476812 = queryNorm
              1.1202447 = fieldWeight in 4562, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.974639 = idf(docFreq=289, maxDocs=41962)
                0.09375 = fieldNorm(doc=4562)
          0.27694103 = weight(abstract_txt:words in 4562) [ClassicSimilarity], result of:
            0.27694103 = score(doc=4562,freq=3.0), product of:
              0.31750333 = queryWeight, product of:
                3.0347528 = boost
                5.3716426 = idf(docFreq=529, maxDocs=41962)
                0.019476812 = queryNorm
              0.872246 = fieldWeight in 4562, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.3716426 = idf(docFreq=529, maxDocs=41962)
                0.09375 = fieldNorm(doc=4562)
        0.24 = coord(6/25)
    
  2. Spink, A.; Wolfram, D.; Jansen, B.J.; Saracevic, T.: Searching the Web : the public and their queries (2001) 0.19
    0.19453616 = sum of:
      0.19453616 = product of:
        0.60792553 = sum of:
          0.056637995 = weight(abstract_txt:spelling in 981) [ClassicSimilarity], result of:
            0.056637995 = score(doc=981,freq=1.0), product of:
              0.15895313 = queryWeight, product of:
                1.0736277 = boost
                7.6014686 = idf(docFreq=56, maxDocs=41962)
                0.019476812 = queryNorm
              0.35631883 = fieldWeight in 981, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6014686 = idf(docFreq=56, maxDocs=41962)
                0.046875 = fieldNorm(doc=981)
          0.06685691 = weight(abstract_txt:minority in 981) [ClassicSimilarity], result of:
            0.06685691 = score(doc=981,freq=1.0), product of:
              0.1775394 = queryWeight, product of:
                1.134662 = boost
                8.033602 = idf(docFreq=36, maxDocs=41962)
                0.019476812 = queryNorm
              0.37657508 = fieldWeight in 981, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.033602 = idf(docFreq=36, maxDocs=41962)
                0.046875 = fieldNorm(doc=981)
          0.01912671 = weight(abstract_txt:language in 981) [ClassicSimilarity], result of:
            0.01912671 = score(doc=981,freq=1.0), product of:
              0.097118214 = queryWeight, product of:
                1.1868191 = boost
                4.2014413 = idf(docFreq=1707, maxDocs=41962)
                0.019476812 = queryNorm
              0.19694257 = fieldWeight in 981, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2014413 = idf(docFreq=1707, maxDocs=41962)
                0.046875 = fieldNorm(doc=981)
          0.07944286 = weight(abstract_txt:mistakes in 981) [ClassicSimilarity], result of:
            0.07944286 = score(doc=981,freq=1.0), product of:
              0.19917451 = queryWeight, product of:
                1.2018106 = boost
                8.509026 = idf(docFreq=22, maxDocs=41962)
                0.019476812 = queryNorm
              0.39886057 = fieldWeight in 981, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.509026 = idf(docFreq=22, maxDocs=41962)
                0.046875 = fieldNorm(doc=981)
          0.08663028 = weight(abstract_txt:sites in 981) [ClassicSimilarity], result of:
            0.08663028 = score(doc=981,freq=2.0), product of:
              0.2415502 = queryWeight, product of:
                2.2923636 = boost
                5.410109 = idf(docFreq=509, maxDocs=41962)
                0.019476812 = queryNorm
              0.35864294 = fieldWeight in 981, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.410109 = idf(docFreq=509, maxDocs=41962)
                0.046875 = fieldNorm(doc=981)
          0.08250335 = weight(abstract_txt:frequency in 981) [ClassicSimilarity], result of:
            0.08250335 = score(doc=981,freq=1.0), product of:
              0.29459044 = queryWeight, product of:
                2.5315652 = boost
                5.974639 = idf(docFreq=289, maxDocs=41962)
                0.019476812 = queryNorm
              0.2800612 = fieldWeight in 981, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.974639 = idf(docFreq=289, maxDocs=41962)
                0.046875 = fieldNorm(doc=981)
          0.1130607 = weight(abstract_txt:words in 981) [ClassicSimilarity], result of:
            0.1130607 = score(doc=981,freq=2.0), product of:
              0.31750333 = queryWeight, product of:
                3.0347528 = boost
                5.3716426 = idf(docFreq=529, maxDocs=41962)
                0.019476812 = queryNorm
              0.35609296 = fieldWeight in 981, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.3716426 = idf(docFreq=529, maxDocs=41962)
                0.046875 = fieldNorm(doc=981)
          0.10366674 = weight(abstract_txt:names in 981) [ClassicSimilarity], result of:
            0.10366674 = score(doc=981,freq=1.0), product of:
              0.37755203 = queryWeight, product of:
                3.309311 = boost
                5.857622 = idf(docFreq=325, maxDocs=41962)
                0.019476812 = queryNorm
              0.27457604 = fieldWeight in 981, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.857622 = idf(docFreq=325, maxDocs=41962)
                0.046875 = fieldNorm(doc=981)
        0.32 = coord(8/25)
    
  3. Thelwall, M.; Wilkinson, D.: Graph structure in three national academic Webs : power laws with anomalies (2003) 0.18
    0.17944086 = sum of:
      0.17944086 = product of:
        0.74767023 = sum of:
          0.11625701 = weight(abstract_txt:zealand in 2682) [ClassicSimilarity], result of:
            0.11625701 = score(doc=2682,freq=1.0), product of:
              0.18263225 = queryWeight, product of:
                1.1508213 = boost
                8.148012 = idf(docFreq=32, maxDocs=41962)
                0.019476812 = queryNorm
              0.6365634 = fieldWeight in 2682, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.148012 = idf(docFreq=32, maxDocs=41962)
                0.078125 = fieldNorm(doc=2682)
          0.11625701 = weight(abstract_txt:webs in 2682) [ClassicSimilarity], result of:
            0.11625701 = score(doc=2682,freq=1.0), product of:
              0.18263225 = queryWeight, product of:
                1.1508213 = boost
                8.148012 = idf(docFreq=32, maxDocs=41962)
                0.019476812 = queryNorm
              0.6365634 = fieldWeight in 2682, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.148012 = idf(docFreq=32, maxDocs=41962)
                0.078125 = fieldNorm(doc=2682)
          0.1872486 = weight(abstract_txt:anomalies in 2682) [ClassicSimilarity], result of:
            0.1872486 = score(doc=2682,freq=2.0), product of:
              0.19917451 = queryWeight, product of:
                1.2018106 = boost
                8.509026 = idf(docFreq=22, maxDocs=41962)
                0.019476812 = queryNorm
              0.9401233 = fieldWeight in 2682, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.509026 = idf(docFreq=22, maxDocs=41962)
                0.078125 = fieldNorm(doc=2682)
          0.046826772 = weight(abstract_txt:university in 2682) [ClassicSimilarity], result of:
            0.046826772 = score(doc=2682,freq=2.0), product of:
              0.09960799 = queryWeight, product of:
                1.2019358 = boost
                4.254956 = idf(docFreq=1618, maxDocs=41962)
                0.019476812 = queryNorm
              0.47011063 = fieldWeight in 2682, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.254956 = idf(docFreq=1618, maxDocs=41962)
                0.078125 = fieldNorm(doc=2682)
          0.13669704 = weight(abstract_txt:regularities in 2682) [ClassicSimilarity], result of:
            0.13669704 = score(doc=2682,freq=1.0), product of:
              0.2034561 = queryWeight, product of:
                1.2146595 = boost
                8.5999975 = idf(docFreq=20, maxDocs=41962)
                0.019476812 = queryNorm
              0.6718748 = fieldWeight in 2682, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.5999975 = idf(docFreq=20, maxDocs=41962)
                0.078125 = fieldNorm(doc=2682)
          0.1443838 = weight(abstract_txt:sites in 2682) [ClassicSimilarity], result of:
            0.1443838 = score(doc=2682,freq=2.0), product of:
              0.2415502 = queryWeight, product of:
                2.2923636 = boost
                5.410109 = idf(docFreq=509, maxDocs=41962)
                0.019476812 = queryNorm
              0.59773827 = fieldWeight in 2682, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.410109 = idf(docFreq=509, maxDocs=41962)
                0.078125 = fieldNorm(doc=2682)
        0.24 = coord(6/25)
    
  4. Wacholder, N.; Byrd, R.J.: Retrieving information from full text using linguistic knowledge (1994) 0.12
    0.119431436 = sum of:
      0.119431436 = product of:
        0.5971572 = sum of:
          0.09439666 = weight(abstract_txt:spelling in 524) [ClassicSimilarity], result of:
            0.09439666 = score(doc=524,freq=1.0), product of:
              0.15895313 = queryWeight, product of:
                1.0736277 = boost
                7.6014686 = idf(docFreq=56, maxDocs=41962)
                0.019476812 = queryNorm
              0.59386474 = fieldWeight in 524, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6014686 = idf(docFreq=56, maxDocs=41962)
                0.078125 = fieldNorm(doc=524)
          0.055214055 = weight(abstract_txt:language in 524) [ClassicSimilarity], result of:
            0.055214055 = score(doc=524,freq=3.0), product of:
              0.097118214 = queryWeight, product of:
                1.1868191 = boost
                4.2014413 = idf(docFreq=1707, maxDocs=41962)
                0.019476812 = queryNorm
              0.5685242 = fieldWeight in 524, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.2014413 = idf(docFreq=1707, maxDocs=41962)
                0.078125 = fieldNorm(doc=524)
          0.14152527 = weight(abstract_txt:acronyms in 524) [ClassicSimilarity], result of:
            0.14152527 = score(doc=524,freq=1.0), product of:
              0.20821916 = queryWeight, product of:
                1.2287952 = boost
                8.700081 = idf(docFreq=18, maxDocs=41962)
                0.019476812 = queryNorm
              0.6796938 = fieldWeight in 524, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.700081 = idf(docFreq=18, maxDocs=41962)
                0.078125 = fieldNorm(doc=524)
          0.13324332 = weight(abstract_txt:words in 524) [ClassicSimilarity], result of:
            0.13324332 = score(doc=524,freq=1.0), product of:
              0.31750333 = queryWeight, product of:
                3.0347528 = boost
                5.3716426 = idf(docFreq=529, maxDocs=41962)
                0.019476812 = queryNorm
              0.41965958 = fieldWeight in 524, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.3716426 = idf(docFreq=529, maxDocs=41962)
                0.078125 = fieldNorm(doc=524)
          0.1727779 = weight(abstract_txt:names in 524) [ClassicSimilarity], result of:
            0.1727779 = score(doc=524,freq=1.0), product of:
              0.37755203 = queryWeight, product of:
                3.309311 = boost
                5.857622 = idf(docFreq=325, maxDocs=41962)
                0.019476812 = queryNorm
              0.45762673 = fieldWeight in 524, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.857622 = idf(docFreq=325, maxDocs=41962)
                0.078125 = fieldNorm(doc=524)
        0.2 = coord(5/25)
    
  5. Riggs, F.W.: Information and social science : the need for onomantics (1989) 0.11
    0.11263337 = sum of:
      0.11263337 = product of:
        0.56316686 = sum of:
          0.045082085 = weight(abstract_txt:language in 2911) [ClassicSimilarity], result of:
            0.045082085 = score(doc=2911,freq=2.0), product of:
              0.097118214 = queryWeight, product of:
                1.1868191 = boost
                4.2014413 = idf(docFreq=1707, maxDocs=41962)
                0.019476812 = queryNorm
              0.46419805 = fieldWeight in 2911, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.2014413 = idf(docFreq=1707, maxDocs=41962)
                0.078125 = fieldNorm(doc=2911)
          0.14152527 = weight(abstract_txt:acronyms in 2911) [ClassicSimilarity], result of:
            0.14152527 = score(doc=2911,freq=1.0), product of:
              0.20821916 = queryWeight, product of:
                1.2287952 = boost
                8.700081 = idf(docFreq=18, maxDocs=41962)
                0.019476812 = queryNorm
              0.6796938 = fieldWeight in 2911, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.700081 = idf(docFreq=18, maxDocs=41962)
                0.078125 = fieldNorm(doc=2911)
          0.07053827 = weight(abstract_txt:word in 2911) [ClassicSimilarity], result of:
            0.07053827 = score(doc=2911,freq=1.0), product of:
              0.16491413 = queryWeight, product of:
                1.5465469 = boost
                5.474909 = idf(docFreq=477, maxDocs=41962)
                0.019476812 = queryNorm
              0.42772725 = fieldWeight in 2911, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.474909 = idf(docFreq=477, maxDocs=41962)
                0.078125 = fieldNorm(doc=2911)
          0.13324332 = weight(abstract_txt:words in 2911) [ClassicSimilarity], result of:
            0.13324332 = score(doc=2911,freq=1.0), product of:
              0.31750333 = queryWeight, product of:
                3.0347528 = boost
                5.3716426 = idf(docFreq=529, maxDocs=41962)
                0.019476812 = queryNorm
              0.41965958 = fieldWeight in 2911, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.3716426 = idf(docFreq=529, maxDocs=41962)
                0.078125 = fieldNorm(doc=2911)
          0.1727779 = weight(abstract_txt:names in 2911) [ClassicSimilarity], result of:
            0.1727779 = score(doc=2911,freq=1.0), product of:
              0.37755203 = queryWeight, product of:
                3.309311 = boost
                5.857622 = idf(docFreq=325, maxDocs=41962)
                0.019476812 = queryNorm
              0.45762673 = fieldWeight in 2911, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.857622 = idf(docFreq=325, maxDocs=41962)
                0.078125 = fieldNorm(doc=2911)
        0.2 = coord(5/25)