Document (#28462)

Author
Thelwall, M.
Title
Text characteristics of English language university Web sites
Source
Journal of the American Society for Information Science and Technology. 56(2005) no.6, S.609-619
Year
2005
Abstract
The nature of the contents of academic Web sites is of direct relevance to the new field of scientific Web intelligence, and for search engine and topic-specific crawler designers. We analyze word frequencies in national academic Webs using the Web sites of three Englishspeaking nations: Australia, New Zealand, and the United Kingdom. Strong regularities were found in page size and word frequency distributions, but with significant anomalies. At least 26% of pages contain no words. High frequency words include university names and acronyms, Internet terminology, and computing product names: not always words in common usage away from the Web. A minority of low frequency words are spelling mistakes, with other common types including nonwords, proper names, foreign language terms or computer science variable names. Based upon these findings, recommendations for data cleansing and filtering are made, particularly for clustering applications.

Similar documents (author)

  1. Thelwall, M.; Thelwall, S.: ¬A thematic analysis of highly retweeted early COVID-19 tweets : consensus, information, dissent and lockdown life (2020) 4.91
    4.9051075 = sum of:
      4.9051075 = weight(author_txt:thelwall in 2465) [ClassicSimilarity], result of:
        4.9051075 = score(doc=2465,freq=2.0), product of:
          0.99999994 = queryWeight, product of:
            6.9368706 = idf(docFreq=114, maxDocs=43556)
            0.14415722 = queryNorm
          4.905108 = fieldWeight in 2465, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            6.9368706 = idf(docFreq=114, maxDocs=43556)
            0.5 = fieldNorm(doc=2465)
    
  2. Thelwall, M.: Extracting macroscopic information from Web links (2001) 4.34
    4.3355436 = sum of:
      4.3355436 = weight(author_txt:thelwall in 849) [ClassicSimilarity], result of:
        4.3355436 = score(doc=849,freq=1.0), product of:
          0.99999994 = queryWeight, product of:
            6.9368706 = idf(docFreq=114, maxDocs=43556)
            0.14415722 = queryNorm
          4.335544 = fieldWeight in 849, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            6.9368706 = idf(docFreq=114, maxDocs=43556)
            0.625 = fieldNorm(doc=849)
    
  3. Thelwall, M.: Conceptualizing documentation on the Web : an evaluation of different heuristic-based models for counting links between university Web sites (2002) 4.34
    4.3355436 = sum of:
      4.3355436 = weight(author_txt:thelwall in 1976) [ClassicSimilarity], result of:
        4.3355436 = score(doc=1976,freq=1.0), product of:
          0.99999994 = queryWeight, product of:
            6.9368706 = idf(docFreq=114, maxDocs=43556)
            0.14415722 = queryNorm
          4.335544 = fieldWeight in 1976, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            6.9368706 = idf(docFreq=114, maxDocs=43556)
            0.625 = fieldNorm(doc=1976)
    
  4. Thelwall, M.: Bibliometrics to webometrics (2009) 4.34
    4.3355436 = sum of:
      4.3355436 = weight(author_txt:thelwall in 237) [ClassicSimilarity], result of:
        4.3355436 = score(doc=237,freq=1.0), product of:
          0.99999994 = queryWeight, product of:
            6.9368706 = idf(docFreq=114, maxDocs=43556)
            0.14415722 = queryNorm
          4.335544 = fieldWeight in 237, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            6.9368706 = idf(docFreq=114, maxDocs=43556)
            0.625 = fieldNorm(doc=237)
    
  5. Thelwall, M.: ¬A layered approach for investigating the topological structure of communities in the Web (2003) 4.34
    4.3355436 = sum of:
      4.3355436 = weight(author_txt:thelwall in 448) [ClassicSimilarity], result of:
        4.3355436 = score(doc=448,freq=1.0), product of:
          0.99999994 = queryWeight, product of:
            6.9368706 = idf(docFreq=114, maxDocs=43556)
            0.14415722 = queryNorm
          4.335544 = fieldWeight in 448, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            6.9368706 = idf(docFreq=114, maxDocs=43556)
            0.625 = fieldNorm(doc=448)
    

Similar documents (content)

  1. Price, L.; Thelwall, M.: ¬The clustering power of low frequency words in academic webs (2005) 0.27
    0.2705313 = sum of:
      0.2705313 = product of:
        1.1272137 = sum of:
          0.13751177 = weight(abstract_txt:zealand in 4559) [ClassicSimilarity], result of:
            0.13751177 = score(doc=4559,freq=1.0), product of:
              0.18112384 = queryWeight, product of:
                1.147581 = boost
                8.098284 = idf(docFreq=35, maxDocs=43556)
                0.019489437 = queryNorm
              0.7592141 = fieldWeight in 4559, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.098284 = idf(docFreq=35, maxDocs=43556)
                0.09375 = fieldNorm(doc=4559)
          0.09316785 = weight(abstract_txt:academic in 4559) [ClassicSimilarity], result of:
            0.09316785 = score(doc=4559,freq=3.0), product of:
              0.12205699 = queryWeight, product of:
                1.3322688 = boost
                4.700797 = idf(docFreq=1075, maxDocs=43556)
                0.019489437 = queryNorm
              0.7633143 = fieldWeight in 4559, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.700797 = idf(docFreq=1075, maxDocs=43556)
                0.09375 = fieldNorm(doc=4559)
          0.08334641 = weight(abstract_txt:word in 4559) [ClassicSimilarity], result of:
            0.08334641 = score(doc=4559,freq=1.0), product of:
              0.16343696 = queryWeight, product of:
                1.5416496 = boost
                5.4395795 = idf(docFreq=513, maxDocs=43556)
                0.019489437 = queryNorm
              0.5099606 = fieldWeight in 4559, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.4395795 = idf(docFreq=513, maxDocs=43556)
                0.09375 = fieldNorm(doc=4559)
          0.21091793 = weight(abstract_txt:sites in 4559) [ClassicSimilarity], result of:
            0.21091793 = score(doc=4559,freq=3.0), product of:
              0.24089329 = queryWeight, product of:
                2.2922845 = boost
                5.392087 = idf(docFreq=538, maxDocs=43556)
                0.019489437 = queryNorm
              0.8755658 = fieldWeight in 4559, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.392087 = idf(docFreq=538, maxDocs=43556)
                0.09375 = fieldNorm(doc=4559)
          0.32698372 = weight(abstract_txt:frequency in 4559) [ClassicSimilarity], result of:
            0.32698372 = score(doc=4559,freq=4.0), product of:
              0.29317045 = queryWeight, product of:
                2.5288103 = boost
                5.9484615 = idf(docFreq=308, maxDocs=43556)
                0.019489437 = queryNorm
              1.1153365 = fieldWeight in 4559, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.9484615 = idf(docFreq=308, maxDocs=43556)
                0.09375 = fieldNorm(doc=4559)
          0.27528596 = weight(abstract_txt:words in 4559) [ClassicSimilarity], result of:
            0.27528596 = score(doc=4559,freq=3.0), product of:
              0.31665376 = queryWeight, product of:
                3.0347145 = boost
                5.353866 = idf(docFreq=559, maxDocs=43556)
                0.019489437 = queryNorm
              0.8693595 = fieldWeight in 4559, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.353866 = idf(docFreq=559, maxDocs=43556)
                0.09375 = fieldNorm(doc=4559)
        0.24 = coord(6/25)
    
  2. Spink, A.; Wolfram, D.; Jansen, B.J.; Saracevic, T.: Searching the Web : the public and their queries (2001) 0.19
    0.1937442 = sum of:
      0.1937442 = product of:
        0.60545063 = sum of:
          0.05770296 = weight(abstract_txt:spelling in 978) [ClassicSimilarity], result of:
            0.05770296 = score(doc=978,freq=1.0), product of:
              0.16115153 = queryWeight, product of:
                1.0824622 = boost
                7.6387515 = idf(docFreq=56, maxDocs=43556)
                0.019489437 = queryNorm
              0.35806647 = fieldWeight in 978, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6387515 = idf(docFreq=56, maxDocs=43556)
                0.046875 = fieldNorm(doc=978)
          0.06549627 = weight(abstract_txt:minority in 978) [ClassicSimilarity], result of:
            0.06549627 = score(doc=978,freq=1.0), product of:
              0.1753531 = queryWeight, product of:
                1.1291516 = boost
                7.9682307 = idf(docFreq=40, maxDocs=43556)
                0.019489437 = queryNorm
              0.3735108 = fieldWeight in 978, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.9682307 = idf(docFreq=40, maxDocs=43556)
                0.046875 = fieldNorm(doc=978)
          0.019048113 = weight(abstract_txt:language in 978) [ClassicSimilarity], result of:
            0.019048113 = score(doc=978,freq=1.0), product of:
              0.09697959 = queryWeight, product of:
                1.1875466 = boost
                4.1901574 = idf(docFreq=1792, maxDocs=43556)
                0.019489437 = queryNorm
              0.19641364 = fieldWeight in 978, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1901574 = idf(docFreq=1792, maxDocs=43556)
                0.046875 = fieldNorm(doc=978)
          0.07960901 = weight(abstract_txt:mistakes in 978) [ClassicSimilarity], result of:
            0.07960901 = score(doc=978,freq=1.0), product of:
              0.19971491 = queryWeight, product of:
                1.2050381 = boost
                8.503749 = idf(docFreq=23, maxDocs=43556)
                0.019489437 = queryNorm
              0.3986132 = fieldWeight in 978, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.503749 = idf(docFreq=23, maxDocs=43556)
                0.046875 = fieldNorm(doc=978)
          0.086106874 = weight(abstract_txt:sites in 978) [ClassicSimilarity], result of:
            0.086106874 = score(doc=978,freq=2.0), product of:
              0.24089329 = queryWeight, product of:
                2.2922845 = boost
                5.392087 = idf(docFreq=538, maxDocs=43556)
                0.019489437 = queryNorm
              0.35744822 = fieldWeight in 978, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.392087 = idf(docFreq=538, maxDocs=43556)
                0.046875 = fieldNorm(doc=978)
          0.08174593 = weight(abstract_txt:frequency in 978) [ClassicSimilarity], result of:
            0.08174593 = score(doc=978,freq=1.0), product of:
              0.29317045 = queryWeight, product of:
                2.5288103 = boost
                5.9484615 = idf(docFreq=308, maxDocs=43556)
                0.019489437 = queryNorm
              0.27883413 = fieldWeight in 978, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9484615 = idf(docFreq=308, maxDocs=43556)
                0.046875 = fieldNorm(doc=978)
          0.11238501 = weight(abstract_txt:words in 978) [ClassicSimilarity], result of:
            0.11238501 = score(doc=978,freq=2.0), product of:
              0.31665376 = queryWeight, product of:
                3.0347145 = boost
                5.353866 = idf(docFreq=559, maxDocs=43556)
                0.019489437 = queryNorm
              0.35491452 = fieldWeight in 978, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.353866 = idf(docFreq=559, maxDocs=43556)
                0.046875 = fieldNorm(doc=978)
          0.103356466 = weight(abstract_txt:names in 978) [ClassicSimilarity], result of:
            0.103356466 = score(doc=978,freq=1.0), product of:
              0.37729478 = queryWeight, product of:
                3.3125765 = boost
                5.8440723 = idf(docFreq=342, maxDocs=43556)
                0.019489437 = queryNorm
              0.2739409 = fieldWeight in 978, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8440723 = idf(docFreq=342, maxDocs=43556)
                0.046875 = fieldNorm(doc=978)
        0.32 = coord(8/25)
    
  3. Thelwall, M.; Wilkinson, D.: Graph structure in three national academic Webs : power laws with anomalies (2003) 0.18
    0.17925067 = sum of:
      0.17925067 = product of:
        0.74687785 = sum of:
          0.11459314 = weight(abstract_txt:zealand in 2679) [ClassicSimilarity], result of:
            0.11459314 = score(doc=2679,freq=1.0), product of:
              0.18112384 = queryWeight, product of:
                1.147581 = boost
                8.098284 = idf(docFreq=35, maxDocs=43556)
                0.019489437 = queryNorm
              0.6326784 = fieldWeight in 2679, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.098284 = idf(docFreq=35, maxDocs=43556)
                0.078125 = fieldNorm(doc=2679)
          0.11703674 = weight(abstract_txt:webs in 2679) [ClassicSimilarity], result of:
            0.11703674 = score(doc=2679,freq=1.0), product of:
              0.18368964 = queryWeight, product of:
                1.1556807 = boost
                8.155442 = idf(docFreq=33, maxDocs=43556)
                0.019489437 = queryNorm
              0.6371439 = fieldWeight in 2679, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.155442 = idf(docFreq=33, maxDocs=43556)
                0.078125 = fieldNorm(doc=2679)
          0.18764022 = weight(abstract_txt:anomalies in 2679) [ClassicSimilarity], result of:
            0.18764022 = score(doc=2679,freq=2.0), product of:
              0.19971491 = queryWeight, product of:
                1.2050381 = boost
                8.503749 = idf(docFreq=23, maxDocs=43556)
                0.019489437 = queryNorm
              0.9395403 = fieldWeight in 2679, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.503749 = idf(docFreq=23, maxDocs=43556)
                0.078125 = fieldNorm(doc=2679)
          0.04729995 = weight(abstract_txt:university in 2679) [ClassicSimilarity], result of:
            0.04729995 = score(doc=2679,freq=2.0), product of:
              0.10041001 = queryWeight, product of:
                1.2083675 = boost
                4.263622 = idf(docFreq=1665, maxDocs=43556)
                0.019489437 = queryNorm
              0.47106808 = fieldWeight in 2679, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.263622 = idf(docFreq=1665, maxDocs=43556)
                0.078125 = fieldNorm(doc=2679)
          0.13679633 = weight(abstract_txt:regularities in 2679) [ClassicSimilarity], result of:
            0.13679633 = score(doc=2679,freq=1.0), product of:
              0.20382282 = queryWeight, product of:
                1.2173681 = boost
                8.59076 = idf(docFreq=21, maxDocs=43556)
                0.019489437 = queryNorm
              0.6711531 = fieldWeight in 2679, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.59076 = idf(docFreq=21, maxDocs=43556)
                0.078125 = fieldNorm(doc=2679)
          0.14351147 = weight(abstract_txt:sites in 2679) [ClassicSimilarity], result of:
            0.14351147 = score(doc=2679,freq=2.0), product of:
              0.24089329 = queryWeight, product of:
                2.2922845 = boost
                5.392087 = idf(docFreq=538, maxDocs=43556)
                0.019489437 = queryNorm
              0.59574705 = fieldWeight in 2679, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.392087 = idf(docFreq=538, maxDocs=43556)
                0.078125 = fieldNorm(doc=2679)
        0.24 = coord(6/25)
    
  4. Wacholder, N.; Byrd, R.J.: Retrieving information from full text using linguistic knowledge (1994) 0.12
    0.119453326 = sum of:
      0.119453326 = product of:
        0.5972666 = sum of:
          0.096171595 = weight(abstract_txt:spelling in 521) [ClassicSimilarity], result of:
            0.096171595 = score(doc=521,freq=1.0), product of:
              0.16115153 = queryWeight, product of:
                1.0824622 = boost
                7.6387515 = idf(docFreq=56, maxDocs=43556)
                0.019489437 = queryNorm
              0.59677744 = fieldWeight in 521, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6387515 = idf(docFreq=56, maxDocs=43556)
                0.078125 = fieldNorm(doc=521)
          0.054987162 = weight(abstract_txt:language in 521) [ClassicSimilarity], result of:
            0.054987162 = score(doc=521,freq=3.0), product of:
              0.09697959 = queryWeight, product of:
                1.1875466 = boost
                4.1901574 = idf(docFreq=1792, maxDocs=43556)
                0.019489437 = queryNorm
              0.5669973 = fieldWeight in 521, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.1901574 = idf(docFreq=1792, maxDocs=43556)
                0.078125 = fieldNorm(doc=521)
          0.1414001 = weight(abstract_txt:acronyms in 521) [ClassicSimilarity], result of:
            0.1414001 = score(doc=521,freq=1.0), product of:
              0.20837055 = queryWeight, product of:
                1.2308743 = boost
                8.68607 = idf(docFreq=19, maxDocs=43556)
                0.019489437 = queryNorm
              0.67859924 = fieldWeight in 521, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.68607 = idf(docFreq=19, maxDocs=43556)
                0.078125 = fieldNorm(doc=521)
          0.13244702 = weight(abstract_txt:words in 521) [ClassicSimilarity], result of:
            0.13244702 = score(doc=521,freq=1.0), product of:
              0.31665376 = queryWeight, product of:
                3.0347145 = boost
                5.353866 = idf(docFreq=559, maxDocs=43556)
                0.019489437 = queryNorm
              0.4182708 = fieldWeight in 521, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.353866 = idf(docFreq=559, maxDocs=43556)
                0.078125 = fieldNorm(doc=521)
          0.17226078 = weight(abstract_txt:names in 521) [ClassicSimilarity], result of:
            0.17226078 = score(doc=521,freq=1.0), product of:
              0.37729478 = queryWeight, product of:
                3.3125765 = boost
                5.8440723 = idf(docFreq=342, maxDocs=43556)
                0.019489437 = queryNorm
              0.45656815 = fieldWeight in 521, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8440723 = idf(docFreq=342, maxDocs=43556)
                0.078125 = fieldNorm(doc=521)
        0.2 = coord(5/25)
    
  5. Riggs, F.W.: Information and social science : the need for onomantics (1989) 0.11
    0.11209201 = sum of:
      0.11209201 = product of:
        0.56046003 = sum of:
          0.044896834 = weight(abstract_txt:language in 2908) [ClassicSimilarity], result of:
            0.044896834 = score(doc=2908,freq=2.0), product of:
              0.09697959 = queryWeight, product of:
                1.1875466 = boost
                4.1901574 = idf(docFreq=1792, maxDocs=43556)
                0.019489437 = queryNorm
              0.46295136 = fieldWeight in 2908, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1901574 = idf(docFreq=1792, maxDocs=43556)
                0.078125 = fieldNorm(doc=2908)
          0.1414001 = weight(abstract_txt:acronyms in 2908) [ClassicSimilarity], result of:
            0.1414001 = score(doc=2908,freq=1.0), product of:
              0.20837055 = queryWeight, product of:
                1.2308743 = boost
                8.68607 = idf(docFreq=19, maxDocs=43556)
                0.019489437 = queryNorm
              0.67859924 = fieldWeight in 2908, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.68607 = idf(docFreq=19, maxDocs=43556)
                0.078125 = fieldNorm(doc=2908)
          0.06945534 = weight(abstract_txt:word in 2908) [ClassicSimilarity], result of:
            0.06945534 = score(doc=2908,freq=1.0), product of:
              0.16343696 = queryWeight, product of:
                1.5416496 = boost
                5.4395795 = idf(docFreq=513, maxDocs=43556)
                0.019489437 = queryNorm
              0.42496714 = fieldWeight in 2908, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.4395795 = idf(docFreq=513, maxDocs=43556)
                0.078125 = fieldNorm(doc=2908)
          0.13244702 = weight(abstract_txt:words in 2908) [ClassicSimilarity], result of:
            0.13244702 = score(doc=2908,freq=1.0), product of:
              0.31665376 = queryWeight, product of:
                3.0347145 = boost
                5.353866 = idf(docFreq=559, maxDocs=43556)
                0.019489437 = queryNorm
              0.4182708 = fieldWeight in 2908, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.353866 = idf(docFreq=559, maxDocs=43556)
                0.078125 = fieldNorm(doc=2908)
          0.17226078 = weight(abstract_txt:names in 2908) [ClassicSimilarity], result of:
            0.17226078 = score(doc=2908,freq=1.0), product of:
              0.37729478 = queryWeight, product of:
                3.3125765 = boost
                5.8440723 = idf(docFreq=342, maxDocs=43556)
                0.019489437 = queryNorm
              0.45656815 = fieldWeight in 2908, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8440723 = idf(docFreq=342, maxDocs=43556)
                0.078125 = fieldNorm(doc=2908)
        0.2 = coord(5/25)