Document (#28464)

Author
Thelwall, M.
Title
Text characteristics of English language university Web sites
Source
Journal of the American Society for Information Science and Technology. 56(2005) no.6, S.609-619
Year
2005
Abstract
The nature of the contents of academic Web sites is of direct relevance to the new field of scientific Web intelligence, and for search engine and topic-specific crawler designers. We analyze word frequencies in national academic Webs using the Web sites of three Englishspeaking nations: Australia, New Zealand, and the United Kingdom. Strong regularities were found in page size and word frequency distributions, but with significant anomalies. At least 26% of pages contain no words. High frequency words include university names and acronyms, Internet terminology, and computing product names: not always words in common usage away from the Web. A minority of low frequency words are spelling mistakes, with other common types including nonwords, proper names, foreign language terms or computer science variable names. Based upon these findings, recommendations for data cleansing and filtering are made, particularly for clustering applications.

Similar documents (author)

  1. Thelwall, M.; Thelwall, S.: ¬A thematic analysis of highly retweeted early COVID-19 tweets : consensus, information, dissent and lockdown life (2020) 4.90
    4.897565 = sum of:
      4.897565 = weight(author_txt:thelwall in 178) [ClassicSimilarity], result of:
        4.897565 = fieldWeight in 178, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.926203 = idf(docFreq=117, maxDocs=44218)
          0.5 = fieldNorm(doc=178)
    
  2. Thelwall, M.: Extracting macroscopic information from Web links (2001) 4.33
    4.3288765 = sum of:
      4.3288765 = weight(author_txt:thelwall in 6851) [ClassicSimilarity], result of:
        4.3288765 = fieldWeight in 6851, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          6.926203 = idf(docFreq=117, maxDocs=44218)
          0.625 = fieldNorm(doc=6851)
    
  3. Thelwall, M.: Conceptualizing documentation on the Web : an evaluation of different heuristic-based models for counting links between university Web sites (2002) 4.33
    4.3288765 = sum of:
      4.3288765 = weight(author_txt:thelwall in 978) [ClassicSimilarity], result of:
        4.3288765 = fieldWeight in 978, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          6.926203 = idf(docFreq=117, maxDocs=44218)
          0.625 = fieldNorm(doc=978)
    
  4. Thelwall, M.: Bibliometrics to webometrics (2009) 4.33
    4.3288765 = sum of:
      4.3288765 = weight(author_txt:thelwall in 4239) [ClassicSimilarity], result of:
        4.3288765 = fieldWeight in 4239, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          6.926203 = idf(docFreq=117, maxDocs=44218)
          0.625 = fieldNorm(doc=4239)
    
  5. Thelwall, M.: ¬A layered approach for investigating the topological structure of communities in the Web (2003) 4.33
    4.3288765 = sum of:
      4.3288765 = weight(author_txt:thelwall in 4450) [ClassicSimilarity], result of:
        4.3288765 = fieldWeight in 4450, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          6.926203 = idf(docFreq=117, maxDocs=44218)
          0.625 = fieldNorm(doc=4450)
    

Similar documents (content)

  1. Price, L.; Thelwall, M.: ¬The clustering power of low frequency words in academic webs (2005) 0.27
    0.27088743 = sum of:
      0.27088743 = product of:
        1.1286976 = sum of:
          0.13846679 = weight(abstract_txt:zealand in 3561) [ClassicSimilarity], result of:
            0.13846679 = score(doc=3561,freq=1.0), product of:
              0.18204266 = queryWeight, product of:
                1.1550827 = boost
                8.113368 = idf(docFreq=35, maxDocs=44218)
                0.019424906 = queryNorm
              0.7606282 = fieldWeight in 3561, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.113368 = idf(docFreq=35, maxDocs=44218)
                0.09375 = fieldNorm(doc=3561)
          0.09207621 = weight(abstract_txt:academic in 3561) [ClassicSimilarity], result of:
            0.09207621 = score(doc=3561,freq=3.0), product of:
              0.12115574 = queryWeight, product of:
                1.332642 = boost
                4.6802773 = idf(docFreq=1114, maxDocs=44218)
                0.019424906 = queryNorm
              0.7599823 = fieldWeight in 3561, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.6802773 = idf(docFreq=1114, maxDocs=44218)
                0.09375 = fieldNorm(doc=3561)
          0.083265595 = weight(abstract_txt:word in 3561) [ClassicSimilarity], result of:
            0.083265595 = score(doc=3561,freq=1.0), product of:
              0.16340418 = queryWeight, product of:
                1.547651 = boost
                5.4353957 = idf(docFreq=523, maxDocs=44218)
                0.019424906 = queryNorm
              0.50956833 = fieldWeight in 3561, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.4353957 = idf(docFreq=523, maxDocs=44218)
                0.09375 = fieldNorm(doc=3561)
          0.21210536 = weight(abstract_txt:sites in 3561) [ClassicSimilarity], result of:
            0.21210536 = score(doc=3561,freq=3.0), product of:
              0.24190445 = queryWeight, product of:
                2.3062642 = boost
                5.399778 = idf(docFreq=542, maxDocs=44218)
                0.019424906 = queryNorm
              0.87681466 = fieldWeight in 3561, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.399778 = idf(docFreq=542, maxDocs=44218)
                0.09375 = fieldNorm(doc=3561)
          0.32726184 = weight(abstract_txt:frequency in 3561) [ClassicSimilarity], result of:
            0.32726184 = score(doc=3561,freq=4.0), product of:
              0.29346755 = queryWeight, product of:
                2.5401955 = boost
                5.947494 = idf(docFreq=313, maxDocs=44218)
                0.019424906 = queryNorm
              1.1151551 = fieldWeight in 3561, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.947494 = idf(docFreq=313, maxDocs=44218)
                0.09375 = fieldNorm(doc=3561)
          0.2755219 = weight(abstract_txt:words in 3561) [ClassicSimilarity], result of:
            0.2755219 = score(doc=3561,freq=3.0), product of:
              0.316976 = queryWeight, product of:
                3.048384 = boost
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.019424906 = queryNorm
              0.86922 = fieldWeight in 3561, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.09375 = fieldNorm(doc=3561)
        0.24 = coord(6/25)
    
  2. Thelwall, M.; Wilkinson, D.: Graph structure in three national academic Webs : power laws with anomalies (2003) 0.23
    0.23064394 = sum of:
      0.23064394 = product of:
        0.8237284 = sum of:
          0.07487299 = weight(abstract_txt:australia in 1681) [ClassicSimilarity], result of:
            0.07487299 = score(doc=1681,freq=1.0), product of:
              0.13644168 = queryWeight, product of:
                7.0240583 = idf(docFreq=106, maxDocs=44218)
                0.019424906 = queryNorm
              0.5487546 = fieldWeight in 1681, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.0240583 = idf(docFreq=106, maxDocs=44218)
                0.078125 = fieldNorm(doc=1681)
          0.115389 = weight(abstract_txt:zealand in 1681) [ClassicSimilarity], result of:
            0.115389 = score(doc=1681,freq=1.0), product of:
              0.18204266 = queryWeight, product of:
                1.1550827 = boost
                8.113368 = idf(docFreq=35, maxDocs=44218)
                0.019424906 = queryNorm
              0.6338569 = fieldWeight in 1681, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.113368 = idf(docFreq=35, maxDocs=44218)
                0.078125 = fieldNorm(doc=1681)
          0.11784494 = weight(abstract_txt:webs in 1681) [ClassicSimilarity], result of:
            0.11784494 = score(doc=1681,freq=1.0), product of:
              0.18461666 = queryWeight, product of:
                1.1632202 = boost
                8.1705265 = idf(docFreq=33, maxDocs=44218)
                0.019424906 = queryNorm
              0.63832235 = fieldWeight in 1681, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.1705265 = idf(docFreq=33, maxDocs=44218)
                0.078125 = fieldNorm(doc=1681)
          0.18619062 = weight(abstract_txt:anomalies in 1681) [ClassicSimilarity], result of:
            0.18619062 = score(doc=1681,freq=2.0), product of:
              0.19877364 = queryWeight, product of:
                1.2069961 = boost
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.019424906 = queryNorm
              0.93669677 = fieldWeight in 1681, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.078125 = fieldNorm(doc=1681)
          0.04740907 = weight(abstract_txt:university in 1681) [ClassicSimilarity], result of:
            0.04740907 = score(doc=1681,freq=2.0), product of:
              0.100609235 = queryWeight, product of:
                1.2143962 = boost
                4.264995 = idf(docFreq=1688, maxDocs=44218)
                0.019424906 = queryNorm
              0.47121984 = fieldWeight in 1681, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.264995 = idf(docFreq=1688, maxDocs=44218)
                0.078125 = fieldNorm(doc=1681)
          0.13770235 = weight(abstract_txt:regularities in 1681) [ClassicSimilarity], result of:
            0.13770235 = score(doc=1681,freq=1.0), product of:
              0.20481315 = queryWeight, product of:
                1.2251955 = boost
                8.6058445 = idf(docFreq=21, maxDocs=44218)
                0.019424906 = queryNorm
              0.6723316 = fieldWeight in 1681, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.6058445 = idf(docFreq=21, maxDocs=44218)
                0.078125 = fieldNorm(doc=1681)
          0.14431942 = weight(abstract_txt:sites in 1681) [ClassicSimilarity], result of:
            0.14431942 = score(doc=1681,freq=2.0), product of:
              0.24190445 = queryWeight, product of:
                2.3062642 = boost
                5.399778 = idf(docFreq=542, maxDocs=44218)
                0.019424906 = queryNorm
              0.5965968 = fieldWeight in 1681, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.399778 = idf(docFreq=542, maxDocs=44218)
                0.078125 = fieldNorm(doc=1681)
        0.28 = coord(7/25)
    
  3. Spink, A.; Wolfram, D.; Jansen, B.J.; Saracevic, T.: Searching the Web : the public and their queries (2001) 0.19
    0.19296691 = sum of:
      0.19296691 = product of:
        0.6030216 = sum of:
          0.058123205 = weight(abstract_txt:spelling in 6980) [ClassicSimilarity], result of:
            0.058123205 = score(doc=6980,freq=1.0), product of:
              0.16200526 = queryWeight, product of:
                1.08966 = boost
                7.653836 = idf(docFreq=56, maxDocs=44218)
                0.019424906 = queryNorm
              0.35877356 = fieldWeight in 6980, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.653836 = idf(docFreq=56, maxDocs=44218)
                0.046875 = fieldNorm(doc=6980)
          0.064222306 = weight(abstract_txt:minority in 6980) [ClassicSimilarity], result of:
            0.064222306 = score(doc=6980,freq=1.0), product of:
              0.173149 = queryWeight, product of:
                1.1265137 = boost
                7.912698 = idf(docFreq=43, maxDocs=44218)
                0.019424906 = queryNorm
              0.37090772 = fieldWeight in 6980, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.912698 = idf(docFreq=43, maxDocs=44218)
                0.046875 = fieldNorm(doc=6980)
          0.018963628 = weight(abstract_txt:language in 6980) [ClassicSimilarity], result of:
            0.018963628 = score(doc=6980,freq=1.0), product of:
              0.09673575 = queryWeight, product of:
                1.1907895 = boost
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.019424906 = queryNorm
              0.19603536 = fieldWeight in 6980, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.046875 = fieldNorm(doc=6980)
          0.077902734 = weight(abstract_txt:mistakes in 6980) [ClassicSimilarity], result of:
            0.077902734 = score(doc=6980,freq=1.0), product of:
              0.19693877 = queryWeight, product of:
                1.2014123 = boost
                8.43879 = idf(docFreq=25, maxDocs=44218)
                0.019424906 = queryNorm
              0.3955683 = fieldWeight in 6980, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.43879 = idf(docFreq=25, maxDocs=44218)
                0.046875 = fieldNorm(doc=6980)
          0.08659165 = weight(abstract_txt:sites in 6980) [ClassicSimilarity], result of:
            0.08659165 = score(doc=6980,freq=2.0), product of:
              0.24190445 = queryWeight, product of:
                2.3062642 = boost
                5.399778 = idf(docFreq=542, maxDocs=44218)
                0.019424906 = queryNorm
              0.35795808 = fieldWeight in 6980, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.399778 = idf(docFreq=542, maxDocs=44218)
                0.046875 = fieldNorm(doc=6980)
          0.08181546 = weight(abstract_txt:frequency in 6980) [ClassicSimilarity], result of:
            0.08181546 = score(doc=6980,freq=1.0), product of:
              0.29346755 = queryWeight, product of:
                2.5401955 = boost
                5.947494 = idf(docFreq=313, maxDocs=44218)
                0.019424906 = queryNorm
              0.27878878 = fieldWeight in 6980, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.947494 = idf(docFreq=313, maxDocs=44218)
                0.046875 = fieldNorm(doc=6980)
          0.11248133 = weight(abstract_txt:words in 6980) [ClassicSimilarity], result of:
            0.11248133 = score(doc=6980,freq=2.0), product of:
              0.316976 = queryWeight, product of:
                3.048384 = boost
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.019424906 = queryNorm
              0.35485756 = fieldWeight in 6980, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.046875 = fieldNorm(doc=6980)
          0.1029213 = weight(abstract_txt:names in 6980) [ClassicSimilarity], result of:
            0.1029213 = score(doc=6980,freq=1.0), product of:
              0.37640285 = queryWeight, product of:
                3.3218722 = boost
                5.8332562 = idf(docFreq=351, maxDocs=44218)
                0.019424906 = queryNorm
              0.2734339 = fieldWeight in 6980, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8332562 = idf(docFreq=351, maxDocs=44218)
                0.046875 = fieldNorm(doc=6980)
        0.32 = coord(8/25)
    
  4. Wacholder, N.; Byrd, R.J.: Retrieving information from full text using linguistic knowledge (1994) 0.12
    0.11960794 = sum of:
      0.11960794 = product of:
        0.5980397 = sum of:
          0.09687201 = weight(abstract_txt:spelling in 8524) [ClassicSimilarity], result of:
            0.09687201 = score(doc=8524,freq=1.0), product of:
              0.16200526 = queryWeight, product of:
                1.08966 = boost
                7.653836 = idf(docFreq=56, maxDocs=44218)
                0.019424906 = queryNorm
              0.59795594 = fieldWeight in 8524, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.653836 = idf(docFreq=56, maxDocs=44218)
                0.078125 = fieldNorm(doc=8524)
          0.05474328 = weight(abstract_txt:language in 8524) [ClassicSimilarity], result of:
            0.05474328 = score(doc=8524,freq=3.0), product of:
              0.09673575 = queryWeight, product of:
                1.1907895 = boost
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.019424906 = queryNorm
              0.56590533 = fieldWeight in 8524, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.078125 = fieldNorm(doc=8524)
          0.14232838 = weight(abstract_txt:acronyms in 8524) [ClassicSimilarity], result of:
            0.14232838 = score(doc=8524,freq=1.0), product of:
              0.20937487 = queryWeight, product of:
                1.2387645 = boost
                8.701155 = idf(docFreq=19, maxDocs=44218)
                0.019424906 = queryNorm
              0.67977774 = fieldWeight in 8524, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.701155 = idf(docFreq=19, maxDocs=44218)
                0.078125 = fieldNorm(doc=8524)
          0.13256052 = weight(abstract_txt:words in 8524) [ClassicSimilarity], result of:
            0.13256052 = score(doc=8524,freq=1.0), product of:
              0.316976 = queryWeight, product of:
                3.048384 = boost
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.019424906 = queryNorm
              0.41820365 = fieldWeight in 8524, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.078125 = fieldNorm(doc=8524)
          0.17153549 = weight(abstract_txt:names in 8524) [ClassicSimilarity], result of:
            0.17153549 = score(doc=8524,freq=1.0), product of:
              0.37640285 = queryWeight, product of:
                3.3218722 = boost
                5.8332562 = idf(docFreq=351, maxDocs=44218)
                0.019424906 = queryNorm
              0.45572314 = fieldWeight in 8524, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8332562 = idf(docFreq=351, maxDocs=44218)
                0.078125 = fieldNorm(doc=8524)
        0.2 = coord(5/25)
    
  5. Riggs, F.W.: Information and social science : the need for onomantics (1989) 0.11
    0.112102024 = sum of:
      0.112102024 = product of:
        0.5605101 = sum of:
          0.0446977 = weight(abstract_txt:language in 2842) [ClassicSimilarity], result of:
            0.0446977 = score(doc=2842,freq=2.0), product of:
              0.09673575 = queryWeight, product of:
                1.1907895 = boost
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.019424906 = queryNorm
              0.46205974 = fieldWeight in 2842, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.078125 = fieldNorm(doc=2842)
          0.14232838 = weight(abstract_txt:acronyms in 2842) [ClassicSimilarity], result of:
            0.14232838 = score(doc=2842,freq=1.0), product of:
              0.20937487 = queryWeight, product of:
                1.2387645 = boost
                8.701155 = idf(docFreq=19, maxDocs=44218)
                0.019424906 = queryNorm
              0.67977774 = fieldWeight in 2842, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.701155 = idf(docFreq=19, maxDocs=44218)
                0.078125 = fieldNorm(doc=2842)
          0.069388 = weight(abstract_txt:word in 2842) [ClassicSimilarity], result of:
            0.069388 = score(doc=2842,freq=1.0), product of:
              0.16340418 = queryWeight, product of:
                1.547651 = boost
                5.4353957 = idf(docFreq=523, maxDocs=44218)
                0.019424906 = queryNorm
              0.4246403 = fieldWeight in 2842, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.4353957 = idf(docFreq=523, maxDocs=44218)
                0.078125 = fieldNorm(doc=2842)
          0.13256052 = weight(abstract_txt:words in 2842) [ClassicSimilarity], result of:
            0.13256052 = score(doc=2842,freq=1.0), product of:
              0.316976 = queryWeight, product of:
                3.048384 = boost
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.019424906 = queryNorm
              0.41820365 = fieldWeight in 2842, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.078125 = fieldNorm(doc=2842)
          0.17153549 = weight(abstract_txt:names in 2842) [ClassicSimilarity], result of:
            0.17153549 = score(doc=2842,freq=1.0), product of:
              0.37640285 = queryWeight, product of:
                3.3218722 = boost
                5.8332562 = idf(docFreq=351, maxDocs=44218)
                0.019424906 = queryNorm
              0.45572314 = fieldWeight in 2842, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8332562 = idf(docFreq=351, maxDocs=44218)
                0.078125 = fieldNorm(doc=2842)
        0.2 = coord(5/25)