Document (#40415)

Author
Lu, K.
Cai, X.
Ajiferuke, I.
Wolfram, D.
Title
Vocabulary size and its effect on topic representation
Source
Information processing and management. 53(2017) no.3, S.653-665
Year
2017
Abstract
This study investigates how computational overhead for topic model training may be reduced by selectively removing terms from the vocabulary of text corpora being modeled. We compare the impact of removing singly occurring terms, the top 0.5%, 1% and 5% most frequently occurring terms and both top 0.5% most frequent and singly occurring terms, along with changes in the number of topics modeled (10, 20, 30, 40, 50, 100) using three datasets. Four outcome measures are compared. The removal of singly occurring terms has little impact on outcomes for all of the measures tested. Document discriminative capacity, as measured by the document space density, is reduced by the removal of frequently occurring terms, but increases with higher numbers of topics. Vocabulary size does not greatly influence entropy, but entropy is affected by the number of topics. Finally, topic similarity, as measured by pairwise topic similarity and Jensen-Shannon divergence, decreases with the removal of frequent terms. The findings have implications for information science research in information retrieval and informetrics that makes use of topic modeling.
Content
Vgl.: http://www.sciencedirect.com/science/article/pii/S0306457317300298.
Theme
Computerlinguistik

Similar documents (author)

  1. Wolfram, D.: Inter-record linkage structure in a hypertext bibliographic retrieval system (1996) 5.09
    5.0884624 = sum of:
      5.0884624 = weight(author_txt:wolfram in 6761) [ClassicSimilarity], result of:
        5.0884624 = fieldWeight in 6761, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.14154 = idf(docFreq=34, maxDocs=44218)
          0.625 = fieldNorm(doc=6761)
    
  2. Wolfram, D.: Applied informetrics for information retrieval research (2003) 5.09
    5.0884624 = sum of:
      5.0884624 = weight(author_txt:wolfram in 4589) [ClassicSimilarity], result of:
        5.0884624 = fieldWeight in 4589, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.14154 = idf(docFreq=34, maxDocs=44218)
          0.625 = fieldNorm(doc=4589)
    
  3. Wolfram, D.: Search characteristics in different types of Web-based IR environments : are they the same? (2008) 5.09
    5.0884624 = sum of:
      5.0884624 = weight(author_txt:wolfram in 2093) [ClassicSimilarity], result of:
        5.0884624 = fieldWeight in 2093, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.14154 = idf(docFreq=34, maxDocs=44218)
          0.625 = fieldNorm(doc=2093)
    
  4. Wolfram, D.: ¬The symbiotic relationship between information retrieval and informetrics (2015) 5.09
    5.0884624 = sum of:
      5.0884624 = weight(author_txt:wolfram in 1689) [ClassicSimilarity], result of:
        5.0884624 = fieldWeight in 1689, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.14154 = idf(docFreq=34, maxDocs=44218)
          0.625 = fieldNorm(doc=1689)
    
  5. Wolfram, S.: ¬A new kind of science (2002) 5.09
    5.0884624 = sum of:
      5.0884624 = weight(author_txt:wolfram in 1866) [ClassicSimilarity], result of:
        5.0884624 = fieldWeight in 1866, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.14154 = idf(docFreq=34, maxDocs=44218)
          0.625 = fieldNorm(doc=1866)
    

Similar documents (content)

  1. Shibata, N.; Kajikawa, Y.; Sakata, I.: Measuring relatedness between communities in a citation network (2011) 0.17
    0.16731553 = sum of:
      0.16731553 = product of:
        0.59755546 = sum of:
          0.028966129 = weight(abstract_txt:number in 4484) [ClassicSimilarity], result of:
            0.028966129 = score(doc=4484,freq=2.0), product of:
              0.063439086 = queryWeight, product of:
                1.0041485 = boost
                4.132649 = idf(docFreq=1927, maxDocs=44218)
                0.015287288 = queryNorm
              0.4565975 = fieldWeight in 4484, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.132649 = idf(docFreq=1927, maxDocs=44218)
                0.078125 = fieldNorm(doc=4484)
          0.06590197 = weight(abstract_txt:measures in 4484) [ClassicSimilarity], result of:
            0.06590197 = score(doc=4484,freq=2.0), product of:
              0.1097393 = queryWeight, product of:
                1.320689 = boost
                5.4353957 = idf(docFreq=523, maxDocs=44218)
                0.015287288 = queryNorm
              0.60053205 = fieldWeight in 4484, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.4353957 = idf(docFreq=523, maxDocs=44218)
                0.078125 = fieldNorm(doc=4484)
          0.05718326 = weight(abstract_txt:similarity in 4484) [ClassicSimilarity], result of:
            0.05718326 = score(doc=4484,freq=1.0), product of:
              0.1257822 = queryWeight, product of:
                1.4139338 = boost
                5.8191514 = idf(docFreq=356, maxDocs=44218)
                0.015287288 = queryNorm
              0.4546212 = fieldWeight in 4484, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8191514 = idf(docFreq=356, maxDocs=44218)
                0.078125 = fieldNorm(doc=4484)
          0.07429607 = weight(abstract_txt:measured in 4484) [ClassicSimilarity], result of:
            0.07429607 = score(doc=4484,freq=1.0), product of:
              0.14976735 = queryWeight, product of:
                1.5428654 = boost
                6.3497796 = idf(docFreq=209, maxDocs=44218)
                0.015287288 = queryNorm
              0.49607652 = fieldWeight in 4484, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.3497796 = idf(docFreq=209, maxDocs=44218)
                0.078125 = fieldNorm(doc=4484)
          0.18210554 = weight(abstract_txt:removing in 4484) [ClassicSimilarity], result of:
            0.18210554 = score(doc=4484,freq=1.0), product of:
              0.27226305 = queryWeight, product of:
                2.0802417 = boost
                8.561393 = idf(docFreq=22, maxDocs=44218)
                0.015287288 = queryNorm
              0.6688588 = fieldWeight in 4484, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.561393 = idf(docFreq=22, maxDocs=44218)
                0.078125 = fieldNorm(doc=4484)
          0.094115615 = weight(abstract_txt:topic in 4484) [ClassicSimilarity], result of:
            0.094115615 = score(doc=4484,freq=1.0), product of:
              0.23797302 = queryWeight, product of:
                3.0750582 = boost
                5.062254 = idf(docFreq=760, maxDocs=44218)
                0.015287288 = queryNorm
              0.3954886 = fieldWeight in 4484, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.062254 = idf(docFreq=760, maxDocs=44218)
                0.078125 = fieldNorm(doc=4484)
          0.09498683 = weight(abstract_txt:terms in 4484) [ClassicSimilarity], result of:
            0.09498683 = score(doc=4484,freq=2.0), product of:
              0.21259916 = queryWeight, product of:
                3.4390166 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.015287288 = queryNorm
              0.44678837 = fieldWeight in 4484, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.078125 = fieldNorm(doc=4484)
        0.28 = coord(7/25)
    
  2. Zhang, J.; Wolfram, D.; Wang, P.; Hong, Y.; Gillis, R.: Visualization of health-subject analysis based on query term co-occurrences (2008) 0.16
    0.1629513 = sum of:
      0.1629513 = product of:
        0.58196896 = sum of:
          0.022418186 = weight(abstract_txt:impact in 2376) [ClassicSimilarity], result of:
            0.022418186 = score(doc=2376,freq=1.0), product of:
              0.07818323 = queryWeight, product of:
                1.1147469 = boost
                4.5878253 = idf(docFreq=1222, maxDocs=44218)
                0.015287288 = queryNorm
              0.28673908 = fieldWeight in 2376, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.5878253 = idf(docFreq=1222, maxDocs=44218)
                0.0625 = fieldNorm(doc=2376)
          0.045746606 = weight(abstract_txt:similarity in 2376) [ClassicSimilarity], result of:
            0.045746606 = score(doc=2376,freq=1.0), product of:
              0.1257822 = queryWeight, product of:
                1.4139338 = boost
                5.8191514 = idf(docFreq=356, maxDocs=44218)
                0.015287288 = queryNorm
              0.36369696 = fieldWeight in 2376, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8191514 = idf(docFreq=356, maxDocs=44218)
                0.0625 = fieldNorm(doc=2376)
          0.050309546 = weight(abstract_txt:frequently in 2376) [ClassicSimilarity], result of:
            0.050309546 = score(doc=2376,freq=1.0), product of:
              0.13401298 = queryWeight, product of:
                1.4594624 = boost
                6.006528 = idf(docFreq=295, maxDocs=44218)
                0.015287288 = queryNorm
              0.375408 = fieldWeight in 2376, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.006528 = idf(docFreq=295, maxDocs=44218)
                0.0625 = fieldNorm(doc=2376)
          0.053573556 = weight(abstract_txt:vocabulary in 2376) [ClassicSimilarity], result of:
            0.053573556 = score(doc=2376,freq=1.0), product of:
              0.15997201 = queryWeight, product of:
                1.9529319 = boost
                5.358293 = idf(docFreq=565, maxDocs=44218)
                0.015287288 = queryNorm
              0.33489332 = fieldWeight in 2376, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.358293 = idf(docFreq=565, maxDocs=44218)
                0.0625 = fieldNorm(doc=2376)
          0.07529249 = weight(abstract_txt:topic in 2376) [ClassicSimilarity], result of:
            0.07529249 = score(doc=2376,freq=1.0), product of:
              0.23797302 = queryWeight, product of:
                3.0750582 = boost
                5.062254 = idf(docFreq=760, maxDocs=44218)
                0.015287288 = queryNorm
              0.31639087 = fieldWeight in 2376, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.062254 = idf(docFreq=760, maxDocs=44218)
                0.0625 = fieldNorm(doc=2376)
          0.10746533 = weight(abstract_txt:terms in 2376) [ClassicSimilarity], result of:
            0.10746533 = score(doc=2376,freq=4.0), product of:
              0.21259916 = queryWeight, product of:
                3.4390166 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.015287288 = queryNorm
              0.5054833 = fieldWeight in 2376, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=2376)
          0.22716326 = weight(abstract_txt:occurring in 2376) [ClassicSimilarity], result of:
            0.22716326 = score(doc=2376,freq=1.0), product of:
              0.49688056 = queryWeight, product of:
                4.4434004 = boost
                7.314861 = idf(docFreq=79, maxDocs=44218)
                0.015287288 = queryNorm
              0.4571788 = fieldWeight in 2376, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.314861 = idf(docFreq=79, maxDocs=44218)
                0.0625 = fieldNorm(doc=2376)
        0.28 = coord(7/25)
    
  3. Sparck Jones, K.: ¬A statistical interpretation of term specificity and its application in retrieval (2004) 0.16
    0.15764315 = sum of:
      0.15764315 = product of:
        0.78821576 = sum of:
          0.0275445 = weight(abstract_txt:document in 4420) [ClassicSimilarity], result of:
            0.0275445 = score(doc=4420,freq=1.0), product of:
              0.0684451 = queryWeight, product of:
                1.0430152 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.015287288 = queryNorm
              0.40243202 = fieldWeight in 4420, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.09375 = fieldNorm(doc=4420)
          0.07546432 = weight(abstract_txt:frequently in 4420) [ClassicSimilarity], result of:
            0.07546432 = score(doc=4420,freq=1.0), product of:
              0.13401298 = queryWeight, product of:
                1.4594624 = boost
                6.006528 = idf(docFreq=295, maxDocs=44218)
                0.015287288 = queryNorm
              0.563112 = fieldWeight in 4420, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.006528 = idf(docFreq=295, maxDocs=44218)
                0.09375 = fieldNorm(doc=4420)
          0.16423725 = weight(abstract_txt:frequent in 4420) [ClassicSimilarity], result of:
            0.16423725 = score(doc=4420,freq=2.0), product of:
              0.17863102 = queryWeight, product of:
                1.6849923 = boost
                6.9347134 = idf(docFreq=116, maxDocs=44218)
                0.015287288 = queryNorm
              0.91942173 = fieldWeight in 4420, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.9347134 = idf(docFreq=116, maxDocs=44218)
                0.09375 = fieldNorm(doc=4420)
          0.18022484 = weight(abstract_txt:terms in 4420) [ClassicSimilarity], result of:
            0.18022484 = score(doc=4420,freq=5.0), product of:
              0.21259916 = queryWeight, product of:
                3.4390166 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.015287288 = queryNorm
              0.84772134 = fieldWeight in 4420, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.09375 = fieldNorm(doc=4420)
          0.34074488 = weight(abstract_txt:occurring in 4420) [ClassicSimilarity], result of:
            0.34074488 = score(doc=4420,freq=1.0), product of:
              0.49688056 = queryWeight, product of:
                4.4434004 = boost
                7.314861 = idf(docFreq=79, maxDocs=44218)
                0.015287288 = queryNorm
              0.6857682 = fieldWeight in 4420, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.314861 = idf(docFreq=79, maxDocs=44218)
                0.09375 = fieldNorm(doc=4420)
        0.2 = coord(5/25)
    
  4. Wolfram, D.; Zhang, J.: ¬The influence of indexing practices and weighting algorithms on document spaces (2008) 0.13
    0.13252981 = sum of:
      0.13252981 = product of:
        0.66264904 = sum of:
          0.044979982 = weight(abstract_txt:document in 1963) [ClassicSimilarity], result of:
            0.044979982 = score(doc=1963,freq=6.0), product of:
              0.0684451 = queryWeight, product of:
                1.0430152 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.015287288 = queryNorm
              0.65716875 = fieldWeight in 1963, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0625 = fieldNorm(doc=1963)
          0.08083347 = weight(abstract_txt:discriminative in 1963) [ClassicSimilarity], result of:
            0.08083347 = score(doc=1963,freq=1.0), product of:
              0.14591415 = queryWeight, product of:
                1.0768449 = boost
                8.863674 = idf(docFreq=16, maxDocs=44218)
                0.015287288 = queryNorm
              0.55397964 = fieldWeight in 1963, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.863674 = idf(docFreq=16, maxDocs=44218)
                0.0625 = fieldNorm(doc=1963)
          0.050309546 = weight(abstract_txt:frequently in 1963) [ClassicSimilarity], result of:
            0.050309546 = score(doc=1963,freq=1.0), product of:
              0.13401298 = queryWeight, product of:
                1.4594624 = boost
                6.006528 = idf(docFreq=295, maxDocs=44218)
                0.015287288 = queryNorm
              0.375408 = fieldWeight in 1963, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.006528 = idf(docFreq=295, maxDocs=44218)
                0.0625 = fieldNorm(doc=1963)
          0.093067706 = weight(abstract_txt:terms in 1963) [ClassicSimilarity], result of:
            0.093067706 = score(doc=1963,freq=3.0), product of:
              0.21259916 = queryWeight, product of:
                3.4390166 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.015287288 = queryNorm
              0.4377614 = fieldWeight in 1963, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=1963)
          0.3934583 = weight(abstract_txt:occurring in 1963) [ClassicSimilarity], result of:
            0.3934583 = score(doc=1963,freq=3.0), product of:
              0.49688056 = queryWeight, product of:
                4.4434004 = boost
                7.314861 = idf(docFreq=79, maxDocs=44218)
                0.015287288 = queryNorm
              0.7918569 = fieldWeight in 1963, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.314861 = idf(docFreq=79, maxDocs=44218)
                0.0625 = fieldNorm(doc=1963)
        0.2 = coord(5/25)
    
  5. Alipour, O.; Soheili, F.; Khasseh, A.A.: ¬A co-word analysis of global research on knowledge organization: 1900-2019 (2022) 0.13
    0.12853287 = sum of:
      0.12853287 = product of:
        0.45904595 = sum of:
          0.012289288 = weight(abstract_txt:number in 1106) [ClassicSimilarity], result of:
            0.012289288 = score(doc=1106,freq=1.0), product of:
              0.063439086 = queryWeight, product of:
                1.0041485 = boost
                4.132649 = idf(docFreq=1927, maxDocs=44218)
                0.015287288 = queryNorm
              0.19371793 = fieldWeight in 1106, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.132649 = idf(docFreq=1927, maxDocs=44218)
                0.046875 = fieldNorm(doc=1106)
          0.03773216 = weight(abstract_txt:frequently in 1106) [ClassicSimilarity], result of:
            0.03773216 = score(doc=1106,freq=1.0), product of:
              0.13401298 = queryWeight, product of:
                1.4594624 = boost
                6.006528 = idf(docFreq=295, maxDocs=44218)
                0.015287288 = queryNorm
              0.281556 = fieldWeight in 1106, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.006528 = idf(docFreq=295, maxDocs=44218)
                0.046875 = fieldNorm(doc=1106)
          0.058066636 = weight(abstract_txt:frequent in 1106) [ClassicSimilarity], result of:
            0.058066636 = score(doc=1106,freq=1.0), product of:
              0.17863102 = queryWeight, product of:
                1.6849923 = boost
                6.9347134 = idf(docFreq=116, maxDocs=44218)
                0.015287288 = queryNorm
              0.3250647 = fieldWeight in 1106, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9347134 = idf(docFreq=116, maxDocs=44218)
                0.046875 = fieldNorm(doc=1106)
          0.058721583 = weight(abstract_txt:reduced in 1106) [ClassicSimilarity], result of:
            0.058721583 = score(doc=1106,freq=1.0), product of:
              0.17997172 = queryWeight, product of:
                1.6913037 = boost
                6.9606886 = idf(docFreq=113, maxDocs=44218)
                0.015287288 = queryNorm
              0.32628226 = fieldWeight in 1106, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9606886 = idf(docFreq=113, maxDocs=44218)
                0.046875 = fieldNorm(doc=1106)
          0.05952112 = weight(abstract_txt:topics in 1106) [ClassicSimilarity], result of:
            0.05952112 = score(doc=1106,freq=3.0), product of:
              0.14413734 = queryWeight, product of:
                1.8537593 = boost
                5.086191 = idf(docFreq=742, maxDocs=44218)
                0.015287288 = queryNorm
              0.41294727 = fieldWeight in 1106, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.086191 = idf(docFreq=742, maxDocs=44218)
                0.046875 = fieldNorm(doc=1106)
          0.15285543 = weight(abstract_txt:removal in 1106) [ClassicSimilarity], result of:
            0.15285543 = score(doc=1106,freq=1.0), product of:
              0.3898433 = queryWeight, product of:
                3.0486681 = boost
                8.364683 = idf(docFreq=27, maxDocs=44218)
                0.015287288 = queryNorm
              0.39209452 = fieldWeight in 1106, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.364683 = idf(docFreq=27, maxDocs=44218)
                0.046875 = fieldNorm(doc=1106)
          0.07985975 = weight(abstract_txt:topic in 1106) [ClassicSimilarity], result of:
            0.07985975 = score(doc=1106,freq=2.0), product of:
              0.23797302 = queryWeight, product of:
                3.0750582 = boost
                5.062254 = idf(docFreq=760, maxDocs=44218)
                0.015287288 = queryNorm
              0.3355832 = fieldWeight in 1106, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.062254 = idf(docFreq=760, maxDocs=44218)
                0.046875 = fieldNorm(doc=1106)
        0.28 = coord(7/25)