Document (#40416)

Author
Lu, K.
Cai, X.
Ajiferuke, I.
Wolfram, D.
Title
Vocabulary size and its effect on topic representation
Source
Information processing and management. 53(2017) no.3, S.653-665
Year
2017
Abstract
This study investigates how computational overhead for topic model training may be reduced by selectively removing terms from the vocabulary of text corpora being modeled. We compare the impact of removing singly occurring terms, the top 0.5%, 1% and 5% most frequently occurring terms and both top 0.5% most frequent and singly occurring terms, along with changes in the number of topics modeled (10, 20, 30, 40, 50, 100) using three datasets. Four outcome measures are compared. The removal of singly occurring terms has little impact on outcomes for all of the measures tested. Document discriminative capacity, as measured by the document space density, is reduced by the removal of frequently occurring terms, but increases with higher numbers of topics. Vocabulary size does not greatly influence entropy, but entropy is affected by the number of topics. Finally, topic similarity, as measured by pairwise topic similarity and Jensen-Shannon divergence, decreases with the removal of frequent terms. The findings have implications for information science research in information retrieval and informetrics that makes use of topic modeling.
Content
Vgl.: http://www.sciencedirect.com/science/article/pii/S0306457317300298.
Theme
Computerlinguistik

Similar documents (author)

  1. Wolfram, D.: Inter-record linkage structure in a hypertext bibliographic retrieval system (1996) 5.08
    5.0789523 = sum of:
      5.0789523 = weight(author_txt:wolfram in 6830) [ClassicSimilarity], result of:
        5.0789523 = fieldWeight in 6830, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.126324 = idf(docFreq=33, maxDocs=42306)
          0.625 = fieldNorm(doc=6830)
    
  2. Wolfram, D.: Applied informetrics for information retrieval research (2003) 5.08
    5.0789523 = sum of:
      5.0789523 = weight(author_txt:wolfram in 590) [ClassicSimilarity], result of:
        5.0789523 = fieldWeight in 590, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.126324 = idf(docFreq=33, maxDocs=42306)
          0.625 = fieldNorm(doc=590)
    
  3. Wolfram, D.: Search characteristics in different types of Web-based IR environments : are they the same? (2008) 5.08
    5.0789523 = sum of:
      5.0789523 = weight(author_txt:wolfram in 4094) [ClassicSimilarity], result of:
        5.0789523 = fieldWeight in 4094, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.126324 = idf(docFreq=33, maxDocs=42306)
          0.625 = fieldNorm(doc=4094)
    
  4. Wolfram, D.: ¬The symbiotic relationship between information retrieval and informetrics (2015) 5.08
    5.0789523 = sum of:
      5.0789523 = weight(author_txt:wolfram in 3690) [ClassicSimilarity], result of:
        5.0789523 = fieldWeight in 3690, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.126324 = idf(docFreq=33, maxDocs=42306)
          0.625 = fieldNorm(doc=3690)
    
  5. Wolfram, S.: ¬A new kind of science (2002) 5.08
    5.0789523 = sum of:
      5.0789523 = weight(author_txt:wolfram in 3867) [ClassicSimilarity], result of:
        5.0789523 = fieldWeight in 3867, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.126324 = idf(docFreq=33, maxDocs=42306)
          0.625 = fieldNorm(doc=3867)
    

Similar documents (content)

  1. Shibata, N.; Kajikawa, Y.; Sakata, I.: Measuring relatedness between communities in a citation network (2011) 0.17
    0.16909824 = sum of:
      0.16909824 = product of:
        0.6039223 = sum of:
          0.028286114 = weight(abstract_txt:number in 1485) [ClassicSimilarity], result of:
            0.028286114 = score(doc=1485,freq=2.0), product of:
              0.062058207 = queryWeight, product of:
                1.003913 = boost
                4.125428 = idf(docFreq=1857, maxDocs=42306)
                0.014984218 = queryNorm
              0.45579973 = fieldWeight in 1485, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.125428 = idf(docFreq=1857, maxDocs=42306)
                0.078125 = fieldNorm(doc=1485)
          0.06493247 = weight(abstract_txt:measures in 1485) [ClassicSimilarity], result of:
            0.06493247 = score(doc=1485,freq=2.0), product of:
              0.107992016 = queryWeight, product of:
                1.324318 = boost
                5.4420843 = idf(docFreq=497, maxDocs=42306)
                0.014984218 = queryNorm
              0.60127103 = fieldWeight in 1485, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.4420843 = idf(docFreq=497, maxDocs=42306)
                0.078125 = fieldNorm(doc=1485)
          0.05601315 = weight(abstract_txt:similarity in 1485) [ClassicSimilarity], result of:
            0.05601315 = score(doc=1485,freq=1.0), product of:
              0.12329733 = queryWeight, product of:
                1.415055 = boost
                5.814954 = idf(docFreq=342, maxDocs=42306)
                0.014984218 = queryNorm
              0.45429325 = fieldWeight in 1485, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.814954 = idf(docFreq=342, maxDocs=42306)
                0.078125 = fieldNorm(doc=1485)
          0.07396883 = weight(abstract_txt:measured in 1485) [ClassicSimilarity], result of:
            0.07396883 = score(doc=1485,freq=1.0), product of:
              0.14840876 = queryWeight, product of:
                1.5524808 = boost
                6.3796844 = idf(docFreq=194, maxDocs=42306)
                0.014984218 = queryNorm
              0.49841285 = fieldWeight in 1485, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.3796844 = idf(docFreq=194, maxDocs=42306)
                0.078125 = fieldNorm(doc=1485)
          0.1916494 = weight(abstract_txt:removing in 1485) [ClassicSimilarity], result of:
            0.1916494 = score(doc=1485,freq=1.0), product of:
              0.27996174 = queryWeight, product of:
                2.1322877 = boost
                8.762313 = idf(docFreq=17, maxDocs=42306)
                0.014984218 = queryNorm
              0.6845557 = fieldWeight in 1485, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.762313 = idf(docFreq=17, maxDocs=42306)
                0.078125 = fieldNorm(doc=1485)
          0.09472004 = weight(abstract_txt:topic in 1485) [ClassicSimilarity], result of:
            0.09472004 = score(doc=1485,freq=1.0), product of:
              0.23752078 = queryWeight, product of:
                3.1053982 = boost
                5.104465 = idf(docFreq=697, maxDocs=42306)
                0.014984218 = queryNorm
              0.39878634 = fieldWeight in 1485, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.104465 = idf(docFreq=697, maxDocs=42306)
                0.078125 = fieldNorm(doc=1485)
          0.09435232 = weight(abstract_txt:terms in 1485) [ClassicSimilarity], result of:
            0.09435232 = score(doc=1485,freq=2.0), product of:
              0.21034947 = queryWeight, product of:
                3.4578106 = boost
                4.059814 = idf(docFreq=1983, maxDocs=42306)
                0.014984218 = queryNorm
              0.4485503 = fieldWeight in 1485, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.059814 = idf(docFreq=1983, maxDocs=42306)
                0.078125 = fieldNorm(doc=1485)
        0.28 = coord(7/25)
    
  2. Zhang, J.; Wolfram, D.; Wang, P.; Hong, Y.; Gillis, R.: Visualization of health-subject analysis based on query term co-occurrences (2008) 0.16
    0.16314073 = sum of:
      0.16314073 = product of:
        0.5826455 = sum of:
          0.022616863 = weight(abstract_txt:impact in 196) [ClassicSimilarity], result of:
            0.022616863 = score(doc=196,freq=1.0), product of:
              0.07816073 = queryWeight, product of:
                1.1266546 = boost
                4.629816 = idf(docFreq=1121, maxDocs=42306)
                0.014984218 = queryNorm
              0.2893635 = fieldWeight in 196, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.629816 = idf(docFreq=1121, maxDocs=42306)
                0.0625 = fieldNorm(doc=196)
          0.04481052 = weight(abstract_txt:similarity in 196) [ClassicSimilarity], result of:
            0.04481052 = score(doc=196,freq=1.0), product of:
              0.12329733 = queryWeight, product of:
                1.415055 = boost
                5.814954 = idf(docFreq=342, maxDocs=42306)
                0.014984218 = queryNorm
              0.3634346 = fieldWeight in 196, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.814954 = idf(docFreq=342, maxDocs=42306)
                0.0625 = fieldNorm(doc=196)
          0.050297305 = weight(abstract_txt:frequently in 196) [ClassicSimilarity], result of:
            0.050297305 = score(doc=196,freq=1.0), product of:
              0.13316707 = queryWeight, product of:
                1.4706012 = boost
                6.0432124 = idf(docFreq=272, maxDocs=42306)
                0.014984218 = queryNorm
              0.37770078 = fieldWeight in 196, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0432124 = idf(docFreq=272, maxDocs=42306)
                0.0625 = fieldNorm(doc=196)
          0.052456945 = weight(abstract_txt:vocabulary in 196) [ClassicSimilarity], result of:
            0.052456945 = score(doc=196,freq=1.0), product of:
              0.15677114 = queryWeight, product of:
                1.9542277 = boost
                5.353735 = idf(docFreq=543, maxDocs=42306)
                0.014984218 = queryNorm
              0.33460844 = fieldWeight in 196, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.353735 = idf(docFreq=543, maxDocs=42306)
                0.0625 = fieldNorm(doc=196)
          0.07577603 = weight(abstract_txt:topic in 196) [ClassicSimilarity], result of:
            0.07577603 = score(doc=196,freq=1.0), product of:
              0.23752078 = queryWeight, product of:
                3.1053982 = boost
                5.104465 = idf(docFreq=697, maxDocs=42306)
                0.014984218 = queryNorm
              0.31902906 = fieldWeight in 196, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.104465 = idf(docFreq=697, maxDocs=42306)
                0.0625 = fieldNorm(doc=196)
          0.10674746 = weight(abstract_txt:terms in 196) [ClassicSimilarity], result of:
            0.10674746 = score(doc=196,freq=4.0), product of:
              0.21034947 = queryWeight, product of:
                3.4578106 = boost
                4.059814 = idf(docFreq=1983, maxDocs=42306)
                0.014984218 = queryNorm
              0.50747675 = fieldWeight in 196, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.059814 = idf(docFreq=1983, maxDocs=42306)
                0.0625 = fieldNorm(doc=196)
          0.22994034 = weight(abstract_txt:occurring in 196) [ClassicSimilarity], result of:
            0.22994034 = score(doc=196,freq=1.0), product of:
              0.4978408 = queryWeight, product of:
                4.495849 = boost
                7.390004 = idf(docFreq=70, maxDocs=42306)
                0.014984218 = queryNorm
              0.46187526 = fieldWeight in 196, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.390004 = idf(docFreq=70, maxDocs=42306)
                0.0625 = fieldNorm(doc=196)
        0.28 = coord(7/25)
    
  3. Sparck Jones, K.: ¬A statistical interpretation of term specificity and its application in retrieval (2004) 0.16
    0.1581121 = sum of:
      0.1581121 = product of:
        0.7905605 = sum of:
          0.026788818 = weight(abstract_txt:document in 421) [ClassicSimilarity], result of:
            0.026788818 = score(doc=421,freq=1.0), product of:
              0.06677418 = queryWeight, product of:
                1.0413597 = boost
                4.2793097 = idf(docFreq=1592, maxDocs=42306)
                0.014984218 = queryNorm
              0.40118527 = fieldWeight in 421, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2793097 = idf(docFreq=1592, maxDocs=42306)
                0.09375 = fieldNorm(doc=421)
          0.07544596 = weight(abstract_txt:frequently in 421) [ClassicSimilarity], result of:
            0.07544596 = score(doc=421,freq=1.0), product of:
              0.13316707 = queryWeight, product of:
                1.4706012 = boost
                6.0432124 = idf(docFreq=272, maxDocs=42306)
                0.014984218 = queryNorm
              0.56655115 = fieldWeight in 421, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0432124 = idf(docFreq=272, maxDocs=42306)
                0.09375 = fieldNorm(doc=421)
          0.16439427 = weight(abstract_txt:frequent in 421) [ClassicSimilarity], result of:
            0.16439427 = score(doc=421,freq=2.0), product of:
              0.17764542 = queryWeight, product of:
                1.698531 = boost
                6.9798555 = idf(docFreq=106, maxDocs=42306)
                0.014984218 = queryNorm
              0.9254068 = fieldWeight in 421, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.9798555 = idf(docFreq=106, maxDocs=42306)
                0.09375 = fieldNorm(doc=421)
          0.17902094 = weight(abstract_txt:terms in 421) [ClassicSimilarity], result of:
            0.17902094 = score(doc=421,freq=5.0), product of:
              0.21034947 = queryWeight, product of:
                3.4578106 = boost
                4.059814 = idf(docFreq=1983, maxDocs=42306)
                0.014984218 = queryNorm
              0.8510644 = fieldWeight in 421, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.059814 = idf(docFreq=1983, maxDocs=42306)
                0.09375 = fieldNorm(doc=421)
          0.34491053 = weight(abstract_txt:occurring in 421) [ClassicSimilarity], result of:
            0.34491053 = score(doc=421,freq=1.0), product of:
              0.4978408 = queryWeight, product of:
                4.495849 = boost
                7.390004 = idf(docFreq=70, maxDocs=42306)
                0.014984218 = queryNorm
              0.6928129 = fieldWeight in 421, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.390004 = idf(docFreq=70, maxDocs=42306)
                0.09375 = fieldNorm(doc=421)
        0.2 = coord(5/25)
    
  4. Wolfram, D.; Zhang, J.: ¬The influence of indexing practices and weighting algorithms on document spaces (2008) 0.13
    0.1332606 = sum of:
      0.1332606 = product of:
        0.6663029 = sum of:
          0.043745957 = weight(abstract_txt:document in 3964) [ClassicSimilarity], result of:
            0.043745957 = score(doc=3964,freq=6.0), product of:
              0.06677418 = queryWeight, product of:
                1.0413597 = boost
                4.2793097 = idf(docFreq=1592, maxDocs=42306)
                0.014984218 = queryNorm
              0.65513283 = fieldWeight in 3964, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.2793097 = idf(docFreq=1592, maxDocs=42306)
                0.0625 = fieldNorm(doc=3964)
          0.08154531 = weight(abstract_txt:discriminative in 3964) [ClassicSimilarity], result of:
            0.08154531 = score(doc=3964,freq=1.0), product of:
              0.14586677 = queryWeight, product of:
                1.0883276 = boost
                8.944634 = idf(docFreq=14, maxDocs=42306)
                0.014984218 = queryNorm
              0.55903965 = fieldWeight in 3964, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.944634 = idf(docFreq=14, maxDocs=42306)
                0.0625 = fieldNorm(doc=3964)
          0.050297305 = weight(abstract_txt:frequently in 3964) [ClassicSimilarity], result of:
            0.050297305 = score(doc=3964,freq=1.0), product of:
              0.13316707 = queryWeight, product of:
                1.4706012 = boost
                6.0432124 = idf(docFreq=272, maxDocs=42306)
                0.014984218 = queryNorm
              0.37770078 = fieldWeight in 3964, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0432124 = idf(docFreq=272, maxDocs=42306)
                0.0625 = fieldNorm(doc=3964)
          0.092446014 = weight(abstract_txt:terms in 3964) [ClassicSimilarity], result of:
            0.092446014 = score(doc=3964,freq=3.0), product of:
              0.21034947 = queryWeight, product of:
                3.4578106 = boost
                4.059814 = idf(docFreq=1983, maxDocs=42306)
                0.014984218 = queryNorm
              0.43948776 = fieldWeight in 3964, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.059814 = idf(docFreq=1983, maxDocs=42306)
                0.0625 = fieldNorm(doc=3964)
          0.39826837 = weight(abstract_txt:occurring in 3964) [ClassicSimilarity], result of:
            0.39826837 = score(doc=3964,freq=3.0), product of:
              0.4978408 = queryWeight, product of:
                4.495849 = boost
                7.390004 = idf(docFreq=70, maxDocs=42306)
                0.014984218 = queryNorm
              0.7999914 = fieldWeight in 3964, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.390004 = idf(docFreq=70, maxDocs=42306)
                0.0625 = fieldNorm(doc=3964)
        0.2 = coord(5/25)
    
  5. Yang, Y.; Wilbur, J.: Using corpus statistics to remove redundant words in text categorization (1996) 0.12
    0.11525455 = sum of:
      0.11525455 = product of:
        0.72034097 = sum of:
          0.028271079 = weight(abstract_txt:impact in 4268) [ClassicSimilarity], result of:
            0.028271079 = score(doc=4268,freq=1.0), product of:
              0.07816073 = queryWeight, product of:
                1.1266546 = boost
                4.629816 = idf(docFreq=1121, maxDocs=42306)
                0.014984218 = queryNorm
              0.36170438 = fieldWeight in 4268, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.629816 = idf(docFreq=1121, maxDocs=42306)
                0.078125 = fieldNorm(doc=4268)
          0.09846524 = weight(abstract_txt:reduced in 4268) [ClassicSimilarity], result of:
            0.09846524 = score(doc=4268,freq=1.0), product of:
              0.17959008 = queryWeight, product of:
                1.7078025 = boost
                7.0179553 = idf(docFreq=102, maxDocs=42306)
                0.014984218 = queryNorm
              0.54827774 = fieldWeight in 4268, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.0179553 = idf(docFreq=102, maxDocs=42306)
                0.078125 = fieldNorm(doc=4268)
          0.06557118 = weight(abstract_txt:vocabulary in 4268) [ClassicSimilarity], result of:
            0.06557118 = score(doc=4268,freq=1.0), product of:
              0.15677114 = queryWeight, product of:
                1.9542277 = boost
                5.353735 = idf(docFreq=543, maxDocs=42306)
                0.014984218 = queryNorm
              0.41826054 = fieldWeight in 4268, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.353735 = idf(docFreq=543, maxDocs=42306)
                0.078125 = fieldNorm(doc=4268)
          0.52803344 = weight(abstract_txt:removal in 4268) [ClassicSimilarity], result of:
            0.52803344 = score(doc=4268,freq=4.0), product of:
              0.39677572 = queryWeight, product of:
                3.1089566 = boost
                8.51719 = idf(docFreq=22, maxDocs=42306)
                0.014984218 = queryNorm
              1.3308109 = fieldWeight in 4268, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                8.51719 = idf(docFreq=22, maxDocs=42306)
                0.078125 = fieldNorm(doc=4268)
        0.16 = coord(4/25)