Document (#38060)

Author
Rehurek, R.
Sojka, P.
Title
Software framework for topic modelling with large corpora
Source
http://radimrehurek.com/gensim/lrec2010_final.pdf
Year
2010
Abstract
Large corpora are ubiquitous in today's world and memory quickly becomes the limiting factor in practical applications of the Vector Space Model (VSM). In this paper, we identify a gap in existing implementations of many of the popular algorithms, which is their scalability and ease of use. We describe a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion. Within this framework, we implement several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation, in a way that makes them completely independent of the training corpus size. Particular emphasis is placed on straightforward and intuitive framework design, so that modifications and extensions of the methods and/or their application by interested practitioners are effortless. We demonstrate the usefulness of our approach on a real-world scenario of computing document similarities within an existing digital library DML-CZ.
Content
Für die Software, vgl.: http://radimrehurek.com/gensim/index.html. Für eine Demo, vgl.: http://dml.cz/handle/10338.dmlcz/100785/SimilarArticles.
Field
Mathematik
Object
Latent Semantic Indexing
Aid
Gensim

Similar documents (author)

  1. Sojka, P.: Exploiting semantic annotations in math information retrieval (2012) 6.18
    6.176928 = sum of:
      6.176928 = weight(author_txt:sojka in 1497) [ClassicSimilarity], result of:
        6.176928 = score(doc=1497,freq=1.0), product of:
          0.99999994 = queryWeight, product of:
            9.883085 = idf(docFreq=5, maxDocs=43254)
            0.101182975 = queryNorm
          6.1769285 = fieldWeight in 1497, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            9.883085 = idf(docFreq=5, maxDocs=43254)
            0.625 = fieldNorm(doc=1497)
    
  2. Líska, M.; Sojka, P.: MIaS 1.5 (2014) 4.94
    4.941542 = sum of:
      4.941542 = weight(author_txt:sojka in 3117) [ClassicSimilarity], result of:
        4.941542 = score(doc=3117,freq=1.0), product of:
          0.99999994 = queryWeight, product of:
            9.883085 = idf(docFreq=5, maxDocs=43254)
            0.101182975 = queryNorm
          4.9415426 = fieldWeight in 3117, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            9.883085 = idf(docFreq=5, maxDocs=43254)
            0.5 = fieldNorm(doc=3117)
    
  3. Sojka, P.; Liska, M.: ¬The art of mathematics retrieval (2011) 4.94
    4.941542 = sum of:
      4.941542 = weight(author_txt:sojka in 4915) [ClassicSimilarity], result of:
        4.941542 = score(doc=4915,freq=1.0), product of:
          0.99999994 = queryWeight, product of:
            9.883085 = idf(docFreq=5, maxDocs=43254)
            0.101182975 = queryNorm
          4.9415426 = fieldWeight in 4915, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            9.883085 = idf(docFreq=5, maxDocs=43254)
            0.5 = fieldNorm(doc=4915)
    
  4. Sojka, P.; Lee, M.; Rehurek, R.; Hatlapatka, R.; Kucbel, M.; Bouche, T.; Goutorbe, C.; Anghelache, R.; Wojciechowski, K.: Toolset for entity and semantic associations : Final Release (2013) 2.16
    2.1619246 = sum of:
      2.1619246 = weight(author_txt:sojka in 2522) [ClassicSimilarity], result of:
        2.1619246 = score(doc=2522,freq=1.0), product of:
          0.99999994 = queryWeight, product of:
            9.883085 = idf(docFreq=5, maxDocs=43254)
            0.101182975 = queryNorm
          2.1619248 = fieldWeight in 2522, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            9.883085 = idf(docFreq=5, maxDocs=43254)
            0.21875 = fieldNorm(doc=2522)
    

Similar documents (content)

  1. Li, X.; Zhang, A.; Li, C.; Ouyang, J.; Cai, Y.: Exploring coherent topics by topic modeling with term weighting (2018) 0.16
    0.16403529 = sum of:
      0.16403529 = product of:
        0.58584034 = sum of:
          0.07625398 = weight(abstract_txt:straightforward in 46) [ClassicSimilarity], result of:
            0.07625398 = score(doc=46,freq=1.0), product of:
              0.162997 = queryWeight, product of:
                1.044836 = boost
                7.4851904 = idf(docFreq=65, maxDocs=43254)
                0.020841485 = queryNorm
              0.4678244 = fieldWeight in 46, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.4851904 = idf(docFreq=65, maxDocs=43254)
                0.0625 = fieldNorm(doc=46)
          0.081386425 = weight(abstract_txt:allocation in 46) [ClassicSimilarity], result of:
            0.081386425 = score(doc=46,freq=1.0), product of:
              0.17023125 = queryWeight, product of:
                1.0677706 = boost
                7.649493 = idf(docFreq=55, maxDocs=43254)
                0.020841485 = queryNorm
              0.47809333 = fieldWeight in 46, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.649493 = idf(docFreq=55, maxDocs=43254)
                0.0625 = fieldNorm(doc=46)
          0.14743064 = weight(abstract_txt:dirichlet in 46) [ClassicSimilarity], result of:
            0.14743064 = score(doc=46,freq=2.0), product of:
              0.20077972 = queryWeight, product of:
                1.1596268 = boost
                8.307549 = idf(docFreq=28, maxDocs=43254)
                0.020841485 = queryNorm
              0.73429054 = fieldWeight in 46, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.307549 = idf(docFreq=28, maxDocs=43254)
                0.0625 = fieldNorm(doc=46)
          0.03182858 = weight(abstract_txt:world in 46) [ClassicSimilarity], result of:
            0.03182858 = score(doc=46,freq=1.0), product of:
              0.114698954 = queryWeight, product of:
                1.2395186 = boost
                4.4399467 = idf(docFreq=1386, maxDocs=43254)
                0.020841485 = queryNorm
              0.27749667 = fieldWeight in 46, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.4399467 = idf(docFreq=1386, maxDocs=43254)
                0.0625 = fieldNorm(doc=46)
          0.06486763 = weight(abstract_txt:existing in 46) [ClassicSimilarity], result of:
            0.06486763 = score(doc=46,freq=3.0), product of:
              0.12783788 = queryWeight, product of:
                1.3085885 = boost
                4.6873546 = idf(docFreq=1082, maxDocs=43254)
                0.020841485 = queryNorm
              0.507421 = fieldWeight in 46, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.6873546 = idf(docFreq=1082, maxDocs=43254)
                0.0625 = fieldNorm(doc=46)
          0.12688883 = weight(abstract_txt:latent in 46) [ClassicSimilarity], result of:
            0.12688883 = score(doc=46,freq=1.0), product of:
              0.28837895 = queryWeight, product of:
                1.9654188 = boost
                7.040116 = idf(docFreq=102, maxDocs=43254)
                0.020841485 = queryNorm
              0.44000724 = fieldWeight in 46, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.040116 = idf(docFreq=102, maxDocs=43254)
                0.0625 = fieldNorm(doc=46)
          0.05718426 = weight(abstract_txt:document in 46) [ClassicSimilarity], result of:
            0.05718426 = score(doc=46,freq=1.0), product of:
              0.21357101 = queryWeight, product of:
                2.3919907 = boost
                4.2840466 = idf(docFreq=1620, maxDocs=43254)
                0.020841485 = queryNorm
              0.26775292 = fieldWeight in 46, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2840466 = idf(docFreq=1620, maxDocs=43254)
                0.0625 = fieldNorm(doc=46)
        0.28 = coord(7/25)
    
  2. Zheng, H.-T.; Borchert, C.; Kim, H.-G.: Exploiting corpus-related ontologies for conceptualizing document corpora (2009) 0.11
    0.10653334 = sum of:
      0.10653334 = product of:
        0.5326667 = sum of:
          0.0271083 = weight(abstract_txt:within in 166) [ClassicSimilarity], result of:
            0.0271083 = score(doc=166,freq=1.0), product of:
              0.10305827 = queryWeight, product of:
                1.1749375 = boost
                4.208617 = idf(docFreq=1747, maxDocs=43254)
                0.020841485 = queryNorm
              0.26303858 = fieldWeight in 166, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.208617 = idf(docFreq=1747, maxDocs=43254)
                0.0625 = fieldNorm(doc=166)
          0.06905831 = weight(abstract_txt:algorithms in 166) [ClassicSimilarity], result of:
            0.06905831 = score(doc=166,freq=1.0), product of:
              0.19223182 = queryWeight, product of:
                1.6046709 = boost
                5.747919 = idf(docFreq=374, maxDocs=43254)
                0.020841485 = queryNorm
              0.35924494 = fieldWeight in 166, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.747919 = idf(docFreq=374, maxDocs=43254)
                0.0625 = fieldNorm(doc=166)
          0.12688883 = weight(abstract_txt:latent in 166) [ClassicSimilarity], result of:
            0.12688883 = score(doc=166,freq=1.0), product of:
              0.28837895 = queryWeight, product of:
                1.9654188 = boost
                7.040116 = idf(docFreq=102, maxDocs=43254)
                0.020841485 = queryNorm
              0.44000724 = fieldWeight in 166, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.040116 = idf(docFreq=102, maxDocs=43254)
                0.0625 = fieldNorm(doc=166)
          0.11436852 = weight(abstract_txt:document in 166) [ClassicSimilarity], result of:
            0.11436852 = score(doc=166,freq=4.0), product of:
              0.21357101 = queryWeight, product of:
                2.3919907 = boost
                4.2840466 = idf(docFreq=1620, maxDocs=43254)
                0.020841485 = queryNorm
              0.53550583 = fieldWeight in 166, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.2840466 = idf(docFreq=1620, maxDocs=43254)
                0.0625 = fieldNorm(doc=166)
          0.19524273 = weight(abstract_txt:corpora in 166) [ClassicSimilarity], result of:
            0.19524273 = score(doc=166,freq=1.0), product of:
              0.43997532 = queryWeight, product of:
                2.9732616 = boost
                7.100134 = idf(docFreq=96, maxDocs=43254)
                0.020841485 = queryNorm
              0.44375837 = fieldWeight in 166, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.100134 = idf(docFreq=96, maxDocs=43254)
                0.0625 = fieldNorm(doc=166)
        0.2 = coord(5/25)
    
  3. Kaiser, M.; Lieder, H.J.; Majcen, K.; Vallant, H.: New ways of sharing and using authority information : the LEAF project (2003) 0.10
    0.1037858 = sum of:
      0.1037858 = product of:
        0.28829387 = sum of:
          0.019168463 = weight(abstract_txt:within in 3167) [ClassicSimilarity], result of:
            0.019168463 = score(doc=3167,freq=2.0), product of:
              0.10305827 = queryWeight, product of:
                1.1749375 = boost
                4.208617 = idf(docFreq=1747, maxDocs=43254)
                0.020841485 = queryNorm
              0.18599635 = fieldWeight in 3167, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.208617 = idf(docFreq=1747, maxDocs=43254)
                0.03125 = fieldNorm(doc=3167)
          0.0149838375 = weight(abstract_txt:software in 3167) [ClassicSimilarity], result of:
            0.0149838375 = score(doc=3167,freq=1.0), product of:
              0.11018353 = queryWeight, product of:
                1.2148752 = boost
                4.351674 = idf(docFreq=1514, maxDocs=43254)
                0.020841485 = queryNorm
              0.13598981 = fieldWeight in 3167, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.351674 = idf(docFreq=1514, maxDocs=43254)
                0.03125 = fieldNorm(doc=3167)
          0.01591429 = weight(abstract_txt:world in 3167) [ClassicSimilarity], result of:
            0.01591429 = score(doc=3167,freq=1.0), product of:
              0.114698954 = queryWeight, product of:
                1.2395186 = boost
                4.4399467 = idf(docFreq=1386, maxDocs=43254)
                0.020841485 = queryNorm
              0.13874833 = fieldWeight in 3167, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.4399467 = idf(docFreq=1386, maxDocs=43254)
                0.03125 = fieldNorm(doc=3167)
          0.022919888 = weight(abstract_txt:large in 3167) [ClassicSimilarity], result of:
            0.022919888 = score(doc=3167,freq=2.0), product of:
              0.1161002 = queryWeight, product of:
                1.2470671 = boost
                4.466985 = idf(docFreq=1349, maxDocs=43254)
                0.020841485 = queryNorm
              0.19741471 = fieldWeight in 3167, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.466985 = idf(docFreq=1349, maxDocs=43254)
                0.03125 = fieldNorm(doc=3167)
          0.01872567 = weight(abstract_txt:existing in 3167) [ClassicSimilarity], result of:
            0.01872567 = score(doc=3167,freq=1.0), product of:
              0.12783788 = queryWeight, product of:
                1.3085885 = boost
                4.6873546 = idf(docFreq=1082, maxDocs=43254)
                0.020841485 = queryNorm
              0.14647983 = fieldWeight in 3167, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6873546 = idf(docFreq=1082, maxDocs=43254)
                0.03125 = fieldNorm(doc=3167)
          0.034529153 = weight(abstract_txt:algorithms in 3167) [ClassicSimilarity], result of:
            0.034529153 = score(doc=3167,freq=1.0), product of:
              0.19223182 = queryWeight, product of:
                1.6046709 = boost
                5.747919 = idf(docFreq=374, maxDocs=43254)
                0.020841485 = queryNorm
              0.17962247 = fieldWeight in 3167, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.747919 = idf(docFreq=374, maxDocs=43254)
                0.03125 = fieldNorm(doc=3167)
          0.035787504 = weight(abstract_txt:independent in 3167) [ClassicSimilarity], result of:
            0.035787504 = score(doc=3167,freq=1.0), product of:
              0.19687426 = queryWeight, product of:
                1.6239318 = boost
                5.8169117 = idf(docFreq=349, maxDocs=43254)
                0.020841485 = queryNorm
              0.18177849 = fieldWeight in 3167, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8169117 = idf(docFreq=349, maxDocs=43254)
                0.03125 = fieldNorm(doc=3167)
          0.09104631 = weight(abstract_txt:memory in 3167) [ClassicSimilarity], result of:
            0.09104631 = score(doc=3167,freq=3.0), product of:
              0.25439143 = queryWeight, product of:
                1.8459697 = boost
                6.61225 = idf(docFreq=157, maxDocs=43254)
                0.020841485 = queryNorm
              0.3578985 = fieldWeight in 3167, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.61225 = idf(docFreq=157, maxDocs=43254)
                0.03125 = fieldNorm(doc=3167)
          0.035218738 = weight(abstract_txt:framework in 3167) [ClassicSimilarity], result of:
            0.035218738 = score(doc=3167,freq=1.0), product of:
              0.2454109 = queryWeight, product of:
                2.5641017 = boost
                4.5922966 = idf(docFreq=1190, maxDocs=43254)
                0.020841485 = queryNorm
              0.14350927 = fieldWeight in 3167, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.5922966 = idf(docFreq=1190, maxDocs=43254)
                0.03125 = fieldNorm(doc=3167)
        0.36 = coord(9/25)
    
  4. Aerts, D.; Broekaert, J.; Sozzo, S.; Veloz, T.: Meaning-focused and quantum-inspired information retrieval (2013) 0.10
    0.103502065 = sum of:
      0.103502065 = product of:
        0.6468879 = sum of:
          0.055086367 = weight(abstract_txt:processing in 2200) [ClassicSimilarity], result of:
            0.055086367 = score(doc=2200,freq=1.0), product of:
              0.14248551 = queryWeight, product of:
                1.3815248 = boost
                4.9486117 = idf(docFreq=833, maxDocs=43254)
                0.020841485 = queryNorm
              0.3866103 = fieldWeight in 2200, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.9486117 = idf(docFreq=833, maxDocs=43254)
                0.078125 = fieldNorm(doc=2200)
          0.15861104 = weight(abstract_txt:latent in 2200) [ClassicSimilarity], result of:
            0.15861104 = score(doc=2200,freq=1.0), product of:
              0.28837895 = queryWeight, product of:
                1.9654188 = boost
                7.040116 = idf(docFreq=102, maxDocs=43254)
                0.020841485 = queryNorm
              0.5500091 = fieldWeight in 2200, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.040116 = idf(docFreq=102, maxDocs=43254)
                0.078125 = fieldNorm(doc=2200)
          0.08804685 = weight(abstract_txt:framework in 2200) [ClassicSimilarity], result of:
            0.08804685 = score(doc=2200,freq=1.0), product of:
              0.2454109 = queryWeight, product of:
                2.5641017 = boost
                4.5922966 = idf(docFreq=1190, maxDocs=43254)
                0.020841485 = queryNorm
              0.35877317 = fieldWeight in 2200, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.5922966 = idf(docFreq=1190, maxDocs=43254)
                0.078125 = fieldNorm(doc=2200)
          0.34514365 = weight(abstract_txt:corpora in 2200) [ClassicSimilarity], result of:
            0.34514365 = score(doc=2200,freq=2.0), product of:
              0.43997532 = queryWeight, product of:
                2.9732616 = boost
                7.100134 = idf(docFreq=96, maxDocs=43254)
                0.020841485 = queryNorm
              0.7844614 = fieldWeight in 2200, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.100134 = idf(docFreq=96, maxDocs=43254)
                0.078125 = fieldNorm(doc=2200)
        0.16 = coord(4/25)
    
  5. Ye, Z.; Huang, J.X.; Lin, H.: Finding a good query-related topic for boosting pseudo-relevance feedback (2011) 0.10
    0.0994661 = sum of:
      0.0994661 = product of:
        0.4973305 = sum of:
          0.081386425 = weight(abstract_txt:allocation in 837) [ClassicSimilarity], result of:
            0.081386425 = score(doc=837,freq=1.0), product of:
              0.17023125 = queryWeight, product of:
                1.0677706 = boost
                7.649493 = idf(docFreq=55, maxDocs=43254)
                0.020841485 = queryNorm
              0.47809333 = fieldWeight in 837, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.649493 = idf(docFreq=55, maxDocs=43254)
                0.0625 = fieldNorm(doc=837)
          0.10424922 = weight(abstract_txt:dirichlet in 837) [ClassicSimilarity], result of:
            0.10424922 = score(doc=837,freq=1.0), product of:
              0.20077972 = queryWeight, product of:
                1.1596268 = boost
                8.307549 = idf(docFreq=28, maxDocs=43254)
                0.020841485 = queryNorm
              0.51922184 = fieldWeight in 837, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.307549 = idf(docFreq=28, maxDocs=43254)
                0.0625 = fieldNorm(doc=837)
          0.12688883 = weight(abstract_txt:latent in 837) [ClassicSimilarity], result of:
            0.12688883 = score(doc=837,freq=1.0), product of:
              0.28837895 = queryWeight, product of:
                1.9654188 = boost
                7.040116 = idf(docFreq=102, maxDocs=43254)
                0.020841485 = queryNorm
              0.44000724 = fieldWeight in 837, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.040116 = idf(docFreq=102, maxDocs=43254)
                0.0625 = fieldNorm(doc=837)
          0.11436852 = weight(abstract_txt:document in 837) [ClassicSimilarity], result of:
            0.11436852 = score(doc=837,freq=4.0), product of:
              0.21357101 = queryWeight, product of:
                2.3919907 = boost
                4.2840466 = idf(docFreq=1620, maxDocs=43254)
                0.020841485 = queryNorm
              0.53550583 = fieldWeight in 837, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.2840466 = idf(docFreq=1620, maxDocs=43254)
                0.0625 = fieldNorm(doc=837)
          0.070437476 = weight(abstract_txt:framework in 837) [ClassicSimilarity], result of:
            0.070437476 = score(doc=837,freq=1.0), product of:
              0.2454109 = queryWeight, product of:
                2.5641017 = boost
                4.5922966 = idf(docFreq=1190, maxDocs=43254)
                0.020841485 = queryNorm
              0.28701854 = fieldWeight in 837, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.5922966 = idf(docFreq=1190, maxDocs=43254)
                0.0625 = fieldNorm(doc=837)
        0.2 = coord(5/25)