Document (#38059)

Author
Rehurek, R.
Sojka, P.
Title
Software framework for topic modelling with large corpora
Source
http://radimrehurek.com/gensim/lrec2010_final.pdf
Year
2010
Abstract
Large corpora are ubiquitous in today's world and memory quickly becomes the limiting factor in practical applications of the Vector Space Model (VSM). In this paper, we identify a gap in existing implementations of many of the popular algorithms, which is their scalability and ease of use. We describe a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion. Within this framework, we implement several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation, in a way that makes them completely independent of the training corpus size. Particular emphasis is placed on straightforward and intuitive framework design, so that modifications and extensions of the methods and/or their application by interested practitioners are effortless. We demonstrate the usefulness of our approach on a real-world scenario of computing document similarities within an existing digital library DML-CZ.
Content
Für die Software, vgl.: http://radimrehurek.com/gensim/index.html. Für eine Demo, vgl.: http://dml.cz/handle/10338.dmlcz/100785/SimilarArticles.
Field
Mathematik
Object
Latent Semantic Indexing
Aid
Gensim

Similar documents (author)

  1. Sojka, P.: Exploiting semantic annotations in math information retrieval (2012) 6.19
    6.190705 = sum of:
      6.190705 = weight(author_txt:sojka in 32) [ClassicSimilarity], result of:
        6.190705 = fieldWeight in 32, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.905128 = idf(docFreq=5, maxDocs=44218)
          0.625 = fieldNorm(doc=32)
    
  2. Líska, M.; Sojka, P.: MIaS 1.5 (2014) 4.95
    4.952564 = sum of:
      4.952564 = weight(author_txt:sojka in 1652) [ClassicSimilarity], result of:
        4.952564 = fieldWeight in 1652, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.905128 = idf(docFreq=5, maxDocs=44218)
          0.5 = fieldNorm(doc=1652)
    
  3. Sojka, P.; Liska, M.: ¬The art of mathematics retrieval (2011) 4.95
    4.952564 = sum of:
      4.952564 = weight(author_txt:sojka in 3450) [ClassicSimilarity], result of:
        4.952564 = fieldWeight in 3450, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.905128 = idf(docFreq=5, maxDocs=44218)
          0.5 = fieldNorm(doc=3450)
    
  4. Sojka, P.; Lee, M.; Rehurek, R.; Hatlapatka, R.; Kucbel, M.; Bouche, T.; Goutorbe, C.; Anghelache, R.; Wojciechowski, K.: Toolset for entity and semantic associations : Final Release (2013) 2.17
    2.1667466 = sum of:
      2.1667466 = weight(author_txt:sojka in 1057) [ClassicSimilarity], result of:
        2.1667466 = fieldWeight in 1057, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.905128 = idf(docFreq=5, maxDocs=44218)
          0.21875 = fieldNorm(doc=1057)
    

Similar documents (content)

  1. Li, X.; Zhang, A.; Li, C.; Ouyang, J.; Cai, Y.: Exploring coherent topics by topic modeling with term weighting (2018) 0.16
    0.16344583 = sum of:
      0.16344583 = product of:
        0.5837351 = sum of:
          0.076370776 = weight(abstract_txt:straightforward in 5045) [ClassicSimilarity], result of:
            0.076370776 = score(doc=5045,freq=1.0), product of:
              0.16373688 = queryWeight, product of:
                1.0416641 = boost
                7.462781 = idf(docFreq=68, maxDocs=44218)
                0.021062898 = queryNorm
              0.4664238 = fieldWeight in 5045, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.462781 = idf(docFreq=68, maxDocs=44218)
                0.0625 = fieldNorm(doc=5045)
          0.080742426 = weight(abstract_txt:allocation in 5045) [ClassicSimilarity], result of:
            0.080742426 = score(doc=5045,freq=1.0), product of:
              0.1699272 = queryWeight, product of:
                1.0611722 = boost
                7.602543 = idf(docFreq=59, maxDocs=44218)
                0.021062898 = queryNorm
              0.47515893 = fieldWeight in 5045, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.602543 = idf(docFreq=59, maxDocs=44218)
                0.0625 = fieldNorm(doc=5045)
          0.14660126 = weight(abstract_txt:dirichlet in 5045) [ClassicSimilarity], result of:
            0.14660126 = score(doc=5045,freq=2.0), product of:
              0.20072901 = queryWeight, product of:
                1.1533457 = boost
                8.2629 = idf(docFreq=30, maxDocs=44218)
                0.021062898 = queryNorm
              0.7303441 = fieldWeight in 5045, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.2629 = idf(docFreq=30, maxDocs=44218)
                0.0625 = fieldNorm(doc=5045)
          0.03198121 = weight(abstract_txt:world in 5045) [ClassicSimilarity], result of:
            0.03198121 = score(doc=5045,freq=1.0), product of:
              0.115469776 = queryWeight, product of:
                1.2370965 = boost
                4.4314575 = idf(docFreq=1429, maxDocs=44218)
                0.021062898 = queryNorm
              0.2769661 = fieldWeight in 5045, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.4314575 = idf(docFreq=1429, maxDocs=44218)
                0.0625 = fieldNorm(doc=5045)
          0.06404508 = weight(abstract_txt:existing in 5045) [ClassicSimilarity], result of:
            0.06404508 = score(doc=5045,freq=3.0), product of:
              0.1272004 = queryWeight, product of:
                1.2984154 = boost
                4.6511106 = idf(docFreq=1147, maxDocs=44218)
                0.021062898 = queryNorm
              0.5034975 = fieldWeight in 5045, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.6511106 = idf(docFreq=1147, maxDocs=44218)
                0.0625 = fieldNorm(doc=5045)
          0.12585784 = weight(abstract_txt:latent in 5045) [ClassicSimilarity], result of:
            0.12585784 = score(doc=5045,freq=1.0), product of:
              0.2878228 = queryWeight, product of:
                1.9531341 = boost
                6.996407 = idf(docFreq=109, maxDocs=44218)
                0.021062898 = queryNorm
              0.43727544 = fieldWeight in 5045, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.996407 = idf(docFreq=109, maxDocs=44218)
                0.0625 = fieldNorm(doc=5045)
          0.058136504 = weight(abstract_txt:document in 5045) [ClassicSimilarity], result of:
            0.058136504 = score(doc=5045,freq=1.0), product of:
              0.21669437 = queryWeight, product of:
                2.39667 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.021062898 = queryNorm
              0.26828802 = fieldWeight in 5045, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0625 = fieldNorm(doc=5045)
        0.28 = coord(7/25)
    
  2. Purpura, A.; Silvello, G.; Susto, G.A.: Learning to rank from relevance judgments distributions (2022) 0.11
    0.11279902 = sum of:
      0.11279902 = product of:
        0.46999592 = sum of:
          0.027145699 = weight(abstract_txt:within in 645) [ClassicSimilarity], result of:
            0.027145699 = score(doc=645,freq=1.0), product of:
              0.10351557 = queryWeight, product of:
                1.1713111 = boost
                4.195805 = idf(docFreq=1809, maxDocs=44218)
                0.021062898 = queryNorm
              0.26223782 = fieldWeight in 645, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.195805 = idf(docFreq=1809, maxDocs=44218)
                0.0625 = fieldNorm(doc=645)
          0.03198121 = weight(abstract_txt:world in 645) [ClassicSimilarity], result of:
            0.03198121 = score(doc=645,freq=1.0), product of:
              0.115469776 = queryWeight, product of:
                1.2370965 = boost
                4.4314575 = idf(docFreq=1429, maxDocs=44218)
                0.021062898 = queryNorm
              0.2769661 = fieldWeight in 645, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.4314575 = idf(docFreq=1429, maxDocs=44218)
                0.0625 = fieldNorm(doc=645)
          0.06834238 = weight(abstract_txt:algorithms in 645) [ClassicSimilarity], result of:
            0.06834238 = score(doc=645,freq=1.0), product of:
              0.19157188 = queryWeight, product of:
                1.5934385 = boost
                5.707926 = idf(docFreq=398, maxDocs=44218)
                0.021062898 = queryNorm
              0.35674536 = fieldWeight in 645, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.707926 = idf(docFreq=398, maxDocs=44218)
                0.0625 = fieldNorm(doc=645)
          0.08221743 = weight(abstract_txt:document in 645) [ClassicSimilarity], result of:
            0.08221743 = score(doc=645,freq=2.0), product of:
              0.21669437 = queryWeight, product of:
                2.39667 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.021062898 = queryNorm
              0.37941656 = fieldWeight in 645, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0625 = fieldNorm(doc=645)
          0.0692752 = weight(abstract_txt:framework in 645) [ClassicSimilarity], result of:
            0.0692752 = score(doc=645,freq=1.0), product of:
              0.24355678 = queryWeight, product of:
                2.5408823 = boost
                4.550903 = idf(docFreq=1268, maxDocs=44218)
                0.021062898 = queryNorm
              0.28443143 = fieldWeight in 645, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.550903 = idf(docFreq=1268, maxDocs=44218)
                0.0625 = fieldNorm(doc=645)
          0.19103399 = weight(abstract_txt:corpora in 645) [ClassicSimilarity], result of:
            0.19103399 = score(doc=645,freq=1.0), product of:
              0.43515354 = queryWeight, product of:
                2.9412796 = boost
                7.0240583 = idf(docFreq=106, maxDocs=44218)
                0.021062898 = queryNorm
              0.43900365 = fieldWeight in 645, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.0240583 = idf(docFreq=106, maxDocs=44218)
                0.0625 = fieldNorm(doc=645)
        0.24 = coord(6/25)
    
  3. Zheng, H.-T.; Borchert, C.; Kim, H.-G.: Exploiting corpus-related ontologies for conceptualizing document corpora (2009) 0.11
    0.105730586 = sum of:
      0.105730586 = product of:
        0.5286529 = sum of:
          0.027145699 = weight(abstract_txt:within in 3165) [ClassicSimilarity], result of:
            0.027145699 = score(doc=3165,freq=1.0), product of:
              0.10351557 = queryWeight, product of:
                1.1713111 = boost
                4.195805 = idf(docFreq=1809, maxDocs=44218)
                0.021062898 = queryNorm
              0.26223782 = fieldWeight in 3165, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.195805 = idf(docFreq=1809, maxDocs=44218)
                0.0625 = fieldNorm(doc=3165)
          0.06834238 = weight(abstract_txt:algorithms in 3165) [ClassicSimilarity], result of:
            0.06834238 = score(doc=3165,freq=1.0), product of:
              0.19157188 = queryWeight, product of:
                1.5934385 = boost
                5.707926 = idf(docFreq=398, maxDocs=44218)
                0.021062898 = queryNorm
              0.35674536 = fieldWeight in 3165, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.707926 = idf(docFreq=398, maxDocs=44218)
                0.0625 = fieldNorm(doc=3165)
          0.12585784 = weight(abstract_txt:latent in 3165) [ClassicSimilarity], result of:
            0.12585784 = score(doc=3165,freq=1.0), product of:
              0.2878228 = queryWeight, product of:
                1.9531341 = boost
                6.996407 = idf(docFreq=109, maxDocs=44218)
                0.021062898 = queryNorm
              0.43727544 = fieldWeight in 3165, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.996407 = idf(docFreq=109, maxDocs=44218)
                0.0625 = fieldNorm(doc=3165)
          0.11627301 = weight(abstract_txt:document in 3165) [ClassicSimilarity], result of:
            0.11627301 = score(doc=3165,freq=4.0), product of:
              0.21669437 = queryWeight, product of:
                2.39667 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.021062898 = queryNorm
              0.53657603 = fieldWeight in 3165, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0625 = fieldNorm(doc=3165)
          0.19103399 = weight(abstract_txt:corpora in 3165) [ClassicSimilarity], result of:
            0.19103399 = score(doc=3165,freq=1.0), product of:
              0.43515354 = queryWeight, product of:
                2.9412796 = boost
                7.0240583 = idf(docFreq=106, maxDocs=44218)
                0.021062898 = queryNorm
              0.43900365 = fieldWeight in 3165, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.0240583 = idf(docFreq=106, maxDocs=44218)
                0.0625 = fieldNorm(doc=3165)
        0.2 = coord(5/25)
    
  4. Kaiser, M.; Lieder, H.J.; Majcen, K.; Vallant, H.: New ways of sharing and using authority information : the LEAF project (2003) 0.10
    0.10381197 = sum of:
      0.10381197 = product of:
        0.2883666 = sum of:
          0.019194907 = weight(abstract_txt:within in 1166) [ClassicSimilarity], result of:
            0.019194907 = score(doc=1166,freq=2.0), product of:
              0.10351557 = queryWeight, product of:
                1.1713111 = boost
                4.195805 = idf(docFreq=1809, maxDocs=44218)
                0.021062898 = queryNorm
              0.18543014 = fieldWeight in 1166, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.195805 = idf(docFreq=1809, maxDocs=44218)
                0.03125 = fieldNorm(doc=1166)
          0.015188123 = weight(abstract_txt:software in 1166) [ClassicSimilarity], result of:
            0.015188123 = score(doc=1166,freq=1.0), product of:
              0.11157351 = queryWeight, product of:
                1.216046 = boost
                4.3560514 = idf(docFreq=1541, maxDocs=44218)
                0.021062898 = queryNorm
              0.13612661 = fieldWeight in 1166, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.3560514 = idf(docFreq=1541, maxDocs=44218)
                0.03125 = fieldNorm(doc=1166)
          0.015990606 = weight(abstract_txt:world in 1166) [ClassicSimilarity], result of:
            0.015990606 = score(doc=1166,freq=1.0), product of:
              0.115469776 = queryWeight, product of:
                1.2370965 = boost
                4.4314575 = idf(docFreq=1429, maxDocs=44218)
                0.021062898 = queryNorm
              0.13848305 = fieldWeight in 1166, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.4314575 = idf(docFreq=1429, maxDocs=44218)
                0.03125 = fieldNorm(doc=1166)
          0.022962376 = weight(abstract_txt:large in 1166) [ClassicSimilarity], result of:
            0.022962376 = score(doc=1166,freq=2.0), product of:
              0.11665219 = queryWeight, product of:
                1.2434144 = boost
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.021062898 = queryNorm
              0.19684479 = fieldWeight in 1166, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.03125 = fieldNorm(doc=1166)
          0.018488223 = weight(abstract_txt:existing in 1166) [ClassicSimilarity], result of:
            0.018488223 = score(doc=1166,freq=1.0), product of:
              0.1272004 = queryWeight, product of:
                1.2984154 = boost
                4.6511106 = idf(docFreq=1147, maxDocs=44218)
                0.021062898 = queryNorm
              0.14534721 = fieldWeight in 1166, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6511106 = idf(docFreq=1147, maxDocs=44218)
                0.03125 = fieldNorm(doc=1166)
          0.03417119 = weight(abstract_txt:algorithms in 1166) [ClassicSimilarity], result of:
            0.03417119 = score(doc=1166,freq=1.0), product of:
              0.19157188 = queryWeight, product of:
                1.5934385 = boost
                5.707926 = idf(docFreq=398, maxDocs=44218)
                0.021062898 = queryNorm
              0.17837268 = fieldWeight in 1166, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.707926 = idf(docFreq=398, maxDocs=44218)
                0.03125 = fieldNorm(doc=1166)
          0.036103778 = weight(abstract_txt:independent in 1166) [ClassicSimilarity], result of:
            0.036103778 = score(doc=1166,freq=1.0), product of:
              0.19872849 = queryWeight, product of:
                1.622929 = boost
                5.813565 = idf(docFreq=358, maxDocs=44218)
                0.021062898 = queryNorm
              0.1816739 = fieldWeight in 1166, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.813565 = idf(docFreq=358, maxDocs=44218)
                0.03125 = fieldNorm(doc=1166)
          0.09162978 = weight(abstract_txt:memory in 1166) [ClassicSimilarity], result of:
            0.09162978 = score(doc=1166,freq=3.0), product of:
              0.256375 = queryWeight, product of:
                1.8433478 = boost
                6.603137 = idf(docFreq=162, maxDocs=44218)
                0.021062898 = queryNorm
              0.35740528 = fieldWeight in 1166, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.603137 = idf(docFreq=162, maxDocs=44218)
                0.03125 = fieldNorm(doc=1166)
          0.0346376 = weight(abstract_txt:framework in 1166) [ClassicSimilarity], result of:
            0.0346376 = score(doc=1166,freq=1.0), product of:
              0.24355678 = queryWeight, product of:
                2.5408823 = boost
                4.550903 = idf(docFreq=1268, maxDocs=44218)
                0.021062898 = queryNorm
              0.14221571 = fieldWeight in 1166, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.550903 = idf(docFreq=1268, maxDocs=44218)
                0.03125 = fieldNorm(doc=1166)
        0.36 = coord(9/25)
    
  5. Aerts, D.; Broekaert, J.; Sozzo, S.; Veloz, T.: Meaning-focused and quantum-inspired information retrieval (2013) 0.10
    0.10187605 = sum of:
      0.10187605 = product of:
        0.6367253 = sum of:
          0.055105407 = weight(abstract_txt:processing in 735) [ClassicSimilarity], result of:
            0.055105407 = score(doc=735,freq=1.0), product of:
              0.14301924 = queryWeight, product of:
                1.3767867 = boost
                4.931848 = idf(docFreq=866, maxDocs=44218)
                0.021062898 = queryNorm
              0.38530064 = fieldWeight in 735, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.931848 = idf(docFreq=866, maxDocs=44218)
                0.078125 = fieldNorm(doc=735)
          0.15732232 = weight(abstract_txt:latent in 735) [ClassicSimilarity], result of:
            0.15732232 = score(doc=735,freq=1.0), product of:
              0.2878228 = queryWeight, product of:
                1.9531341 = boost
                6.996407 = idf(docFreq=109, maxDocs=44218)
                0.021062898 = queryNorm
              0.5465943 = fieldWeight in 735, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.996407 = idf(docFreq=109, maxDocs=44218)
                0.078125 = fieldNorm(doc=735)
          0.08659401 = weight(abstract_txt:framework in 735) [ClassicSimilarity], result of:
            0.08659401 = score(doc=735,freq=1.0), product of:
              0.24355678 = queryWeight, product of:
                2.5408823 = boost
                4.550903 = idf(docFreq=1268, maxDocs=44218)
                0.021062898 = queryNorm
              0.3555393 = fieldWeight in 735, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.550903 = idf(docFreq=1268, maxDocs=44218)
                0.078125 = fieldNorm(doc=735)
          0.33770356 = weight(abstract_txt:corpora in 735) [ClassicSimilarity], result of:
            0.33770356 = score(doc=735,freq=2.0), product of:
              0.43515354 = queryWeight, product of:
                2.9412796 = boost
                7.0240583 = idf(docFreq=106, maxDocs=44218)
                0.021062898 = queryNorm
              0.7760561 = fieldWeight in 735, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.0240583 = idf(docFreq=106, maxDocs=44218)
                0.078125 = fieldNorm(doc=735)
        0.16 = coord(4/25)