Rehurek, R.; Sojka, P.: Software framework for topic modelling with large corpora (2010)
0.04
0.039293982 = product of:
0.078587964 = sum of:
0.05441116 = weight(_text_:digital in 1058) [ClassicSimilarity], result of:
0.05441116 = score(doc=1058,freq=2.0), product of:
0.20808177 = queryWeight, product of:
3.944552 = idf(docFreq=2326, maxDocs=44218)
0.052751686 = queryNorm
0.26148933 = fieldWeight in 1058, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
3.944552 = idf(docFreq=2326, maxDocs=44218)
0.046875 = fieldNorm(doc=1058)
0.0241768 = weight(_text_:library in 1058) [ClassicSimilarity], result of:
0.0241768 = score(doc=1058,freq=2.0), product of:
0.13870415 = queryWeight, product of:
2.6293786 = idf(docFreq=8668, maxDocs=44218)
0.052751686 = queryNorm
0.17430481 = fieldWeight in 1058, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
2.6293786 = idf(docFreq=8668, maxDocs=44218)
0.046875 = fieldNorm(doc=1058)
0.5 = coord(2/4)
- Abstract
- Large corpora are ubiquitous in today's world and memory quickly becomes the limiting factor in practical applications of the Vector Space Model (VSM). In this paper, we identify a gap in existing implementations of many of the popular algorithms, which is their scalability and ease of use. We describe a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion. Within this framework, we implement several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation, in a way that makes them completely independent of the training corpus size. Particular emphasis is placed on straightforward and intuitive framework design, so that modifications and extensions of the methods and/or their application by interested practitioners are effortless. We demonstrate the usefulness of our approach on a real-world scenario of computing document similarities within an existing digital library DML-CZ.