Search (2 results, page 1 of 1)

  • × author_ss:"Rehurek, R."
  • × type_ss:"el"
  1. Rehurek, R.; Sojka, P.: Software framework for topic modelling with large corpora (2010) 0.04
    0.037334766 = product of:
      0.07466953 = sum of:
        0.051698197 = weight(_text_:digital in 1058) [ClassicSimilarity], result of:
          0.051698197 = score(doc=1058,freq=2.0), product of:
            0.19770671 = queryWeight, product of:
              3.944552 = idf(docFreq=2326, maxDocs=44218)
              0.050121464 = queryNorm
            0.26148933 = fieldWeight in 1058, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.944552 = idf(docFreq=2326, maxDocs=44218)
              0.046875 = fieldNorm(doc=1058)
        0.022971334 = weight(_text_:library in 1058) [ClassicSimilarity], result of:
          0.022971334 = score(doc=1058,freq=2.0), product of:
            0.1317883 = queryWeight, product of:
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.050121464 = queryNorm
            0.17430481 = fieldWeight in 1058, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.6293786 = idf(docFreq=8668, maxDocs=44218)
              0.046875 = fieldNorm(doc=1058)
      0.5 = coord(2/4)
    
    Abstract
    Large corpora are ubiquitous in today's world and memory quickly becomes the limiting factor in practical applications of the Vector Space Model (VSM). In this paper, we identify a gap in existing implementations of many of the popular algorithms, which is their scalability and ease of use. We describe a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion. Within this framework, we implement several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation, in a way that makes them completely independent of the training corpus size. Particular emphasis is placed on straightforward and intuitive framework design, so that modifications and extensions of the methods and/or their application by interested practitioners are effortless. We demonstrate the usefulness of our approach on a real-world scenario of computing document similarities within an existing digital library DML-CZ.
  2. Sojka, P.; Lee, M.; Rehurek, R.; Hatlapatka, R.; Kucbel, M.; Bouche, T.; Goutorbe, C.; Anghelache, R.; Wojciechowski, K.: Toolset for entity and semantic associations : Final Release (2013) 0.01
    0.010464822 = product of:
      0.041859288 = sum of:
        0.041859288 = product of:
          0.083718576 = sum of:
            0.083718576 = weight(_text_:project in 1057) [ClassicSimilarity], result of:
              0.083718576 = score(doc=1057,freq=4.0), product of:
                0.21156175 = queryWeight, product of:
                  4.220981 = idf(docFreq=1764, maxDocs=44218)
                  0.050121464 = queryNorm
                0.39571697 = fieldWeight in 1057, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  4.220981 = idf(docFreq=1764, maxDocs=44218)
                  0.046875 = fieldNorm(doc=1057)
          0.5 = coord(1/2)
      0.25 = coord(1/4)
    
    Abstract
    In this document we describe the final release of the toolset for entity and semantic associations, integrating two versions (language dependent and language independent) of Unsupervised Document Similarity implemented by MU (using gensim tool) and Citation Indexing, Resolution and Matching (UJF/CMD). We give a brief description of tools, the rationale behind decisions made, and provide elementary evaluation. Tools are integrated in the main project result, EuDML website, and they deliver the needed functionality for exploratory searching and browsing the collected documents. EuDML users and content providers thus benefit from millions of algorithmically generated similarity and citation links, developed using state of the art machine learning and matching methods.
    Content
    Vgl. auch: https://is.muni.cz/repo/1076213/en/Lee-Sojka-Rehurek-Bolikowski/Toolset-for-Entity-and-Semantic-Associations-Initial-Release-Deliverable-82-of-project-EuDML?lang=en.