Search (1447 results, page 1 of 73)

Kumar, C.A.; Radvansky, M.; Annapurna, J.: Analysis of Vector Space Model, Latent Semantic Indexing and Formal Concept Analysis for information retrieval (2012) 0.30

0.30362892 = product of:
  0.40483856 = sum of:
    0.2159047 = weight(_text_:vector in 2710) [ClassicSimilarity], result of:
      0.2159047 = score(doc=2710,freq=4.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.7043085 = fieldWeight in 2710, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.0546875 = fieldNorm(doc=2710)
    0.14178926 = weight(_text_:space in 2710) [ClassicSimilarity], result of:
      0.14178926 = score(doc=2710,freq=4.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.5707601 = fieldWeight in 2710, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.0546875 = fieldNorm(doc=2710)
    0.04714458 = product of:
      0.09428916 = sum of:
        0.09428916 = weight(_text_:model in 2710) [ClassicSimilarity], result of:
          0.09428916 = score(doc=2710,freq=6.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.51509297 = fieldWeight in 2710, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2710)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: Latent Semantic Indexing (LSI), a variant of classical Vector Space Model (VSM), is an Information Retrieval (IR) model that attempts to capture the latent semantic relationship between the data items. Mathematical lattices, under the framework of Formal Concept Analysis (FCA), represent conceptual hierarchies in data and retrieve the information. However both LSI and FCA uses the data represented in form of matrices. The objective of this paper is to systematically analyze VSM, LSI and FCA for the task of IR using the standard and real life datasets.

Zhu, W.Z.; Allen, R.B.: Document clustering using the LSI subspace signature model (2013) 0.27

0.26987368 = product of:
  0.35983157 = sum of:
    0.130858 = weight(_text_:vector in 690) [ClassicSimilarity], result of:
      0.130858 = score(doc=690,freq=2.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.4268754 = fieldWeight in 690, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.046875 = fieldNorm(doc=690)
    0.08593727 = weight(_text_:space in 690) [ClassicSimilarity], result of:
      0.08593727 = score(doc=690,freq=2.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.34593284 = fieldWeight in 690, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.046875 = fieldNorm(doc=690)
    0.1430363 = sum of:
      0.10433724 = weight(_text_:model in 690) [ClassicSimilarity], result of:
        0.10433724 = score(doc=690,freq=10.0), product of:
          0.1830527 = queryWeight, product of:
            3.845226 = idf(docFreq=2569, maxDocs=44218)
            0.047605187 = queryNorm
          0.5699847 = fieldWeight in 690, product of:
            3.1622777 = tf(freq=10.0), with freq of:
              10.0 = termFreq=10.0
            3.845226 = idf(docFreq=2569, maxDocs=44218)
            0.046875 = fieldNorm(doc=690)
      0.03869907 = weight(_text_:22 in 690) [ClassicSimilarity], result of:
        0.03869907 = score(doc=690,freq=2.0), product of:
          0.16670525 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.047605187 = queryNorm
          0.23214069 = fieldWeight in 690, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.046875 = fieldNorm(doc=690)
  0.75 = coord(3/4)

Abstract: We describe the latent semantic indexing subspace signature model (LSISSM) for semantic content representation of unstructured text. Grounded on singular value decomposition, the model represents terms and documents by the distribution signatures of their statistical contribution across the top-ranking latent concept dimensions. LSISSM matches term signatures with document signatures according to their mapping coherence between latent semantic indexing (LSI) term subspace and LSI document subspace. LSISSM does feature reduction and finds a low-rank approximation of scalable and sparse term-document matrices. Experiments demonstrate that this approach significantly improves the performance of major clustering algorithms such as standard K-means and self-organizing maps compared with the vector space model and the traditional LSI model. The unique contribution ranking mechanism in LSISSM also improves the initialization of standard K-means compared with random seeding procedure, which sometimes causes low efficiency and effectiveness of clustering. A two-stage initialization strategy based on LSISSM significantly reduces the running time of standard K-means procedures.
Date: 23. 3.2013 13:22:36

Kiros, R.; Salakhutdinov, R.; Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models (2014) 0.25

0.25179514 = product of:
  0.33572686 = sum of:
    0.130858 = weight(_text_:vector in 1871) [ClassicSimilarity], result of:
      0.130858 = score(doc=1871,freq=2.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.4268754 = fieldWeight in 1871, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.046875 = fieldNorm(doc=1871)
    0.17187454 = weight(_text_:space in 1871) [ClassicSimilarity], result of:
      0.17187454 = score(doc=1871,freq=8.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.6918657 = fieldWeight in 1871, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.046875 = fieldNorm(doc=1871)
    0.03299433 = product of:
      0.06598866 = sum of:
        0.06598866 = weight(_text_:model in 1871) [ClassicSimilarity], result of:
          0.06598866 = score(doc=1871,freq=4.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.36048993 = fieldWeight in 1871, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.046875 = fieldNorm(doc=1871)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: Inspired by recent advances in multimodal learning and machine translation, we introduce an encoder-decoder pipeline that learns (a): a multimodal joint embedding space with images and text and (b): a novel language model for decoding distributed representations from our space. Our pipeline effectively unifies joint image-text embedding models with multimodal neural language models. We introduce the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder. The encoder allows one to rank images and sentences while the decoder can generate novel descriptions from scratch. Using LSTM to encode sentences, we match the state-of-the-art performance on Flickr8K and Flickr30K without using object detections. We also set new best results when using the 19-layer Oxford convolutional network. Furthermore we show that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic e.g. *image of a blue car* - "blue" + "red" is near images of red cars. Sample captions generated for 800 images are made available for comparison.

Cribbin, T.: Discovering latent topical structure by second-order similarity analysis (2011) 0.23

0.23219806 = product of:
  0.3095974 = sum of:
    0.18887727 = weight(_text_:vector in 4470) [ClassicSimilarity], result of:
      0.18887727 = score(doc=4470,freq=6.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.6161416 = fieldWeight in 4470, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.0390625 = fieldNorm(doc=4470)
    0.101278044 = weight(_text_:space in 4470) [ClassicSimilarity], result of:
      0.101278044 = score(doc=4470,freq=4.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.40768576 = fieldWeight in 4470, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.0390625 = fieldNorm(doc=4470)
    0.019442094 = product of:
      0.03888419 = sum of:
        0.03888419 = weight(_text_:model in 4470) [ClassicSimilarity], result of:
          0.03888419 = score(doc=4470,freq=2.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.21242073 = fieldWeight in 4470, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4470)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: Computing document similarity directly from a "bag of words" vector space model can be problematic because term independence causes the relationships between synonymous terms and the contextual influences that determine the sense of polysemous terms to be ignored. This study compares two methods that potentially address these problems by deriving the higher order relationships that lie latent within the original first-order space. The first is latent semantic analysis (LSA), a dimension reduction method that is a well-known means of addressing the vocabulary mismatch problem in information retrieval systems. The second is the lesser known yet conceptually simple approach of second-order similarity (SOS) analysis, whereby latent similarity is measured in terms of mutual first-order similarity. Nearest neighbour tests show that SOS analysis derives similarity models that are superior to both first-order and LSA-derived models at both coarse and fine levels of semantic granularity. SOS analysis has been criticized for its computational complexity. A second contribution is the novel application of vector truncation to reduce run-time by a constant factor. Speed-ups of 4 to 10 times are achievable without compromising the structural gains achieved by full-vector SOS analysis.

Mestrovic, A.; Cali, A.: ¬An ontology-based approach to information retrieval (2017) 0.23

0.23186485 = product of:
  0.30915314 = sum of:
    0.21809667 = weight(_text_:vector in 3489) [ClassicSimilarity], result of:
      0.21809667 = score(doc=3489,freq=8.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.711459 = fieldWeight in 3489, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3489)
    0.07161439 = weight(_text_:space in 3489) [ClassicSimilarity], result of:
      0.07161439 = score(doc=3489,freq=2.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.28827736 = fieldWeight in 3489, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3489)
    0.019442094 = product of:
      0.03888419 = sum of:
        0.03888419 = weight(_text_:model in 3489) [ClassicSimilarity], result of:
          0.03888419 = score(doc=3489,freq=2.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.21242073 = fieldWeight in 3489, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3489)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: We define a general framework for ontology-based information retrieval (IR). In our approach, document and query expansion rely on a base taxonomy that is extracted from a lexical database or a Linked Data set (e.g. WordNet, Wiktionary etc.). Each term from a document or query is modelled as a vector of base concepts from the base taxonomy. We define a set of mapping functions which map multiple ontological layers (dimensions) onto the base taxonomy. This way, each concept from the included ontologies can also be represented as a vector of base concepts from the base taxonomy. We propose a general weighting schema which is used for the vector space model. Our framework can therefore take into account various lexical and semantic relations between terms and concepts (e.g. synonymy, hierarchy, meronymy, antonymy, geo-proximity, etc.). This allows us to avoid certain vocabulary problems (e.g. synonymy, polysemy) as well as to reduce the vector size in the IR tasks.

Akerele, O.; David, A.; Osofisan, A.: Using the concepts of Case Based Reasoning and Basic Categories for enhancing adaptation to the user's level of knowledge in Decision Support System (2014) 0.23

0.22661653 = product of:
  0.30215538 = sum of:
    0.130858 = weight(_text_:vector in 1449) [ClassicSimilarity], result of:
      0.130858 = score(doc=1449,freq=2.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.4268754 = fieldWeight in 1449, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.046875 = fieldNorm(doc=1449)
    0.08593727 = weight(_text_:space in 1449) [ClassicSimilarity], result of:
      0.08593727 = score(doc=1449,freq=2.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.34593284 = fieldWeight in 1449, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.046875 = fieldNorm(doc=1449)
    0.085360095 = sum of:
      0.046661027 = weight(_text_:model in 1449) [ClassicSimilarity], result of:
        0.046661027 = score(doc=1449,freq=2.0), product of:
          0.1830527 = queryWeight, product of:
            3.845226 = idf(docFreq=2569, maxDocs=44218)
            0.047605187 = queryNorm
          0.25490487 = fieldWeight in 1449, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.845226 = idf(docFreq=2569, maxDocs=44218)
            0.046875 = fieldNorm(doc=1449)
      0.03869907 = weight(_text_:22 in 1449) [ClassicSimilarity], result of:
        0.03869907 = score(doc=1449,freq=2.0), product of:
          0.16670525 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.047605187 = queryNorm
          0.23214069 = fieldWeight in 1449, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.046875 = fieldNorm(doc=1449)
  0.75 = coord(3/4)

Abstract: In most search systems, mapping queries with documents employs techniques such as vector space model, naïve Bayes, Bayesian theorem etc. to classify resulting documents. In this research studies, we are proposing the use of the concept of basic categories to representing the user's level of knowledge based on the concepts he employed during his search activities, so that the system could propose adapted results based on the observed user's level of knowledge. Our hypothesis is that this approach will enhance the decision support system for solving decisional problems in which information retrieval constitutes the backbone technical problem.
Source: Knowledge organization in the 21st century: between historical patterns and future prospects. Proceedings of the Thirteenth International ISKO Conference 19-22 May 2014, Kraków, Poland. Ed.: Wieslaw Babik

Kiela, D.; Clark, S.: Detecting compositionality of multi-word expressions using nearest neighbours in vector space models (2013) 0.20

0.20439655 = product of:
  0.4087931 = sum of:
    0.24674822 = weight(_text_:vector in 1161) [ClassicSimilarity], result of:
      0.24674822 = score(doc=1161,freq=4.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.804924 = fieldWeight in 1161, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.0625 = fieldNorm(doc=1161)
    0.16204487 = weight(_text_:space in 1161) [ClassicSimilarity], result of:
      0.16204487 = score(doc=1161,freq=4.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.6522972 = fieldWeight in 1161, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.0625 = fieldNorm(doc=1161)
  0.5 = coord(2/4)

Abstract: We present a novel unsupervised approach to detecting the compositionality of multi-word expressions. We compute the compositionality of a phrase through substituting the constituent words with their "neighbours" in a semantic vector space and averaging over the distance between the original phrase and the substituted neighbour phrases. Several methods of obtaining neighbours are presented. The results are compared to existing supervised results and achieve state-of-the-art performance on a verb-object dataset of human compositionality ratings.

Liu, X.; Turtle, H.: Real-time user interest modeling for real-time ranking (2013) 0.20

0.19759221 = product of:
  0.26345628 = sum of:
    0.130858 = weight(_text_:vector in 1035) [ClassicSimilarity], result of:
      0.130858 = score(doc=1035,freq=2.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.4268754 = fieldWeight in 1035, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.046875 = fieldNorm(doc=1035)
    0.08593727 = weight(_text_:space in 1035) [ClassicSimilarity], result of:
      0.08593727 = score(doc=1035,freq=2.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.34593284 = fieldWeight in 1035, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.046875 = fieldNorm(doc=1035)
    0.046661027 = product of:
      0.09332205 = sum of:
        0.09332205 = weight(_text_:model in 1035) [ClassicSimilarity], result of:
          0.09332205 = score(doc=1035,freq=8.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.50980973 = fieldWeight in 1035, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.046875 = fieldNorm(doc=1035)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: User interest as a very dynamic information need is often ignored in most existing information retrieval systems. In this research, we present the results of experiments designed to evaluate the performance of a real-time interest model (RIM) that attempts to identify the dynamic and changing query level interests regarding social media outputs. Unlike most existing ranking methods, our ranking approach targets calculation of the probability that user interest in the content of the document is subject to very dynamic user interest change. We describe 2 formulations of the model (real-time interest vector space and real-time interest language model) stemming from classical relevance ranking methods and develop a novel methodology for evaluating the performance of RIM using Amazon Mechanical Turk to collect (interest-based) relevance judgments on a daily basis. Our results show that the model usually, although not always, performs better than baseline results obtained from commercial web search engines. We identify factors that affect RIM performance and outline plans for future research.

Li, D.; Kwong, C.-P.: Understanding latent semantic indexing : a topological structure analysis using Q-analysis (2010) 0.18

0.18009433 = product of:
  0.24012578 = sum of:
    0.130858 = weight(_text_:vector in 3427) [ClassicSimilarity], result of:
      0.130858 = score(doc=3427,freq=2.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.4268754 = fieldWeight in 3427, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.046875 = fieldNorm(doc=3427)
    0.08593727 = weight(_text_:space in 3427) [ClassicSimilarity], result of:
      0.08593727 = score(doc=3427,freq=2.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.34593284 = fieldWeight in 3427, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.046875 = fieldNorm(doc=3427)
    0.023330513 = product of:
      0.046661027 = sum of:
        0.046661027 = weight(_text_:model in 3427) [ClassicSimilarity], result of:
          0.046661027 = score(doc=3427,freq=2.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.25490487 = fieldWeight in 3427, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.046875 = fieldNorm(doc=3427)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: The method of latent semantic indexing (LSI) is well-known for tackling the synonymy and polysemy problems in information retrieval; however, its performance can be very different for various datasets, and the questions of what characteristics of a dataset and why these characteristics contribute to this difference have not been fully understood. In this article, we propose that the mathematical structure of simplexes can be attached to a term-document matrix in the vector space model (VSM) for information retrieval. The Q-analysis devised by R.H. Atkin ([1974]) may then be applied to effect an analysis of the topological structure of the simplexes and their corresponding dataset. Experimental results of this analysis reveal that there is a correlation between the effectiveness of LSI and the topological structure of the dataset. By using the information obtained from the topological analysis, we develop a new method to explore the semantic information in a dataset. Experimental results show that our method can enhance the performance of VSM for datasets over which LSI is not effective.

Rehurek, R.; Sojka, P.: Software framework for topic modelling with large corpora (2010) 0.18

0.18009433 = product of:
  0.24012578 = sum of:
    0.130858 = weight(_text_:vector in 1058) [ClassicSimilarity], result of:
      0.130858 = score(doc=1058,freq=2.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.4268754 = fieldWeight in 1058, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.046875 = fieldNorm(doc=1058)
    0.08593727 = weight(_text_:space in 1058) [ClassicSimilarity], result of:
      0.08593727 = score(doc=1058,freq=2.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.34593284 = fieldWeight in 1058, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.046875 = fieldNorm(doc=1058)
    0.023330513 = product of:
      0.046661027 = sum of:
        0.046661027 = weight(_text_:model in 1058) [ClassicSimilarity], result of:
          0.046661027 = score(doc=1058,freq=2.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.25490487 = fieldWeight in 1058, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.046875 = fieldNorm(doc=1058)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: Large corpora are ubiquitous in today's world and memory quickly becomes the limiting factor in practical applications of the Vector Space Model (VSM). In this paper, we identify a gap in existing implementations of many of the popular algorithms, which is their scalability and ease of use. We describe a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion. Within this framework, we implement several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation, in a way that makes them completely independent of the training corpus size. Particular emphasis is placed on straightforward and intuitive framework design, so that modifications and extensions of the methods and/or their application by interested practitioners are effortless. We demonstrate the usefulness of our approach on a real-world scenario of computing document similarities within an existing digital library DML-CZ.

Choi, Y.: ¬A complete assessment of tagging quality : a consolidated methodology (2015) 0.18

0.18009433 = product of:
  0.24012578 = sum of:
    0.130858 = weight(_text_:vector in 1730) [ClassicSimilarity], result of:
      0.130858 = score(doc=1730,freq=2.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.4268754 = fieldWeight in 1730, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.046875 = fieldNorm(doc=1730)
    0.08593727 = weight(_text_:space in 1730) [ClassicSimilarity], result of:
      0.08593727 = score(doc=1730,freq=2.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.34593284 = fieldWeight in 1730, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.046875 = fieldNorm(doc=1730)
    0.023330513 = product of:
      0.046661027 = sum of:
        0.046661027 = weight(_text_:model in 1730) [ClassicSimilarity], result of:
          0.046661027 = score(doc=1730,freq=2.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.25490487 = fieldWeight in 1730, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.046875 = fieldNorm(doc=1730)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: This paper presents a methodological discussion of a study of tagging quality in subject indexing. The data analysis in the study was divided into 3 phases: analysis of indexing consistency, analysis of tagging effectiveness, and analysis of the semantic values of tags. To analyze indexing consistency, this study employed the vector space model-based indexing consistency measures. An analysis of tagging effectiveness with tagging exhaustivity and tag specificity was conducted to ameliorate the drawbacks of consistency analysis based on only the quantitative measures of vocabulary matching. To further investigate the semantic values of tags at various levels of specificity, a latent semantic analysis (LSA) was conducted. To test statistical significance for the relation between tag specificity and semantic quality, correlation analysis was conducted. This research demonstrates the potential of tags for web document indexing with a complete assessment of tagging quality and provides a basis for further study of the strengths and limitations of tagging.

Rubin, V.L.; Lukoianova, T.: Truth and deception at the rhetorical structure level (2015) 0.17

0.17232636 = product of:
  0.22976847 = sum of:
    0.10904834 = weight(_text_:vector in 1816) [ClassicSimilarity], result of:
      0.10904834 = score(doc=1816,freq=2.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.3557295 = fieldWeight in 1816, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1816)
    0.101278044 = weight(_text_:space in 1816) [ClassicSimilarity], result of:
      0.101278044 = score(doc=1816,freq=4.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.40768576 = fieldWeight in 1816, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1816)
    0.019442094 = product of:
      0.03888419 = sum of:
        0.03888419 = weight(_text_:model in 1816) [ClassicSimilarity], result of:
          0.03888419 = score(doc=1816,freq=2.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.21242073 = fieldWeight in 1816, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1816)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: This paper furthers the development of methods to distinguish truth from deception in textual data. We use rhetorical structure theory (RST) as the analytic framework to identify systematic differences between deceptive and truthful stories in terms of their coherence and structure. A sample of 36 elicited personal stories, self-ranked as truthful or deceptive, is manually analyzed by assigning RST discourse relations among each story's constituent parts. A vector space model (VSM) assesses each story's position in multidimensional RST space with respect to its distance from truthful and deceptive centers as measures of the story's level of deception and truthfulness. Ten human judges evaluate independently whether each story is deceptive and assign their confidence levels (360 evaluations total), producing measures of the expected human ability to recognize deception. As a robustness check, a test sample of 18 truthful stories (with 180 additional evaluations) is used to determine the reliability of our RST-VSM method in determining deception. The contribution is in demonstration of the discourse structure analysis as a significant method for automated deception detection and an effective complement to lexicosemantic analysis. The potential is in developing novel discourse-based tools to alert information users to potential deception in computer-mediated texts.

Cavalcante Dourado, Í.; Galante, R.; Gonçalves, M.A.; Silva Torres, R. de: Bag of textual graphs (BoTG) : a general graph-based text representation model (2019) 0.16

0.16466019 = product of:
  0.21954691 = sum of:
    0.10904834 = weight(_text_:vector in 5291) [ClassicSimilarity], result of:
      0.10904834 = score(doc=5291,freq=2.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.3557295 = fieldWeight in 5291, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5291)
    0.07161439 = weight(_text_:space in 5291) [ClassicSimilarity], result of:
      0.07161439 = score(doc=5291,freq=2.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.28827736 = fieldWeight in 5291, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5291)
    0.03888419 = product of:
      0.07776838 = sum of:
        0.07776838 = weight(_text_:model in 5291) [ClassicSimilarity], result of:
          0.07776838 = score(doc=5291,freq=8.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.42484146 = fieldWeight in 5291, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5291)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: Text representation models are the fundamental basis for information retrieval and text mining tasks. Although different text models have been proposed, they typically target specific task aspects in isolation, such as time efficiency, accuracy, or applicability for different scenarios. Here we present Bag of Textual Graphs (BoTG), a general text representation model that addresses these three requirements at the same time. The proposed textual representation is based on a graph-based scheme that encodes term proximity and term ordering, and represents text documents into an efficient vector space that addresses all these aspects as well as provides discriminative textual patterns. Extensive experiments are conducted in two experimental scenarios-classification and retrieval-considering multiple well-known text collections. We also compare our model against several methods from the literature. Experimental results demonstrate that our model is generic enough to handle different tasks and collections. It is also more efficient than the widely used state-of-the-art methods in textual classification and retrieval tasks, with a competitive effectiveness, sometimes with gains by large margins.

Corrêa, C.A.; Kobashi, N.Y.: Automatic indexing and information visualization : a study based on paraconsistent logic (2012) 0.15
```
0.15007861 = product of:
  0.20010482 = sum of:
    0.10904834 = weight(_text_:vector in 869) [ClassicSimilarity], result of:
      0.10904834 = score(doc=869,freq=2.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.3557295 = fieldWeight in 869, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.0390625 = fieldNorm(doc=869)
    0.07161439 = weight(_text_:space in 869) [ClassicSimilarity], result of:
      0.07161439 = score(doc=869,freq=2.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.28827736 = fieldWeight in 869, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.0390625 = fieldNorm(doc=869)
    0.019442094 = product of:
      0.03888419 = sum of:
        0.03888419 = weight(_text_:model in 869) [ClassicSimilarity], result of:
          0.03888419 = score(doc=869,freq=2.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.21242073 = fieldWeight in 869, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.0390625 = fieldNorm(doc=869)
      0.5 = coord(1/2)
  0.75 = coord(3/4)
```
Abstract

This paper reports a research to evaluate the potential and the effects of use of annotated Paraconsistent logic in automatic indexing. This logic attempts to deal with contradictions, concerned with studying and developing inconsistency-tolerant systems of logic. This logic, being flexible and containing logical states that go beyond the dichotomies yes and no, permits to advance the hypothesis that the results of indexing could be better than those obtained by traditional methods. Interactions between different disciplines, as information retrieval, automatic indexing, information visualization, and nonclassical logics were considered in this research. From the methodological point of view, an algorithm for treatment of uncertainty and imprecision, developed under the Paraconsistent logic, was used to modify the values of the weights assigned to indexing terms of the text collections. The tests were performed on an information visualization system named Projection Explorer (PEx), created at Institute of Mathematics and Computer Science (ICMC - USP São Carlos), with available source code. PEx uses traditional vector space model to represent documents of a collection. The results were evaluated by criteria built in the information visualization system itself, and demonstrated measurable gains in the quality of the displays, confirming the hypothesis that the use of the para-analyser under the conditions of the experiment has the ability to generate more effective clusters of similar documents. This is a point that draws attention, since the constitution of more significant clusters can be used to enhance information indexing and retrieval. It can be argued that the adoption of non-dichotomous (non-exclusive) parameters provides new possibilities to relate similar information.
Borodin, Y.; Polishchuk, V.; Mahmud, J.; Ramakrishnan, I.V.; Stent, A.: Live and learn from mistakes : a lightweight system for document classification (2013) 0.15
```
0.15007861 = product of:
  0.20010482 = sum of:
    0.10904834 = weight(_text_:vector in 2722) [ClassicSimilarity], result of:
      0.10904834 = score(doc=2722,freq=2.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.3557295 = fieldWeight in 2722, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2722)
    0.07161439 = weight(_text_:space in 2722) [ClassicSimilarity], result of:
      0.07161439 = score(doc=2722,freq=2.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.28827736 = fieldWeight in 2722, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2722)
    0.019442094 = product of:
      0.03888419 = sum of:
        0.03888419 = weight(_text_:model in 2722) [ClassicSimilarity], result of:
          0.03888419 = score(doc=2722,freq=2.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.21242073 = fieldWeight in 2722, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2722)
      0.5 = coord(1/2)
  0.75 = coord(3/4)
```
Abstract

We present a Life-Long Learning from Mistakes (3LM) algorithm for document classification, which could be used in various scenarios such as spam filtering, blog classification, and web resource categorization. We extend the ideas of online clustering and batch-mode centroid-based classification to online learning with negative feedback. The 3LM is a competitive learning algorithm, which avoids over-smoothing, characteristic of the centroid-based classifiers, by using a different class representative, which we call clusterhead. The clusterheads competing for vector-space dominance are drawn toward misclassified documents, eventually bringing the model to a "balanced state" for a fixed distribution of documents. Subsequently, the clusterheads oscillate between the misclassified documents, heuristically minimizing the rate of misclassifications, an NP-complete problem. Further, the 3LM algorithm prevents over-fitting by "leashing" the clusterheads to their respective centroids. A clusterhead provably converges if its class can be separated by a hyper-plane from all other classes. Lifelong learning with fixed learning rate allows 3LM to adapt to possibly changing distribution of the data and continually learn and unlearn document classes. We report on our experiments, which demonstrate high accuracy of document classification on Reuters21578, OHSUMED, and TREC07p-spam datasets. The 3LM algorithm did not show over-fitting, while consistently outperforming centroid-based, Naïve Bayes, C4.5, AdaBoost, kNN, and SVM whose accuracy had been reported on the same three corpora.
Lu, K.; Wolfram, D.: Measuring author research relatedness : a comparison of word-based, topic-based, and author cocitation approaches (2012) 0.15
```
0.14872321 = product of:
  0.29744643 = sum of:
    0.15421765 = weight(_text_:vector in 453) [ClassicSimilarity], result of:
      0.15421765 = score(doc=453,freq=4.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.5030775 = fieldWeight in 453, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.0390625 = fieldNorm(doc=453)
    0.14322878 = weight(_text_:space in 453) [ClassicSimilarity], result of:
      0.14322878 = score(doc=453,freq=8.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.5765547 = fieldWeight in 453, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.0390625 = fieldNorm(doc=453)
  0.5 = coord(2/4)
```
Abstract

Relationships between authors based on characteristics of published literature have been studied for decades. Author cocitation analysis using mapping techniques has been most frequently used to study how closely two authors are thought to be in intellectual space based on how members of the research community co-cite their works. Other approaches exist to study author relatedness based more directly on the text of their published works. In this study we present static and dynamic word-based approaches using vector space modeling, as well as a topic-based approach based on latent Dirichlet allocation for mapping author research relatedness. Vector space modeling is used to define an author space consisting of works by a given author. Outcomes for the two word-based approaches and a topic-based approach for 50 prolific authors in library and information science are compared with more traditional author cocitation analysis using multidimensional scaling and hierarchical cluster analysis. The two word-based approaches produced similar outcomes except where two authors were frequent co-authors for the majority of their articles. The topic-based approach produced the most distinctive map.
Jorge-Botana, G.; León, J.A.; Olmos, R.; Hassan-Montero, Y.: Visualizing polysemy using LSA and the predication algorithm (2010) 0.13
```
0.13024583 = product of:
  0.26049167 = sum of:
    0.18887727 = weight(_text_:vector in 3696) [ClassicSimilarity], result of:
      0.18887727 = score(doc=3696,freq=6.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.6161416 = fieldWeight in 3696, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3696)
    0.07161439 = weight(_text_:space in 3696) [ClassicSimilarity], result of:
      0.07161439 = score(doc=3696,freq=2.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.28827736 = fieldWeight in 3696, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3696)
  0.5 = coord(2/4)
```
Abstract

Context is a determining factor in language and plays a decisive role in polysemic words. Several psycholinguistically motivated algorithms have been proposed to emulate human management of context, under the assumption that the value of a word is evanescent and takes on meaning only in interaction with other structures. The predication algorithm (Kintsch, [2001]), for example, uses a vector representation of the words produced by LSA (Latent Semantic Analysis) to dynamically simulate the comprehension of predications and even of predicative metaphors. The objective of this study was to predict some unwanted effects that could be present in vector-space models when extracting different meanings of a polysemic word (predominant meaning inundation, lack of precision, and low-level definition), and propose ideas based on the predication algorithm for avoiding them. Our first step was to visualize such unwanted phenomena and also the effect of solutions. We use different methods to extract the meanings for a polysemic word (without context, vector sum, and predication algorithm). Our second step was to conduct an analysis of variance to compare such methods and measure the impact of potential solutions. Results support the idea that a human-based computational algorithm like the predication algorithm can take into account features that ensure more accurate representations of the structures we seek to extract. Theoretical assumptions and their repercussions are discussed.
Xiong, C.: Knowledge based text representations for information retrieval (2016) 0.13
```
0.12872586 = product of:
  0.17163447 = sum of:
    0.050406437 = product of:
      0.15121931 = sum of:
        0.15121931 = weight(_text_:3a in 5820) [ClassicSimilarity], result of:
          0.15121931 = score(doc=5820,freq=2.0), product of:
            0.4035973 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.047605187 = queryNorm
            0.3746787 = fieldWeight in 5820, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.03125 = fieldNorm(doc=5820)
      0.33333334 = coord(1/3)
    0.09923182 = weight(_text_:space in 5820) [ClassicSimilarity], result of:
      0.09923182 = score(doc=5820,freq=6.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.39944884 = fieldWeight in 5820, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.03125 = fieldNorm(doc=5820)
    0.021996219 = product of:
      0.043992437 = sum of:
        0.043992437 = weight(_text_:model in 5820) [ClassicSimilarity], result of:
          0.043992437 = score(doc=5820,freq=4.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.24032663 = fieldWeight in 5820, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.03125 = fieldNorm(doc=5820)
      0.5 = coord(1/2)
  0.75 = coord(3/4)
```
Abstract

The successes of information retrieval (IR) in recent decades were built upon bag-of-words representations. Effective as it is, bag-of-words is only a shallow text understanding; there is a limited amount of information for document ranking in the word space. This dissertation goes beyond words and builds knowledge based text representations, which embed the external and carefully curated information from knowledge bases, and provide richer and structured evidence for more advanced information retrieval systems. This thesis research first builds query representations with entities associated with the query. Entities' descriptions are used by query expansion techniques that enrich the query with explanation terms. Then we present a general framework that represents a query with entities that appear in the query, are retrieved by the query, or frequently show up in the top retrieved documents. A latent space model is developed to jointly learn the connections from query to entities and the ranking of documents, modeling the external evidence from knowledge bases and internal ranking features cooperatively. To further improve the quality of relevant entities, a defining factor of our query representations, we introduce learning to rank to entity search and retrieve better entities from knowledge bases. In the document representation part, this thesis research also moves one step forward with a bag-of-entities model, in which documents are represented by their automatic entity annotations, and the ranking is performed in the entity space.

Content

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Language and Information Technologies. Vgl.: https%3A%2F%2Fwww.cs.cmu.edu%2F~cx%2Fpapers%2Fknowledge_based_text_representation.pdf&usg=AOvVaw0SaTSvhWLTh__Uz_HtOtl3.
AlQenaei, Z.M.; Monarchi, D.E.: ¬The use of learning techniques to analyze the results of a manual classification system (2016) 0.13
```
0.12774785 = product of:
  0.2554957 = sum of:
    0.15421765 = weight(_text_:vector in 2836) [ClassicSimilarity], result of:
      0.15421765 = score(doc=2836,freq=4.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.5030775 = fieldWeight in 2836, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2836)
    0.101278044 = weight(_text_:space in 2836) [ClassicSimilarity], result of:
      0.101278044 = score(doc=2836,freq=4.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.40768576 = fieldWeight in 2836, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2836)
  0.5 = coord(2/4)
```
Abstract

Classification is the process of assigning objects to pre-defined classes based on observations or characteristics of those objects, and there are many approaches to performing this task. The overall objective of this study is to demonstrate the use of two learning techniques to analyze the results of a manual classification system. Our sample consisted of 1,026 documents, from the ACM Computing Classification System, classified by their authors as belonging to one of the groups of the classification system: "H.3 Information Storage and Retrieval." A singular value decomposition of the documents' weighted term-frequency matrix was used to represent each document in a 50-dimensional vector space. The analysis of the representation using both supervised (decision tree) and unsupervised (clustering) techniques suggests that two pairs of the ACM classes are closely related to each other in the vector space. Class 1 (Content Analysis and Indexing) is closely related to Class 3 (Information Search and Retrieval), and Class 4 (Systems and Software) is closely related to Class 5 (Online Information Services). Further analysis was performed to test the diffusion of the words in the two classes using both cosine and Euclidean distance.
Sah, M.; Wade, V.: Personalized concept-based search on the Linked Open Data (2015) 0.12
```
0.1248948 = product of:
  0.16652639 = sum of:
    0.08723867 = weight(_text_:vector in 2511) [ClassicSimilarity], result of:
      0.08723867 = score(doc=2511,freq=2.0), product of:
        0.30654848 = queryWeight, product of:
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.047605187 = queryNorm
        0.2845836 = fieldWeight in 2511, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.439392 = idf(docFreq=191, maxDocs=44218)
          0.03125 = fieldNorm(doc=2511)
    0.05729151 = weight(_text_:space in 2511) [ClassicSimilarity], result of:
      0.05729151 = score(doc=2511,freq=2.0), product of:
        0.24842183 = queryWeight, product of:
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.047605187 = queryNorm
        0.23062189 = fieldWeight in 2511, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.2183776 = idf(docFreq=650, maxDocs=44218)
          0.03125 = fieldNorm(doc=2511)
    0.021996219 = product of:
      0.043992437 = sum of:
        0.043992437 = weight(_text_:model in 2511) [ClassicSimilarity], result of:
          0.043992437 = score(doc=2511,freq=4.0), product of:
            0.1830527 = queryWeight, product of:
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.047605187 = queryNorm
            0.24032663 = fieldWeight in 2511, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.845226 = idf(docFreq=2569, maxDocs=44218)
              0.03125 = fieldNorm(doc=2511)
      0.5 = coord(1/2)
  0.75 = coord(3/4)
```
Abstract

In this paper, we present a novel personalized concept-based search mechanism for the Web of Data based on results categorization. The innovation of the paper comes from combining novel categorization and personalization techniques, and using categorization for providing personalization. In our approach, search results (Linked Open Data resources) are dynamically categorized into Upper Mapping and Binding Exchange Layer (UMBEL) concepts using a novel fuzzy retrieval model. Then, results with the same concepts are grouped together to form categories, which we call conceptlenses. Such categorization enables concept-based browsing of the retrieved results aligned to users' intent or interests. When the user selects a concept lens for exploration, results are immediately personalized. In particular, all concept lenses are personally re-organized according to their similarity to the selected lens. Within the selected concept lens; more relevant results are included using results re-ranking and query expansion, as well as relevant concept lenses are suggested to support results exploration. This allows dynamic adaptation of results to the user's local choices. We also support interactive personalization; when the user clicks on a result, within the interacted lens, relevant lenses and results are included using results re-ranking and query expansion. Extensive evaluations were performed to assess our approach: (i) Performance of our fuzzy-based categorization approach was evaluated on a particular benchmark (~10,000 mappings). The evaluations showed that we can achieve highly acceptable categorization accuracy and perform better than the vector space model. (ii) Personalized search efficacy was assessed using a user study with 32 participants in a tourist domain. The results revealed that our approach performed significantly better than a non-adaptive baseline search. (iii) Dynamic personalization performance was evaluated, which illustrated that our personalization approach is scalable. (iv) Finally, we compared our system with the existing LOD search engines, which showed that our approach is unique.

Search (1447 results, page 1 of 73)

Authors

Languages

Types

Themes

Subjects

Classifications