Search (62 results, page 1 of 4)

  • × theme_ss:"Retrievalalgorithmen"
  1. Soulier, L.; Jabeur, L.B.; Tamine, L.; Bahsoun, W.: On ranking relevant entities in heterogeneous networks using a language-based model (2013) 0.03
    0.0269097 = product of:
      0.0807291 = sum of:
        0.04803372 = weight(_text_:problem in 664) [ClassicSimilarity], result of:
          0.04803372 = score(doc=664,freq=2.0), product of:
            0.20485485 = queryWeight, product of:
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.04826377 = queryNorm
            0.23447686 = fieldWeight in 664, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.0390625 = fieldNorm(doc=664)
        0.032695375 = weight(_text_:22 in 664) [ClassicSimilarity], result of:
          0.032695375 = score(doc=664,freq=2.0), product of:
            0.1690115 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.04826377 = queryNorm
            0.19345059 = fieldWeight in 664, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=664)
      0.33333334 = coord(2/6)
    
    Abstract
    A new challenge, accessing multiple relevant entities, arises from the availability of linked heterogeneous data. In this article, we address more specifically the problem of accessing relevant entities, such as publications and authors within a bibliographic network, given an information need. We propose a novel algorithm, called BibRank, that estimates a joint relevance of documents and authors within a bibliographic network. This model ranks each type of entity using a score propagation algorithm with respect to the query topic and the structure of the underlying bi-type information entity network. Evidence sources, namely content-based and network-based scores, are both used to estimate the topical similarity between connected entities. For this purpose, authorship relationships are analyzed through a language model-based score on the one hand and on the other hand, non topically related entities of the same type are detected through marginal citations. The article reports the results of experiments using the Bibrank algorithm for an information retrieval task. The CiteSeerX bibliographic data set forms the basis for the topical query automatic generation and evaluation. We show that a statistically significant improvement over closely related ranking models is achieved.
    Date
    22. 3.2013 19:34:49
  2. Maron, M.E.: ¬An historical note on the origins of probabilistic indexing (2008) 0.02
    0.018114652 = product of:
      0.108687915 = sum of:
        0.108687915 = weight(_text_:problem in 2047) [ClassicSimilarity], result of:
          0.108687915 = score(doc=2047,freq=4.0), product of:
            0.20485485 = queryWeight, product of:
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.04826377 = queryNorm
            0.5305606 = fieldWeight in 2047, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.0625 = fieldNorm(doc=2047)
      0.16666667 = coord(1/6)
    
    Abstract
    The motivation behind "Probabilistic Indexing" was to replace two-valued thinking about information retrieval with probabilistic notions. This involved a new view of the information retrieval problem - viewing it as problem of inference and prediction, and introducing probabilistically weighted indexes and probabilistically ranked output. These ideas were first formulated and written up in August 1958.
  3. Voorhees, E.M.: Implementing agglomerative hierarchic clustering algorithms for use in document retrieval (1986) 0.02
    0.017437533 = product of:
      0.104625195 = sum of:
        0.104625195 = weight(_text_:22 in 402) [ClassicSimilarity], result of:
          0.104625195 = score(doc=402,freq=2.0), product of:
            0.1690115 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.04826377 = queryNorm
            0.61904186 = fieldWeight in 402, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.125 = fieldNorm(doc=402)
      0.16666667 = coord(1/6)
    
    Source
    Information processing and management. 22(1986) no.6, S.465-476
  4. Sachs, W.M.: ¬An approach to associative retrieval through the theory of fuzzy sets (1976) 0.02
    0.016011242 = product of:
      0.09606744 = sum of:
        0.09606744 = weight(_text_:problem in 7) [ClassicSimilarity], result of:
          0.09606744 = score(doc=7,freq=2.0), product of:
            0.20485485 = queryWeight, product of:
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.04826377 = queryNorm
            0.46895373 = fieldWeight in 7, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.078125 = fieldNorm(doc=7)
      0.16666667 = coord(1/6)
    
    Abstract
    The theory of fuzzy sets is used to provide a rogorous formulation of the problem of associative retrieval. This formulation suggests the idea of using fuzzy clustering to organize data for retrieval
  5. Cheng, C.-S.; Chung, C.-P.; Shann, J.J.-J.: Fast query evaluation through document identifier assignment for inverted file-based information retrieval systems (2006) 0.02
    0.016011242 = product of:
      0.09606744 = sum of:
        0.09606744 = weight(_text_:problem in 979) [ClassicSimilarity], result of:
          0.09606744 = score(doc=979,freq=8.0), product of:
            0.20485485 = queryWeight, product of:
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.04826377 = queryNorm
            0.46895373 = fieldWeight in 979, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.0390625 = fieldNorm(doc=979)
      0.16666667 = coord(1/6)
    
    Abstract
    Compressing an inverted file can greatly improve query performance of an information retrieval system (IRS) by reducing disk I/Os. We observe that a good document identifier assignment (DIA) can make the document identifiers in the posting lists more clustered, and result in better compression as well as shorter query processing time. In this paper, we tackle the NP-complete problem of finding an optimal DIA to minimize the average query processing time in an IRS when the probability distribution of query terms is given. We indicate that the greedy nearest neighbor (Greedy-NN) algorithm can provide excellent performance for this problem. However, the Greedy-NN algorithm is inappropriate if used in large-scale IRSs, due to its high complexity O(N2 × n), where N denotes the number of documents and n denotes the number of distinct terms. In real-world IRSs, the distribution of query terms is skewed. Based on this fact, we propose a fast O(N × n) heuristic, called partition-based document identifier assignment (PBDIA) algorithm, which can efficiently assign consecutive document identifiers to those documents containing frequently used query terms, and improve compression efficiency of the posting lists for those terms. This can result in reduced query processing time. The experimental results show that the PBDIA algorithm can yield a competitive performance versus the Greedy-NN for the DIA problem, and that this optimization problem has significant advantages for both long queries and parallel information retrieval (IR).
  6. Na, S.-H.; Kang, I.-S.; Roh, J.-E.; Lee, J.-H.: ¬An empirical study of query expansion and cluster-based retrieval in language modeling approach (2007) 0.02
    0.015850322 = product of:
      0.09510193 = sum of:
        0.09510193 = weight(_text_:problem in 906) [ClassicSimilarity], result of:
          0.09510193 = score(doc=906,freq=4.0), product of:
            0.20485485 = queryWeight, product of:
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.04826377 = queryNorm
            0.46424055 = fieldWeight in 906, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.0546875 = fieldNorm(doc=906)
      0.16666667 = coord(1/6)
    
    Abstract
    The term mismatch problem in information retrieval is a critical problem, and several techniques have been developed, such as query expansion, cluster-based retrieval and dimensionality reduction to resolve this issue. Of these techniques, this paper performs an empirical study on query expansion and cluster-based retrieval. We examine the effect of using parsimony in query expansion and the effect of clustering algorithms in cluster-based retrieval. In addition, query expansion and cluster-based retrieval are compared, and their combinations are evaluated in terms of retrieval performance by performing experimentations on seven test collections of NTCIR and TREC.
  7. Sánchez-de-Madariaga, R.; Fernández-del-Castillo, J.R.: ¬The bootstrapping of the Yarowsky algorithm in real corpora (2009) 0.02
    0.015850322 = product of:
      0.09510193 = sum of:
        0.09510193 = weight(_text_:problem in 2451) [ClassicSimilarity], result of:
          0.09510193 = score(doc=2451,freq=4.0), product of:
            0.20485485 = queryWeight, product of:
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.04826377 = queryNorm
            0.46424055 = fieldWeight in 2451, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2451)
      0.16666667 = coord(1/6)
    
    Abstract
    The Yarowsky bootstrapping algorithm resolves the homograph-level word sense disambiguation (WSD) problem, which is the sense granularity level required for real natural language processing (NLP) applications. At the same time it resolves the knowledge acquisition bottleneck problem affecting most WSD algorithms and can be easily applied to foreign language corpora. However, this paper shows that the Yarowsky algorithm is significantly less accurate when applied to domain fluctuating, real corpora. This paper also introduces a new bootstrapping methodology that performs much better when applied to these corpora. The accuracy achieved in non-domain fluctuating corpora is not reached due to inherent domain fluctuation ambiguities.
  8. Smeaton, A.F.; Rijsbergen, C.J. van: ¬The retrieval effects of query expansion on a feedback document retrieval system (1983) 0.02
    0.015257841 = product of:
      0.09154704 = sum of:
        0.09154704 = weight(_text_:22 in 2134) [ClassicSimilarity], result of:
          0.09154704 = score(doc=2134,freq=2.0), product of:
            0.1690115 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.04826377 = queryNorm
            0.5416616 = fieldWeight in 2134, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.109375 = fieldNorm(doc=2134)
      0.16666667 = coord(1/6)
    
    Date
    30. 3.2001 13:32:22
  9. Back, J.: ¬An evaluation of relevancy ranking techniques used by Internet search engines (2000) 0.02
    0.015257841 = product of:
      0.09154704 = sum of:
        0.09154704 = weight(_text_:22 in 3445) [ClassicSimilarity], result of:
          0.09154704 = score(doc=3445,freq=2.0), product of:
            0.1690115 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.04826377 = queryNorm
            0.5416616 = fieldWeight in 3445, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.109375 = fieldNorm(doc=3445)
      0.16666667 = coord(1/6)
    
    Date
    25. 8.2005 17:42:22
  10. Fuhr, N.: Ranking-Experimente mit gewichteter Indexierung (1986) 0.01
    0.013078149 = product of:
      0.0784689 = sum of:
        0.0784689 = weight(_text_:22 in 58) [ClassicSimilarity], result of:
          0.0784689 = score(doc=58,freq=2.0), product of:
            0.1690115 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.04826377 = queryNorm
            0.46428138 = fieldWeight in 58, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.09375 = fieldNorm(doc=58)
      0.16666667 = coord(1/6)
    
    Date
    14. 6.2015 22:12:44
  11. Fuhr, N.: Rankingexperimente mit gewichteter Indexierung (1986) 0.01
    0.013078149 = product of:
      0.0784689 = sum of:
        0.0784689 = weight(_text_:22 in 2051) [ClassicSimilarity], result of:
          0.0784689 = score(doc=2051,freq=2.0), product of:
            0.1690115 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.04826377 = queryNorm
            0.46428138 = fieldWeight in 2051, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.09375 = fieldNorm(doc=2051)
      0.16666667 = coord(1/6)
    
    Date
    14. 6.2015 22:12:56
  12. Frants, V.I.; Shapiro, J.: Control and feedback in a documentary information retrieval system (1991) 0.01
    0.0128089925 = product of:
      0.07685395 = sum of:
        0.07685395 = weight(_text_:problem in 416) [ClassicSimilarity], result of:
          0.07685395 = score(doc=416,freq=2.0), product of:
            0.20485485 = queryWeight, product of:
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.04826377 = queryNorm
            0.375163 = fieldWeight in 416, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.0625 = fieldNorm(doc=416)
      0.16666667 = coord(1/6)
    
    Abstract
    Addresses the problem of control in documentary information retrieval systems is analysed and it is shown why an IR system has to be looked at as an adaptive system. The algorithms of feedback are proposed and it is shown how they depend on the type of the collection of documents: static (no change in the collection between searches) and dynamic (when the change occurs between searches). The proposed algorithms are the basis for the development of the fully automated information retrieval systems
  13. Uratani, N.; Takeda, M.: ¬A fast string-searching algorithm for multiple patterns (1993) 0.01
    0.0128089925 = product of:
      0.07685395 = sum of:
        0.07685395 = weight(_text_:problem in 6275) [ClassicSimilarity], result of:
          0.07685395 = score(doc=6275,freq=2.0), product of:
            0.20485485 = queryWeight, product of:
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.04826377 = queryNorm
            0.375163 = fieldWeight in 6275, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.0625 = fieldNorm(doc=6275)
      0.16666667 = coord(1/6)
    
    Abstract
    The string-searching problem is to find all occurrences of pattern(s) in a text string. The Aho-Corasick string searching algorithm simultaneously finds all occurrences of multiple patterns in one pass through the text. The Boyer-Moore algorithm is the fastest algorithm for a single pattern. By combining the ideas of these two algorithms, presents an efficient string searching algorithm for multiple patterns. The algorithm runs in sublinear time, on the average, as the BM algorithm achieves, and its preprocessing time is linear proportional to the sum of the lengths of the patterns like the AC algorithm
  14. Smith, M.; Smith, M.P.; Wade, S.J.: Applying genetic programming to the problem of term weight algorithms (1995) 0.01
    0.0128089925 = product of:
      0.07685395 = sum of:
        0.07685395 = weight(_text_:problem in 5803) [ClassicSimilarity], result of:
          0.07685395 = score(doc=5803,freq=2.0), product of:
            0.20485485 = queryWeight, product of:
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.04826377 = queryNorm
            0.375163 = fieldWeight in 5803, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.0625 = fieldNorm(doc=5803)
      0.16666667 = coord(1/6)
    
  15. Carpineto, C.; Romano, G.: Order-theoretical ranking (2000) 0.01
    0.011321658 = product of:
      0.067929946 = sum of:
        0.067929946 = weight(_text_:problem in 4766) [ClassicSimilarity], result of:
          0.067929946 = score(doc=4766,freq=4.0), product of:
            0.20485485 = queryWeight, product of:
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.04826377 = queryNorm
            0.33160037 = fieldWeight in 4766, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4766)
      0.16666667 = coord(1/6)
    
    Abstract
    Current best-match ranking (BMR) systems perform well but cannot handle word mismatch between a query and a document. The best known alternative ranking method, hierarchical clustering-based ranking (HCR), seems to be more robust than BMR with respect to this problem, but it is hampered by theoretical and practical limitations. We present an approach to document ranking that explicitly addresses the word mismatch problem by exploiting interdocument similarity information in a novel way. Document ranking is seen as a query-document transformation driven by a conceptual representation of the whole document collection, into which the query is merged. Our approach is nased on the theory of concept (or Galois) lattices, which, er argue, provides a powerful, well-founded, and conputationally-tractable framework to model the space in which documents and query are represented and to compute such a transformation. We compared information retrieval using concept lattice-based ranking (CLR) to BMR and HCR. The results showed that HCR was outperformed by CLR as well as BMR, and suggested that, of the two best methods, BMR achieved better performance than CLR on the whole document set, whereas CLR compared more favorably when only the first retrieved documents were used for evaluation. We also evaluated the three methods' specific ability to rank documents that did not match the query, in which case the speriority of CLR over BMR and HCR was apparent
  16. Lewandowski, D.: How can library materials be ranked in the OPAC? (2009) 0.01
    0.011321658 = product of:
      0.067929946 = sum of:
        0.067929946 = weight(_text_:problem in 2810) [ClassicSimilarity], result of:
          0.067929946 = score(doc=2810,freq=4.0), product of:
            0.20485485 = queryWeight, product of:
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.04826377 = queryNorm
            0.33160037 = fieldWeight in 2810, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2810)
      0.16666667 = coord(1/6)
    
    Abstract
    Some Online Public Access Catalogues offer a ranking component. However, ranking there is merely text-based and is doomed to fail due to limited text in bibliographic data. The main assumption for the talk is that we are in a situation where the appropriate ranking factors for OPACs should be defined, while the implementation is no major problem. We must define what we want, and not so much focus on the technical work. Some deep thinking is necessary on the "perfect results set" and how we can achieve it through ranking. The talk presents a set of potential ranking factors and clustering possibilities for further discussion. A look at commercial Web search engines could provide us with ideas how ranking can be improved with additional factors. Search engines are way beyond pure text-based ranking and apply ranking factors in the groups like popularity, freshness, personalisation, etc. The talk describes the main factors used in search engines and how derivatives of these could be used for libraries' purposes. The goal of ranking is to provide the user with the best-suitable results on top of the results list. How can this goal be achieved with the library catalogue and also concerning the library's different collections and databases? The assumption is that ranking of such materials is a complex problem and is yet nowhere near solved. Libraries should focus on ranking to improve user experience.
  17. Wei, F.; Li, W.; Liu, S.: iRANK: a rank-learn-combine framework for unsupervised ensemble ranking (2010) 0.01
    0.011321658 = product of:
      0.067929946 = sum of:
        0.067929946 = weight(_text_:problem in 3472) [ClassicSimilarity], result of:
          0.067929946 = score(doc=3472,freq=4.0), product of:
            0.20485485 = queryWeight, product of:
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.04826377 = queryNorm
            0.33160037 = fieldWeight in 3472, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3472)
      0.16666667 = coord(1/6)
    
    Abstract
    The authors address the problem of unsupervised ensemble ranking. Traditional approaches either combine multiple ranking criteria into a unified representation to obtain an overall ranking score or to utilize certain rank fusion or aggregation techniques to combine the ranking results. Beyond the aforementioned combine-then-rank and rank-then-combine approaches, the authors propose a novel rank-learn-combine ranking framework, called Interactive Ranking (iRANK), which allows two base rankers to teach each other before combination during the ranking process by providing their own ranking results as feedback to the others to boost the ranking performance. This mutual ranking refinement process continues until the two base rankers cannot learn from each other any more. The overall performance is improved by the enhancement of the base rankers through the mutual learning mechanism. The authors further design two ranking refinement strategies to efficiently and effectively use the feedback based on reasonable assumptions and rational analysis. Although iRANK is applicable to many applications, as a case study, they apply this framework to the sentence ranking problem in query-focused summarization and evaluate its effectiveness on the DUC 2005 and 2006 data sets. The results are encouraging with consistent and promising improvements.
  18. Ozdemiray, A.M.; Altingovde, I.S.: Explicit search result diversification using score and rank aggregation methods (2015) 0.01
    0.011321658 = product of:
      0.067929946 = sum of:
        0.067929946 = weight(_text_:problem in 1856) [ClassicSimilarity], result of:
          0.067929946 = score(doc=1856,freq=4.0), product of:
            0.20485485 = queryWeight, product of:
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.04826377 = queryNorm
            0.33160037 = fieldWeight in 1856, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1856)
      0.16666667 = coord(1/6)
    
    Abstract
    Search result diversification is one of the key techniques to cope with the ambiguous and underspecified information needs of web users. In the last few years, strategies that are based on the explicit knowledge of query aspects emerged as highly effective ways of diversifying search results. Our contributions in this article are two-fold. First, we extensively evaluate the performance of a state-of-the-art explicit diversification strategy and pin-point its potential weaknesses. We propose basic yet novel optimizations to remedy these weaknesses and boost the performance of this algorithm. As a second contribution, inspired by the success of the current diversification strategies that exploit the relevance of the candidate documents to individual query aspects, we cast the diversification problem into the problem of ranking aggregation. To this end, we propose to materialize the re-rankings of the candidate documents for each query aspect and then merge these rankings by adapting the score(-based) and rank(-based) aggregation methods. Our extensive experimental evaluations show that certain ranking aggregation methods are superior to existing explicit diversification strategies in terms of diversification effectiveness. Furthermore, these ranking aggregation methods have lower computational complexity than the state-of-the-art diversification strategies.
  19. Liu, X.; Zheng, W.; Fang, H.: ¬An exploration of ranking models and feedback method for related entity finding (2013) 0.01
    0.011321658 = product of:
      0.067929946 = sum of:
        0.067929946 = weight(_text_:problem in 2714) [ClassicSimilarity], result of:
          0.067929946 = score(doc=2714,freq=4.0), product of:
            0.20485485 = queryWeight, product of:
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.04826377 = queryNorm
            0.33160037 = fieldWeight in 2714, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2714)
      0.16666667 = coord(1/6)
    
    Abstract
    Most existing search engines focus on document retrieval. However, information needs are certainly not limited to finding relevant documents. Instead, a user may want to find relevant entities such as persons and organizations. In this paper, we study the problem of related entity finding. Our goal is to rank entities based on their relevance to a structured query, which specifies an input entity, the type of related entities and the relation between the input and related entities. We first discuss a general probabilistic framework, derive six possible retrieval models to rank the related entities, and then compare these models both analytically and empirically. To further improve performance, we study the problem of feedback in the context of related entity finding. Specifically, we propose a mixture model based feedback method that can utilize the pseudo feedback entities to estimate an enriched model for the relation between the input and related entities. Experimental results over two standard TREC collections show that the derived relation generation model combined with a relation feedback method performs better than other models.
  20. French, J.C.; Powell, A.L.; Schulman, E.: Using clustering strategies for creating authority files (2000) 0.01
    0.009606745 = product of:
      0.05764047 = sum of:
        0.05764047 = weight(_text_:problem in 4811) [ClassicSimilarity], result of:
          0.05764047 = score(doc=4811,freq=2.0), product of:
            0.20485485 = queryWeight, product of:
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.04826377 = queryNorm
            0.28137225 = fieldWeight in 4811, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.244485 = idf(docFreq=1723, maxDocs=44218)
              0.046875 = fieldNorm(doc=4811)
      0.16666667 = coord(1/6)
    
    Abstract
    As more online databases are integrated into digital libraries, the issue of quality control of the data becomes increasingly important, especially as it relates to the effective retrieval of information. Authority work, the need to discover and reconcile variant forms of strings in bibliographical entries, will become more critical in the future. Spelling variants, misspellings, and transliteration differences will all increase the difficulty of retrieving information. We investigate a number of approximate string matching techniques that have traditionally been used to help with this problem. We then introduce the notion of approximate word matching and show how it can be used to improve detection and categorization of variant forms. We demonstrate the utility of these approaches using data from the Astrophysics Data System and show how we can reduce the human effort involved in the creation of authority files

Years

Languages

  • e 57
  • d 5

Types

  • a 55
  • m 4
  • el 2
  • r 1
  • s 1
  • More… Less…