Diese Datenbank enthält über 40.000 Dokumente zu Themen aus den Bereichen Formalerschließung – Inhaltserschließung – Information Retrieval.
© 2015 W. Gödert, TH Köln, Institut für Informationswissenschaft / Powered by litecat, BIS Oldenburg (Stand: 28. April 2022)
1Dang, E.K.F. ; Luk, R.W.P. ; Allan, J. ; Ho, K.S. ; Chung, K.F.L. ; Lee, D.L.: ¬A new context-dependent term weight computed by boost and discount using relevance information.
In: Journal of the American Society for Information Science and Technology. 61(2010) no.12, S.2514-2530.
Abstract: We studied the effectiveness of a new class of context-dependent term weights for information retrieval. Unlike the traditional term frequency-inverse document frequency (TF-IDF), the new weighting of a term t in a document d depends not only on the occurrence statistics of t alone but also on the terms found within a text window (or "document-context") centered on t. We introduce a Boost and Discount (B&D) procedure which utilizes partial relevance information to compute the context-dependent term weights of query terms according to a logistic regression model. We investigate the effectiveness of the new term weights compared with the context-independent BM25 weights in the setting of relevance feedback. We performed experiments with title queries of the TREC-6, -7, -8, and 2005 collections, comparing the residual Mean Average Precision (MAP) measures obtained using B&D term weights and those obtained by a baseline using BM25 weights. Given either 10 or 20 relevance judgments of the top retrieved documents, using the new term weights yields improvement over the baseline for all collections tested. The MAP obtained with the new weights has relative improvement over the baseline by 3.3 to 15.2%, with statistical significance at the 95% confidence level across all four collections.
2Li, D. ; Kwong, C.-P. ; Lee, D.L.: Unified linear subspace approach to semantic analysis.
In: Journal of the American Society for Information Science and Technology. 61(2010) no.1, S.175-189.
Abstract: The Basic Vector Space Model (BVSM) is well known in information retrieval. Unfortunately, its retrieval effectiveness is limited because it is based on literal term matching. The Generalized Vector Space Model (GVSM) and Latent Semantic Indexing (LSI) are two prominent semantic retrieval methods, both of which assume there is some underlying latent semantic structure in a dataset that can be used to improve retrieval performance. However, while this structure may be derived from both the term space and the document space, GVSM exploits only the former and LSI the latter. In this article, the latent semantic structure of a dataset is examined from a dual perspective; namely, we consider the term space and the document space simultaneously. This new viewpoint has a natural connection to the notion of kernels. Specifically, a unified kernel function can be derived for a class of vector space models. The dual perspective provides a deeper understanding of the semantic space and makes transparent the geometrical meaning of the unified kernel function. New semantic analysis methods based on the unified kernel function are developed, which combine the advantages of LSI and GVSM. We also prove that the new methods are stable because although the selected rank of the truncated Singular Value Decomposition (SVD) is far from the optimum, the retrieval performance will not be degraded significantly. Experiments performed on standard test collections show that our methods are promising.
Themenfeld: Semantisches Umfeld in Indexierung u. Retrieval
Objekt: Latent Semantic Indexing ; Generalized Vector Space Model
3Dang, E.K.F. ; Luk, R.W.P. ; Ho, K.S. ; Chan, S.C.F. ; Lee, D.L.: ¬A new measure of clustering effectiveness : algorithms and experimental studies.
In: Journal of the American Society for Information Science and Technology. 59(2008) no.3, S.390-406.
Abstract: We propose a new optimal clustering effectiveness measure, called CS1, based on a combination of clusters rather than selecting a single optimal cluster as in the traditional MK1 measure. For hierarchical clustering, we present an algorithm to compute CS1, defined by seeking the optimal combinations of disjoint clusters obtained by cutting the hierarchical structure at a certain similarity level. By reformulating the optimization to a 0-1 linear fractional programming problem, we demonstrate that an exact solution can be obtained by a linear time algorithm. We further discuss how our approach can be generalized to more general problems involving overlapping clusters, and we show how optimal estimates can be obtained by greedy algorithms.
Themenfeld: Automatisches Klassifizieren
4Wong, W.S. ; Luk, R.W.P. ; Leong, H.V. ; Ho, K.S. ; Lee, D.L.: Re-examining the effects of adding relevance information in a relevance feedback environment.
In: Information processing and management. 44(2008) no.3, S.1086-1116.
Abstract: This paper presents an investigation about how to automatically formulate effective queries using full or partial relevance information (i.e., the terms that are in relevant documents) in the context of relevance feedback (RF). The effects of adding relevance information in the RF environment are studied via controlled experiments. The conditions of these controlled experiments are formalized into a set of assumptions that form the framework of our study. This framework is called idealized relevance feedback (IRF) framework. In our IRF settings, we confirm the previous findings of relevance feedback studies. In addition, our experiments show that better retrieval effectiveness can be obtained when (i) we normalize the term weights by their ranks, (ii) we select weighted terms in the top K retrieved documents, (iii) we include terms in the initial title queries, and (iv) we use the best query sizes for each topic instead of the average best query size where they produce at most five percentage points improvement in the mean average precision (MAP) value. We have also achieved a new level of retrieval effectiveness which is about 55-60% MAP instead of 40+% in the previous findings. This new level of retrieval effectiveness was found to be similar to a level using a TREC ad hoc test collection that is about double the number of documents in the TREC-3 test collection used in previous works.
5Lee, D.L. ; Ren, L.: Document ranking on weight-partitioned signature files.
In: ACM transactions on information systems. 14(1996) no.2, S.109-137.
Abstract: Proposes the weight partitioned signature file, a signature file organization for supporting document ranking. It uses multiple signature files each corresponding to one term frequency to represent terms with different term frequencies. Words with the same term frequency in a document are grouped together and hased into the signature file corresponding to that term frequency. Investigates the effect of false drops on retrieval effectiveness. Analyses the performance of the weight partitioned signature file under different search strategies and configurations. Obtains an optimal formula for storage allocation to minimise the effect of false drops on document ranks. Analytical results are supported by experiments on document collections
6Lee, D.L.: Massive parallelism on the hybrid text-retrieval machine.
In: Information processing and management. 31(1995) no.6, S.815-830.
Abstract: Discusses the design of a high-performance, cost effective, machine for retrieving textual data, HYTREM. High performance and cost effectiveness are achieved by a combination of low cost hard discs, software filtering techniques, and a large amount of main memory. Focuses on the signature processor, which is based on the partitioned signature file technique, and the mass storage system, which is based on a disc array. Presents a performance evaluation on the individual system components, i.e. the signature processor and the mass storage system, as well as the entire system
7Couvreur, T.R. ; Benzel, R.N. ; Miller, S.F. ; Zeitler, D.N. ; Lee, D.L. ; Singhal, M. ; Shivaratri, N. ; Wong, W.Y.P.: ¬An analysis of performance and cost factors in searching large text databases using parallel search systems.
In: Journal of the American Society for Information Science. 45(1994) no.7, S.443-464.
Abstract: The results of modelling the performance of searching large text databases (>10 GBytes) via various parallel hardware architectures and search algorithms are discussed. The performance under load and the cost of each configuration are compared. Strengths, weaknesses, performance sensitivities, and search features supported for each configuration are also addressed. In addition, a common search workload used in the modelling is described. The search workload is derived from a set of searches run against the Chemical Abstracts file of bibliographic and abstract text available on STN International. This common workload is applied to all configurations modelled to provide a common basis of comparison
Themenfeld: Retrievalalgorithmen ; Volltextretrieval
8Wong, W.Y.P. ; Lee, D.L.: Implementation of partial document ranking using inverted files.
In: Information processing and management. 29(1993) no.5, S.647-669.
Abstract: Examines the implementations of document ranking based on inverted files. Studies three heuristic methods for implementing the term frequency X inverse document frequency weighting strategy. The basic idea of the heuristic methods is to process the query terms in an order so that as many top documents as possible can be identified without processing all of the query terms. The heuristics were evaluated and compared. The results show improved performance. Two methods for estimating the retrieval accuracy were studied. All experiments were based on four test collection made available with the SMART system