Diese Datenbank enthält über 40.000 Dokumente zu Themen aus den Bereichen Formalerschließung – Inhaltserschließung – Information Retrieval.
© 2015 W. Gödert, TH Köln, Institut für Informationswissenschaft / Powered by litecat, BIS Oldenburg (Stand: 03. März 2020)
1Dang, E.K.F. ; Luk, R.W.P. ; Allan, J.: ¬A context-dependent relevance model.
In: Journal of the Association for Information Science and Technology. 67(2016) no.3, S.582-593.
Abstract: Numerous past studies have demonstrated the effectiveness of the relevance model (RM) for information retrieval (IR). This approach enables relevance or pseudo-relevance feedback to be incorporated within the language modeling framework of IR. In the traditional RM, the feedback information is used to improve the estimate of the query language model. In this article, we introduce an extension of RM in the setting of relevance feedback. Our method provides an additional way to incorporate feedback via the improvement of the document language models. Specifically, we make use of the context information of known relevant and nonrelevant documents to obtain weighted counts of query terms for estimating the document language models. The context information is based on the words (unigrams or bigrams) appearing within a text window centered on query terms. Experiments on several Text REtrieval Conference (TREC) collections show that our context-dependent relevance model can improve retrieval performance over the baseline RM. Together with previous studies within the BM25 framework, our current study demonstrates that the effectiveness of our method for using context information in IR is quite general and not limited to any specific retrieval model.
Inhalt: Vgl.: http://onlinelibrary.wiley.com/doi/10.1002/asi.23419/abstract.
2Dang, E.K.F. ; Luk, R.W.P. ; Allan, J.: Beyond bag-of-words : bigram-enhanced context-dependent term weights.
In: Journal of the Association for Information Science and Technology. 65(2014) no.6, S.1134-1148.
Abstract: While term independence is a widely held assumption in most of the established information retrieval approaches, it is clearly not true and various works in the past have investigated a relaxation of the assumption. One approach is to use n-grams in document representation instead of unigrams. However, the majority of early works on n-grams obtained only modest performance improvement. On the other hand, the use of information based on supporting terms or "contexts" of queries has been found to be promising. In particular, recent studies showed that using new context-dependent term weights improved the performance of relevance feedback (RF) retrieval compared with using traditional bag-of-words BM25 term weights. Calculation of the new term weights requires an estimation of the local probability of relevance of each query term occurrence. In previous studies, the estimation of this probability was based on unigrams that occur in the neighborhood of a query term. We explore an integration of the n-gram and context approaches by computing context-dependent term weights based on a mixture of unigrams and bigrams. Extensive experiments are performed using the title queries of the Text Retrieval Conference (TREC)-6, TREC-7, TREC-8, and TREC-2005 collections, for RF with relevance judgment of either the top 10 or top 20 documents of an initial retrieval. We identify some crucial elements needed in the use of bigrams in our methods, such as proper inverse document frequency (IDF) weighting of the bigrams and noise reduction by pruning bigrams with large document frequency values. We show that enhancing context-dependent term weights with bigrams is effective in further improving retrieval performance.
3Dang, E.K.F. ; Luk, R.W.P. ; Allan, J. ; Ho, K.S. ; Chung, K.F.L. ; Lee, D.L.: ¬A new context-dependent term weight computed by boost and discount using relevance information.
In: Journal of the American Society for Information Science and Technology. 61(2010) no.12, S.2514-2530.
Abstract: We studied the effectiveness of a new class of context-dependent term weights for information retrieval. Unlike the traditional term frequency-inverse document frequency (TF-IDF), the new weighting of a term t in a document d depends not only on the occurrence statistics of t alone but also on the terms found within a text window (or "document-context") centered on t. We introduce a Boost and Discount (B&D) procedure which utilizes partial relevance information to compute the context-dependent term weights of query terms according to a logistic regression model. We investigate the effectiveness of the new term weights compared with the context-independent BM25 weights in the setting of relevance feedback. We performed experiments with title queries of the TREC-6, -7, -8, and 2005 collections, comparing the residual Mean Average Precision (MAP) measures obtained using B&D term weights and those obtained by a baseline using BM25 weights. Given either 10 or 20 relevance judgments of the top retrieved documents, using the new term weights yields improvement over the baseline for all collections tested. The MAP obtained with the new weights has relative improvement over the baseline by 3.3 to 15.2%, with statistical significance at the 95% confidence level across all four collections.
4Dang, E.K.F. ; Luk, R.W.P. ; Ho, K.S. ; Chan, S.C.F. ; Lee, D.L.: ¬A new measure of clustering effectiveness : algorithms and experimental studies.
In: Journal of the American Society for Information Science and Technology. 59(2008) no.3, S.390-406.
Abstract: We propose a new optimal clustering effectiveness measure, called CS1, based on a combination of clusters rather than selecting a single optimal cluster as in the traditional MK1 measure. For hierarchical clustering, we present an algorithm to compute CS1, defined by seeking the optimal combinations of disjoint clusters obtained by cutting the hierarchical structure at a certain similarity level. By reformulating the optimization to a 0-1 linear fractional programming problem, we demonstrate that an exact solution can be obtained by a linear time algorithm. We further discuss how our approach can be generalized to more general problems involving overlapping clusters, and we show how optimal estimates can be obtained by greedy algorithms.
Themenfeld: Automatisches Klassifizieren
5Lan, K.C. ; Ho, K.S. ; Luk, R.W.P. ; Leong, H.V.: Dialogue act recognition using maximum entropy.
In: Journal of the American Society for Information Science and Technology. 59(2008) no.6, S.859-874.
Abstract: A dialogue-based interface for information systems is considered a potentially very useful approach to information access. A key step in computer processing of natural-language dialogues is dialogue-act (DA) recognition. In this paper, we apply a feature-based classification approach for DA recognition, by using the maximum entropy (ME) method to build a classifier for labeling utterances with DA tags. The ME method has the advantage that a large number of heterogeneous features can be flexibly combined in one classifier, which can facilitate feature selection. A unique characteristic of our approach is that it does not need to model the prior probability of DAs directly, and thus avoids the use of a discourse grammar. This simplifies the implementation of the classifier and improves the efficiency of DA recognition, without sacrificing the classification accuracy. We evaluate the classifier using a large data set based on the Switchboard corpus. Encouraging performance is observed; the highest classification accuracy achieved is 75.03%. We also propose a heuristic to address the problem of sparseness of the data set. This problem has resulted in poor classification accuracies of some DA types that have very low occurrence frequencies in the data set. Preliminary evaluation shows that the method is effective in improving the macroaverage classification accuracy of the ME classifier.
6Wong, W.S. ; Luk, R.W.P. ; Leong, H.V. ; Ho, K.S. ; Lee, D.L.: Re-examining the effects of adding relevance information in a relevance feedback environment.
In: Information processing and management. 44(2008) no.3, S.1086-1116.
Abstract: This paper presents an investigation about how to automatically formulate effective queries using full or partial relevance information (i.e., the terms that are in relevant documents) in the context of relevance feedback (RF). The effects of adding relevance information in the RF environment are studied via controlled experiments. The conditions of these controlled experiments are formalized into a set of assumptions that form the framework of our study. This framework is called idealized relevance feedback (IRF) framework. In our IRF settings, we confirm the previous findings of relevance feedback studies. In addition, our experiments show that better retrieval effectiveness can be obtained when (i) we normalize the term weights by their ranks, (ii) we select weighted terms in the top K retrieved documents, (iii) we include terms in the initial title queries, and (iv) we use the best query sizes for each topic instead of the average best query size where they produce at most five percentage points improvement in the mean average precision (MAP) value. We have also achieved a new level of retrieval effectiveness which is about 55-60% MAP instead of 40+% in the previous findings. This new level of retrieval effectiveness was found to be similar to a level using a TREC ad hoc test collection that is about double the number of documents in the TREC-3 test collection used in previous works.
7Wu, H.C. ; Luk, R.W.P. ; Wong, K.F, ; Kwok, K.L.: ¬A retrospective study of a hybrid document-context based retrieval model.
In: Information processing and management. 43(2007) no.5, S.1308-1331.
Abstract: This paper describes our novel retrieval model that is based on contexts of query terms in documents (i.e., document contexts). Our model is novel because it explicitly takes into account of the document contexts instead of implicitly using the document contexts to find query expansion terms. Our model is based on simulating a user making relevance decisions, and it is a hybrid of various existing effective models and techniques. It estimates the relevance decision preference of a document context as the log-odds and uses smoothing techniques as found in language models to solve the problem of zero probabilities. It combines these estimated preferences of document contexts using different types of aggregation operators that comply with different relevance decision principles (e.g., aggregate relevance principle). Our model is evaluated using retrospective experiments (i.e., with full relevance information), because such experiments can (a) reveal the potential of our model, (b) isolate the problems of the model from those of the parameter estimation, (c) provide information about the major factors affecting the retrieval effectiveness of the model, and (d) show that whether the model obeys the probability ranking principle. Our model is promising as its mean average precision is 60-80% in our experiments using different TREC ad hoc English collections and the NTCIR-5 ad hoc Chinese collection. Our experiments showed that (a) the operators that are consistent with aggregate relevance principle were effective in combining the estimated preferences, and (b) that estimating probabilities using the contexts in the relevant documents can produce better retrieval effectiveness than using the entire relevant documents.
8Luk, R.W.P. ; Leong, H.V. ; Dillon, T.S. ; Chan, A.T.S. ; Croft, W.B. ; Allen, J.: ¬A survey in indexing and searching XML documents.
In: Journal of the American Society for Information Science and technology. 53(2002) no.6, S.415-437.
Abstract: XML holds the promise to yield (1) a more precise search by providing additional information in the elements, (2) a better integrated search of documents from heterogeneous sources, (3) a powerful search paradigm using structural as well as content specifications, and (4) data and information exchange to share resources and to support cooperative search. We survey several indexing techniques for XML documents, grouping them into flatfile, semistructured, and structured indexing paradigms. Searching techniques and supporting techniques for searching are reviewed, including full text search and multistage search. Because searching XML documents can be very flexible, various search result presentations are discussed, as well as database and information retrieval system integration and XML query languages. We also survey various retrieval models, examining how they would be used or extended for retrieving XML documents. To conclude the article, we discuss various open issues that XML poses with respect to information retrieval and database research.