Diese Datenbank enthält über 40.000 Dokumente zu Themen aus den Bereichen Formalerschließung – Inhaltserschließung – Information Retrieval.
© 2015 W. Gödert, TH Köln, Institut für Informationswissenschaft / Powered by litecat, BIS Oldenburg (Stand: 04. Juni 2021)
1Hawking, D. ; Zobel, J.: Does topic metadata help with Web search?.
In: Journal of the American Society for Information Science and Technology. 58(2007) no.5, S.613-628.
Abstract: It has been claimed that topic metadata can be used to improve the accuracy of text searches. Here, we test this claim by examining the contribution of metadata to effective searching within Web sites published by a university with a strong commitment to and substantial investment in metadata. The authors use four sets of queries, a total of 463, extracted from the university's official query logs and from the university's site map. The results are clear: The available metadata is of little value in ranking answers to those queries. A follow-up experiment with the Web sites published in a particular government jurisdiction confirms that this conclusion is not specific to the particular university. Examination of the metadata present at the university reveals that, in addition to implementation deficiencies, there are inherent problems in trying to use subject and description metadata to enhance the searchability of Web sites. Our experiments show that link anchor text, which can be regarded as metadata created by others, is much more effective in identifying best answers to queries than other textual evidence. Furthermore, query-independent evidence such as link counts and uniform resource locator (URL) length, unlike subject and description metadata, can substantially improve baseline performance.
2Shokouhi, M. ; Zobel, J. ; Tahaghoghi, S. ; Scholer, F.: Using query logs to establish vocabularies in distributed information retrieval.
In: Information processing and management. 43(2007) no.1, S.169-180.
Abstract: Users of search engines express their needs as queries, typically consisting of a small number of terms. The resulting search engine query logs are valuable resources that can be used to predict how people interact with the search system. In this paper, we introduce two novel applications of query logs, in the context of distributed information retrieval. First, we use query log terms to guide sampling from uncooperative distributed collections. We show that while our sampling strategy is at least as efficient as current methods, it consistently performs better. Second, we propose and evaluate a pruning strategy that uses query log information to eliminate terms. Our experiments show that our proposed pruning method maintains the accuracy achieved by complete indexes, while decreasing the index size by up to 60%. While such pruning may not always be desirable in practice, it provides a useful benchmark against which other pruning strategies can be measured.
3Uitdenbogerd, A.L. ; Zobel, J.: ¬An architecture for effective music information retrieval.
In: Journal of the American Society for Information Science and Technology. 55(2004) no.12, S.1053-1057.
Abstract: We have explored methods for music information retrieval for polyphonic music stored in the MIDI format. These methods use a query, expressed as a series of notes that are intended to represent a melody or theme, to identify similar pieces. Our work has shown that a three-phase architecture is appropriate for this task in which the first phase is melody extraction, the second is standardization, and the third is query-to-melody matching. We have investigated and systematically compared algorithms for each of these phases. To ensure that our results are robust, we have applied methodologies that are derived from text information retrieval: We developed test collections and compared different ways of acquiring test queries and relevance judgments. In this article we review this program of work, compare it to other approaches to music information retrieval, and identify outstanding issues.
Anmerkung: Beitrag in einem Themenheft zur Musikerschließung und zum Musikretrieval
4Heinz, S. ; Zobel, J.: Efficient single-pass index construction for text databases.
In: Journal of the American Society for Information Science and technology. 54(2003) no.8, S.713-729.
Abstract: Efficient construction of inverted indexes is essential to provision of search over large collections of text data. In this article, we review the principal approaches to inversion, analyze their theoretical cost, and present experimental results. We identify the drawbacks of existing inversion approaches and propose a single-pass inversion method that, in contrast to previous approaches, does not require the complete vocabulary of the indexed collection in main memory, can operate within limited resources, and does not sacrifice speed with high temporary storage requirements. We show that the performance of the single-pass approach can be improved by constructing inverted files in segments, reducing the cost of disk accesses during inversion of large volumes of data.
5Hoad, T.C. ; Zobel, J.: Methods for identifying versioned and plagiarized documents.
In: Journal of the American Society for Information Science and technology. 54(2003) no.3, S.203-215.
Abstract: Hoad and Zobel term documents that originate from the same source, whether versions or plagiarisms, co-derivatives. Identification of co-derivatives is normally by a technique called fingerprinting, which uses hashing to generate surrogates in the form of integer strings derived from substrings of text, for comparison purposes, or by ranking using a similarity measure as in information retrieval. Hoad and Zobel derive several variants of what they term an identity measure, where documents with similar numbers of occurrences of words benefit and those with dissimilar numbers are penalized, for use in a ranking technique. They then review fingerprinting strategies, and characterize them by the substring size utilized, i.e. granularity, character of the hashing function, the size of the document fingerprint, i.e. resolution, and the substring selection strategy. In their experiments highest false match, HFM, the highest percentage score given an incorrect result, and separation, the difference between the lowest correct result and HFM were the measures utilized in two collections, one of 3,300 documents, and the other of 80,000 with 53 query documents. The new identity measure demonstrates superior performance to the alternatives. Only one fingerprinting strategy was able to identify all human identified similar documents, the anchor strategy. The key parameter in fingerprinting appears to be granularity, with three to five words producing the best results.
6Kaszkiel, M. ; Zobel, J.: Effective ranking with arbitrary passages.
In: Journal of the American Society for Information Science and technology. 52(2001) no.4, S.344-364.
Abstract: Text retrieval systems store a great variety of documents, from abstracts, newspaper articles, and Web pages to journal articles, books, court transcripts, and legislation. Collections of diverse types of documents expose shortcomings in current approaches to ranking. Use of short fragments of documents, called passages, instead of whole documents can overcome these shortcomings: passage ranking provides convenient units of text to return to the user, can avoid the difficulties of comparing documents of different length, and enables identification of short blocks of relevant material among otherwise irrelevant text. In this article, we compare several kinds of passage in an extensive series of experiments. We introduce a new type of passage, overlapping fragments of either fixed or variable length. We show that ranking with these arbitrary passages gives substantial improvements in retrieval effectiveness over traditional document ranking schemes, particularly for queries on collections of long documents. Ranking with arbitrary passages shows consistent improvements compared to ranking with whole documents, and to ranking with previous passage types that depend on document structure or topic shifts in documents
7Persin, M. ; Zobel, J. ; Sacks-Davis, R.: Filtered document retrieval with frequency-sorted indexes.
In: Journal of the American Society for Information SCience. 47(1996) no.10, S.749-764.
Abstract: Proposes an evaluation technique for ranking that uses early recognition of which documents are likely to be highly ranked to reduce costs. Queries are evaluated in 2% of the memory of standard implementation without degradation in retrieval effectiveness. CPU time and disc traffic can also be dramatically reduced by designing inverted indexes explicitly to support the technique. Inverted lists are sorted by decreasing within-document frequency rather than by document number, and this method experimentally reduces CPU time and disk traffic to around 1/3rd of the original requirement. Frequency sorting can lead to a net reduction in index size, regardless of whether the index is compressed
8Moffat, A. ; Zobel, J.: Self-indexing inverted files for fast text retrieval.
In: ACM transactions on information systems. 14(1996) no.4, S.349-379.
Abstract: Query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Retrieval time for inverted lists can be greatly reduced by the use of compression, but this adds to the CPU time required. Shows that the CPU component of query response time for conjunctive Boolean queries and for informal ranked queries can be similarly reduced, at little cost in terms of storage, by the inclusion of an internal index in each compressed inverted list. This method has been applied in a retrieval system for a collection of nearly 2 million short documents. The self-indexing strategy adds less than 20% to the size of the compressed inverted file, which itself occupies less than 10% of the indexed text, yet can reduce processing time for Boolean queries of 5-10 terms to under one fifth of the previous cost. Similarly, ranked queries of 40-50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval effectiveness
9Bell, T.C. ; Moffat, A. ; Nevill-Manning, C.G. ; Witten, I.H. ; Zobel, J.: Data compression in full-text retrieval system.
In: Journal of the American Society for Information Science. 44(1993) no.9, S.508-531.
Abstract: When data compression is applied to full-text retrieval systems, intricate relationships emerge between the amount of compression, access speed, and computing resources required. We propose compression methods, and explore corresponding tradeoffs, for all components of static full-text systems such as text databases on CD-ROM. These components include lexical indexes, and the mein text itself. Results are reported on the application of the methods to several substantial full-text databases, and show that a large, unindexed text can be stored, along with indexes that facilitate fast searching, in less than half its original size - at some appreciable cost in primary memory requirements