Diese Datenbank enthält über 40.000 Dokumente zu Themen aus den Bereichen Formalerschließung – Inhaltserschließung – Information Retrieval.
© 2015 W. Gödert, TH Köln, Institut für Informationswissenschaft / Powered by litecat, BIS Oldenburg (Stand: 15. Juni 2019)
1Liu, W. ; Dog(an, R.I. ; Kim, S. ; Comeau, D.C. ; Kim, W. ; Yeganova, L. ; Lu, Z. ; Wilbur, W.J.: Author name disambiguation for PubMed.
In: Journal of the Association for Information Science and Technology. 65(2014) no.4, S.765-781.
Abstract: Log analysis shows that PubMed users frequently use author names in queries for retrieving scientific literature. However, author name ambiguity may lead to irrelevant retrieval results. To improve the PubMed user experience with author name queries, we designed an author name disambiguation system consisting of similarity estimation and agglomerative clustering. A machine-learning method was employed to score the features for disambiguating a pair of papers with ambiguous names. These features enable the computation of pairwise similarity scores to estimate the probability of a pair of papers belonging to the same author, which drives an agglomerative clustering algorithm regulated by 2 factors: name compatibility and probability level. With transitivity violation correction, high precision author clustering is achieved by focusing on minimizing false-positive pairing. Disambiguation performance is evaluated with manual verification of random samples of pairs from clustering results. When compared with a state-of-the-art system, our evaluation shows that among all the pairs the lumping error rate drops from 10.1% to 2.2% for our system, while the splitting error rises from 1.8% to 7.7%. This results in an overall error rate of 9.9%, compared with 11.9% for the state-of-the-art method. Other evaluations based on gold standard data also show the increase in accuracy of our clustering. We attribute the performance improvement to the machine-learning method driven by a large-scale training set and the clustering algorithm regulated by a name compatibility scheme preferring precision. With integration of the author name disambiguation system into the PubMed search engine, the overall click-through-rate of PubMed users on author name query results improved from 34.9% to 36.9%.
2Yeganova, L. ; Comeau, D.C. ; Kim, W. ; Wilbur, W.J.: How to interpret PubMed queries and why it matters.
In: Journal of the American Society for Information Science and Technology. 60(2009) no.2, S.264-274.
Abstract: A significant fraction of queries in PubMed(TM) are multiterm queries without parsing instructions. Generally, search engines interpret such queries as collections of terms, and handle them as a Boolean conjunction of these terms. However, analysis of queries in PubMed(TM) indicates that many such queries are meaningful phrases, rather than simple collections of terms. In this study, we examine whether or not it makes a difference, in terms of retrieval quality, if such queries are interpreted as a phrase or as a conjunction of query terms. And, if it does, what is the optimal way of searching with such queries. To address the question, we developed an automated retrieval evaluation method, based on machine learning techniques, that enables us to evaluate and compare various retrieval outcomes. We show that the class of records that contain all the search terms, but not the phrase, qualitatively differs from the class of records containing the phrase. We also show that the difference is systematic, depending on the proximity of query terms to each other within the record. Based on these results, one can establish the best retrieval order for the records. Our findings are consistent with studies in proximity searching.
3Lin, J. ; DiCuccio, M. ; Grigoryan, V. ; Wilbur, W.J.: Navigating information spaces : a case study of related article search in PubMed.
In: Information processing and management. 44(2008) no.5, S.1771-1783.
Abstract: The concept of an "information space" provides a powerful metaphor for guiding the design of interactive retrieval systems. We present a case study of related article search, a browsing tool designed to help users navigate the information space defined by results of the PubMed® search engine. This feature leverages content-similarity links that tie MEDLINE® citations together in a vast document network. We examine the effectiveness of related article search from two perspectives: a topological analysis of networks generated from information needs represented in the TREC 2005 genomics track and a query log analysis of real PubMed users. Together, data suggest that related article search is a useful feature and that browsing related articles has become an integral part of how users interact with PubMed.
Themenfeld: Semantisches Umfeld in Indexierung u. Retrieval
4Comeau, D.C. ; Wilbur, W.J.: Non-Word Identification or Spell Checking Without a Dictionary.
In: Journal of the American Society for Information Science and technology. 55(2004) no.2, S.169-177.
Abstract: MEDLINE is a collection of more than 12 million references and abstracts covering recent life science literature. With its continued growth and cutting-edge terminology, spell-checking with a traditional lexicon based approach requires significant additional manual followup. In this work, an internal corpus based context quality rating a, frequency, and simple misspelling transformations are used to rank words from most likely to be misspellings to least likely. Eleven-point average precisions of 0.891 have been achieved within a class of 42,340 all alphabetic words having an a score less than 10. Our models predict that 16,274 or 38% of these words are misspellings. Based an test data, this result has a recall of 79% and a precision of 86%. In other words, spell checking can be done by statistics instead of with a dictionary. As an application we examine the time history of low a words in MEDLINE titles and abstracts.
6Kim, W. ; Wilbur, W.J.: Corpus-based statistical screening for content-bearing terms.
In: Journal of the American Society for Information Science and technology. 52(2001) no.3, S.247-259.
Abstract: Kim and Wilber present three techniques for the algorithmic identification in text of content bearing terms and phrases intended for human use as entry points or hyperlinks. Using a set of 1,075 terms from MEDLINE evaluated on a zero to four, stop word to definite content word scale, they evaluate the ranked lists of their three methods based on their placement of content words in the top ranks. Data consist of the natural language elements of 304,057 MEDLINE records from 1996, and 173,252 Wall Street Journal records from the TIPSTER collection. Phrases are extracted by breaking at punctuation marks and stop words, normalized by lower casing, replacement of nonalphanumerics with spaces, and the reduction of multiple spaces. In the ``strength of context'' approach each document is a vector of binary values for each word or word pair. The words or word pairs are removed from all documents, and the Robertson, Spark Jones relevance weight for each term computed, negative weights replaced with zero, those below a randomness threshold ignored, and the remainder summed for each document, to yield a score for the document and finally to assign to the term the average document score for documents in which it occurred. The average of these word scores is assigned to the original phrase. The ``frequency clumping'' approach defines a random phrase as one whose distribution among documents is Poisson in character. A pvalue, the probability that a phrase frequency of occurrence would be equal to, or less than, Poisson expectations is computed, and a score assigned which is the negative log of that value. In the ``database comparison'' approach if a phrase occurring in a document allows prediction that the document is in MEDLINE rather that in the Wall Street Journal, it is considered to be content bearing for MEDLINE. The score is computed by dividing the number of occurrences of the term in MEDLINE by occurrences in the Journal, and taking the product of all these values. The one hundred top and bottom ranked phrases that occurred in at least 500 documents were collected for each method. The union set had 476 phrases. A second selection was made of two word phrases occurring each in only three documents with a union of 599 phrases. A judge then ranked the two sets of terms as to subject specificity on a 0 to 4 scale. Precision was the average subject specificity of the first r ranks and recall the fraction of the subject specific phrases in the first r ranks and eleven point average precision was used as a summary measure. The three methods all move content bearing terms forward in the lists as does the use of the sum of the logs of the three methods.
8Wilbur, W.J.: ¬A comparison of group and individual performance among subject experts and untrained workers at the document retrieval task.
In: Journal of the American Society for Information Science. 49(1998) no.6, S.517-529.
Abstract: Reports on a study that contrdicts the hypothesis that building detailed subject knowledge into search system improves retrieval. A group with a background in molecular biology performed the same judgements when considering document retrieval as another group without subject knowledge. The untrained panel performed better than any of the members of the trained panel and almost at the level of the trained panel as a whole. Explains the method which uses the probability ranking principle to measure retrieval
9Wilbur, W.J.: Human subjectivity and performance limits in document retrieval.
In: Information processing and management. 32(1996) no.5, S.515-527.
Abstract: Test sets for the document retrieval task composed of human relevance judgments have been constructed that allow one to compare human performance directly with that of automatic methods and that place absolute limits on performance by any method. Current retrieval systems are found to generate only about half of the information allowed by these absolute limits. The data suggests that most of the improvement that could be achieved consistent with these limits can only be achieved by incorporating specific subject information into retrieval systems
10Wilbur, W.J. ; Coffee, L.: ¬The effectiveness of document neighboring in search enhancement.
In: Information processing and management. 30(1994) no.2, S.253-277.
Abstract: Considers two kinds of queries that may be applied to a database. The first is a query written by a searcher to express an information need. The second is a request for documents most similar to a document already judge relevant by the searcher. Examines the effectiveness of these two procedures and shows that in important cases the latter query types is more effective than the former. This provides a new view of the cluster hypothesis and a justification for document neighbouring procedures. If all the documents in a database have readily available precomputed nearest neighbours, a new search algorithm, called parallel neighbourhood searching. Shows that this feedback-based method provides significant improvement in recall over traditional linear searching methods, and appears superior to traditional feedback methods in overall performance
11Wilbur, W.J. ; Sirotkin, K.: ¬The automatic identification of stop words.
In: Journal of information science. 18(1992) no.1, S.45-55.
Abstract: A stop word may be identified as a word that has the same likelihood of occuring in those documents not relevant to a query as in those documents relevant to the query. Shows how the concept of relevance may be replaced by the condition of being highly rated by a similarity measure. Thus it becomes possible to identify the stop words in a collection by automatical statistical testing. Describes the nature of the statistical test as it is realised with a vector retrieval methodology based on the cosine coefficient of document-document similarity. As an example, this technique is then applied to a large MEDLINE subset in the area of biotechnology
12Wilbur, W.J.: ¬A retrieval system based on automatic relevance weighting of search terms.
In: Proceedings of the 55th Annual Meeting of the American Society for Information Science, Pittsburgh, 26.-29.10.92. Ed.: D. Shaw. Medford, NJ : Learned Information Inc., 1992. S.216-220.
Abstract: Describes the development of a retrieval system based on automatic relevance weighting of search terms and founded on the Bayesian formulation of the probability of relevance as function of term occurrence where the contribution from individual terms is assumed to be independent. The relevance pair (RP) model and the vector cosine (VC) model were compared and in the test environment improved retrieval was obtained with the RP model when compared with the VC model