Search (12 results, page 1 of 1)

  • × author_ss:"Wilbur, W.J."
  1. Wilbur, W.J.; Coffee, L.: ¬The effectiveness of document neighboring in search enhancement (1994) 0.00
    0.0037439493 = product of:
      0.0074878987 = sum of:
        0.0074878987 = product of:
          0.014975797 = sum of:
            0.014975797 = weight(_text_:a in 7419) [ClassicSimilarity], result of:
              0.014975797 = score(doc=7419,freq=20.0), product of:
                0.053105544 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046056706 = queryNorm
                0.28200063 = fieldWeight in 7419, product of:
                  4.472136 = tf(freq=20.0), with freq of:
                    20.0 = termFreq=20.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=7419)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Considers two kinds of queries that may be applied to a database. The first is a query written by a searcher to express an information need. The second is a request for documents most similar to a document already judge relevant by the searcher. Examines the effectiveness of these two procedures and shows that in important cases the latter query types is more effective than the former. This provides a new view of the cluster hypothesis and a justification for document neighbouring procedures. If all the documents in a database have readily available precomputed nearest neighbours, a new search algorithm, called parallel neighbourhood searching. Shows that this feedback-based method provides significant improvement in recall over traditional linear searching methods, and appears superior to traditional feedback methods in overall performance
    Type
    a
  2. Comeau, D.C.; Wilbur, W.J.: Non-Word Identification or Spell Checking Without a Dictionary (2004) 0.00
    0.0033657318 = product of:
      0.0067314636 = sum of:
        0.0067314636 = product of:
          0.013462927 = sum of:
            0.013462927 = weight(_text_:a in 2092) [ClassicSimilarity], result of:
              0.013462927 = score(doc=2092,freq=22.0), product of:
                0.053105544 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046056706 = queryNorm
                0.25351265 = fieldWeight in 2092, product of:
                  4.690416 = tf(freq=22.0), with freq of:
                    22.0 = termFreq=22.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2092)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    MEDLINE is a collection of more than 12 million references and abstracts covering recent life science literature. With its continued growth and cutting-edge terminology, spell-checking with a traditional lexicon based approach requires significant additional manual followup. In this work, an internal corpus based context quality rating a, frequency, and simple misspelling transformations are used to rank words from most likely to be misspellings to least likely. Eleven-point average precisions of 0.891 have been achieved within a class of 42,340 all alphabetic words having an a score less than 10. Our models predict that 16,274 or 38% of these words are misspellings. Based an test data, this result has a recall of 79% and a precision of 86%. In other words, spell checking can be done by statistics instead of with a dictionary. As an application we examine the time history of low a words in MEDLINE titles and abstracts.
    Type
    a
  3. Wilbur, W.J.; Sirotkin, K.: ¬The automatic identification of stop words (1992) 0.00
    0.00334869 = product of:
      0.00669738 = sum of:
        0.00669738 = product of:
          0.01339476 = sum of:
            0.01339476 = weight(_text_:a in 4853) [ClassicSimilarity], result of:
              0.01339476 = score(doc=4853,freq=16.0), product of:
                0.053105544 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046056706 = queryNorm
                0.25222903 = fieldWeight in 4853, product of:
                  4.0 = tf(freq=16.0), with freq of:
                    16.0 = termFreq=16.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=4853)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    A stop word may be identified as a word that has the same likelihood of occuring in those documents not relevant to a query as in those documents relevant to the query. Shows how the concept of relevance may be replaced by the condition of being highly rated by a similarity measure. Thus it becomes possible to identify the stop words in a collection by automatical statistical testing. Describes the nature of the statistical test as it is realised with a vector retrieval methodology based on the cosine coefficient of document-document similarity. As an example, this technique is then applied to a large MEDLINE subset in the area of biotechnology
    Type
    a
  4. Lin, J.; DiCuccio, M.; Grigoryan, V.; Wilbur, W.J.: Navigating information spaces : a case study of related article search in PubMed (2008) 0.00
    0.0030444188 = product of:
      0.0060888375 = sum of:
        0.0060888375 = product of:
          0.012177675 = sum of:
            0.012177675 = weight(_text_:a in 2124) [ClassicSimilarity], result of:
              0.012177675 = score(doc=2124,freq=18.0), product of:
                0.053105544 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046056706 = queryNorm
                0.22931081 = fieldWeight in 2124, product of:
                  4.2426405 = tf(freq=18.0), with freq of:
                    18.0 = termFreq=18.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2124)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    The concept of an "information space" provides a powerful metaphor for guiding the design of interactive retrieval systems. We present a case study of related article search, a browsing tool designed to help users navigate the information space defined by results of the PubMed® search engine. This feature leverages content-similarity links that tie MEDLINE® citations together in a vast document network. We examine the effectiveness of related article search from two perspectives: a topological analysis of networks generated from information needs represented in the TREC 2005 genomics track and a query log analysis of real PubMed users. Together, data suggest that related article search is a useful feature and that browsing related articles has become an integral part of how users interact with PubMed.
    Type
    a
  5. Wilbur, W.J.: ¬A comparison of group and individual performance among subject experts and untrained workers at the document retrieval task (1998) 0.00
    0.0029000505 = product of:
      0.005800101 = sum of:
        0.005800101 = product of:
          0.011600202 = sum of:
            0.011600202 = weight(_text_:a in 3263) [ClassicSimilarity], result of:
              0.011600202 = score(doc=3263,freq=12.0), product of:
                0.053105544 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046056706 = queryNorm
                0.21843673 = fieldWeight in 3263, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=3263)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Reports on a study that contrdicts the hypothesis that building detailed subject knowledge into search system improves retrieval. A group with a background in molecular biology performed the same judgements when considering document retrieval as another group without subject knowledge. The untrained panel performed better than any of the members of the trained panel and almost at the level of the trained panel as a whole. Explains the method which uses the probability ranking principle to measure retrieval
    Type
    a
  6. Kim, W.; Wilbur, W.J.: Corpus-based statistical screening for content-bearing terms (2001) 0.00
    0.0027894354 = product of:
      0.005578871 = sum of:
        0.005578871 = product of:
          0.011157742 = sum of:
            0.011157742 = weight(_text_:a in 5188) [ClassicSimilarity], result of:
              0.011157742 = score(doc=5188,freq=34.0), product of:
                0.053105544 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046056706 = queryNorm
                0.21010503 = fieldWeight in 5188, product of:
                  5.8309517 = tf(freq=34.0), with freq of:
                    34.0 = termFreq=34.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.03125 = fieldNorm(doc=5188)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Kim and Wilber present three techniques for the algorithmic identification in text of content bearing terms and phrases intended for human use as entry points or hyperlinks. Using a set of 1,075 terms from MEDLINE evaluated on a zero to four, stop word to definite content word scale, they evaluate the ranked lists of their three methods based on their placement of content words in the top ranks. Data consist of the natural language elements of 304,057 MEDLINE records from 1996, and 173,252 Wall Street Journal records from the TIPSTER collection. Phrases are extracted by breaking at punctuation marks and stop words, normalized by lower casing, replacement of nonalphanumerics with spaces, and the reduction of multiple spaces. In the ``strength of context'' approach each document is a vector of binary values for each word or word pair. The words or word pairs are removed from all documents, and the Robertson, Spark Jones relevance weight for each term computed, negative weights replaced with zero, those below a randomness threshold ignored, and the remainder summed for each document, to yield a score for the document and finally to assign to the term the average document score for documents in which it occurred. The average of these word scores is assigned to the original phrase. The ``frequency clumping'' approach defines a random phrase as one whose distribution among documents is Poisson in character. A pvalue, the probability that a phrase frequency of occurrence would be equal to, or less than, Poisson expectations is computed, and a score assigned which is the negative log of that value. In the ``database comparison'' approach if a phrase occurring in a document allows prediction that the document is in MEDLINE rather that in the Wall Street Journal, it is considered to be content bearing for MEDLINE. The score is computed by dividing the number of occurrences of the term in MEDLINE by occurrences in the Journal, and taking the product of all these values. The one hundred top and bottom ranked phrases that occurred in at least 500 documents were collected for each method. The union set had 476 phrases. A second selection was made of two word phrases occurring each in only three documents with a union of 599 phrases. A judge then ranked the two sets of terms as to subject specificity on a 0 to 4 scale. Precision was the average subject specificity of the first r ranks and recall the fraction of the subject specific phrases in the first r ranks and eleven point average precision was used as a summary measure. The three methods all move content bearing terms forward in the lists as does the use of the sum of the logs of the three methods.
    Type
    a
  7. Wilbur, W.J.: Global term weights for document retrieval learned from TREC data (2001) 0.00
    0.0023678814 = product of:
      0.0047357627 = sum of:
        0.0047357627 = product of:
          0.009471525 = sum of:
            0.009471525 = weight(_text_:a in 2647) [ClassicSimilarity], result of:
              0.009471525 = score(doc=2647,freq=2.0), product of:
                0.053105544 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046056706 = queryNorm
                0.17835285 = fieldWeight in 2647, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.109375 = fieldNorm(doc=2647)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Type
    a
  8. Wilbur, W.J.: Human subjectivity and performance limits in document retrieval (1999) 0.00
    0.0023678814 = product of:
      0.0047357627 = sum of:
        0.0047357627 = product of:
          0.009471525 = sum of:
            0.009471525 = weight(_text_:a in 4539) [ClassicSimilarity], result of:
              0.009471525 = score(doc=4539,freq=2.0), product of:
                0.053105544 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046056706 = queryNorm
                0.17835285 = fieldWeight in 4539, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.109375 = fieldNorm(doc=4539)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Type
    a
  9. Wilbur, W.J.: ¬A retrieval system based on automatic relevance weighting of search terms (1992) 0.00
    0.0023435948 = product of:
      0.0046871896 = sum of:
        0.0046871896 = product of:
          0.009374379 = sum of:
            0.009374379 = weight(_text_:a in 5269) [ClassicSimilarity], result of:
              0.009374379 = score(doc=5269,freq=6.0), product of:
                0.053105544 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046056706 = queryNorm
                0.17652355 = fieldWeight in 5269, product of:
                  2.4494898 = tf(freq=6.0), with freq of:
                    6.0 = termFreq=6.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0625 = fieldNorm(doc=5269)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Describes the development of a retrieval system based on automatic relevance weighting of search terms and founded on the Bayesian formulation of the probability of relevance as function of term occurrence where the contribution from individual terms is assumed to be independent. The relevance pair (RP) model and the vector cosine (VC) model were compared and in the test environment improved retrieval was obtained with the RP model when compared with the VC model
    Type
    a
  10. Liu, W.; Dog(an, R.I.; Kim, S.; Comeau, D.C.; Kim, W.; Yeganova, L.; Lu, Z.; Wilbur, W.J.: Author name disambiguation for PubMed (2014) 0.00
    0.0022374375 = product of:
      0.004474875 = sum of:
        0.004474875 = product of:
          0.00894975 = sum of:
            0.00894975 = weight(_text_:a in 1240) [ClassicSimilarity], result of:
              0.00894975 = score(doc=1240,freq=14.0), product of:
                0.053105544 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046056706 = queryNorm
                0.1685276 = fieldWeight in 1240, product of:
                  3.7416575 = tf(freq=14.0), with freq of:
                    14.0 = termFreq=14.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1240)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Log analysis shows that PubMed users frequently use author names in queries for retrieving scientific literature. However, author name ambiguity may lead to irrelevant retrieval results. To improve the PubMed user experience with author name queries, we designed an author name disambiguation system consisting of similarity estimation and agglomerative clustering. A machine-learning method was employed to score the features for disambiguating a pair of papers with ambiguous names. These features enable the computation of pairwise similarity scores to estimate the probability of a pair of papers belonging to the same author, which drives an agglomerative clustering algorithm regulated by 2 factors: name compatibility and probability level. With transitivity violation correction, high precision author clustering is achieved by focusing on minimizing false-positive pairing. Disambiguation performance is evaluated with manual verification of random samples of pairs from clustering results. When compared with a state-of-the-art system, our evaluation shows that among all the pairs the lumping error rate drops from 10.1% to 2.2% for our system, while the splitting error rises from 1.8% to 7.7%. This results in an overall error rate of 9.9%, compared with 11.9% for the state-of-the-art method. Other evaluations based on gold standard data also show the increase in accuracy of our clustering. We attribute the performance improvement to the machine-learning method driven by a large-scale training set and the clustering algorithm regulated by a name compatibility scheme preferring precision. With integration of the author name disambiguation system into the PubMed search engine, the overall click-through-rate of PubMed users on author name query results improved from 34.9% to 36.9%.
    Type
    a
  11. Yeganova, L.; Comeau, D.C.; Kim, W.; Wilbur, W.J.: How to interpret PubMed queries and why it matters (2009) 0.00
    0.0020714647 = product of:
      0.0041429293 = sum of:
        0.0041429293 = product of:
          0.008285859 = sum of:
            0.008285859 = weight(_text_:a in 2712) [ClassicSimilarity], result of:
              0.008285859 = score(doc=2712,freq=12.0), product of:
                0.053105544 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046056706 = queryNorm
                0.15602624 = fieldWeight in 2712, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2712)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    A significant fraction of queries in PubMed(TM) are multiterm queries without parsing instructions. Generally, search engines interpret such queries as collections of terms, and handle them as a Boolean conjunction of these terms. However, analysis of queries in PubMed(TM) indicates that many such queries are meaningful phrases, rather than simple collections of terms. In this study, we examine whether or not it makes a difference, in terms of retrieval quality, if such queries are interpreted as a phrase or as a conjunction of query terms. And, if it does, what is the optimal way of searching with such queries. To address the question, we developed an automated retrieval evaluation method, based on machine learning techniques, that enables us to evaluate and compare various retrieval outcomes. We show that the class of records that contain all the search terms, but not the phrase, qualitatively differs from the class of records containing the phrase. We also show that the difference is systematic, depending on the proximity of query terms to each other within the record. Based on these results, one can establish the best retrieval order for the records. Our findings are consistent with studies in proximity searching.
    Type
    a
  12. Wilbur, W.J.: Human subjectivity and performance limits in document retrieval (1996) 0.00
    0.001353075 = product of:
      0.00270615 = sum of:
        0.00270615 = product of:
          0.0054123 = sum of:
            0.0054123 = weight(_text_:a in 6607) [ClassicSimilarity], result of:
              0.0054123 = score(doc=6607,freq=2.0), product of:
                0.053105544 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046056706 = queryNorm
                0.10191591 = fieldWeight in 6607, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0625 = fieldNorm(doc=6607)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Type
    a