Document (#27093)

Author
Comeau, D.C.
Wilbur, W.J.
Title
Non-Word Identification or Spell Checking Without a Dictionary
Source
Journal of the American Society for Information Science and technology. 55(2004) no.2, S.169-177
Year
2004
Abstract
MEDLINE is a collection of more than 12 million references and abstracts covering recent life science literature. With its continued growth and cutting-edge terminology, spell-checking with a traditional lexicon based approach requires significant additional manual followup. In this work, an internal corpus based context quality rating a, frequency, and simple misspelling transformations are used to rank words from most likely to be misspellings to least likely. Eleven-point average precisions of 0.891 have been achieved within a class of 42,340 all alphabetic words having an a score less than 10. Our models predict that 16,274 or 38% of these words are misspellings. Based an test data, this result has a recall of 79% and a precision of 86%. In other words, spell checking can be done by statistics instead of with a dictionary. As an application we examine the time history of low a words in MEDLINE titles and abstracts.
Theme
Computerlinguistik
Field
Medizin
Object
Medline

Similar documents (author)

  1. Wilbur, W.J.: Global term weights for document retrieval learned from TREC data (2001) 5.62
    5.6180234 = sum of:
      5.6180234 = weight(author_txt:wilbur in 2647) [ClassicSimilarity], result of:
        5.6180234 = fieldWeight in 2647, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.988837 = idf(docFreq=14, maxDocs=44218)
          0.625 = fieldNorm(doc=2647)
    
  2. Wilbur, W.J.: Human subjectivity and performance limits in document retrieval (1996) 5.62
    5.6180234 = sum of:
      5.6180234 = weight(author_txt:wilbur in 6607) [ClassicSimilarity], result of:
        5.6180234 = fieldWeight in 6607, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.988837 = idf(docFreq=14, maxDocs=44218)
          0.625 = fieldNorm(doc=6607)
    
  3. Wilbur, W.J.: ¬A comparison of group and individual performance among subject experts and untrained workers at the document retrieval task (1998) 5.62
    5.6180234 = sum of:
      5.6180234 = weight(author_txt:wilbur in 3263) [ClassicSimilarity], result of:
        5.6180234 = fieldWeight in 3263, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.988837 = idf(docFreq=14, maxDocs=44218)
          0.625 = fieldNorm(doc=3263)
    
  4. Wilbur, W.J.: Human subjectivity and performance limits in document retrieval (1999) 5.62
    5.6180234 = sum of:
      5.6180234 = weight(author_txt:wilbur in 4539) [ClassicSimilarity], result of:
        5.6180234 = fieldWeight in 4539, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.988837 = idf(docFreq=14, maxDocs=44218)
          0.625 = fieldNorm(doc=4539)
    
  5. Wilbur, W.J.: ¬A retrieval system based on automatic relevance weighting of search terms (1992) 5.62
    5.6180234 = sum of:
      5.6180234 = weight(author_txt:wilbur in 5269) [ClassicSimilarity], result of:
        5.6180234 = fieldWeight in 5269, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.988837 = idf(docFreq=14, maxDocs=44218)
          0.625 = fieldNorm(doc=5269)
    

Similar documents (content)

  1. Lee, K.H.; Ng, M.K.M.; Lu, Q.: Text segmentation for Chinese spell checking (1999) 0.35
    0.3497263 = sum of:
      0.3497263 = product of:
        1.2490225 = sum of:
          0.009098792 = weight(abstract_txt:with in 3913) [ClassicSimilarity], result of:
            0.009098792 = score(doc=3913,freq=2.0), product of:
              0.041180823 = queryWeight, product of:
                1.2063868 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.013655724 = queryNorm
              0.22094731 = fieldWeight in 3913, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.0625 = fieldNorm(doc=3913)
          0.022948947 = weight(abstract_txt:than in 3913) [ClassicSimilarity], result of:
            0.022948947 = score(doc=3913,freq=2.0), product of:
              0.066657744 = queryWeight, product of:
                1.2531953 = boost
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.013655724 = queryNorm
              0.34428027 = fieldWeight in 3913, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.0625 = fieldNorm(doc=3913)
          0.01887244 = weight(abstract_txt:based in 3913) [ClassicSimilarity], result of:
            0.01887244 = score(doc=3913,freq=2.0), product of:
              0.06697683 = queryWeight, product of:
                1.5385138 = boost
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.013655724 = queryNorm
              0.28177565 = fieldWeight in 3913, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.0625 = fieldNorm(doc=3913)
          0.0801826 = weight(abstract_txt:dictionary in 3913) [ClassicSimilarity], result of:
            0.0801826 = score(doc=3913,freq=1.0), product of:
              0.19337732 = queryWeight, product of:
                2.1345003 = boost
                6.634292 = idf(docFreq=157, maxDocs=44218)
                0.013655724 = queryNorm
              0.41464326 = fieldWeight in 3913, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.634292 = idf(docFreq=157, maxDocs=44218)
                0.0625 = fieldNorm(doc=3913)
          0.34467992 = weight(abstract_txt:checking in 3913) [ClassicSimilarity], result of:
            0.34467992 = score(doc=3913,freq=3.0), product of:
              0.40577576 = queryWeight, product of:
                3.786885 = boost
                7.84674 = idf(docFreq=46, maxDocs=44218)
                0.013655724 = queryNorm
              0.8494345 = fieldWeight in 3913, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.84674 = idf(docFreq=46, maxDocs=44218)
                0.0625 = fieldNorm(doc=3913)
          0.56263924 = weight(abstract_txt:spell in 3913) [ClassicSimilarity], result of:
            0.56263924 = score(doc=3913,freq=4.0), product of:
              0.5111118 = queryWeight, product of:
                4.250079 = boost
                8.806516 = idf(docFreq=17, maxDocs=44218)
                0.013655724 = queryNorm
              1.1008145 = fieldWeight in 3913, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                8.806516 = idf(docFreq=17, maxDocs=44218)
                0.0625 = fieldNorm(doc=3913)
          0.21060067 = weight(abstract_txt:words in 3913) [ClassicSimilarity], result of:
            0.21060067 = score(doc=3913,freq=4.0), product of:
              0.31474 = queryWeight, product of:
                4.305657 = boost
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.013655724 = queryNorm
              0.66912585 = fieldWeight in 3913, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.0625 = fieldNorm(doc=3913)
        0.28 = coord(7/25)
    
  2. Drabenstott, K.M.; Weller, M.S.: Handling spelling errors in online catalog searches (1996) 0.15
    0.14743721 = sum of:
      0.14743721 = product of:
        0.7371861 = sum of:
          0.012867635 = weight(abstract_txt:with in 5973) [ClassicSimilarity], result of:
            0.012867635 = score(doc=5973,freq=4.0), product of:
              0.041180823 = queryWeight, product of:
                1.2063868 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.013655724 = queryNorm
              0.31246668 = fieldWeight in 5973, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.0625 = fieldNorm(doc=5973)
          0.022948947 = weight(abstract_txt:than in 5973) [ClassicSimilarity], result of:
            0.022948947 = score(doc=5973,freq=2.0), product of:
              0.066657744 = queryWeight, product of:
                1.2531953 = boost
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.013655724 = queryNorm
              0.34428027 = fieldWeight in 5973, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.0625 = fieldNorm(doc=5973)
          0.35345122 = weight(abstract_txt:misspellings in 5973) [ClassicSimilarity], result of:
            0.35345122 = score(doc=5973,freq=3.0), product of:
              0.3604663 = queryWeight, product of:
                2.9142432 = boost
                9.05783 = idf(docFreq=13, maxDocs=44218)
                0.013655724 = queryNorm
              0.98053885 = fieldWeight in 5973, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.05783 = idf(docFreq=13, maxDocs=44218)
                0.0625 = fieldNorm(doc=5973)
          0.19900104 = weight(abstract_txt:checking in 5973) [ClassicSimilarity], result of:
            0.19900104 = score(doc=5973,freq=1.0), product of:
              0.40577576 = queryWeight, product of:
                3.786885 = boost
                7.84674 = idf(docFreq=46, maxDocs=44218)
                0.013655724 = queryNorm
              0.49042124 = fieldWeight in 5973, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.84674 = idf(docFreq=46, maxDocs=44218)
                0.0625 = fieldNorm(doc=5973)
          0.14891717 = weight(abstract_txt:words in 5973) [ClassicSimilarity], result of:
            0.14891717 = score(doc=5973,freq=2.0), product of:
              0.31474 = queryWeight, product of:
                4.305657 = boost
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.013655724 = queryNorm
              0.47314343 = fieldWeight in 5973, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.0625 = fieldNorm(doc=5973)
        0.2 = coord(5/25)
    
  3. Kim, W.; Wilbur, W.J.: Corpus-based statistical screening for content-bearing terms (2001) 0.12
    0.12430026 = sum of:
      0.12430026 = product of:
        0.4439295 = sum of:
          0.07080801 = weight(abstract_txt:score in 5188) [ClassicSimilarity], result of:
            0.07080801 = score(doc=5188,freq=4.0), product of:
              0.1078126 = queryWeight, product of:
                1.1269722 = boost
                7.0055394 = idf(docFreq=108, maxDocs=44218)
                0.013655724 = queryNorm
              0.65676934 = fieldWeight in 5188, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.0055394 = idf(docFreq=108, maxDocs=44218)
                0.046875 = fieldNorm(doc=5188)
          0.008357774 = weight(abstract_txt:with in 5188) [ClassicSimilarity], result of:
            0.008357774 = score(doc=5188,freq=3.0), product of:
              0.041180823 = queryWeight, product of:
                1.2063868 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.013655724 = queryNorm
              0.20295306 = fieldWeight in 5188, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.046875 = fieldNorm(doc=5188)
          0.012170518 = weight(abstract_txt:than in 5188) [ClassicSimilarity], result of:
            0.012170518 = score(doc=5188,freq=1.0), product of:
              0.066657744 = queryWeight, product of:
                1.2531953 = boost
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.013655724 = queryNorm
              0.1825822 = fieldWeight in 5188, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.046875 = fieldNorm(doc=5188)
          0.061056014 = weight(abstract_txt:eleven in 5188) [ClassicSimilarity], result of:
            0.061056014 = score(doc=5188,freq=1.0), product of:
              0.1550435 = queryWeight, product of:
                1.3514663 = boost
                8.401051 = idf(docFreq=26, maxDocs=44218)
                0.013655724 = queryNorm
              0.39379925 = fieldWeight in 5188, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.401051 = idf(docFreq=26, maxDocs=44218)
                0.046875 = fieldNorm(doc=5188)
          0.010008624 = weight(abstract_txt:based in 5188) [ClassicSimilarity], result of:
            0.010008624 = score(doc=5188,freq=1.0), product of:
              0.06697683 = queryWeight, product of:
                1.5385138 = boost
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.013655724 = queryNorm
              0.14943412 = fieldWeight in 5188, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.046875 = fieldNorm(doc=5188)
          0.14473943 = weight(abstract_txt:medline in 5188) [ClassicSimilarity], result of:
            0.14473943 = score(doc=5188,freq=5.0), product of:
              0.20310122 = queryWeight, product of:
                2.1875083 = boost
                6.7990475 = idf(docFreq=133, maxDocs=44218)
                0.013655724 = queryNorm
              0.71264684 = fieldWeight in 5188, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                6.7990475 = idf(docFreq=133, maxDocs=44218)
                0.046875 = fieldNorm(doc=5188)
          0.13678916 = weight(abstract_txt:words in 5188) [ClassicSimilarity], result of:
            0.13678916 = score(doc=5188,freq=3.0), product of:
              0.31474 = queryWeight, product of:
                4.305657 = boost
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.013655724 = queryNorm
              0.43461 = fieldWeight in 5188, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.046875 = fieldNorm(doc=5188)
        0.28 = coord(7/25)
    
  4. Rubashkin, V.S.; Lakhuti, D.G.: Semanticheskii (kontseptual'nyi) slovar' dlya informatsionnykh tekhnologii, ch.1 (1998) 0.10
    0.099307135 = sum of:
      0.099307135 = product of:
        0.6206696 = sum of:
          0.04056839 = weight(abstract_txt:than in 3253) [ClassicSimilarity], result of:
            0.04056839 = score(doc=3253,freq=1.0), product of:
              0.066657744 = queryWeight, product of:
                1.2531953 = boost
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.013655724 = queryNorm
              0.6086073 = fieldWeight in 3253, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.15625 = fieldNorm(doc=3253)
          0.033362076 = weight(abstract_txt:based in 3253) [ClassicSimilarity], result of:
            0.033362076 = score(doc=3253,freq=1.0), product of:
              0.06697683 = queryWeight, product of:
                1.5385138 = boost
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.013655724 = queryNorm
              0.4981137 = fieldWeight in 3253, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.15625 = fieldNorm(doc=3253)
          0.2834883 = weight(abstract_txt:dictionary in 3253) [ClassicSimilarity], result of:
            0.2834883 = score(doc=3253,freq=2.0), product of:
              0.19337732 = queryWeight, product of:
                2.1345003 = boost
                6.634292 = idf(docFreq=157, maxDocs=44218)
                0.013655724 = queryNorm
              1.4659853 = fieldWeight in 3253, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.634292 = idf(docFreq=157, maxDocs=44218)
                0.15625 = fieldNorm(doc=3253)
          0.26325083 = weight(abstract_txt:words in 3253) [ClassicSimilarity], result of:
            0.26325083 = score(doc=3253,freq=1.0), product of:
              0.31474 = queryWeight, product of:
                4.305657 = boost
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.013655724 = queryNorm
              0.8364073 = fieldWeight in 3253, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.353007 = idf(docFreq=568, maxDocs=44218)
                0.15625 = fieldNorm(doc=3253)
        0.16 = coord(4/25)
    
  5. Willson, R.; Given, L.M.: ¬The effect of spelling and retrieval system familiarity on search behavior in online public access catalogs : a mixed methods study (2010) 0.09
    0.09294094 = sum of:
      0.09294094 = product of:
        0.7745079 = sum of:
          0.012867635 = weight(abstract_txt:with in 4042) [ClassicSimilarity], result of:
            0.012867635 = score(doc=4042,freq=4.0), product of:
              0.041180823 = queryWeight, product of:
                1.2063868 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.013655724 = queryNorm
              0.31246668 = fieldWeight in 4042, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.0625 = fieldNorm(doc=4042)
          0.19900104 = weight(abstract_txt:checking in 4042) [ClassicSimilarity], result of:
            0.19900104 = score(doc=4042,freq=1.0), product of:
              0.40577576 = queryWeight, product of:
                3.786885 = boost
                7.84674 = idf(docFreq=46, maxDocs=44218)
                0.013655724 = queryNorm
              0.49042124 = fieldWeight in 4042, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.84674 = idf(docFreq=46, maxDocs=44218)
                0.0625 = fieldNorm(doc=4042)
          0.56263924 = weight(abstract_txt:spell in 4042) [ClassicSimilarity], result of:
            0.56263924 = score(doc=4042,freq=4.0), product of:
              0.5111118 = queryWeight, product of:
                4.250079 = boost
                8.806516 = idf(docFreq=17, maxDocs=44218)
                0.013655724 = queryNorm
              1.1008145 = fieldWeight in 4042, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                8.806516 = idf(docFreq=17, maxDocs=44218)
                0.0625 = fieldNorm(doc=4042)
        0.12 = coord(3/25)