Document (#13555)

Author
Taghva, K.
Borsack, J.
Condit, A.
Title
Evaluation of model-based retrieval effectiveness with OCR text
Source
ACM transactions on information systems. 14(1996) no.1, S.64-93
Year
1996
Abstract
Reports on experiments with retrieval from OCR-generated text using systems based on standard models of retrieval. Shows that average precision and recall is not affected by OCR errors across systems for several collections. Both the actual and the simulation experiments include full text and abstract length documents. The ranking and feedback methods associated with the retrieval models are generally not robust enough to deal with OCR errors. OCR errors and garbage strings generated from the mistranslation of graphic objects increase the size of the index significantly. Describes the problems of applying OCR text within an information retrieval environment and offers solutions

Similar documents (content)

  1. Tagheva, K.; Borsack, J.; Condit, A.: Effects of OCR errors on ranking and feedback using the vector space model (1996) 0.37
    0.36881796 = sum of:
      0.36881796 = product of:
        1.1525562 = sum of:
          0.08377515 = weight(abstract_txt:recall in 4951) [ClassicSimilarity], result of:
            0.08377515 = score(doc=4951,freq=1.0), product of:
              0.13323428 = queryWeight, product of:
                5.7488523 = idf(docFreq=382, maxDocs=44218)
                0.023175806 = queryNorm
              0.6287807 = fieldWeight in 4951, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7488523 = idf(docFreq=382, maxDocs=44218)
                0.109375 = fieldNorm(doc=4951)
          0.09101898 = weight(abstract_txt:average in 4951) [ClassicSimilarity], result of:
            0.09101898 = score(doc=4951,freq=1.0), product of:
              0.14080794 = queryWeight, product of:
                1.0280296 = boost
                5.90999 = idf(docFreq=325, maxDocs=44218)
                0.023175806 = queryNorm
              0.64640516 = fieldWeight in 4951, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.90999 = idf(docFreq=325, maxDocs=44218)
                0.109375 = fieldNorm(doc=4951)
          0.09261409 = weight(abstract_txt:feedback in 4951) [ClassicSimilarity], result of:
            0.09261409 = score(doc=4951,freq=1.0), product of:
              0.14244828 = queryWeight, product of:
                1.0340002 = boost
                5.9443145 = idf(docFreq=314, maxDocs=44218)
                0.023175806 = queryNorm
              0.6501594 = fieldWeight in 4951, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9443145 = idf(docFreq=314, maxDocs=44218)
                0.109375 = fieldNorm(doc=4951)
          0.122893944 = weight(abstract_txt:affected in 4951) [ClassicSimilarity], result of:
            0.122893944 = score(doc=4951,freq=1.0), product of:
              0.1720123 = queryWeight, product of:
                1.1362444 = boost
                6.532101 = idf(docFreq=174, maxDocs=44218)
                0.023175806 = queryNorm
              0.7144486 = fieldWeight in 4951, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.532101 = idf(docFreq=174, maxDocs=44218)
                0.109375 = fieldNorm(doc=4951)
          0.027549446 = weight(abstract_txt:with in 4951) [ClassicSimilarity], result of:
            0.027549446 = score(doc=4951,freq=1.0), product of:
              0.100763 = queryWeight, product of:
                1.7392921 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.023175806 = queryNorm
              0.27340835 = fieldWeight in 4951, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.109375 = fieldNorm(doc=4951)
          0.11663321 = weight(abstract_txt:text in 4951) [ClassicSimilarity], result of:
            0.11663321 = score(doc=4951,freq=1.0), product of:
              0.2636983 = queryWeight, product of:
                2.8136861 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.023175806 = queryNorm
              0.4422979 = fieldWeight in 4951, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.109375 = fieldNorm(doc=4951)
          0.09252486 = weight(abstract_txt:retrieval in 4951) [ClassicSimilarity], result of:
            0.09252486 = score(doc=4951,freq=1.0), product of:
              0.24342665 = queryWeight, product of:
                3.0224636 = boost
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.023175806 = queryNorm
              0.38009337 = fieldWeight in 4951, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.109375 = fieldNorm(doc=4951)
          0.5255464 = weight(abstract_txt:errors in 4951) [ClassicSimilarity], result of:
            0.5255464 = score(doc=4951,freq=2.0), product of:
              0.51877254 = queryWeight, product of:
                3.4177566 = boost
                6.5493927 = idf(docFreq=171, maxDocs=44218)
                0.023175806 = queryNorm
              1.0130575 = fieldWeight in 4951, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.5493927 = idf(docFreq=171, maxDocs=44218)
                0.109375 = fieldNorm(doc=4951)
        0.32 = coord(8/25)
    
  2. Losee, R.M.: Determining information retrieval and filtering performance without experimentation (1995) 0.26
    0.25642613 = sum of:
      0.25642613 = product of:
        0.71229476 = sum of:
          0.06501356 = weight(abstract_txt:average in 3368) [ClassicSimilarity], result of:
            0.06501356 = score(doc=3368,freq=1.0), product of:
              0.14080794 = queryWeight, product of:
                1.0280296 = boost
                5.90999 = idf(docFreq=325, maxDocs=44218)
                0.023175806 = queryNorm
              0.46171796 = fieldWeight in 3368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.90999 = idf(docFreq=325, maxDocs=44218)
                0.078125 = fieldNorm(doc=3368)
          0.06615292 = weight(abstract_txt:feedback in 3368) [ClassicSimilarity], result of:
            0.06615292 = score(doc=3368,freq=1.0), product of:
              0.14244828 = queryWeight, product of:
                1.0340002 = boost
                5.9443145 = idf(docFreq=314, maxDocs=44218)
                0.023175806 = queryNorm
              0.46439958 = fieldWeight in 3368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9443145 = idf(docFreq=314, maxDocs=44218)
                0.078125 = fieldNorm(doc=3368)
          0.020407936 = weight(abstract_txt:based in 3368) [ClassicSimilarity], result of:
            0.020407936 = score(doc=3368,freq=1.0), product of:
              0.081940874 = queryWeight, product of:
                1.1090658 = boost
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.023175806 = queryNorm
              0.24905685 = fieldWeight in 3368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.078125 = fieldNorm(doc=3368)
          0.08801262 = weight(abstract_txt:length in 3368) [ClassicSimilarity], result of:
            0.08801262 = score(doc=3368,freq=1.0), product of:
              0.17231424 = queryWeight, product of:
                1.1372412 = boost
                6.537832 = idf(docFreq=173, maxDocs=44218)
                0.023175806 = queryNorm
              0.5107681 = fieldWeight in 3368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.537832 = idf(docFreq=173, maxDocs=44218)
                0.078125 = fieldNorm(doc=3368)
          0.0250181 = weight(abstract_txt:systems in 3368) [ClassicSimilarity], result of:
            0.0250181 = score(doc=3368,freq=1.0), product of:
              0.093857884 = queryWeight, product of:
                1.1869773 = boost
                3.4118783 = idf(docFreq=3963, maxDocs=44218)
                0.023175806 = queryNorm
              0.26655298 = fieldWeight in 3368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4118783 = idf(docFreq=3963, maxDocs=44218)
                0.078125 = fieldNorm(doc=3368)
          0.16921212 = weight(abstract_txt:simulation in 3368) [ClassicSimilarity], result of:
            0.16921212 = score(doc=3368,freq=2.0), product of:
              0.21146354 = queryWeight, product of:
                1.2598237 = boost
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.023175806 = queryNorm
              0.8001952 = fieldWeight in 3368, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.078125 = fieldNorm(doc=3368)
          0.06298972 = weight(abstract_txt:models in 3368) [ClassicSimilarity], result of:
            0.06298972 = score(doc=3368,freq=1.0), product of:
              0.1737058 = queryWeight, product of:
                1.6147829 = boost
                4.6415744 = idf(docFreq=1158, maxDocs=44218)
                0.023175806 = queryNorm
              0.362623 = fieldWeight in 3368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6415744 = idf(docFreq=1158, maxDocs=44218)
                0.078125 = fieldNorm(doc=3368)
          0.08330944 = weight(abstract_txt:text in 3368) [ClassicSimilarity], result of:
            0.08330944 = score(doc=3368,freq=1.0), product of:
              0.2636983 = queryWeight, product of:
                2.8136861 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.023175806 = queryNorm
              0.3159271 = fieldWeight in 3368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.078125 = fieldNorm(doc=3368)
          0.13217837 = weight(abstract_txt:retrieval in 3368) [ClassicSimilarity], result of:
            0.13217837 = score(doc=3368,freq=4.0), product of:
              0.24342665 = queryWeight, product of:
                3.0224636 = boost
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.023175806 = queryNorm
              0.5429905 = fieldWeight in 3368, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.078125 = fieldNorm(doc=3368)
        0.36 = coord(9/25)
    
  3. Lam-Adesina, A.M.; Jones, G.J.F.: Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents (2006) 0.24
    0.23552759 = sum of:
      0.23552759 = product of:
        0.84116995 = sum of:
          0.10584467 = weight(abstract_txt:feedback in 977) [ClassicSimilarity], result of:
            0.10584467 = score(doc=977,freq=4.0), product of:
              0.14244828 = queryWeight, product of:
                1.0340002 = boost
                5.9443145 = idf(docFreq=314, maxDocs=44218)
                0.023175806 = queryNorm
              0.7430393 = fieldWeight in 977, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.9443145 = idf(docFreq=314, maxDocs=44218)
                0.0625 = fieldNorm(doc=977)
          0.01632635 = weight(abstract_txt:based in 977) [ClassicSimilarity], result of:
            0.01632635 = score(doc=977,freq=1.0), product of:
              0.081940874 = queryWeight, product of:
                1.1090658 = boost
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.023175806 = queryNorm
              0.19924548 = fieldWeight in 977, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.0625 = fieldNorm(doc=977)
          0.02830475 = weight(abstract_txt:systems in 977) [ClassicSimilarity], result of:
            0.02830475 = score(doc=977,freq=2.0), product of:
              0.093857884 = queryWeight, product of:
                1.1869773 = boost
                3.4118783 = idf(docFreq=3963, maxDocs=44218)
                0.023175806 = queryNorm
              0.3015703 = fieldWeight in 977, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4118783 = idf(docFreq=3963, maxDocs=44218)
                0.0625 = fieldNorm(doc=977)
          0.09912664 = weight(abstract_txt:strings in 977) [ClassicSimilarity], result of:
            0.09912664 = score(doc=977,freq=1.0), product of:
              0.21645027 = queryWeight, product of:
                1.2745917 = boost
                7.3274393 = idf(docFreq=78, maxDocs=44218)
                0.023175806 = queryNorm
              0.45796496 = fieldWeight in 977, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.3274393 = idf(docFreq=78, maxDocs=44218)
                0.0625 = fieldNorm(doc=977)
          0.09425387 = weight(abstract_txt:text in 977) [ClassicSimilarity], result of:
            0.09425387 = score(doc=977,freq=2.0), product of:
              0.2636983 = queryWeight, product of:
                2.8136861 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.023175806 = queryNorm
              0.3574307 = fieldWeight in 977, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=977)
          0.12950782 = weight(abstract_txt:retrieval in 977) [ClassicSimilarity], result of:
            0.12950782 = score(doc=977,freq=6.0), product of:
              0.24342665 = queryWeight, product of:
                3.0224636 = boost
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.023175806 = queryNorm
              0.5320199 = fieldWeight in 977, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.0625 = fieldNorm(doc=977)
          0.36780587 = weight(abstract_txt:errors in 977) [ClassicSimilarity], result of:
            0.36780587 = score(doc=977,freq=3.0), product of:
              0.51877254 = queryWeight, product of:
                3.4177566 = boost
                6.5493927 = idf(docFreq=171, maxDocs=44218)
                0.023175806 = queryNorm
              0.70899254 = fieldWeight in 977, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.5493927 = idf(docFreq=171, maxDocs=44218)
                0.0625 = fieldNorm(doc=977)
        0.28 = coord(7/25)
    
  4. Taghva, K.: ¬The effects of noisy data on text retrieval (1994) 0.22
    0.2221423 = sum of:
      0.2221423 = product of:
        0.9255929 = sum of:
          0.09045669 = weight(abstract_txt:applying in 7227) [ClassicSimilarity], result of:
            0.09045669 = score(doc=7227,freq=1.0), product of:
              0.14022744 = queryWeight, product of:
                1.0259082 = boost
                5.8977947 = idf(docFreq=329, maxDocs=44218)
                0.023175806 = queryNorm
              0.64507127 = fieldWeight in 7227, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8977947 = idf(docFreq=329, maxDocs=44218)
                0.109375 = fieldNorm(doc=7227)
          0.027549446 = weight(abstract_txt:with in 7227) [ClassicSimilarity], result of:
            0.027549446 = score(doc=7227,freq=1.0), product of:
              0.100763 = queryWeight, product of:
                1.7392921 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.023175806 = queryNorm
              0.27340835 = fieldWeight in 7227, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.109375 = fieldNorm(doc=7227)
          0.13356271 = weight(abstract_txt:experiments in 7227) [ClassicSimilarity], result of:
            0.13356271 = score(doc=7227,freq=1.0), product of:
              0.22908995 = queryWeight, product of:
                1.8544282 = boost
                5.3304167 = idf(docFreq=581, maxDocs=44218)
                0.023175806 = queryNorm
              0.5830143 = fieldWeight in 7227, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.3304167 = idf(docFreq=581, maxDocs=44218)
                0.109375 = fieldNorm(doc=7227)
          0.20988175 = weight(abstract_txt:generated in 7227) [ClassicSimilarity], result of:
            0.20988175 = score(doc=7227,freq=2.0), product of:
              0.24576628 = queryWeight, product of:
                1.9207381 = boost
                5.52102 = idf(docFreq=480, maxDocs=44218)
                0.023175806 = queryNorm
              0.8539892 = fieldWeight in 7227, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.52102 = idf(docFreq=480, maxDocs=44218)
                0.109375 = fieldNorm(doc=7227)
          0.09252486 = weight(abstract_txt:retrieval in 7227) [ClassicSimilarity], result of:
            0.09252486 = score(doc=7227,freq=1.0), product of:
              0.24342665 = queryWeight, product of:
                3.0224636 = boost
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.023175806 = queryNorm
              0.38009337 = fieldWeight in 7227, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.109375 = fieldNorm(doc=7227)
          0.37161744 = weight(abstract_txt:errors in 7227) [ClassicSimilarity], result of:
            0.37161744 = score(doc=7227,freq=1.0), product of:
              0.51877254 = queryWeight, product of:
                3.4177566 = boost
                6.5493927 = idf(docFreq=171, maxDocs=44218)
                0.023175806 = queryNorm
              0.7163398 = fieldWeight in 7227, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5493927 = idf(docFreq=171, maxDocs=44218)
                0.109375 = fieldNorm(doc=7227)
        0.24 = coord(6/25)
    
  5. Wien, C.: Sample sizes and composition : their effect on recall and precision in IR experiments with OPACs (2000) 0.20
    0.19650093 = sum of:
      0.19650093 = product of:
        0.8187539 = sum of:
          0.071807265 = weight(abstract_txt:recall in 5368) [ClassicSimilarity], result of:
            0.071807265 = score(doc=5368,freq=1.0), product of:
              0.13323428 = queryWeight, product of:
                5.7488523 = idf(docFreq=382, maxDocs=44218)
                0.023175806 = queryNorm
              0.5389549 = fieldWeight in 5368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7488523 = idf(docFreq=382, maxDocs=44218)
                0.09375 = fieldNorm(doc=5368)
          0.077415034 = weight(abstract_txt:size in 5368) [ClassicSimilarity], result of:
            0.077415034 = score(doc=5368,freq=1.0), product of:
              0.14008358 = queryWeight, product of:
                1.0253818 = boost
                5.8947687 = idf(docFreq=330, maxDocs=44218)
                0.023175806 = queryNorm
              0.5526346 = fieldWeight in 5368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8947687 = idf(docFreq=330, maxDocs=44218)
                0.09375 = fieldNorm(doc=5368)
          0.14896996 = weight(abstract_txt:affected in 5368) [ClassicSimilarity], result of:
            0.14896996 = score(doc=5368,freq=2.0), product of:
              0.1720123 = queryWeight, product of:
                1.1362444 = boost
                6.532101 = idf(docFreq=174, maxDocs=44218)
                0.023175806 = queryNorm
              0.8660425 = fieldWeight in 5368, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.532101 = idf(docFreq=174, maxDocs=44218)
                0.09375 = fieldNorm(doc=5368)
          0.25599027 = weight(abstract_txt:experiments in 5368) [ClassicSimilarity], result of:
            0.25599027 = score(doc=5368,freq=5.0), product of:
              0.22908995 = queryWeight, product of:
                1.8544282 = boost
                5.3304167 = idf(docFreq=581, maxDocs=44218)
                0.023175806 = queryNorm
              1.1174226 = fieldWeight in 5368, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.3304167 = idf(docFreq=581, maxDocs=44218)
                0.09375 = fieldNorm(doc=5368)
          0.12720756 = weight(abstract_txt:generated in 5368) [ClassicSimilarity], result of:
            0.12720756 = score(doc=5368,freq=1.0), product of:
              0.24576628 = queryWeight, product of:
                1.9207381 = boost
                5.52102 = idf(docFreq=480, maxDocs=44218)
                0.023175806 = queryNorm
              0.51759565 = fieldWeight in 5368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.52102 = idf(docFreq=480, maxDocs=44218)
                0.09375 = fieldNorm(doc=5368)
          0.13736379 = weight(abstract_txt:retrieval in 5368) [ClassicSimilarity], result of:
            0.13736379 = score(doc=5368,freq=3.0), product of:
              0.24342665 = queryWeight, product of:
                3.0224636 = boost
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.023175806 = queryNorm
              0.5642923 = fieldWeight in 5368, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.09375 = fieldNorm(doc=5368)
        0.24 = coord(6/25)