Document (#13555)

Author
Taghva, K.
Borsack, J.
Condit, A.
Title
Evaluation of model-based retrieval effectiveness with OCR text
Source
ACM transactions on information systems. 14(1996) no.1, S.64-93
Year
1996
Abstract
Reports on experiments with retrieval from OCR-generated text using systems based on standard models of retrieval. Shows that average precision and recall is not affected by OCR errors across systems for several collections. Both the actual and the simulation experiments include full text and abstract length documents. The ranking and feedback methods associated with the retrieval models are generally not robust enough to deal with OCR errors. OCR errors and garbage strings generated from the mistranslation of graphic objects increase the size of the index significantly. Describes the problems of applying OCR text within an information retrieval environment and offers solutions

Similar documents (content)

  1. Tagheva, K.; Borsack, J.; Condit, A.: Effects of OCR errors on ranking and feedback using the vector space model (1996) 0.37
    0.36750907 = sum of:
      0.36750907 = product of:
        1.1484659 = sum of:
          0.08199536 = weight(abstract_txt:recall in 5020) [ClassicSimilarity], result of:
            0.08199536 = score(doc=5020,freq=1.0), product of:
              0.13093273 = queryWeight, product of:
                5.725626 = idf(docFreq=371, maxDocs=41962)
                0.022867847 = queryNorm
              0.6262404 = fieldWeight in 5020, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.725626 = idf(docFreq=371, maxDocs=41962)
                0.109375 = fieldNorm(doc=5020)
          0.091281615 = weight(abstract_txt:average in 5020) [ClassicSimilarity], result of:
            0.091281615 = score(doc=5020,freq=1.0), product of:
              0.14064068 = queryWeight, product of:
                1.0364094 = boost
                5.9340925 = idf(docFreq=301, maxDocs=41962)
                0.022867847 = queryNorm
              0.64904135 = fieldWeight in 5020, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9340925 = idf(docFreq=301, maxDocs=41962)
                0.109375 = fieldNorm(doc=5020)
          0.09221083 = weight(abstract_txt:feedback in 5020) [ClassicSimilarity], result of:
            0.09221083 = score(doc=5020,freq=1.0), product of:
              0.1415935 = queryWeight, product of:
                1.0399143 = boost
                5.95416 = idf(docFreq=295, maxDocs=41962)
                0.022867847 = queryNorm
              0.6512363 = fieldWeight in 5020, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.95416 = idf(docFreq=295, maxDocs=41962)
                0.109375 = fieldNorm(doc=5020)
          0.12492097 = weight(abstract_txt:affected in 5020) [ClassicSimilarity], result of:
            0.12492097 = score(doc=5020,freq=1.0), product of:
              0.1733587 = queryWeight, product of:
                1.1506644 = boost
                6.588274 = idf(docFreq=156, maxDocs=41962)
                0.022867847 = queryNorm
              0.7205925 = fieldWeight in 5020, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.588274 = idf(docFreq=156, maxDocs=41962)
                0.109375 = fieldNorm(doc=5020)
          0.028442683 = weight(abstract_txt:with in 5020) [ClassicSimilarity], result of:
            0.028442683 = score(doc=5020,freq=1.0), product of:
              0.102609865 = queryWeight, product of:
                1.7705183 = boost
                2.5343313 = idf(docFreq=9046, maxDocs=41962)
                0.022867847 = queryNorm
              0.27719247 = fieldWeight in 5020, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.5343313 = idf(docFreq=9046, maxDocs=41962)
                0.109375 = fieldNorm(doc=5020)
          0.11656672 = weight(abstract_txt:text in 5020) [ClassicSimilarity], result of:
            0.11656672 = score(doc=5020,freq=1.0), product of:
              0.26277968 = queryWeight, product of:
                2.83336 = boost
                4.05569 = idf(docFreq=1975, maxDocs=41962)
                0.022867847 = queryNorm
              0.44359106 = fieldWeight in 5020, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.05569 = idf(docFreq=1975, maxDocs=41962)
                0.109375 = fieldNorm(doc=5020)
          0.09058294 = weight(abstract_txt:retrieval in 5020) [ClassicSimilarity], result of:
            0.09058294 = score(doc=5020,freq=1.0), product of:
              0.23926343 = queryWeight, product of:
                3.0227277 = boost
                3.4614017 = idf(docFreq=3579, maxDocs=41962)
                0.022867847 = queryNorm
              0.37859082 = fieldWeight in 5020, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4614017 = idf(docFreq=3579, maxDocs=41962)
                0.109375 = fieldNorm(doc=5020)
          0.52246475 = weight(abstract_txt:errors in 5020) [ClassicSimilarity], result of:
            0.52246475 = score(doc=5020,freq=2.0), product of:
              0.51513827 = queryWeight, product of:
                3.435567 = boost
                6.5569234 = idf(docFreq=161, maxDocs=41962)
                0.022867847 = queryNorm
              1.0142224 = fieldWeight in 5020, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.5569234 = idf(docFreq=161, maxDocs=41962)
                0.109375 = fieldNorm(doc=5020)
        0.32 = coord(8/25)
    
  2. Losee, R.M.: Determining information retrieval and filtering performance without experimentation (1995) 0.26
    0.25800517 = sum of:
      0.25800517 = product of:
        0.716681 = sum of:
          0.065201156 = weight(abstract_txt:average in 3437) [ClassicSimilarity], result of:
            0.065201156 = score(doc=3437,freq=1.0), product of:
              0.14064068 = queryWeight, product of:
                1.0364094 = boost
                5.9340925 = idf(docFreq=301, maxDocs=41962)
                0.022867847 = queryNorm
              0.463601 = fieldWeight in 3437, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9340925 = idf(docFreq=301, maxDocs=41962)
                0.078125 = fieldNorm(doc=3437)
          0.065864876 = weight(abstract_txt:feedback in 3437) [ClassicSimilarity], result of:
            0.065864876 = score(doc=3437,freq=1.0), product of:
              0.1415935 = queryWeight, product of:
                1.0399143 = boost
                5.95416 = idf(docFreq=295, maxDocs=41962)
                0.022867847 = queryNorm
              0.46516877 = fieldWeight in 3437, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.95416 = idf(docFreq=295, maxDocs=41962)
                0.078125 = fieldNorm(doc=3437)
          0.020879444 = weight(abstract_txt:based in 3437) [ClassicSimilarity], result of:
            0.020879444 = score(doc=3437,freq=1.0), product of:
              0.08293987 = queryWeight, product of:
                1.12557 = boost
                3.2222967 = idf(docFreq=4546, maxDocs=41962)
                0.022867847 = queryNorm
              0.25174195 = fieldWeight in 3437, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.2222967 = idf(docFreq=4546, maxDocs=41962)
                0.078125 = fieldNorm(doc=3437)
          0.089229256 = weight(abstract_txt:length in 3437) [ClassicSimilarity], result of:
            0.089229256 = score(doc=3437,freq=1.0), product of:
              0.1733587 = queryWeight, product of:
                1.1506644 = boost
                6.588274 = idf(docFreq=156, maxDocs=41962)
                0.022867847 = queryNorm
              0.5147089 = fieldWeight in 3437, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.588274 = idf(docFreq=156, maxDocs=41962)
                0.078125 = fieldNorm(doc=3437)
          0.024971025 = weight(abstract_txt:systems in 3437) [ClassicSimilarity], result of:
            0.024971025 = score(doc=3437,freq=1.0), product of:
              0.09344907 = queryWeight, product of:
                1.1947536 = boost
                3.4203563 = idf(docFreq=3729, maxDocs=41962)
                0.022867847 = queryNorm
              0.26721534 = fieldWeight in 3437, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4203563 = idf(docFreq=3729, maxDocs=41962)
                0.078125 = fieldNorm(doc=3437)
          0.17171308 = weight(abstract_txt:simulation in 3437) [ClassicSimilarity], result of:
            0.17171308 = score(doc=3437,freq=2.0), product of:
              0.2128791 = queryWeight, product of:
                1.2750945 = boost
                7.3007145 = idf(docFreq=76, maxDocs=41962)
                0.022867847 = queryNorm
              0.80662256 = fieldWeight in 3437, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.3007145 = idf(docFreq=76, maxDocs=41962)
                0.078125 = fieldNorm(doc=3437)
          0.06615599 = weight(abstract_txt:models in 3437) [ClassicSimilarity], result of:
            0.06615599 = score(doc=3437,freq=1.0), product of:
              0.17892192 = queryWeight, product of:
                1.6531895 = boost
                4.7327724 = idf(docFreq=1003, maxDocs=41962)
                0.022867847 = queryNorm
              0.36974785 = fieldWeight in 3437, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7327724 = idf(docFreq=1003, maxDocs=41962)
                0.078125 = fieldNorm(doc=3437)
          0.083261944 = weight(abstract_txt:text in 3437) [ClassicSimilarity], result of:
            0.083261944 = score(doc=3437,freq=1.0), product of:
              0.26277968 = queryWeight, product of:
                2.83336 = boost
                4.05569 = idf(docFreq=1975, maxDocs=41962)
                0.022867847 = queryNorm
              0.31685078 = fieldWeight in 3437, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.05569 = idf(docFreq=1975, maxDocs=41962)
                0.078125 = fieldNorm(doc=3437)
          0.1294042 = weight(abstract_txt:retrieval in 3437) [ClassicSimilarity], result of:
            0.1294042 = score(doc=3437,freq=4.0), product of:
              0.23926343 = queryWeight, product of:
                3.0227277 = boost
                3.4614017 = idf(docFreq=3579, maxDocs=41962)
                0.022867847 = queryNorm
              0.540844 = fieldWeight in 3437, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.4614017 = idf(docFreq=3579, maxDocs=41962)
                0.078125 = fieldNorm(doc=3437)
        0.36 = coord(9/25)
    
  3. Lam-Adesina, A.M.; Jones, G.J.F.: Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents (2006) 0.23
    0.23430903 = sum of:
      0.23430903 = product of:
        0.836818 = sum of:
          0.1053838 = weight(abstract_txt:feedback in 2978) [ClassicSimilarity], result of:
            0.1053838 = score(doc=2978,freq=4.0), product of:
              0.1415935 = queryWeight, product of:
                1.0399143 = boost
                5.95416 = idf(docFreq=295, maxDocs=41962)
                0.022867847 = queryNorm
              0.74427 = fieldWeight in 2978, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.95416 = idf(docFreq=295, maxDocs=41962)
                0.0625 = fieldNorm(doc=2978)
          0.016703555 = weight(abstract_txt:based in 2978) [ClassicSimilarity], result of:
            0.016703555 = score(doc=2978,freq=1.0), product of:
              0.08293987 = queryWeight, product of:
                1.12557 = boost
                3.2222967 = idf(docFreq=4546, maxDocs=41962)
                0.022867847 = queryNorm
              0.20139354 = fieldWeight in 2978, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.2222967 = idf(docFreq=4546, maxDocs=41962)
                0.0625 = fieldNorm(doc=2978)
          0.02825149 = weight(abstract_txt:systems in 2978) [ClassicSimilarity], result of:
            0.02825149 = score(doc=2978,freq=2.0), product of:
              0.09344907 = queryWeight, product of:
                1.1947536 = boost
                3.4203563 = idf(docFreq=3729, maxDocs=41962)
                0.022867847 = queryNorm
              0.30231965 = fieldWeight in 2978, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4203563 = idf(docFreq=3729, maxDocs=41962)
                0.0625 = fieldNorm(doc=2978)
          0.099840164 = weight(abstract_txt:strings in 2978) [ClassicSimilarity], result of:
            0.099840164 = score(doc=2978,freq=1.0), product of:
              0.21681248 = queryWeight, product of:
                1.2868207 = boost
                7.3678536 = idf(docFreq=71, maxDocs=41962)
                0.022867847 = queryNorm
              0.46049085 = fieldWeight in 2978, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.3678536 = idf(docFreq=71, maxDocs=41962)
                0.0625 = fieldNorm(doc=2978)
          0.094200134 = weight(abstract_txt:text in 2978) [ClassicSimilarity], result of:
            0.094200134 = score(doc=2978,freq=2.0), product of:
              0.26277968 = queryWeight, product of:
                2.83336 = boost
                4.05569 = idf(docFreq=1975, maxDocs=41962)
                0.022867847 = queryNorm
              0.35847571 = fieldWeight in 2978, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.05569 = idf(docFreq=1975, maxDocs=41962)
                0.0625 = fieldNorm(doc=2978)
          0.1267897 = weight(abstract_txt:retrieval in 2978) [ClassicSimilarity], result of:
            0.1267897 = score(doc=2978,freq=6.0), product of:
              0.23926343 = queryWeight, product of:
                3.0227277 = boost
                3.4614017 = idf(docFreq=3579, maxDocs=41962)
                0.022867847 = queryNorm
              0.52991676 = fieldWeight in 2978, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                3.4614017 = idf(docFreq=3579, maxDocs=41962)
                0.0625 = fieldNorm(doc=2978)
          0.36564913 = weight(abstract_txt:errors in 2978) [ClassicSimilarity], result of:
            0.36564913 = score(doc=2978,freq=3.0), product of:
              0.51513827 = queryWeight, product of:
                3.435567 = boost
                6.5569234 = idf(docFreq=161, maxDocs=41962)
                0.022867847 = queryNorm
              0.70980775 = fieldWeight in 2978, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.5569234 = idf(docFreq=161, maxDocs=41962)
                0.0625 = fieldNorm(doc=2978)
        0.28 = coord(7/25)
    
  4. Taghva, K.: ¬The effects of noisy data on text retrieval (1994) 0.22
    0.22344148 = sum of:
      0.22344148 = product of:
        0.9310062 = sum of:
          0.09143477 = weight(abstract_txt:applying in 7227) [ClassicSimilarity], result of:
            0.09143477 = score(doc=7227,freq=1.0), product of:
              0.14079794 = queryWeight, product of:
                1.0369887 = boost
                5.9374094 = idf(docFreq=300, maxDocs=41962)
                0.022867847 = queryNorm
              0.64940417 = fieldWeight in 7227, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9374094 = idf(docFreq=300, maxDocs=41962)
                0.109375 = fieldNorm(doc=7227)
          0.028442683 = weight(abstract_txt:with in 7227) [ClassicSimilarity], result of:
            0.028442683 = score(doc=7227,freq=1.0), product of:
              0.102609865 = queryWeight, product of:
                1.7705183 = boost
                2.5343313 = idf(docFreq=9046, maxDocs=41962)
                0.022867847 = queryNorm
              0.27719247 = fieldWeight in 7227, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.5343313 = idf(docFreq=9046, maxDocs=41962)
                0.109375 = fieldNorm(doc=7227)
          0.13715596 = weight(abstract_txt:experiments in 7227) [ClassicSimilarity], result of:
            0.13715596 = score(doc=7227,freq=1.0), product of:
              0.2324566 = queryWeight, product of:
                1.884351 = boost
                5.3945446 = idf(docFreq=517, maxDocs=41962)
                0.022867847 = queryNorm
              0.5900283 = fieldWeight in 7227, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.3945446 = idf(docFreq=517, maxDocs=41962)
                0.109375 = fieldNorm(doc=7227)
          0.21395147 = weight(abstract_txt:generated in 7227) [ClassicSimilarity], result of:
            0.21395147 = score(doc=7227,freq=2.0), product of:
              0.24816027 = queryWeight, product of:
                1.9469599 = boost
                5.573782 = idf(docFreq=432, maxDocs=41962)
                0.022867847 = queryNorm
              0.8621504 = fieldWeight in 7227, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.573782 = idf(docFreq=432, maxDocs=41962)
                0.109375 = fieldNorm(doc=7227)
          0.09058294 = weight(abstract_txt:retrieval in 7227) [ClassicSimilarity], result of:
            0.09058294 = score(doc=7227,freq=1.0), product of:
              0.23926343 = queryWeight, product of:
                3.0227277 = boost
                3.4614017 = idf(docFreq=3579, maxDocs=41962)
                0.022867847 = queryNorm
              0.37859082 = fieldWeight in 7227, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4614017 = idf(docFreq=3579, maxDocs=41962)
                0.109375 = fieldNorm(doc=7227)
          0.36943838 = weight(abstract_txt:errors in 7227) [ClassicSimilarity], result of:
            0.36943838 = score(doc=7227,freq=1.0), product of:
              0.51513827 = queryWeight, product of:
                3.435567 = boost
                6.5569234 = idf(docFreq=161, maxDocs=41962)
                0.022867847 = queryNorm
              0.7171635 = fieldWeight in 7227, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5569234 = idf(docFreq=161, maxDocs=41962)
                0.109375 = fieldNorm(doc=7227)
        0.24 = coord(6/25)
    
  5. Wien, C.: Sample sizes and composition : their effect on recall and precision in IR experiments with OPACs (2000) 0.20
    0.1983821 = sum of:
      0.1983821 = product of:
        0.8265921 = sum of:
          0.07028174 = weight(abstract_txt:recall in 369) [ClassicSimilarity], result of:
            0.07028174 = score(doc=369,freq=1.0), product of:
              0.13093273 = queryWeight, product of:
                5.725626 = idf(docFreq=371, maxDocs=41962)
                0.022867847 = queryNorm
              0.53677744 = fieldWeight in 369, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.725626 = idf(docFreq=371, maxDocs=41962)
                0.09375 = fieldNorm(doc=369)
          0.077851065 = weight(abstract_txt:size in 369) [ClassicSimilarity], result of:
            0.077851065 = score(doc=369,freq=1.0), product of:
              0.14017254 = queryWeight, product of:
                1.0346831 = boost
                5.924208 = idf(docFreq=304, maxDocs=41962)
                0.022867847 = queryNorm
              0.55539453 = fieldWeight in 369, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.924208 = idf(docFreq=304, maxDocs=41962)
                0.09375 = fieldNorm(doc=369)
          0.15142708 = weight(abstract_txt:affected in 369) [ClassicSimilarity], result of:
            0.15142708 = score(doc=369,freq=2.0), product of:
              0.1733587 = queryWeight, product of:
                1.1506644 = boost
                6.588274 = idf(docFreq=156, maxDocs=41962)
                0.022867847 = queryNorm
              0.87349 = fieldWeight in 369, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.588274 = idf(docFreq=156, maxDocs=41962)
                0.09375 = fieldNorm(doc=369)
          0.26287723 = weight(abstract_txt:experiments in 369) [ClassicSimilarity], result of:
            0.26287723 = score(doc=369,freq=5.0), product of:
              0.2324566 = queryWeight, product of:
                1.884351 = boost
                5.3945446 = idf(docFreq=517, maxDocs=41962)
                0.022867847 = queryNorm
              1.1308658 = fieldWeight in 369, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.3945446 = idf(docFreq=517, maxDocs=41962)
                0.09375 = fieldNorm(doc=369)
          0.12967418 = weight(abstract_txt:generated in 369) [ClassicSimilarity], result of:
            0.12967418 = score(doc=369,freq=1.0), product of:
              0.24816027 = queryWeight, product of:
                1.9469599 = boost
                5.573782 = idf(docFreq=432, maxDocs=41962)
                0.022867847 = queryNorm
              0.52254206 = fieldWeight in 369, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.573782 = idf(docFreq=432, maxDocs=41962)
                0.09375 = fieldNorm(doc=369)
          0.13448079 = weight(abstract_txt:retrieval in 369) [ClassicSimilarity], result of:
            0.13448079 = score(doc=369,freq=3.0), product of:
              0.23926343 = queryWeight, product of:
                3.0227277 = boost
                3.4614017 = idf(docFreq=3579, maxDocs=41962)
                0.022867847 = queryNorm
              0.5620616 = fieldWeight in 369, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.4614017 = idf(docFreq=3579, maxDocs=41962)
                0.09375 = fieldNorm(doc=369)
        0.24 = coord(6/25)