Document (#29340)

Author
Brener, N.E.
lyengar, S.S.
Pianykh, O.S.
Title
¬A conclusive methodology for rating OCR performance
Source
Journal of the American Society for Information Science and Technology. 56(2005) no.12, S.1274-1287
Year
2005
Abstract
One of the most challenging topics in the automatic document rating process is the development of a rating scheure for the image quality of documents. As part of the Department of Energy (DOE) document declassification program, we have developed a generalized rating system to predict the optical character recognition (OCR) accuracy level that is achieved when processing a document. The need for such a system emerged from the declassification of degraded, typewriter-era documents, which is currently a time-consuming manual process. This article presents the statistical analysis of the most influential document quality features affecting OCR accuracy, develops consistent predictive models for four currently used OCR engines, and studies the applicability of different OCR products to the DOE document declassification process. This study is expected to lead to an efficient and completely automated document declassification system.
Object
OCR

Similar documents (content)

  1. Jiang, X.; Tan, A.-H.: CRCTOL: a semantic-based domain ontology learning system (2009) 0.15
    0.15035014 = sum of:
      0.15035014 = product of:
        0.5369648 = sum of:
          0.052571315 = weight(abstract_txt:generalized in 3320) [ClassicSimilarity], result of:
            0.052571315 = score(doc=3320,freq=1.0), product of:
              0.13593265 = queryWeight, product of:
                1.0854646 = boost
                7.071914 = idf(docFreq=101, maxDocs=44218)
                0.017708067 = queryNorm
              0.3867453 = fieldWeight in 3320, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.071914 = idf(docFreq=101, maxDocs=44218)
                0.0546875 = fieldNorm(doc=3320)
          0.020809965 = weight(abstract_txt:documents in 3320) [ClassicSimilarity], result of:
            0.020809965 = score(doc=3320,freq=1.0), product of:
              0.09233127 = queryWeight, product of:
                1.2651533 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.017708067 = queryNorm
              0.22538373 = fieldWeight in 3320, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0546875 = fieldNorm(doc=3320)
          0.042348813 = weight(abstract_txt:quality in 3320) [ClassicSimilarity], result of:
            0.042348813 = score(doc=3320,freq=2.0), product of:
              0.11768436 = queryWeight, product of:
                1.4283285 = boost
                4.6528544 = idf(docFreq=1145, maxDocs=44218)
                0.017708067 = queryNorm
              0.35985082 = fieldWeight in 3320, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.6528544 = idf(docFreq=1145, maxDocs=44218)
                0.0546875 = fieldNorm(doc=3320)
          0.017101768 = weight(abstract_txt:system in 3320) [ClassicSimilarity], result of:
            0.017101768 = score(doc=3320,freq=1.0), product of:
              0.09273115 = queryWeight, product of:
                1.5528418 = boost
                3.3723085 = idf(docFreq=4123, maxDocs=44218)
                0.017708067 = queryNorm
              0.18442312 = fieldWeight in 3320, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.3723085 = idf(docFreq=4123, maxDocs=44218)
                0.0546875 = fieldNorm(doc=3320)
          0.06325567 = weight(abstract_txt:accuracy in 3320) [ClassicSimilarity], result of:
            0.06325567 = score(doc=3320,freq=1.0), product of:
              0.19374664 = queryWeight, product of:
                1.8326766 = boost
                5.9700394 = idf(docFreq=306, maxDocs=44218)
                0.017708067 = queryNorm
              0.32648653 = fieldWeight in 3320, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9700394 = idf(docFreq=306, maxDocs=44218)
                0.0546875 = fieldNorm(doc=3320)
          0.07054283 = weight(abstract_txt:document in 3320) [ClassicSimilarity], result of:
            0.07054283 = score(doc=3320,freq=1.0), product of:
              0.30049935 = queryWeight, product of:
                3.9532216 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.017708067 = queryNorm
              0.23475201 = fieldWeight in 3320, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0546875 = fieldNorm(doc=3320)
          0.27033445 = weight(abstract_txt:rating in 3320) [ClassicSimilarity], result of:
            0.27033445 = score(doc=3320,freq=1.0), product of:
              0.6428537 = queryWeight, product of:
                4.721063 = boost
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.017708067 = queryNorm
              0.4205225 = fieldWeight in 3320, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.0546875 = fieldNorm(doc=3320)
        0.28 = coord(7/25)
    
  2. Li, H.; Bhowmick, S.S.; Sun, A.: AffRank: affinity-driven ranking of products in online social rating networks (2011) 0.14
    0.13977778 = sum of:
      0.13977778 = product of:
        0.6988889 = sum of:
          0.052871823 = weight(abstract_txt:predict in 4483) [ClassicSimilarity], result of:
            0.052871823 = score(doc=4483,freq=1.0), product of:
              0.12482822 = queryWeight, product of:
                1.0401839 = boost
                6.7769065 = idf(docFreq=136, maxDocs=44218)
                0.017708067 = queryNorm
              0.42355666 = fieldWeight in 4483, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.7769065 = idf(docFreq=136, maxDocs=44218)
                0.0625 = fieldNorm(doc=4483)
          0.055832125 = weight(abstract_txt:affecting in 4483) [ClassicSimilarity], result of:
            0.055832125 = score(doc=4483,freq=1.0), product of:
              0.12944523 = queryWeight, product of:
                1.0592458 = boost
                6.901097 = idf(docFreq=120, maxDocs=44218)
                0.017708067 = queryNorm
              0.43131855 = fieldWeight in 4483, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.901097 = idf(docFreq=120, maxDocs=44218)
                0.0625 = fieldNorm(doc=4483)
          0.020838626 = weight(abstract_txt:most in 4483) [ClassicSimilarity], result of:
            0.020838626 = score(doc=4483,freq=1.0), product of:
              0.08454462 = queryWeight, product of:
                1.2106309 = boost
                3.943693 = idf(docFreq=2328, maxDocs=44218)
                0.017708067 = queryNorm
              0.24648081 = fieldWeight in 4483, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.943693 = idf(docFreq=2328, maxDocs=44218)
                0.0625 = fieldNorm(doc=4483)
          0.034223013 = weight(abstract_txt:quality in 4483) [ClassicSimilarity], result of:
            0.034223013 = score(doc=4483,freq=1.0), product of:
              0.11768436 = queryWeight, product of:
                1.4283285 = boost
                4.6528544 = idf(docFreq=1145, maxDocs=44218)
                0.017708067 = queryNorm
              0.2908034 = fieldWeight in 4483, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6528544 = idf(docFreq=1145, maxDocs=44218)
                0.0625 = fieldNorm(doc=4483)
          0.53512335 = weight(abstract_txt:rating in 4483) [ClassicSimilarity], result of:
            0.53512335 = score(doc=4483,freq=3.0), product of:
              0.6428537 = queryWeight, product of:
                4.721063 = boost
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.017708067 = queryNorm
              0.8324186 = fieldWeight in 4483, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.0625 = fieldNorm(doc=4483)
        0.2 = coord(5/25)
    
  3. Taylor, S.L.: Integrating natural language understanding with document structure analysis (1994) 0.14
    0.13859768 = sum of:
      0.13859768 = product of:
        0.5774903 = sum of:
          0.07046689 = weight(abstract_txt:character in 1794) [ClassicSimilarity], result of:
            0.07046689 = score(doc=1794,freq=1.0), product of:
              0.11536989 = queryWeight, product of:
                6.515104 = idf(docFreq=177, maxDocs=44218)
                0.017708067 = queryNorm
              0.61079097 = fieldWeight in 1794, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.515104 = idf(docFreq=177, maxDocs=44218)
                0.09375 = fieldNorm(doc=1794)
          0.074193396 = weight(abstract_txt:develops in 1794) [ClassicSimilarity], result of:
            0.074193396 = score(doc=1794,freq=1.0), product of:
              0.11940227 = queryWeight, product of:
                1.0173258 = boost
                6.627983 = idf(docFreq=158, maxDocs=44218)
                0.017708067 = queryNorm
              0.6213734 = fieldWeight in 1794, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.627983 = idf(docFreq=158, maxDocs=44218)
                0.09375 = fieldNorm(doc=1794)
          0.10184587 = weight(abstract_txt:optical in 1794) [ClassicSimilarity], result of:
            0.10184587 = score(doc=1794,freq=1.0), product of:
              0.1474794 = queryWeight, product of:
                1.1306273 = boost
                7.3661537 = idf(docFreq=75, maxDocs=44218)
                0.017708067 = queryNorm
              0.6905769 = fieldWeight in 1794, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.3661537 = idf(docFreq=75, maxDocs=44218)
                0.09375 = fieldNorm(doc=1794)
          0.03125794 = weight(abstract_txt:most in 1794) [ClassicSimilarity], result of:
            0.03125794 = score(doc=1794,freq=1.0), product of:
              0.08454462 = queryWeight, product of:
                1.2106309 = boost
                3.943693 = idf(docFreq=2328, maxDocs=44218)
                0.017708067 = queryNorm
              0.3697212 = fieldWeight in 1794, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.943693 = idf(docFreq=2328, maxDocs=44218)
                0.09375 = fieldNorm(doc=1794)
          0.029317316 = weight(abstract_txt:system in 1794) [ClassicSimilarity], result of:
            0.029317316 = score(doc=1794,freq=1.0), product of:
              0.09273115 = queryWeight, product of:
                1.5528418 = boost
                3.3723085 = idf(docFreq=4123, maxDocs=44218)
                0.017708067 = queryNorm
              0.3161539 = fieldWeight in 1794, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.3723085 = idf(docFreq=4123, maxDocs=44218)
                0.09375 = fieldNorm(doc=1794)
          0.27040896 = weight(abstract_txt:document in 1794) [ClassicSimilarity], result of:
            0.27040896 = score(doc=1794,freq=5.0), product of:
              0.30049935 = queryWeight, product of:
                3.9532216 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.017708067 = queryNorm
              0.8998654 = fieldWeight in 1794, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.09375 = fieldNorm(doc=1794)
        0.24 = coord(6/25)
    
  4. Tagheva, K.; Borsack, J.; Condit, A.: Effects of OCR errors on ranking and feedback using the vector space model (1996) 0.12
    0.12202537 = sum of:
      0.12202537 = product of:
        0.61012685 = sum of:
          0.082211375 = weight(abstract_txt:character in 4951) [ClassicSimilarity], result of:
            0.082211375 = score(doc=4951,freq=1.0), product of:
              0.11536989 = queryWeight, product of:
                6.515104 = idf(docFreq=177, maxDocs=44218)
                0.017708067 = queryNorm
              0.7125895 = fieldWeight in 4951, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.515104 = idf(docFreq=177, maxDocs=44218)
                0.109375 = fieldNorm(doc=4951)
          0.11882018 = weight(abstract_txt:optical in 4951) [ClassicSimilarity], result of:
            0.11882018 = score(doc=4951,freq=1.0), product of:
              0.1474794 = queryWeight, product of:
                1.1306273 = boost
                7.3661537 = idf(docFreq=75, maxDocs=44218)
                0.017708067 = queryNorm
              0.80567306 = fieldWeight in 4951, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.3661537 = idf(docFreq=75, maxDocs=44218)
                0.109375 = fieldNorm(doc=4951)
          0.04161993 = weight(abstract_txt:documents in 4951) [ClassicSimilarity], result of:
            0.04161993 = score(doc=4951,freq=1.0), product of:
              0.09233127 = queryWeight, product of:
                1.2651533 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.017708067 = queryNorm
              0.45076746 = fieldWeight in 4951, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.109375 = fieldNorm(doc=4951)
          0.22638972 = weight(abstract_txt:degraded in 4951) [ClassicSimilarity], result of:
            0.22638972 = score(doc=4951,freq=1.0), product of:
              0.22666042 = queryWeight, product of:
                1.4016565 = boost
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.017708067 = queryNorm
              0.9988057 = fieldWeight in 4951, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.109375 = fieldNorm(doc=4951)
          0.14108565 = weight(abstract_txt:document in 4951) [ClassicSimilarity], result of:
            0.14108565 = score(doc=4951,freq=1.0), product of:
              0.30049935 = queryWeight, product of:
                3.9532216 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.017708067 = queryNorm
              0.46950403 = fieldWeight in 4951, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.109375 = fieldNorm(doc=4951)
        0.2 = coord(5/25)
    
  5. Broadhurst, R.: ¬The digitisation of library material (1993) 0.11
    0.113619186 = sum of:
      0.113619186 = product of:
        0.5680959 = sum of:
          0.09395585 = weight(abstract_txt:character in 6256) [ClassicSimilarity], result of:
            0.09395585 = score(doc=6256,freq=1.0), product of:
              0.11536989 = queryWeight, product of:
                6.515104 = idf(docFreq=177, maxDocs=44218)
                0.017708067 = queryNorm
              0.814388 = fieldWeight in 6256, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.515104 = idf(docFreq=177, maxDocs=44218)
                0.125 = fieldNorm(doc=6256)
          0.13579449 = weight(abstract_txt:optical in 6256) [ClassicSimilarity], result of:
            0.13579449 = score(doc=6256,freq=1.0), product of:
              0.1474794 = queryWeight, product of:
                1.1306273 = boost
                7.3661537 = idf(docFreq=75, maxDocs=44218)
                0.017708067 = queryNorm
              0.9207692 = fieldWeight in 6256, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.3661537 = idf(docFreq=75, maxDocs=44218)
                0.125 = fieldNorm(doc=6256)
          0.10934515 = weight(abstract_txt:currently in 6256) [ClassicSimilarity], result of:
            0.10934515 = score(doc=6256,freq=1.0), product of:
              0.16082478 = queryWeight, product of:
                1.6697261 = boost
                5.4392195 = idf(docFreq=521, maxDocs=44218)
                0.017708067 = queryNorm
              0.67990243 = fieldWeight in 6256, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.4392195 = idf(docFreq=521, maxDocs=44218)
                0.125 = fieldNorm(doc=6256)
          0.06775971 = weight(abstract_txt:process in 6256) [ClassicSimilarity], result of:
            0.06775971 = score(doc=6256,freq=1.0), product of:
              0.13381292 = queryWeight, product of:
                1.8653631 = boost
                4.0510116 = idf(docFreq=2091, maxDocs=44218)
                0.017708067 = queryNorm
              0.50637645 = fieldWeight in 6256, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0510116 = idf(docFreq=2091, maxDocs=44218)
                0.125 = fieldNorm(doc=6256)
          0.16124074 = weight(abstract_txt:document in 6256) [ClassicSimilarity], result of:
            0.16124074 = score(doc=6256,freq=1.0), product of:
              0.30049935 = queryWeight, product of:
                3.9532216 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.017708067 = queryNorm
              0.53657603 = fieldWeight in 6256, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.125 = fieldNorm(doc=6256)
        0.2 = coord(5/25)