hig.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard-cite-them-right
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • sv-SE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • de-DE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Improvement of Optical Character Recognition on Scanned Historical Documents Using Image Processing
University of Gävle, Faculty of Engineering and Sustainable Development, Department of Computer and Geospatial Sciences, Computer Science.
2021 (English)Independent thesis Basic level (professional degree), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

As an effort to improve accessibility to historical documents, digitization of historical archives has been an ongoing process at many institutions since the origination of Optical Character Recognition. The old, scanned documents can contain deteriorations acquired over time or caused by old printing methods. Common visual attributes seen on the documents are variations in style and font, broken characters, ink intensity, noise levels and damage caused by folding or ripping and more. Many of these attributes are disfavoring for modern Optical Character Recognition tools and can lead to failed character recognition. This study approaches stated problem by using image processing methods to improve the result of character recognition. Furthermore, common image quality characteristics of scanned historical documents with unidentifiable text are analyzed. The Optical Character Recognition tool used to conduct this research was the open-source Tesseract software. Image processing methods like Gaussian lowpass filtering, Otsu’s optimum thresholding method and morphological operations were used to prepare the historical documents for Tesseract. Using the Precision and Recall classification method, the OCR output was evaluated, and it was seen that the recall improved by 63 percentage points and the precision by 18 percentage points. This shows that using image pre-processing methods as an approach to increase the readability of historical documents for Optical Character Recognition tools is effective. Further it was seen that common characteristics that are especially disadvantageous for Tesseract are font deviations, occurrence of non-belonging objects, character fading, broken characters, and Poisson noise.

Place, publisher, year, edition, pages
2021. , p. 41
Keywords [en]
Image pre-processing, Tesseract, Optical Character Recognition, Historical documents, Precision and Recall
National Category
Engineering and Technology Computer Systems
Identifiers
URN: urn:nbn:se:hig:diva-36244OAI: oai:DiVA.org:hig-36244DiVA, id: diva2:1566673
External cooperation
Stockholms Stadsarkiv
Subject / course
Computer science
Educational program
Högskoleingenjör
Supervisors
Examiners
Available from: 2021-06-15 Created: 2021-06-15 Last updated: 2021-06-15Bibliographically approved

Open Access in DiVA

fulltext(1161 kB)1829 downloads
File information
File name FULLTEXT01.pdfFile size 1161 kBChecksum SHA-512
8de8453dd065f20e0e5fd119bb19a6794905a4370d406006c70f7562d2270ddee387de2305aeb6bdfe4fc83208c22293f703287d3303eae7a70d13a641c4342f
Type fulltextMimetype application/pdf

By organisation
Computer Science
Engineering and TechnologyComputer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 1833 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 850 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard-cite-them-right
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • sv-SE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • de-DE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf