A Domain Knowledge-based Approach for Automatic Correction of Printed Invoices

27 mar 2012

IEEE International Conference on Information Society (iSociety), 2012, London (UK)

Enrico Sorio, Alberto Bartoli, Giorgio Davanzo, Eric Medvet

Although OCR technology is now commonplace, character recognition errors are still a problem, in particular, in automated systems for information extraction from printed documents. This paper proposes a method for the automatic detection and correction of OCR errors in an information extraction system. Our algorithm uses domain-knowledge about possible misrecognition of characters to propose corrections; then it exploits knowledge about the type of the extracted information to perform syntactic and semantic checks in order to validate the proposed corrections.
We assess our proposal on a real-world, highly challenging dataset composed of nearly 800 values extracted from approximately 100 commercial invoices and we obtained very good results.