Alberto Bartoli, Giorgio Davanzo, Eric Medvet, Enrico Sorio,
Proc. AIA 2010 - Artificial Intelligence and Applications
An essential step in the understanding of printed documents is the
classification of such documents based on their class, i.e., on the
nature of information they contain and their layout.
In this work we
are concerned with automatic classification of such documents. This
task is usually accomplished by extracting a suitable set of low-level
features from each document which are then fed to a classifier.
quality of the results depends primarily on the classifier, but they are
also heavily influenced by the specific features used. In this work we
focus on the feature extraction part and propose a method that
characterizes each document based on the spatial density of black pixels
and of image edges.
We assess our proposal on a real-world dataset
composed of 560 invoices belonging to 68 different classes. These
documents have been digitalized after their printed counterparts have
been handled by a corporate environment, thus they contain a substantial
amount of noise---big stamps and handwritten signatures at unfortunate
positions and so on. We show that our proposal is accurate, even a with
very small learning set.