Sorio, Alberto Bartoli, Giorgio Davanzo, Eric Medvet,
Proc. ACM Document Engineering Conference
A key step in the understanding of printed documents is their classification based on the nature of information they contain and their layout. In this work we consider a dynamic scenario in which document classes are not known a priori and new classes can appear at any time. This open world setting is both realistic and highly challenging. We use an SVM-based classifier based only on image-level features and use a nearest-neighbor approach for detecting new classes. We assess our proposal on a real-world dataset composed of 562 invoices belonging to 68 different classes. These documents were digitalized after being handled by a corporate environment, thus they are quite noisy---e.g., big stamps and handwritten signatures at unfortunate positions and alike. The experimental results are highly promising.