IEEE International Conference on Information Society (iSociety), 2012, London (UK)
Enrico Sorio, Alberto Bartoli, Giorgio Davanzo, Eric Medvet
Although OCR technology is now commonplace, character recognition errors are still a problem, in particular, in automated systems for information extraction from printed documents. This paper proposes a method for the automatic detection and correction of OCR errors in an information extraction system. Our algorithm uses domain-knowledge about possible misrecognition of characters to propose corrections; then it exploits knowledge about the type of the extracted information to perform syntactic and semantic checks in order to validate the proposed corrections.
We assess our proposal on a real-world, highly challenging dataset composed of nearly 800 values extracted from approximately 100 commercial invoices and we obtained very good results.
IEEE International Conference on Information Society (iSociety), 2012, London (UK)
Alberto Bartoli, Eric Medvet, Marco Mauri
Web-based applications have an ever increasing impact on modern society. Organizations spend an ever increasing amount of resources developing and maintaining web applications that are in fact their public face. Hence, each problem that an external user encounters using these products---either due to an external attack or a simple defect in the underlying code---often results in a loss for the organization itself. In order to reduce defects and prevent attacks web applications require a thorough testing. Unfortunately, nowadays testing a web application is a very complicated issue, primarily due to the vast use of AJAX technologies. To simplify and improve the testing procedures, we propose a method to register and replay web navigation, i.e., sequences of user's interactions with web applications. Then, we developed a tool that implements this method and tested it on various web applications, both real and synthetic, obtaining very positive results.
A. Bartoli, G. Davanzo, E. Medvet
in Proc. 11-th International Conference on Intelligent Systems Design and Applications (ISDA)
Electronic devices capable of wireless communication are becoming ubiquitous. They enable a wide range of novel applications but these are often difficult to deploy in practice because wireless channels provide ample opportunities to attackers. A number of approaches have been proposed for building secure channels in these scenarios. An approach that would be simple, general and effective consists in establishing a shared secret between two devices by placing them in physical contact for a few seconds. This approach has not been exploited in practice due to the lack of common interfaces. We demonstrate the practical feasibility of the approach for devices equipped with a small LCD screen and cameras. We transfer secret keys between Android-based smartphones put in contact with each other for just a few seconds. The transfer occurs across a visual channel that cannot be intercepted.
E. Medvet, A. Bartoli, G. Davanzo, A. De Lorenzo,
in Proc. 10-th IEEE / WIC / ACM International Conference on Web Intelligence
We consider the automatic annotation of faces of people mentioned in news. News stories provide a constant flow of potentially useful image indexing information, due to their huge diffusion on the web and to the involvement of human operators in selecting relevant images for the stories. In this work we investigate the possibility of actually exploiting this wealth of information.
We propose and evaluate a system for automatic face annotation of image news that is fully unsupervised and does not require any prior knowledge about topic or people involved. Key feature of our proposal is that it attempts to identify the essential piece of information---how a person with a given name looks like---by querying popular image search engines. Mining the web allows overcoming intrinsic limitations of approaches built above a predefined collection of stories: our system can potentially annotate people never handled before since its knowledge base is constantly expanded, as long as search engines keep on indexing the web. On the other hand, leveraging on image search engines forces to cope with the substantial amount of noise in search engine results. Our contribution shows experimentally that automatic face annotation may indeed be achieved based entirely on knowledge that lives in the web.
IEEE/WIC/ACM WI-2011 is a highly selective conference; only 20,5% of the 200 submissions were accepted as regular papers and 19% of the 200 submissions were accepted as short papers. Papers went through a rigorous review process. Each paper was reviewed by at least three program committee members.
Giorgio Davanzo, Alberto Bartoli, Eric Medvet,
Expert Systems with Applications
(IF 2010: 2.908) to appear
The defacement of web sites has become a widespread problem. Reaction to these incidents is often quite slow and triggered by occasional checks or even feedback from users, because organizations usually lack a systematic and round the clock surveillance of the integrity of their web sites. A more systematic approach is certainly desirable. An attractive option in this respect consists in augmenting availability and performance monitoring services with defacement detection capabilities. Motivated by these considerations, in this paper we assess the performance of several anomaly detection approaches when faced with the problem of detecting web defacements automatically. All these approaches construct a profile of the monitored page automatically, based on machine learning techniques, and raise an alert when the page content does not fit the profile. We assessed their performance in terms of false positives and false negatives on a dataset composed of 300 highly dynamic web pages that we observed for three months and includes a set of 320 real defacements.
A. Bartoli, G. Davanzo, A. De Lorenzo, E. Medvet
in Proc. 14-th European Conference on Genetic Programming (EuroGP 2011)
The electric power market is increasingly relying on competitive mechanisms taking the form of day-ahead auctions, in which buyers and sellers submit their bids in terms of prices and quantities for each hour of the next day. Methods for electricity price forecasting suitable for these contexts are crucial to the success of any bidding strategy. Such methods have thus become very important in practice, due to the economic relevance of electric power auctions.
In this work we propose a novel forecasting method based on Genetic Programming. Key feature of our proposal is the handling of outliers, i.e., regions of the input space rarely seen during the learning. Since a predictor generated with Genetic Programming can hardly provide acceptable performance in these regions, we use a classifier that attempts to determine whether the system is shifting toward a difficult-to-learn region. In those cases, we replace the prediction made by Genetic Programming by a constant value determined during learning and tailored to the specific subregion expected.
We evaluate the performance of our proposal against a challenging baseline representative of the state-of-the-art. The baseline analyzes a real-world dataset by means of a number of different methods, each calibrated separately for each hour of the day and recalibrated every day on a progressively growing learning set. Our proposal exhibits smaller prediction error, even though we construct one single model, valid for each hour of the day and used unmodified across the entire testing set. We believe that our results are highly promising and may open a broad range of novel solutions.
Eric Medvet, Alberto Bartoli, Giorgio Davanzo,
International Journal on Document Analysis and Understanding
We propose an approach for information extraction for multi-page printed document understanding. The approach is designed for scenarios in which the set of possible document classes, i.e., documents sharing similar content and layout, is large and may evolve over time.
Describing a new class is a very simple task: the operator merely provides a few samples and then, by means of a a GUI, clicks on the OCR-generated blocks of a document containing the information to be extracted.
Our approach is based on probability: we derived a general form for the probability that a sequence of blocks contains the searched information. We estimate the parameters for a new class by applying the maximum likelihood method to the samples of the class. All these parameters depend only on block properties that can be extracted automatically from the operator actions on the GUI. Processing a document of a given class consists in finding the sequence of blocks which maximizes the corresponding probability for that class.
We evaluated experimentally our proposal using 807 multi-page printed documents of different domains (invoices, patents, data-sheets), obtaining very good results---e.g., a success rate often greater than 90% even for classes with just two samples.
Sorio, Alberto Bartoli, Giorgio Davanzo, Eric Medvet,
Proc. ACM Document Engineering Conference
A key step in the understanding of printed documents is their classification based on the nature of information they contain and their layout. In this work we consider a dynamic scenario in which document classes are not known a priori and new classes can appear at any time. This open world setting is both realistic and highly challenging. We use an SVM-based classifier based only on image-level features and use a nearest-neighbor approach for detecting new classes. We assess our proposal on a real-world dataset composed of 562 invoices belonging to 68 different classes. These documents were digitalized after being handled by a corporate environment, thus they are quite noisy---e.g., big stamps and handwritten signatures at unfortunate positions and alike. The experimental results are highly promising.
Alberto Bartoli, Giorgio Davanzo, Eric Medvet,
ACM Transactions on Internet Technology
Web site defacement, the process of introducing unauthorized
modifications to a web site, is a very common form of attack. In this
paper we describe and evaluate experimentally a framework that may
constitute the basis for a defacement detection service capable of
monitoring thousands of remote web sites systematically and
In our framework an organization may join the service by simply
providing the URLs of the resources to be monitored along with the
contact point of an administrator. The monitored organization may thus
take advantage of the service with just a few mouse clicks, without
installing any software locally nor changing its own daily operational
processes. Our approach is based on anomaly detection and allows
monitoring the integrity of many remote web resources automatically
while remaining fully decoupled from them, in particular, without
requiring any prior knowledge about those resources.
We evaluated our approach over a selection of dynamic resources and a
set of publicly available defacements. The results are very
satisfactory: all attacks are detected while keeping false positives to a
minimum. We also assessed performance and scalability of our proposal
and we found that it may indeed constitute the basis for actually
deploying the proposed service on a large-scale.
Alberto Bartoli, Giorgio Davanzo, Eric Medvet, Enrico Sorio,
Proc. AIA 2010 - Artificial Intelligence and Applications
An essential step in the understanding of printed documents is the
classification of such documents based on their class, i.e., on the
nature of information they contain and their layout.
In this work we
are concerned with automatic classification of such documents. This
task is usually accomplished by extracting a suitable set of low-level
features from each document which are then fed to a classifier.
quality of the results depends primarily on the classifier, but they are
also heavily influenced by the specific features used. In this work we
focus on the feature extraction part and propose a method that
characterizes each document based on the spatial density of black pixels
and of image edges.
We assess our proposal on a real-world dataset
composed of 560 invoices belonging to 68 different classes. These
documents have been digitalized after their printed counterparts have
been handled by a corporate environment, thus they contain a substantial
amount of noise---big stamps and handwritten signatures at unfortunate
positions and so on. We show that our proposal is accurate, even a with
very small learning set.