Mission critical distributed systems are typically constructed out of many different software components strictly interacting with each other. Due to the inherent complexity of those systems and to the asynchrony of the interactions, they can exhibit anomalies that can lead to failures and, in some cases, to an abrupt block of the system, even though the software has been extensively tested. To avoid these failures and blocks, it is necessary to monitor at run time the system in order to discover such anomalies, thus predicting the occurrence of failures in advance. At this time, in order to minimize system damages appropriate measures can be taken by a system manager before the actual failure occurs.
This project introduces a novel approach to failure prediction for mission critical distributed systems that has the distinctive features to be (i) black-box: no knowledge of applications’ internals and logic of the mission critical distributed system is required (ii) non-intrusive: no status information of the nodes (e.g., CPU) is used; and (iii) online: the analysis is performed during the functioning of the monitored systems.
A comprehensive demo can be appreciated in the following video, showing the Casper Graphical User Interface (GUI) while analyzing a real ATC system.
The video shows a "Topology Viewer" panel in which an operator can observe the behavior of the system at runtime. A graph represents the network and the messages exchanged among the hosts. The "Ranking Panel" provides to the operator a ranking among the hosts, in order to represent at runtime the health of those, an important information to timely detect problematic nodes. The "speed and graph controller" panel allows the operator to control the monitoring system. At time 1:25 the failure prediction mechanism recognized a failure-prone situation and triggered an alert by changing the "Topology Viewer" panel background (white to gray) and by highlighting the "trace state" text box (green to yellow) in the other panel. At time 2:09, 49 seconds after the failure prediction, one of the host became dark, meaning that it became completely inactive. The dark host experienced a failure, well predicted by Casper.
Joint work with Selex Sistemi Integrati