TitleModule prototype for online failure prediction for the IBM Blue Gene/L
Year of Publication2008
AuthorsSolano-Quinde LD, Bode BM
Book Title2008 Ieee International Conference on Electro/Information Technology
KeywordsBlue Gene/L, computer fault tolerance, failure analysis, software fault, tolerance

The growing complexity of scientific applications has led to the design and deployment of large-scale parallel systems. The IBM Blue Gene/L can hold in excess of 200K processors and it has been designed for high performance and reliability. However, failures in this large-scale parallel system are a major concern, since it has been demonstrated that a failure will significantly reduce the performance of the system. Although reactive fault tolerant policies effectively minimize the effects of faults, it has been shown that these techniques drastically reduce the system performance. Proactive fault tolerant policies have emerged as an alternative due to the reduced performance degradation they impose. Proactive fault tolerant policies are based on the analysis of information about the state of the system. The monitoring system of the IBM Blue Gene/L generates online information about the state of hardware and software of the system and stores that information in the RAS event log. In this study, we design and implement a module prototype for online failure prediction. This prototype is tested and validated, on a realistic scenario, using the RAS event log of an IBM Blue Gene/L system. We show that our module prototype for failure prediction predicts up to 70% of the fatal events.