Fanfarillo, A., Bouteiller, A., Bosilca, G., & Del Vento, D. (2017). Fault detection in Fortran 2015. In Workshop on Exascale MPI (ExaMPI) 2017. Association for Computing Machinery (ACM): Denver, CO, US.
The presence of billions of hardware components and several levels of software stack in High Performance Computing machines will likely represent an increment in number of hardware and software failures, which will be user-visible. Although several techniques to address this problem have been dev... Show moreThe presence of billions of hardware components and several levels of software stack in High Performance Computing machines will likely represent an increment in number of hardware and software failures, which will be user-visible. Although several techniques to address this problem have been developed, the support provided by the programming model, for the user to mitigate or workaround this issue, is still insuffcient. The Fortran 2015 standard defines failed images, a new feature that allows the programmer to detect and manage image failures in a parallel program. In this paper we present the failed images implementation provided by the GNU Fortran Compiler and OpenCoarrays based on the User-Level Failure Mitigation proposal presented in the MPI Forum. We also provide tangible example of the differences between fault detection and fault tolerance. Show less