Abstract
HPC systems are a critical resource for scientific research and advanced industries. The demand for computational power and memory is increasing and ushers in the exascale era, in which supercomputers are designed to provide enormous computing power to meet these needs. These complex supercomputers consist of many compute nodes and are consequently expected to experience frequent faults and crashes. Exact state reconstruction (ESR) has been proposed as a mechanism to alleviate the impact of frequent failures on long-term computations. ESR has shown great potential in the context of iterative linear algebra solvers, a key building block in numerous scientific applications. Recent designs of supercomputers feature the emerging nonvolatile memory (NVM) technology. For example, the Exascale Aurora supercomputer is planned to integrate Intel Optane DCPMM. This work investigates how NVM can be used to improve ESR so that it can scale to future exascale systems such as Aurora and provide enhanced resilience. We propose the non-volatile memory ESR (NVM-ESR) mechanism. NVM-ESR demonstrates how NVM can be utilized in supercomputers for enabling efficient recovery from faults while requiring significantly smaller memory footprint and time overheads in comparison to ESR. We focus on the preconditioned conjugate gradient (PCG) iterative solver also studied in prior ESR research, because it is employed by the representative HPCG scientific benchmark.
Original language | English |
---|---|
DOIs | |
State | Published - 9 Aug 2022 |
Keywords
- cs.DC