Low-overhead checkpoint for large-scale GPU-accelerated systems

概要

論文の詳細を見る
In HPC, the applications are periodically checkpointed to stable storage to increase the success rate of long executions. Nowadays, the overhead imposed by remote-disk based checkpoint is about 20% of the execution time and in the next years it will be more than 50% if the checkpoint frequency increases as the fault frequency increases. Diskless checkpoint has been introduced as a solution to avoid the I/O bottleneck of remote-disk based checkpoint. However, the encoding time, the spare nodes and the memory overhead imposed by diskless checkpoint are significant obstacles against its adoption. At the same time, heterogeneous computing is becoming more and more popular in HPC, with new clusters combining CPUs and GPUs. In this work, we propose a way to checkpoint GPU applications, and avoid the I/O bottleneck by using SSDs in the compute nodes to significantly increase the checkpoint performance and avoid the memory overhead of classic diskless checkpoint. Our technique does not require spare nodes and can tolerate up to 50% of process failures with a low checkpoint overhead. We plan to evaluate and present the first results of our technique on TSUBAME 2.0.
2010-12-09