MPI-CUDA Applications Checkpointing

概要

論文の詳細を見る
We describe a method to checkpoint MPI applications that use GPUs as accelerators. As current MPI checkpointing tools such as LAM/MPI and Open MPI do not support checkpointing states on GPU, it is a big hindrance for users who want to develop hybrid MPI CUDA applications running on large-scale clusters with high rate of failure. Here we propose a method to checkpoint MPI CUDA applications by integrating Open MPI, BLCR and our CUDA checkpointer. Our CUDA checkpointer hooks CUDA Runtime API calls to record data on GPU for backup during checkpoint/restart sessions and we integrate this checkpointer into the BLCR checkpoint/restart module in Open MPI. In this method, our CUDA checkpointer will monitor and record CUDA resources used on the GPU during program execution. At checkpointing, it is invoked to checkpoint states on GPU by calling our user-defined callback function in BLCR. As restarting, the CUDA checkpointer will perform restoring data and CUDA contexts on the GPU together with Open MPI's restarting service. Based on this methodology, our implementation demonstrates that MPI CUDA applications in which CUDA Runtime API codes are used can be checkpointed and restarted properly in a transparent way. Our implementation also shows a checkpoint overhead of about 38 seconds in checkpointing a 3D stencil application with size 256x256x600 running on 60 GPU-enabled nodes.
2010-07-27