Towards an Asynchronous Checkpointing System
スポンサーリンク
概要
- 論文の詳細を見る
The overall failure rate of HPC systems is increasing because the number of components is growing. Checkpoint/Restart, the most common technique to tolerate these faults, enables an application to restart from the last checkpoint even if a failure happens while the application is running. However, writing large checkpoint files may impact application runtime, depending on the bandwidth of the file systems to which checkpoints are written. To minimize the impact, we propose an asynchronous checkpointing system to write checkpoints to the file system in the background. This system uses extra nodes to drain a checkpoint from compute nodes using RDMA (Remote Direct Memory Access) to minimize CPU usage. Our preliminary evaluation shows that our asynchronous checkpointing system reduces checkpointing impact with runtime increases of CPU-bound applications under 1% compared to not checkpointing to a parallel file system.
- 2011-11-21
著者
-
Satoshi Matsuoka
Tokyo Institute of Technology
-
Naoya Maruyama
Tokyo Institute of Technology
-
Satoshi Matsuoka
Tokyo Institute Of Technology|national Institute Of Informatics|japan Science And Technology Agency
-
Satoshi Matsuoka
National Inst. Of Informatics
-
Adam Moody
Lawrence Livermore National Laboratory
-
Todd Gamblin
Center for Applied Scientific Computing, Lawrence Livermore National Laboratory
-
Todd Gamblin
Lawrence Livermore National Laboratory
-
Kento Sato
Tokyo Institute of Technology
-
Kathryn Mohror
Lawrence Livermore National Laboratory
関連論文
- MPI-CUDA Applications Checkpointing
- Efficient PageRank on GPU Clusters
- Low-overhead checkpoint for large-scale GPU-accelerated systems
- Low-overhead checkpoint for large-scale GPU-accelerated systems
- Efficient PageRank on GPU Clusters
- Web-site-based partitioning techniques for efficient parallelization of the PageRank computation (ハイパフォーマンスコンピューティング)
- CG on GPU-enhanced Clusters
- CG on GPU-enhanced Clusters
- Fast GPU Read Alignmennt with Burrows Wheeler Transform Based Index
- GPU-based approach for elastic-plastic deformation simulations
- Intuitive Performance Visualization Techniques for Topological Analysis on Capability Machines
- Data Ownership Assurance in the Inter-Cloud supporting data dynamics
- Towards an Asynchronous Checkpointing System
- Towards an Asynchronous Checkpointing System
- Towards an Asynchronous Checkpointing System
- Towards an Asynchronous Checkpointing System
- Towards Fast PGAS Implementation of Multithreaded Asynchronous Large-Scale Graph Traversal for Supercomputers with Local Semi-External Memory
- Towards a Dataflow FMM using the OmpSs Programming Model
- Avoiding silent data corruption in checkpoint files
- Burst SSD Buffer: Checkpoint Strategy at Extreme Scale
- Multi-level Temporal Blocking for Stencil Computation for Memory Hierarchy on TSUBAME2.5
- Performance modeling of a hierarchcial N-body algorithm for arbitrary particle distribution (Unrefereed Workshop Manuscript)
- Increasing GPU batch queue's utilization using rCUDA (Unrefereed Workshop Manuscript)
- Visualizing Collectives over InfiniBand Networks (Unrefereed Workshop Manuscript)