Burst SSD Buffer: Checkpoint Strategy at Extreme Scale
スポンサーリンク
概要
- 論文の詳細を見る
Checkpointing is an indispensable fault tolerance technique, commonly used by HPC applications that run continuously for hours or days at a time. However, when checkpointing extreme scale systems, the bursty nature of the I/O pattern of checkpointing overburdens file systems and also causes huge overhead to be added to an application's runtime. In order to alleviate the overhead and achieve fast checkpoint/restart, we propose a highly-resilient mini-SSD-based burst buffer system, and explore a checkpoint strategy on the system based on our checkpointing model.
- 2013-09-23
著者
-
Satoshi Matsuoka
Tokyo Institute of Technology
-
Satoshi Matsuoka
National Inst. Of Informatics
-
Adam Moody
Lawrence Livermore National Laboratory
-
Todd Gamblin
Center for Applied Scientific Computing, Lawrence Livermore National Laboratory
-
Todd Gamblin
Lawrence Livermore National Laboratory
-
Kento Sato
Tokyo Institute of Technology
-
Kathryn Mohror
Lawrence Livermore National Laboratory
-
Naoya Maruyama
RIKEN
-
Kento Sato
Tokyo Institute of Technology|Research Fellow of Japan Society for the Promotion of Science
関連論文
- MPI-CUDA Applications Checkpointing
- Efficient PageRank on GPU Clusters
- Low-overhead checkpoint for large-scale GPU-accelerated systems
- Low-overhead checkpoint for large-scale GPU-accelerated systems
- Efficient PageRank on GPU Clusters
- Web-site-based partitioning techniques for efficient parallelization of the PageRank computation (ハイパフォーマンスコンピューティング)
- CG on GPU-enhanced Clusters
- CG on GPU-enhanced Clusters
- Fast GPU Read Alignmennt with Burrows Wheeler Transform Based Index
- GPU-based approach for elastic-plastic deformation simulations
- Intuitive Performance Visualization Techniques for Topological Analysis on Capability Machines
- Data Ownership Assurance in the Inter-Cloud supporting data dynamics
- Towards an Asynchronous Checkpointing System
- Towards an Asynchronous Checkpointing System
- Towards an Asynchronous Checkpointing System
- Towards an Asynchronous Checkpointing System
- Towards Fast PGAS Implementation of Multithreaded Asynchronous Large-Scale Graph Traversal for Supercomputers with Local Semi-External Memory
- Towards a Dataflow FMM using the OmpSs Programming Model
- Avoiding silent data corruption in checkpoint files
- Burst SSD Buffer: Checkpoint Strategy at Extreme Scale
- Multi-level Temporal Blocking for Stencil Computation for Memory Hierarchy on TSUBAME2.5
- Performance modeling of a hierarchcial N-body algorithm for arbitrary particle distribution (Unrefereed Workshop Manuscript)
- Increasing GPU batch queue's utilization using rCUDA (Unrefereed Workshop Manuscript)
- Visualizing Collectives over InfiniBand Networks (Unrefereed Workshop Manuscript)