MPICH-GF : Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes(Distributed, Grid and P2P Computing)(<Special Section>Hardware/Software Support for High Performance Scientific and Engineering Computing)
スポンサーリンク
概要
- 論文の詳細を見る
Fault-tolerance is an essential feature of the distributed systems where the possibility of a failure increases with the growth of the system. In spite of extensive researches over two decades, fault-tolerance systems have not succeeded in practical use. It is due to the high overhead and the unhandiness of the previous fault-tolerance systems. In this paper, we propose MPICH-GF, a user-transparent checkpointing system for grid-enabled MPICH. Our objectives are to fill the gap between the theory and the practice of fault-tolerance systems, and to provide a checkpointing-recovery system for grids. To build a fault-tolerant MPICH version, we have designed task migration, dynamic process management, and atomic message transfer. MPICH-GF requires no modification of application source codes, and it affects the MPICH communication characteristics as less as possible. The features of MPICH-GF are that it supports the direct message transfer mode and that all of the implementation has been done at the lower layer, that is, the abstract device level. We have evaluated MPICH-GF using NPB applications on Globus middleware.
- 一般社団法人電子情報通信学会の論文
- 2004-07-01
著者
-
Yeom Heon
School Of Computer Science And Engineering Seoul National University
-
Yeom Heon
School Of Computer Science And Engineering Seoul National Univ.
-
Park Taesoon
Department Of Computer Engineering Sejong University
-
Park Taesoon
Department Of Computer Engineering Seiong University
-
WOO Namyoon
School of Computer Science and Engineering, Seoul National University
-
JUNG Hyungsoo
School of Computer Science and Engineering, Seoul National University
-
PARK Hyungwoo
Supercomputing Center, KISTI
-
Woo Namyoon
School Of Computer Science And Engineering Seoul National University
-
Jung Hyungsoo
School Of Computer Science And Engineering Seoul National University
-
Park Hyungwoo
Supercomputing Center Kisti
関連論文
- A New Approach for Distributed Main Memory Database Systems : A Causal Commit Protocol (Databases)
- A grid computing-based remote-experiment system for wind engineering
- Fault-Tolerance for the Mobile Ad-Hoc Environment
- Fault-Tolerant Execution of Collaborating Mobile Agents(Reliability, Maintainability and Safty Analysis)
- MPICH-GF : Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes(Distributed, Grid and P2P Computing)(Hardware/Software Support for High Performance Scientific and Engineering Computing)
- Agent Based Fault Tolerance for the Mobile Environment