Refine
Has Fulltext
- no (2)
Year of publication
- 2005 (2)
Document Type
- Article (2)
Language
- English (2) (remove)
Is part of the Bibliography
- yes (2)
Institute
- Institut für Mathematik (2) (remove)
In a distributed, inherently dynamic Grid environment the reliability of individual resources cannot be guaranteed. The more resources and components are involved the more error-prone is the system. Therefore, it is important to enhance the dependability of the system with fault-tolerance mechanisms. In this paper, we present Migol, a fault-tolerant, self-healing Grid service infrastructure for MPI applications. The benefit of the Grid is that in case of a failure an application may be migrated and restarted from a checkpoint file on another site. This approach requires a service infrastructure which handles the necessary activities transparently for an application. But any migration framework cannot support fault-tolerant applications, if it is not fault-tolerant itself.
Parallel File Systems like PVFS2 are a necessary compo nent for high-performance computing. The design of ef ;cient communication layers for these systems is still of great research interest. This paper presents a low- latency messaging method for PVFS2 dedicated for Gigabit Ether net networks and discusses relevant design issues. In con trast to other approaches, we argue that zero-copying can be achieved also for big messages without use of a rendez vous protocol. Further, ef;ciency within the communica tion layer like a small call stack plays an important role.