[Mpi3-ft] system-level C/R requirements
joseph.ruscio at librato.com
Mon Oct 27 11:16:27 CDT 2008
On Oct 27, 2008, at 7:18 AM, Thomas Herault wrote:
> So, my question is: what are the benefits of solution b) as compared
> to solution a)?
If you go with solution b), the only thing the MPI implementation
needs to worry about is PREPAREing and RESTOREing the network state.
All of the checkpoint specific issues and complexity such as queuing
system integration, checkpoint file storage options like redundant
local storage, etc.
In case a), MPI implementations need to concern themselves with these
issues AND support the non-standard checkpoint invocation API's for
every different desired system checkpointer. So that's a question of
whether the responsibility for system-level checkpointer integration
lays on the MPI implementor or the individual checkpoint implementors.
Going with b) gives MPI an opportunity to minimally specify an
integration mechanism with system-level checkpointer's. If the MPI
implementor does not wish to worry about these classes of CP/R
systems, they just don't implement the minimal set of calls. If they
do want the support, they implement the calls.
A single implementation of PREPARE and RESTORE would support most
system-level checkpointers. For example our checkpointer that sits
completely in user-land and BLCR that sits completely in the OS would
have the same set of requirements. VMM level checkpointers have been
suggested as being different. Wouldn't they either have the same
requirements, or be completely transparent to the MPI stack i.e.
snapshotting the OS, communication device, etc and allowing the
protocol to sort out lost messages?
More information about the mpiwg-ft