[Mpi3-ft] Quiescence Interface
bronevetsky1 at llnl.gov
Wed Jan 28 00:07:13 CST 2009
First, Josh, your proposal has overcome my biggest objection: it
doesn't specify things that cannot be specified in MPI and is
However, having spent some time thinking about it, there is something
that doesn't make sense to me. The quiescence API states that all MPI
communication must be above a certain level of abstraction. Lets say
for the sake of argument that this is the level between MPI and
Infiniband Verbs or sockets, meaning that all messages are in RAM and
not on the network card. However, even in this case this is not
sufficient for recovery because on restart the IDs of the nodes the
application is running on will be different from those in the
original execution. Thus, MPI will restart normally and any
communication that uses the old node IDs will fail. The same applies
to other entities and resources such as pinned memory, Infiniband
connections, etc. As such, we really have two options.
- Require MPI to essentially create a restart image of
itself so that on restart it can recreate all the data that changes,
such as node IDs, pinned memory, connections, etc. This is quite
demanding from the perspective of MPI developers but may be doable.
(MPI implementors, please chime in here!)
- Let the checkpointing library completely virtualize
everything at the preferred abstraction level so that on restart even
if MPI uses node IDs from its original execution, this virtualization
layer will remap them to their new IDs on the fly. The same can be
done for most other resources and entities that exist below the level
of abstraction. However, if the checkpointing library can do that,
there is not need for the quiescence API because the checkpointing
library can ensure quiescence on its own by using the same mechanisms
that MPI would have used.
As such, it appears to me that we either force system-level
checkpointing layers to virtualize everything on their own (already
done by VMM-based checkpointers) or we provide a full
checkpoint/restart API for MPI rather than a simple quiescence API.
So the options are to do nothing or to demand everything, neither of
which is appealing.
Does this make sense? What do you guys think?
1028 Building 451
Lawrence Livermore National Lab
bronevetsky1 at llnl.gov
At 10:26 AM 1/22/2009, Josh Hursey wrote:
>On Jan 22, 2009, at 11:02 AM, Greg Bronevetsky wrote:
>>At 07:21 AM 1/22/2009, Josh Hursey wrote:
>>>I have updated the Quiescence Interface proposal on the wiki:
>>> https:// svn.mpi-forum.org/trac/mpi-forum-web/wiki/Quiescence
>>>I am interested in questions and comments from the group regarding
>>>interface, and use-cases. The primary use-case is for application
>>>initiated checkpoint/restart, but the semantics of the proposed API
>>>likely also open doors to other use-cases.
>>>Let me know what you think.
>>I like that this proposal is more detailed than our previous
>>discussions but I don't think that my major concern has been
>>addressed yet. The key text in the proposal is "all of the in-flight
>>messages have been accounted for in the specified comm". What does
>>"accounted for" mean? Does the relevant message information go into
>>some sort of buffer? Is it on the network card? What semantics are
>>we actually guaranteeing here?
>I missed some of this text when copying it over to the wiki. I updated
>the wiki with some of the wording below.
>'accounted for' means that any message sent by one process on the
>specified communicator has been transmitted to the recipient. The
>recipient may either buffer the message in the MPI implementation (if
>no receive has been posted) or place it in the recipient's buffer (if
>a receive has been posted).
>The MPI implementation can choose to buffer the message contents
>however it determines best (network card, internal data structure,
>specialized hardware). The MPI_Info argument set may allow an
>application to express any requirements regarding how the MPI
>implementation buffers the message contents with regard to MPI managed
>devices such as the network card. I think specifying this type of a
>key is implementation specific and should be left as an optional, MPI
>implementation defined key.
>>I'm comfortable with the idea that the semantics are application
>>dependent and this is just a standard API for applications to use to
>>access these semantics but in that case the value of having this API
>>is significantly reduced.
>I can see your point. For the system-level checkpointing use-case, an
>MPI library that provides guarantees about the state of the network
>cards during the quiescent region is important.
>I have heard from a couple of application developers that providing a
>mechanism for 'fencing' all of the messages on a communicator in a
>single call is also useful for purely application level checkpointing.
>They just want a (better?) mechanism for forcing all messages to the
>recipients in a synchronized manner.
>So I believe that this API is useful even if the implementation
>specific details of how the message contents are buffered by the
>recipient are more loosely defined.
>I'm interested in hearing if others have suggestions on alternative
>use cases where this API might be useful.
>One use-case might be an application that would like to prevent the
>MPI library's progress engine from using CPU cycles for a region of
>time. They would use this API to define a region of time in which it
>may compute without interference from the MPI library's progress
>engine looking for unexpected receives. Just a thought, but I don't
>have a clear target application for this use-case so I did not include
>it in the proposal.
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
More information about the mpiwg-ft