[Mpi3-ft] Quiescence Interface

Greg Bronevetsky bronevetsky1 at llnl.gov
Wed Jan 28 00:07:13 CST 2009

First, Josh, your proposal has overcome my biggest objection: it 
doesn't specify things that cannot be specified in MPI and is 
therefore well-defined.

However, having spent some time thinking about it, there is something 
that doesn't make sense to me. The quiescence API states that all MPI 
communication must be above a certain level of abstraction. Lets say 
for the sake of argument that this is the level between MPI and 
Infiniband Verbs or sockets, meaning that all messages are in RAM and 
not on the network card. However, even in this case this is not 
sufficient for recovery because on restart the IDs of the nodes the 
application is running on will be different from those in the 
original execution. Thus, MPI will restart normally and any 
communication that uses the old node IDs will fail. The same applies 
to other entities and resources such as pinned memory, Infiniband 
connections, etc. As such, we really have two options.
         - Require MPI to essentially create a restart image of 
itself so that on restart it can recreate all the data that changes, 
such as node IDs, pinned memory, connections, etc. This is quite 
demanding from the perspective of MPI developers but may be doable. 
(MPI implementors, please chime in here!)
         - Let the checkpointing library completely virtualize 
everything at the preferred abstraction level so that on restart even 
if MPI uses node IDs from its original execution, this virtualization 
layer will remap them to their new IDs on the fly. The same can be 
done for most other resources and entities that exist below the level 
of abstraction. However, if the checkpointing library can do that, 
there is not need for the quiescence API because the checkpointing 
library can ensure quiescence on its own by using the same mechanisms 
that MPI would have used.

As such, it appears to me that we either force system-level 
checkpointing layers to virtualize everything on their own (already 
done by VMM-based checkpointers) or we provide a full 
checkpoint/restart API for MPI rather than a simple quiescence API. 
So the options are to do nothing or to demand everything, neither of 
which is appealing.

Does this make sense? What do you guys think?

Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov

At 10:26 AM 1/22/2009, Josh Hursey wrote:

>On Jan 22, 2009, at 11:02 AM, Greg Bronevetsky wrote:
>>At 07:21 AM 1/22/2009, Josh Hursey wrote:
>>>I have updated the Quiescence Interface proposal on the wiki:
>>>  https://  svn.mpi-forum.org/trac/mpi-forum-web/wiki/Quiescence
>>>I am interested in questions and comments from the group regarding
>>>interface, and use-cases. The primary use-case is for application
>>>initiated checkpoint/restart, but the semantics of the proposed API
>>>likely also open doors to other use-cases.
>>>Let me know what you think.
>>I like that this proposal is more detailed than our previous
>>discussions but I don't think that my major concern has been
>>addressed yet. The key text in the proposal is "all of the in-flight
>>messages have been accounted for in the specified comm". What does
>>"accounted for" mean? Does the relevant message information go into
>>some sort of buffer? Is it on the network card? What semantics are
>>we actually guaranteeing here?
>I missed some of this text when copying it over to the wiki. I updated
>the wiki with some of the wording below.
>'accounted for' means that any message sent by one process on the
>specified communicator has been transmitted to the recipient. The
>recipient may either buffer the message in the MPI implementation (if
>no receive has been posted) or place it in the recipient's buffer (if
>a receive has been posted).
>The MPI implementation can choose to buffer the message contents
>however it determines best (network card, internal data structure,
>specialized hardware). The MPI_Info argument set may allow an
>application to express any requirements regarding how the MPI
>implementation buffers the message contents with regard to MPI managed
>devices such as the network card. I think specifying this type of a
>key is implementation specific and should be left as an optional, MPI
>implementation defined key.
>>I'm comfortable with the idea that the semantics are application
>>dependent and this is just a standard API for applications to use to
>>access these semantics but in that case the value of having this API
>>is significantly reduced.
>I can see your point. For the system-level checkpointing use-case, an
>MPI library that provides guarantees about the state of the network
>cards during the quiescent region is important.
>I have heard from a couple of application developers that providing a
>mechanism for 'fencing' all of the messages on a communicator in a
>single call is also useful for purely application level checkpointing.
>They just want a (better?) mechanism for forcing all messages to the
>recipients in a synchronized manner.
>So I believe that this API is useful even if the implementation
>specific details of how the message contents are buffered by the
>recipient are more loosely defined.
>I'm interested in hearing if others have suggestions on alternative
>use cases where this API might be useful.
>One use-case might be an application that would like to prevent the
>MPI library's progress engine from using CPU cycles for a region of
>time. They would use this API to define a region of time in which it
>may compute without interference from the MPI library's progress
>engine looking for unexpected receives. Just a thought, but I don't
>have a clear target application for this use-case so I did not include
>it in the proposal.
>-- Josh
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

More information about the mpiwg-ft mailing list