[Mpi3-ft] Quiescence Interface
Josh Hursey
jjhursey at open-mpi.org
Thu Jan 22 12:26:44 CST 2009
On Jan 22, 2009, at 11:02 AM, Greg Bronevetsky wrote:
> At 07:21 AM 1/22/2009, Josh Hursey wrote:
>> I have updated the Quiescence Interface proposal on the wiki:
>> https:// svn.mpi-forum.org/trac/mpi-forum-web/wiki/Quiescence
>>
>> I am interested in questions and comments from the group regarding
>> the
>> interface, and use-cases. The primary use-case is for application
>> initiated checkpoint/restart, but the semantics of the proposed API
>> likely also open doors to other use-cases.
>>
>> Let me know what you think.
> I like that this proposal is more detailed than our previous
> discussions but I don't think that my major concern has been
> addressed yet. The key text in the proposal is "all of the in-flight
> messages have been accounted for in the specified comm". What does
> "accounted for" mean? Does the relevant message information go into
> some sort of buffer? Is it on the network card? What semantics are
> we actually guaranteeing here?
I missed some of this text when copying it over to the wiki. I updated
the wiki with some of the wording below.
'accounted for' means that any message sent by one process on the
specified communicator has been transmitted to the recipient. The
recipient may either buffer the message in the MPI implementation (if
no receive has been posted) or place it in the recipient's buffer (if
a receive has been posted).
The MPI implementation can choose to buffer the message contents
however it determines best (network card, internal data structure,
specialized hardware). The MPI_Info argument set may allow an
application to express any requirements regarding how the MPI
implementation buffers the message contents with regard to MPI managed
devices such as the network card. I think specifying this type of a
key is implementation specific and should be left as an optional, MPI
implementation defined key.
> I'm comfortable with the idea that the semantics are application
> dependent and this is just a standard API for applications to use to
> access these semantics but in that case the value of having this API
> is significantly reduced.
I can see your point. For the system-level checkpointing use-case, an
MPI library that provides guarantees about the state of the network
cards during the quiescent region is important.
I have heard from a couple of application developers that providing a
mechanism for 'fencing' all of the messages on a communicator in a
single call is also useful for purely application level checkpointing.
They just want a (better?) mechanism for forcing all messages to the
recipients in a synchronized manner.
So I believe that this API is useful even if the implementation
specific details of how the message contents are buffered by the
recipient are more loosely defined.
I'm interested in hearing if others have suggestions on alternative
use cases where this API might be useful.
One use-case might be an application that would like to prevent the
MPI library's progress engine from using CPU cycles for a region of
time. They would use this API to define a region of time in which it
may compute without interference from the MPI library's progress
engine looking for unexpected receives. Just a thought, but I don't
have a clear target application for this use-case so I did not include
it in the proposal.
-- Josh
More information about the mpiwg-ft
mailing list