[Mpi3-ft] Quiescence Interface

Thu Jan 22 12:26:44 CST 2009

On Jan 22, 2009, at 11:02 AM, Greg Bronevetsky wrote:

> At 07:21 AM 1/22/2009, Josh Hursey wrote:
>> I have updated the Quiescence Interface proposal on the wiki:
>>  https:// svn.mpi-forum.org/trac/mpi-forum-web/wiki/Quiescence
>>
>> I am interested in questions and comments from the group regarding  
>> the
>> interface, and use-cases. The primary use-case is for application
>> initiated checkpoint/restart, but the semantics of the proposed API
>> likely also open doors to other use-cases.
>>
>> Let me know what you think.
> I like that this proposal is more detailed than our previous  
> discussions but I don't think that my major concern has been  
> addressed yet. The key text in the proposal is "all of the in-flight  
> messages have been accounted for in the specified comm". What does  
> "accounted for" mean? Does the relevant message information go into  
> some sort of buffer? Is it on the network card? What semantics are  
> we actually guaranteeing here?

I missed some of this text when copying it over to the wiki. I updated  
the wiki with some of the wording below.

'accounted for' means that any message sent by one process on the  
specified communicator has been transmitted to the recipient. The  
recipient may either buffer the message in the MPI implementation (if  
no receive has been posted) or place it in the recipient's buffer (if  
a receive has been posted).

The MPI implementation can choose to buffer the message contents  
however it determines best (network card, internal data structure,  
specialized hardware). The MPI_Info argument set may allow an  
application to express any requirements regarding how the MPI  
implementation buffers the message contents with regard to MPI managed  
devices such as the network card. I think specifying this type of a  
key is implementation specific and should be left as an optional, MPI  
implementation defined key.

> I'm comfortable with the idea that the semantics are application  
> dependent and this is just a standard API for applications to use to  
> access these semantics but in that case the value of having this API  
> is significantly reduced.

I can see your point. For the system-level checkpointing use-case, an  
MPI library that provides guarantees about the state of the network  
cards during the quiescent region is important.

I have heard from a couple of application developers that providing a  
mechanism for 'fencing' all of the messages on a communicator in a  
single call is also useful for purely application level checkpointing.  
They just want a (better?) mechanism for forcing all messages to the  
recipients in a synchronized manner.

So I believe that this API is useful even if the implementation  
specific details of how the message contents are buffered by the  
recipient are more loosely defined.

I'm interested in hearing if others have suggestions on alternative  
use cases where this API might be useful.

One use-case might be an application that would like to prevent the  
MPI library's progress engine from using CPU cycles for a region of  
time. They would use this API to define a region of time in which it  
may compute without interference from the MPI library's progress  
engine looking for unexpected receives. Just a thought, but I don't  
have a clear target application for this use-case so I did not include  
it in the proposal.

-- Josh