[Mpi3-ft] my notes from 11/18/08 ft meeting at SC '08

Howard Pritchard howardp at cray.com
Tue Nov 25 16:10:15 CST 2008

Hello Folks,

Here is a distillation of the notes I had from the meeting for mpi ft at 
SC'08.  I wrote them more in the styles of minutes then high level concept 
notes.  Please add or correct if you see problems/inaccuracies.

In attendance at the meeting were Kannan Narashimhan, Thomas Herault,
Greg Bronevetsky, Erez Hara, Hideyuki  Jitsumoto, Greg Koenig, Rich Graham, 
and Howard Pritchard, and Josh Hursey.

There was an attempt to organize the discussion around the most recent version 
of Rich's "Point-to-Point Communications recovery" document.  Rich updated the 
document during the course of the meeting.

The meeting began with a discussion of the "Purpose" section of this document. 
The document listed symmetric and asymmetric cases.  There were questions 
about whether or not 'asymmetric' made sense here - how can it imply process 
death?  Rich said the document was not intended to cover communication/network 
errors, but what to do when a process was dead/unreachable.  It was decided to 
remove the mention of 'asymmetric' from this part of the document.

At this point Rich brouight up that he would like to turn this into a paper 
suitable to submission to 2009 ACM International Conference on Computing 
Frontiers in about four weeks.

There was general agreement with the "Errors" section of the document.

Discussion next moved to the "Error States" section.  The point was repeated 
that a  application process will only be able to determine that MPI has 
detected an  error when making an MPI call, using a communicator that is in an 
error state.   There was more discussion about communicators within a process 
being in a  'consistent' state.  Greg brought up the idea of a process as a 
group of  "local communicators", with Kannan adding the idea of "global 
communicator" -  a notion of local view vs global view.  Greg made an analogy 
to memory  consistency models.  Erez wanted to understand better what it means 
to have an  error state associated with a communicator - bringing up the 
notion of virtual  connections.  One of the problems discussed was whether to 
keep going if a  communicator is being used for ranks that are still okay.  
Thomas advocates  lazy notification.  Erez pulled up the VC document and 
discussion continued  about allowing a communicator to be used even if one or 
more 'vcs'  was 'bad', as long as the application wasn't trying to use these 
'vcs'.  The  problem of MPI_ANY_SOURCE was brought into the discussion.  
Questions were  raised about what to do with received messages from a 'bad' vc 
which had not  yet been matched.  There were also questions as to whether or 
not this model was consistent.

Thomas' drew up a diagram on a white board showing where he thought we may 
have a consistency issue - the problem requiring at least 3 bodies to show up.

The conclusions at this point were that the app would get an error if
- it tries to send to a rank /vc in a communicator which is in error
- tries to receive from a rank/vc in a communicator which is in error
- if it posts an MPI_ANY_SOURCE on a communicator with one or more bad vcs.

For the last case, if the post does match a message, the data is returned 
along with the error indication.  The app can at this point decide to repair 
the communicator or just keep going.  The app can continue to use 
MPI_ANY_SOURCE, and, if messages are available to be received, MPI will 
continue to return message data and the error status.  Its up to the 
application to know its not necessary to repair the communicator however.  It 
can ignore the error return but may eventually hang.  There was some 
dissatisfaction with this policy.  Kannan suggested making it an option the 
application could set, presumably at application startup.  Another idea 
considered would be to never hang in the MPI_ANY_SOURCE case but just return 
an error.  

Discussion returned to the 'consistent state' problem again - led by Greg and 
Kannan.  The current thinking is two types of repair policies: repair with 
hole, and repair with replace.  Erez brought up that the notion that the 
later only would work with MPI_COMM_WORLD -at least according to previous 
discussions.  Rich said that isn't sufficient.  This  then led to a lengthy 
discussion about how to rebuild communicators in the 
case of the repair with replace policy.

A scenario was outlined for the 'restarted/respawned' process:
- original process dies/goes away
- another process, detecting an error when using MPI with the 'errored' 
communicator, respawns the process as part of its communicator repair 
- new process starts
- out-of-band the restarted process (more accurately the MPI library 
internally) knows about communicators that existed and somehow repairs them

The last step was the main point of further discussion.  The issue of "naming" 
of communicators came up.  There has to be some way for the restarted process 
to link the opaque communicator handles returned by mpi to the application.  
This "naming" convention would allow for reconstructing communicators.  To 
avoid deadlock and address scalability,  repair is considered to be non-
blocking, if not local.  A diagram of the discussion was drawn on the 
whiteboard at this point:

respawning process                     respawned/new process
--------------------------                    --------------------------------

comm repair start                         comm rejoin/patch start
comm repair complete                 comm rejoin/patch complete

The discussion leaned toward having the rejoin/patching of the communicator be 
a local operation - or at least have the behavior of being local to the 
respawned process.

There was also discussion of how to handle legacy libraries with a concept of 
"shutting down" communicators being raised.  This was not discussed in detail.

Erez brought up a final issue before the room had to be given up to the next 
group.  Can the respawning process continue to use the communicator to 
exchange messages with "good" ranks while the communicator is being repaired?


More information about the mpiwg-ft mailing list