[Mpi3-ft] my notes from 11/18/08 ft meeting at SC '08
howardp at cray.com
Tue Nov 25 16:10:15 CST 2008
Here is a distillation of the notes I had from the meeting for mpi ft at
SC'08. I wrote them more in the styles of minutes then high level concept
notes. Please add or correct if you see problems/inaccuracies.
In attendance at the meeting were Kannan Narashimhan, Thomas Herault,
Greg Bronevetsky, Erez Hara, Hideyuki Jitsumoto, Greg Koenig, Rich Graham,
and Howard Pritchard, and Josh Hursey.
There was an attempt to organize the discussion around the most recent version
of Rich's "Point-to-Point Communications recovery" document. Rich updated the
document during the course of the meeting.
The meeting began with a discussion of the "Purpose" section of this document.
The document listed symmetric and asymmetric cases. There were questions
about whether or not 'asymmetric' made sense here - how can it imply process
death? Rich said the document was not intended to cover communication/network
errors, but what to do when a process was dead/unreachable. It was decided to
remove the mention of 'asymmetric' from this part of the document.
At this point Rich brouight up that he would like to turn this into a paper
suitable to submission to 2009 ACM International Conference on Computing
Frontiers in about four weeks.
There was general agreement with the "Errors" section of the document.
Discussion next moved to the "Error States" section. The point was repeated
that a application process will only be able to determine that MPI has
detected an error when making an MPI call, using a communicator that is in an
error state. There was more discussion about communicators within a process
being in a 'consistent' state. Greg brought up the idea of a process as a
group of "local communicators", with Kannan adding the idea of "global
communicator" - a notion of local view vs global view. Greg made an analogy
to memory consistency models. Erez wanted to understand better what it means
to have an error state associated with a communicator - bringing up the
notion of virtual connections. One of the problems discussed was whether to
keep going if a communicator is being used for ranks that are still okay.
Thomas advocates lazy notification. Erez pulled up the VC document and
discussion continued about allowing a communicator to be used even if one or
more 'vcs' was 'bad', as long as the application wasn't trying to use these
'vcs'. The problem of MPI_ANY_SOURCE was brought into the discussion.
Questions were raised about what to do with received messages from a 'bad' vc
which had not yet been matched. There were also questions as to whether or
not this model was consistent.
Thomas' drew up a diagram on a white board showing where he thought we may
have a consistency issue - the problem requiring at least 3 bodies to show up.
The conclusions at this point were that the app would get an error if
- it tries to send to a rank /vc in a communicator which is in error
- tries to receive from a rank/vc in a communicator which is in error
- if it posts an MPI_ANY_SOURCE on a communicator with one or more bad vcs.
For the last case, if the post does match a message, the data is returned
along with the error indication. The app can at this point decide to repair
the communicator or just keep going. The app can continue to use
MPI_ANY_SOURCE, and, if messages are available to be received, MPI will
continue to return message data and the error status. Its up to the
application to know its not necessary to repair the communicator however. It
can ignore the error return but may eventually hang. There was some
dissatisfaction with this policy. Kannan suggested making it an option the
application could set, presumably at application startup. Another idea
considered would be to never hang in the MPI_ANY_SOURCE case but just return
Discussion returned to the 'consistent state' problem again - led by Greg and
Kannan. The current thinking is two types of repair policies: repair with
hole, and repair with replace. Erez brought up that the notion that the
later only would work with MPI_COMM_WORLD -at least according to previous
discussions. Rich said that isn't sufficient. This then led to a lengthy
discussion about how to rebuild communicators in the
case of the repair with replace policy.
A scenario was outlined for the 'restarted/respawned' process:
- original process dies/goes away
- another process, detecting an error when using MPI with the 'errored'
communicator, respawns the process as part of its communicator repair
- new process starts
- out-of-band the restarted process (more accurately the MPI library
internally) knows about communicators that existed and somehow repairs them
The last step was the main point of further discussion. The issue of "naming"
of communicators came up. There has to be some way for the restarted process
to link the opaque communicator handles returned by mpi to the application.
This "naming" convention would allow for reconstructing communicators. To
avoid deadlock and address scalability, repair is considered to be non-
blocking, if not local. A diagram of the discussion was drawn on the
whiteboard at this point:
respawning process respawned/new process
comm repair start comm rejoin/patch start
comm repair complete comm rejoin/patch complete
The discussion leaned toward having the rejoin/patching of the communicator be
a local operation - or at least have the behavior of being local to the
There was also discussion of how to handle legacy libraries with a concept of
"shutting down" communicators being raised. This was not discussed in detail.
Erez brought up a final issue before the room had to be given up to the next
group. Can the respawning process continue to use the communicator to
exchange messages with "good" ranks while the communicator is being repaired?
More information about the mpiwg-ft