[Mpi3-ft] Choosing a BLANK or SHRINK model for the RTS proposal
jjhursey at open-mpi.org
Tue Jan 24 12:38:56 CST 2012
First let me say that I greatly appreciate the effort of Sayantan and
others to push us towards considering alternative techniques, and
stimulating discussion about design decisions. This is exactly the type of
discussion that needs to occur, and the working group is the
most appropriate place to have it.
One of the core suggestions of Sayantan's proposal is the switch from
(using FT-MPI's language) a model like BLANK to a model like SHRINK. I
think many of the other semantics are derived from this core shift, so we
should probably focus the discussion on this point earlier in our
The current RTS proposal allows for a communicator to contain failed
processes and continue to be used for all operations, including
collectives, after acknowledging them. This matches closely to FT-MPI's
BLANK mode. The user can use MPI_Comm_split() to get the equivalent of
SHRINK if they need it.
The suggested modification allows for only/primarily a SHRINK-like mode in
order to have full functionality in the communicator. As discussed on the
previous call, one can get the BLANK mode by adding a library on top of MPI
that virtualizes the communicators to create shadow communicators. The
argument for the SHRINK mode is that it is -easier- to pass/explain.
The reason we chose BLANK was derived from the literature reviewed, code
examples available, and feedback from application groups. From which there
seemed to be a strong demand for the BLANK mode. In fact, I had a difficult
time finding good use cases for the SHIRNK mode (I'm still looking
though). Additionally, a BLANK mode seems also to make it easier to reason
about process recovery. To reason about process recovery (something like
FT-MPI's REBUILD mode) one needs to be able to reason about the missing
processes without changing the identities of the existing processes, which
can be difficult in a SHRINK mode. So from this review it seemed that there
was an application demand for a BLANK-like mode for the RTS proposal.
In light of this background, it is concerning to me to advise these
application users that MPI will not provide the functionality they require,
but they have to depend upon a non-standard, third-party library because
we shied away from doing the right thing by them. This background is
advised from my review of the state of the art, but others may have
alternative evidence/commentary to present as well that could sway the
discussion. It just seems like a weak argument that we should do the easy
thing at the expense of doing the right thing by the application community.
I certainly meant this email to stimulate conversation for the
teleconference tomorrow. In particular, I would like those on the list with
experience building ABFT/Natural FT applications/libraries (UTK?) to
express their perspective on this topic. Hopefully they can help guide us
towards the right solution, which might just be a SHRINK-like mode.
Postdoctoral Research Associate
Oak Ridge National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the mpiwg-ft