[Mpi3-ft] An alternative FT proposal

Josh Hursey jjhursey at open-mpi.org
Fri Feb 10 15:10:03 CST 2012

I think that the proposal sketched in this email represents a good step
towards a fault tolerant MPI standard. There are some interface and
semantics that we want/need, but, as previously discussed on the list, we
can build them on top (at likely considerable expense). Then sometime later
we might push to standardize them for the sake of programability and
efficiency. But this gives us something to start with, and points in the
right direction.

I have a couple questions that I was hoping you all could clarify for me,
and a few general comments.
- The wording "...must not return an error" tripped me up a few times. I
know what you are trying to say, but we need to find a cleaner way. "Does
not return a error related to process faults" or something.

- I am not a fan of the error name MPI_ERR_FAILED. it seems overly general.
Since the error is specific to process failure maybe consider

- Should MPI_ERR_INVALIDATED be specific to communicators (e.g.,
MPI_ERR_COMM_INVALIDATED), or can it be generally reused for other
communication objects (e.g., windows)? I mention this partially because the
description of this error class ties it exclusively to communicators, and I
did not know if that was intentional.

- If all communicators use MPI_ERRORS_RETURN and on process calls
MPI_Abort() on a sub-communicator what happens? Are you taking the
semantics from the RTS proposal, or leaving it ambiguous?
  - The ability to contain the abort or MPI_ERRORS_ARE_FATAL semantics to
just those processes in the communicator represented a number of use cases
for ensemble type applications. I would like to see those semantics
preserved in this proposal.

- The section of MPI_ANY_SOURCE was a little unclear to me.
  - If I have a blocking recv. ANY_SOURCE and a failure emerges while
waiting - does that return an MPI_ERR_FAILED or MPI_ERR_PENDING? If the
latter, then what is the handle to the message. I think this was just
missed in the description.
  - If I have a nonblocking recv ANY_SOURCE and a failure emerges while
waiting - Then I am returned MPI_ERR_PENDING. If I wait on that request
again, I keep getting MPI_ERR_PENDING? Until I call something like
MPI_Comm_failure_ack(), right?

- In the advice to users on page 2 lines 58-60 there is a comment about
non-rooted collectives is unclear as worded. I had to read it a few times
to understand both the scenario and the semantic implications. It seems
that the scenario is that the failed process never joined the non-rooted
collective, so all other processes are dependent upon its participation.
Eventually those processes blocking in the collective will eventually find
out that a process failure is blocking the collective, and return an error.
  - It would also be good to point out that if the failed process failed
inside the collective then, even if it was a non-rooted collective, you
could get some alive processes returning MPI_SUCCESS and others returning
MPI_ERR_FAILED. Since it depends on when the process failed in the
underlying protocol.
  - There might be wording in the RTS proposal that you can copy over about
this scenario.

- Communicator Creation (advice to user on page 2 lines 62-68)
  - Minor point, but some of this needs to be normative text since it
clarifies how these operations are expected to behave.
  - Since the object can be partially created (created somewhere, but not
everywhere if a failure emerges) is there any protection from the MPI if
the user decides to use the communicator for point-to-point operations?
Maybe between two peers that were both returned valid communicators (even
though no everyone else did)?
  - You will need to add some text to fix MPI_Comm_free, which is a
collective over the input communicator. So it needs something that allows
the user to 'free' the partially valid communicator object.
  - The example with barrier is correct, but slightly misleading to the
casual reader. Since 'success' from the barrier is a local state, and does
not tell them that the other alive processes that called the barrier also
received 'success' from the same operation (due to emerging failure during
the collective). So they locally know that the communicator was created,
but a peer that received MPI_ERR_FAILED from the operation does not. So
saying that if you receive success locally from the barrier that you can
use the communicator normally is slightly misleading. I suppose that if the
barrier fails anywhere then that process will call MPI_Comm_invalidate, and
the whole thing falls over. But if they only ever intend to use
point-to-point operations and instead call MPI_Comm_failure_ack, what is
the expected behavior or is that erroneous?

- Thinking about MPI_Comm_create_group which is collective over a subgroup
of comm, would it be valid to call that operation even if the input
communicator contained failed processes not in the collective subgroup? or
was invalidated by a peer?
  - The authors of this operation cite fault tolerance applications
as beneficiaries of this operation, I am wondering if we can preserve that
assertion in this model.

- Once a communicator is invalidated you do not state if the outstanding
messages are dropped, completed in error, or completed in success (if
already matched).
  - If I have a nonblocking send request that completed inside MPI (so
marked as completed successfully, but I have not waited on it yet), and a
peer invalidates the communicator. I call a different MPI operation and it
return MPI_ERR_INVALIDATED. If I then call the wait on that outstanding
  - If I call MPI_Comm_free on an invalidated communicator, that is a
collective operation and should be exempt from the rule that all non-local
  - If I call MPI_Comm_free and there are outstanding requests there is
some troubling language in the standard about what MPI -should- do. Meaning
the standard says that if the request is outstanding and completes in
error, but the original communicator has been freed then
MPI_ERRORS_ARE_FATAL is enacted. We might want to think about how we can
get around this so users can easily 'throw away' communicators that they do
not want any longer.

- If you call MPI_Comm_shrink(MPI_COMM_WORLD) when MCW has outstanding
failure, what is the resulting semantic?

- Do all processes need to call MPI_Comm_invalidate before they call
MPI_Comm_shrink? Or if they received the error MPI_ERR_INVALIDATED, is the
invalidation implied?
  - It is unclear from the sentence on page 3 lines 90-91. If the user
needs to use MPI_Comm_invalidate to 'invalidate' the communicator or if a
peer could have done it for them.

- Does MPI_Comm_size() return the number of members of the group regardless
of failed state, or just the number of alive?

- MPI_Comm_failure_ack() takes a snapshot of the set of locally known
failed processes on the communicator which can be accessed by
MPI_Comm_failure_get_acked(). This set of failed processes is "stored" per
communicator. When we proposed something similar, we received push back
about the amount of memory per-communicator this would require. I know we
talked about how to mitigate this briefly in the past, but can you talk to
how you would expect an MPI implementation to efficiently manage this state?

- MPI_Comm_failure_ack() allows the user to acknowledge the locally known
set of failed processes. This allows previously posted nonblocking receive
ANY operations (that may have already returned MPI_ERR_PENDING) to succeed.
  - If you only use directed point-to-point communication do you ever need
to call this operation? I guess only if you need a list of locally known
failures, but the user can track that themselves.
  - As far as message completion this operation only has an affect on ANY
source operations, right? collectives remain disabled, and directed p2p are
  - So I can have a directed nonblocking p2p operation that spans the
MPI_Comm_failure_ack() call?

- Thinking forward a bit, have you thought about how we might add process
recovery to this model? Maybe MPI_Comm_grow()?


On Mon, Feb 6, 2012 at 7:31 PM, George Bosilca <bosilca at eecs.utk.edu> wrote:

> FT working group,
> As announced, we have been working on an alternative FT proposal. The
> leading idea of this proposal is to relieve the burden of consistency from
> the MPI implementations, while providing the means for the user to regain
> control. As suggested in the mailing list a few days ago, this proposal
> tries to minimize the semantic changes and additional functions. We believe
> the proposed set of functions is minimal, yet sufficient to implement
> stronger consistency models, such as the current working group proposal.
> The proposal is not yet complete, we are still working on intercoms, RMA
> and file operations, but we wanted to submit it for feedback and discussion
> on the call this week.
>  The UTK team.
> PS: The abstract and the proposal are attached below.
> Abstract:
> In this document we propose a flexible approach providing fail-stop
> process fault tolerance by allowing the application to react to failures
> while maintaining a minimal execution path in failure-free executions. Our
> proposal focuses on returning control to the application by avoiding
> deadlocks due to failures within the MPI library. No implicit, asynchronous
> error notification is required. Instead, functions are provided to allow
> processes to invalidate any communication object, thus preventing any
> process from waiting indefinitely on calls involving the invalidated
> objects. We consider the proposed set of functions to constitute a minimal
> basis which allows libraries and applications to increase the fault
> tolerance capabilities by supporting additional types of failures, and to
> build other desired strategies and consistency models to tolerate faults.
> George Bosilca
> Research Assistant Professor
> Innovative Computing Laboratory
> Department of Electrical Engineering and Computer Science
> University of Tennessee, Knoxville
> http://web.eecs.utk.edu/~bosilca/
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20120210/8985f3ab/attachment-0001.html>

More information about the mpiwg-ft mailing list