[Mpi3-ft] An alternative FT proposal

Fri Feb 10 18:04:52 CST 2012

See comments inline:

On Friday, February 10, 2012 at 4:10 PM, Josh Hursey wrote:

> I think that the proposal sketched in this email represents a good step towards a fault tolerant MPI standard. There are some interface and semantics that we want/need, but, as previously discussed on the list, we can build them on top (at likely considerable expense). Then sometime later we might push to standardize them for the sake of programability and efficiency. But this gives us something to start with, and points in the right direction.
>  
>  
> I have a couple questions that I was hoping you all could clarify for me, and a few general comments.
> - The wording "...must not return an error" tripped me up a few times. I know what you are trying to say, but we need to find a cleaner way. "Does not return a error related to process faults" or something.
>  
>  
>  
>  

Noted. We’ll fix it.
>  
> - I am not a fan of the error name MPI_ERR_FAILED. it seems overly general. Since the error is specific to process failure maybe consider MPI_ERR_PROC_FAILED.
Agreed.  
>  
> - Should MPI_ERR_INVALIDATED be specific to communicators (e.g., MPI_ERR_COMM_INVALIDATED), or can it be generally reused for other communication objects (e.g., windows)? I mention this partially because the description of this error class ties it exclusively to communicators, and I did not know if that was intentional.
This wasn’t necessarily intentional. We wrote the first draft only from the standpoint of intracommunicators. As the rest gets added, we’ll have to go back and edit a few things. This would be one of them.  
>  
> - If all communicators use MPI_ERRORS_RETURN and on process calls MPI_Abort() on a sub-communicator what happens? Are you taking the semantics from the RTS proposal, or leaving it ambiguous?
>   - The ability to contain the abort or MPI_ERRORS_ARE_FATAL semantics to just those processes in the communicator represented a number of use cases for ensemble type applications. I would like to see those semantics preserved in this proposal.
>  
>  
>  
>  

This seems like more of a clarification of the current standard rather than a new part of the FT chapter. While we heartily agree with the RTS proposal here, this should probably be in its own ticket.  
>  
> - The section of MPI_ANY_SOURCE was a little unclear to me.
>   - If I have a blocking recv. ANY_SOURCE and a failure emerges while waiting - does that return an MPI_ERR_FAILED or MPI_ERR_PENDING? If the latter, then what is the handle to the message. I think this was just missed in the description.
>  
>  
>  
>  

We intended this to return MPI_ERR_PENDING, because the communication (if not the request) is still pending (to be in sync with p. 4 line 105). In case the communication is matched but the process failed during the data transfer, MPI_ERR_FAILED would be returned. However, the naming here can be confusing so we will clarify.  
>   - If I have a nonblocking recv ANY_SOURCE and a failure emerges while waiting - Then I am returned MPI_ERR_PENDING. If I wait on that request again, I keep getting MPI_ERR_PENDING? Until I call something like MPI_Comm_failure_ack(), right?
>  
>  
>  
>  

Correct.  
>  
> - In the advice to users on page 2 lines 58-60 there is a comment about non-rooted collectives is unclear as worded. I had to read it a few times to understand both the scenario and the semantic implications. It seems that the scenario is that the failed process never joined the non-rooted collective, so all other processes are dependent upon its participation. Eventually those processes blocking in the collective will eventually find out that a process failure is blocking the collective, and return an error.
Correct. Better wording is welcome here and generally throughout.  
>   - It would also be good to point out that if the failed process failed inside the collective then, even if it was a non-rooted collective, you could get some alive processes returning MPI_SUCCESS and others returning MPI_ERR_FAILED. Since it depends on when the process failed in the underlying protocol.
>   - There might be wording in the RTS proposal that you can copy over about this scenario.
>  
>  
>  
>  

We can clear that up.  
>  
> - Communicator Creation (advice to user on page 2 lines 62-68)
>   - Minor point, but some of this needs to be normative text since it clarifies how these operations are expected to behave.
>   - Since the object can be partially created (created somewhere, but not everywhere if a failure emerges) is there any protection from the MPI if the user decides to use the communicator for point-to-point operations? Maybe between two peers that were both returned valid communicators (even though no everyone else did)?
>  
>  
>  
>  

Yes. That is valid until someone calls MPI_Invalidate.  
>   - You will need to add some text to fix MPI_Comm_free, which is a collective over the input communicator. So it needs something that allows the user to 'free' the partially valid communicator object.  
>  
>  
>  
>  

We do have a comment in our local doc but we weren’t editing the full specification at the time so we didn’t include that in our proposal.  
>   - The example with barrier is correct, but slightly misleading to the casual reader. Since 'success' from the barrier is a local state, and does not tell them that the other alive processes that called the barrier also received 'success' from the same operation (due to emerging failure during the collective). So they locally know that the communicator was created, but a peer that received MPI_ERR_FAILED from the operation does not. So saying that if you receive success locally from the barrier that you can use the communicator normally is slightly misleading. I suppose that if the barrier fails anywhere then that process will call MPI_Comm_invalidate, and the whole thing falls over. But if they only ever intend to use point-to-point operations and instead call MPI_Comm_failure_ack, what is the expected behavior or is that erroneous?
>  
>  
>  
>  

I hope the casual reader stays out of the standard, but just in case: As in the previous question, that would be valid until they decide they need collective operations or someone calls MPI_Invalidate. Because you already received a SUCCESS for the communicator creation function, it is valid for you to call FAILURE_ACK after something returns a failure and GET_ACKED to figure out who is gone.  
>  
> - Thinking about MPI_Comm_create_group which is collective over a subgroup of comm, would it be valid to call that operation even if the input communicator contained failed processes not in the collective subgroup? or was invalidated by a peer?  
>   - The authors of this operation cite fault tolerance applications as beneficiaries of this operation, I am wondering if we can preserve that assertion in this model.
>  
>  
>  
>  

That might be an interesting thing to allow, at least in the case where the communicator was not invalidated. We can mention something about that and add it as the specification is filled out.  
>  
> - Once a communicator is invalidated you do not state if the outstanding messages are dropped, completed in error, or completed in success (if already matched).  
>   - If I have a nonblocking send request that completed inside MPI (so marked as completed successfully, but I have not waited on it yet), and a peer invalidates the communicator. I call a different MPI operation and it return MPI_ERR_INVALIDATED. If I then call the wait on that outstanding request do I get MPI_SUCCESS or MPI_ERR_INVALIDATED?
>  
>  
>  
>  

That is up to the implementation (Advice to implementers p. 2 line 31). Once you’ve returned that the communicator is INVALID, it can never return anything else (except in the cases specifically mentioned).  
>   - If I call MPI_Comm_free on an invalidated communicator, that is a collective operation and should be exempt from the rule that all non-local MPI calls return MPI_ERR_INVALIDATED.
>  
>  
>  
>  

Agreed.  
>   - If I call MPI_Comm_free and there are outstanding requests there is some troubling language in the standard about what MPI -should- do. Meaning the standard says that if the request is outstanding and completes in error, but the original communicator has been freed then MPI_ERRORS_ARE_FATAL is enacted. We might want to think about how we can get around this so users can easily 'throw away' communicators that they do not want any longer.
>  
>  
>  
>  

It’s the user’s responsibility to ensure that there are no pending messages. If the communicator is already invalid, the requests should have been cleaned up already as there is never a case where they could have been completed. If the communicator is not invalid but there are known failed processes inside, it is up to the user to clean up its own requests before calling free.  
>  
> - If you call MPI_Comm_shrink(MPI_COMM_WORLD) when MCW has outstanding failure, what is the resulting semantic?
There shouldn’t be anything special here about MCW. As long as you invalidated it first, you will get a new, smaller communicator as expected, every time you do it. This doesn’t replace MCW.  
>  
> - Do all processes need to call MPI_Comm_invalidate before they call MPI_Comm_shrink? Or if they received the error MPI_ERR_INVALIDATED, is the invalidation implied?  
>   - It is unclear from the sentence on page 3 lines 90-91. If the user needs to use MPI_Comm_invalidate to 'invalidate' the communicator or if a peer could have done it for them.
>  
>  
>  
>  

When any process calls MPI_COMM_INVALIDATE, it becomes invalid (eventually) for all processes (see p. 3 line 79). If it hasn’t yet been invalidated for you then you need to do it yourself. We’ll clear that up in the doc as follows:

A communicator becomes invalid as soon as:
  - MPI_COMM_INVALIDATE is locally called on it
  - Or any MPI function returned MPI_ERR_INVALIDATED (or such error field was set in the status pertaining to a request on this communicator)  
>  
> - Does MPI_Comm_size() return the number of members of the group regardless of failed state, or just the number of alive?
It returns the number of processes in the communicator, regardless of their state. So it would include the failed processes. The user can easily calculate the number of known alive processes using the group returned by GET_ACKED.  
>  
> - MPI_Comm_failure_ack() takes a snapshot of the set of locally known failed processes on the communicator which can be accessed by MPI_Comm_failure_get_acked(). This set of failed processes is "stored" per communicator. When we proposed something similar, we received push back about the amount of memory per-communicator this would require. I know we talked about how to mitigate this briefly in the past, but can you talk to how you would expect an MPI implementation to efficiently manage this state?
This can be done with a memory requirement of O(1) per communicator.  
>  
> - MPI_Comm_failure_ack() allows the user to acknowledge the locally known set of failed processes. This allows previously posted nonblocking receive ANY operations (that may have already returned MPI_ERR_PENDING) to succeed. (Right?)
Yes.  
>   - If you only use directed point-to-point communication do you ever need to call this operation? I guess only if you need a list of locally known failures, but the user can track that themselves.
>  
>  
>  
>  

While that’s true, if you use ACK and GET_ACKed, you can learn more than only with specific point-to-point. For example, you might learn about other failures that happened without having to have communications pending with all peers.  
>   - As far as message completion this operation only has an affect on ANY source operations, right? collectives remain disabled, and directed p2p are undisturbed.
>  
>  
>  
>  

Correct (p. 4 line 105).  
>   - So I can have a directed nonblocking p2p operation that spans the MPI_Comm_failure_ack() call?
>  
>  
>  
>  

Correct.  
>  
> - Thinking forward a bit, have you thought about how we might add process recovery to this model? Maybe MPI_Comm_grow()?
The method proposed by Gropp and Lusk (COMM_SPAWN followed by COMM_MERGE) can be used, in addition to the provided COMM_SHRINK, to rebuild communicators.  
>  
>  
> Thanks,
> Josh
>  
>  
>  
>  

Thanks for the comments. Keep them coming.

Wesley  
>  
> On Mon, Feb 6, 2012 at 7:31 PM, George Bosilca <bosilca at eecs.utk.edu (mailto:bosilca at eecs.utk.edu)> wrote:
> > FT working group,
> >  
> > As announced, we have been working on an alternative FT proposal. The leading idea of this proposal is to relieve the burden of consistency from the MPI implementations, while providing the means for the user to regain control. As suggested in the mailing list a few days ago, this proposal tries to minimize the semantic changes and additional functions. We believe the proposed set of functions is minimal, yet sufficient to implement stronger consistency models, such as the current working group proposal.
> >  
> > The proposal is not yet complete, we are still working on intercoms, RMA and file operations, but we wanted to submit it for feedback and discussion on the call this week.
> >  
> >  The UTK team.
> >  
> > PS: The abstract and the proposal are attached below.
> >  
> > Abstract:
> > In this document we propose a flexible approach providing fail-stop process fault tolerance by allowing the application to react to failures while maintaining a minimal execution path in failure-free executions. Our proposal focuses on returning control to the application by avoiding deadlocks due to failures within the MPI library. No implicit, asynchronous error notification is required. Instead, functions are provided to allow processes to invalidate any communication object, thus preventing any process from waiting indefinitely on calls involving the invalidated objects. We consider the proposed set of functions to constitute a minimal basis which allows libraries and applications to increase the fault tolerance capabilities by supporting additional types of failures, and to build other desired strategies and consistency models to tolerate faults.
> >  
> >   
> >  
> >  
> > George Bosilca
> > Research Assistant Professor
> > Innovative Computing Laboratory
> > Department of Electrical Engineering and Computer Science
> > University of Tennessee, Knoxville
> > http://web.eecs.utk.edu/~bosilca/
> >  
> >  
> > _______________________________________________
> > mpi3-ft mailing list
> > mpi3-ft at lists.mpi-forum.org (mailto:mpi3-ft at lists.mpi-forum.org)
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>  
>  
>  
> --  
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org (mailto:mpi3-ft at lists.mpi-forum.org)
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>  
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20120210/b5540e63/attachment-0001.html>