[mpiwg-ft] Ticket #324 Reading Slides

Bland, Wesley wesley.bland at intel.com
Thu May 21 13:07:16 CDT 2015


The abort case is pretty simple. I’ve implemented it in MPICH with just an hour or so’s work. If a process calls abort on a communicator, it sends out a message to the other processes in its communicator first to tell them to abort as well. This can also be done by the process manager if it supports it. At the moment, PMI only handles full job aborts, but that’s something that might come in PMI-3.

The failure case is as implementation specific as it always was. If the implementation detects a failure, it’s supposed to raise an error on the affected communicators. Those communicators raise their error handlers (which may include aborting as above). If the implementation can’t/doesn’t want to do that, it can still just abort everyone like before.

The use case that this enables in my mind is a more dynamic processing model where processes are spawned/joined/connected to do a specific task, then shut down while the rest of the application continues. There’s lots of communicators between the master(s) and workers and fine grained failure detection isn’t really necessary because you will just re-spawn a new set of workers and it’s up to the launcher to avoid failures. For the inter-communicators, you’d set MPI_ERRORS_RETURN and when you see an error, you just drop the communicator. The remote processes would have their own communicator that they could use to abort if necessary.

It still doesn’t define whether or not communication is possible. That’s a much bigger issue (obviously) and one that we’re trying to tackle with ULFM, but for this sort of very simple fault tolerance, the implementation may be able to continue communicating without an issue as long as you are going to be destroying all of the communicators with the failed process anyway. You wouldn’t have to worry about “fixing” MPI_COMM_WORLD since it wouldn’t have a failure.

If you want to think about it as part of the overall timeline of improving FT in the MPI Standard, this really is just a step 0 to ensure that MPI_Abort works in the correct way. It doesn’t solve everything, but it does fix a small thing that easy to carve off and doesn’t impact other parts of the standard too much.

Thanks,
Wesley

On May 20, 2015, at 5:53 PM, George Bosilca <bosilca at icl.utk.edu<mailto:bosilca at icl.utk.edu>> wrote:

Wesley,

I understand the interest in scoping the abort to a single set of processes. However, without support for detecting that some processes have disappeared (failure or abort) I don't see how you can use this for anything constructive. Thus, I am confused by this proposal as it seems to lack a second part where the processes outside of the MPI_ABORT scope deal with the newly defined behavior.

Can you be a little bit more explicit on how this scoped MPI_ABORT, as defined in this ticket, will be beneficial to application?

Thanks,
  George.



On Tue, May 19, 2015 at 12:14 PM, Bland, Wesley <wesley.bland at intel.com<mailto:wesley.bland at intel.com>> wrote:
I plan to go over this during the con call next week, but if you’d like to get a preview and/or comment early, I have a draft of the slides for the reading for ticket #324. You can view the slides here:

https://docs.google.com/presentation/d/10Sz9aCDezSLH1rss6XYApo_wVqTyHH08G3FUe9NcygE/edit?usp=sharing

If you have any questions/comments, let me know.

Aurelien, I’m not sure of the status of this ticket in Open MPI. Is there a way to keep Open MPI from aborting on any error in the trunk? Is the only way to get this in the ULFM repo?

Thanks,
Wesley
_______________________________________________
mpiwg-ft mailing list
mpiwg-ft at lists.mpi-forum.org<mailto:mpiwg-ft at lists.mpi-forum.org>
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft

_______________________________________________
mpiwg-ft mailing list
mpiwg-ft at lists.mpi-forum.org<mailto:mpiwg-ft at lists.mpi-forum.org>
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft



More information about the mpiwg-ft mailing list