[Mpi3-ft] Distinguishing errors from failures
Aurélien Bouteiller
bouteill at icl.utk.edu
Wed Jul 17 13:22:10 CDT 2013
Le 17 juil. 2013 à 12:16, Jim Dinan <james.dinan at gmail.com> a écrit :
> Hi George and Aurilien,
>
> Thanks for the detailed responses. I looked at the paper, and it indicates that failure detection is needed when ANY_SOURCE is used, and I assume also when passive target RMA is used, since a process can fail while holding the lock. Won't this have an impact on performance?
>
Depending on implementation it may. But it can be mitigated to pretty much nothing in practice:
For ANY_SOURCE, the code can go directly into optimized mode, w/o failure detection for "some time". After "some time (implementation dependent)" has passed and the implementation suspects this is not normal, actions to detect potential failures can be taken (interrogate a failure detector, poke around to see who's still responding, etc). The result is that latency w/o failure is not impacted when it matters, detection latency is added only for messages that already have long latency (due to load imbalance in the user code, presumably). Bandwidth is never an issue. Once a message is matched, it is not "any source" anymore.
For lock, we discussed together of how the typical implementation can work in practice w/o adding much in terms of failure detection. Another implementation may be more problematic, but a similar approach as for ANY_SOURCE could be taken.
If one wants to favor raw performance, it is always possible to do so at the expense of failure detection delay.
Also note that in the paper, we do have the (optional) failure detection framework enabled at all time during the measurements, and performance are unchanged nonetheless.
Also, many HPC fabrics embed advanced failure detection and reporting. We do not mandate that they are present, but these would provide both fairly accurate detection and good response time.
Aurelien
> ~Jim.
>
>
> On Tue, Jul 16, 2013 at 7:27 PM, George Bosilca <bosilca at icl.utk.edu> wrote:
> In addition to Aurelien's answer, there is something else that I think should be stressed in this context, something I feel the WG failed to make clear enough in its interactions with the forum.
>
> There is one single strong mandate from an MPI implementation in the current FT proposal: a revoked communicator is improper for further communications. This is about the extent imposed on MPI implementations by the current proposal, both in terms of capabilities and overheads. And still, there were complaints raised about it …
>
> Why is this so? Because there is no mandatory error detection and propagation, when the MPI library detects a failure only local dispatch of this information is necessary. Of course, one would expect a high quality MPI implementation to do it's best to ensure a certain level of quality of services here, a significantly lesser effort than ensuring some other "high quality" type of capabilities (namely progress and fairness). The paper mentioned by Aurelien proves that at least under certain assumptions this can be achieved with minimal/unnoticeable overhead.
>
> >From there, the application itself is responsible to make good use of the functions provided by the FT proposal to handle the failure in a meaningful way for the application (again there is no imposed FT model). MPI_Comm_revoke provides a communication channel where at the request of the application the knowledge about a process failure is propagated to other MPI processes. The ACK function to locally acknowledge the failure. The agreement to reach a consensus for building more complex FT methods, and finally shrink to take you back to a communicator that has all the properties of a sane/workable MPI communicator.
>
> George.
>
> On Jul 16, 2013, at 23:41 , Aurélien Bouteiller <bouteill at icl.utk.edu> wrote:
>
> > Jim,
> >
> > There is no specific cost to enable FT at all time, so this might not be such a crucial issue actually, since the initial assumption is not true.
> >
> > You can point to the following paper that investigates the cost sustained by applications and stressful micro benchmarks to support this claim with solid evidence.
> >
> > Bland, W., Bouteiller, A., Herault, T., Hursey, J., Bosilca, G., Dongarra, J.J. “An Evaluation of User-Level Failure Mitigation Support in MPI,“ Computing, Springer, 2013, issn 0010-4885X, http://dx.doi.org/10.1007/s00607-013-0331-3
> >
> > The intuitive explanation to the no-impact result is that the spec does not change the behavior of any existing MPI functions. If a function has local completion, it retains local completion even when it reports a failure. It merely adds a new class of errors to let users know that what killed them is process failure, rather than say network retransmit error.
> >
> > The FT additions kick in only -after- a failure was reported. Nothing is implicit. The MPI implementation is not expected to "fix" things on its own in the background, normal MPI functions are not overloaded with failure related actives. All recovery actions are triggered by explicit calls from the user code, which removes all potential performance problems from "spilling" recovery activity inside failure-free path.
> >
> > All this means that the codebase (see the prototype implementation) stays basically unchanged, with no modifications in the transport layer, no modifications in the collective framework, etc.
> >
> > ~~~~~~
> >
> > As you noted, it can also be turned on/off by mpirun switches if deemed necessary. It is standard compliant (and this is deliberate) to have all FT recovery routines map to no-op (revoke=no-op) or no-FT equivalents (agree=allreduce) when no fault tolerance is needed, and even FT codes will run fine on such an MPI library (as long as there are no failures, of course). If some implementation really wants to "optimize out" anything FT related (or even load-in only if required), This would be our recommendation. But again, the performance gain is expected to be minor, if even measurable.
> >
> >
> > Aurelien
> >
> >
> > Le 16 juil. 2013 à 16:23, Jim Dinan <james.dinan at gmail.com> a écrit :
> >
> >> Hi FT WG,
> >>
> >> I am doing my best to socialize the FT proposal at Intel and gathered a piece of feedback to bring back to the WG.
> >>
> >> There was a concern that any time the user registers an error handler, fault tolerance could be "switched on" because MPI_Comm_set_errhandler() does not distinguish between error classes. The assumption was that, when switched on, there would be space/time costs associated with fault tolerance. How does the current proposal determine when fault tolerance should be enabled?
> >>
> >> One suggested mechanism was to add a function, MPI_Comm_set_faulthandler() that allows the programmer to distinguish between errors and failures. This would allow the runtime to determine when fault tolerance was desired. I think the way this is handled currently is to rely on the implementation switching on/off fault tolerance when the job is launched.
> >>
> >> ~Jim.
> >> _______________________________________________
> >> mpi3-ft mailing list
> >> mpi3-ft at lists.mpi-forum.org
> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >
> > --
> > * Dr. Aurélien Bouteiller
> > * Researcher at Innovative Computing Laboratory
> > * University of Tennessee
> > * 1122 Volunteer Boulevard, suite 309b
> > * Knoxville, TN 37996
> > * 865 974 9375
> >
> >
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > mpi3-ft mailing list
> > mpi3-ft at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
--
* Dr. Aurélien Bouteiller
* Researcher at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 309b
* Knoxville, TN 37996
* 865 974 9375
More information about the mpiwg-ft
mailing list