[Mpi3-ft] General Network Channel Failure

Thu Jun 9 09:56:35 CDT 2011

I like the idea of having an abstract model of failures that can approximate changes in system functionality due to failures. However, I think before we go too far with this we should consider the type of model we want to make. One option is to make a system model that has as its basic elements nodes, network links and other hardware components and identifies points in time when stop functioning. The other option is to make it MPI-centric by talking about the status of ranks and point-to-point communication between them as well as communicators and collective communication over them. So in the first type of model we can talk about network resource exhaustion and in the latter we can talk about an intermittent inability to send messages over some or all communicators. 

I think that the MPI-centric model is a better option since it talks exclusively  about entities that exist in MPI and ignores the physical phenomena that cause a given type of degradation in functionality.

The other question we need to discuss is the types of problems we want to represent. We obviously care about fail-stop failures but we're not talking about resource exhaustion. Do we want to add error classes for transient errors and if so, what about performance slowdowns?

Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky at llnl.gov
http://greg.bronevetsky.com

> -----Original Message-----
> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
> bounces at lists.mpi-forum.org] On Behalf Of Josh Hursey
> Sent: Wednesday, June 08, 2011 11:36 AM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: [Mpi3-ft] General Network Channel Failure
> 
> It was mentioned in the conversation today that MPI_ERR_RANK_FAIL_STOP
> may not be the first error returned by an MPI call. In particular the MPI call
> may return an error symptomatic of a fail-stop process failure (e.g., network
> link failed - currently MPI_ERR_OTHER), before eventually diagnosing the
> event as a process failure. This 'space between' MPI_SUCCESS behavior and
> MPI_ERR_RANK_FAIL_STOP behavior is not currently defined, and probably
> should be for the application to cleanly move from set of semantics for one
> error class to another.
> 
> The suggestion was to create a new general network error class (e.g.,
> MPI_ERR_COMMUNICATION or MPI_ERR_NETWORK - MPI_ERR_COMM is
> taken) that can be returned when the operation cannot complete due to
> network issues (which might be later diagnosed as process failure and
> escalated to the MPI_ERR_RANK_FAIL_STOP semantics). You could also think
> about this error being used for network resource exhaustion as well
> (something that Tony mentioned during the last MPI Forum meeting). In
> which case retrying at a later time or taking some other action before trying
> again would be useful/expected.
> 
> There are some issues with matching, and the implications on collective
> operations. If the network error is sticky/permanent then once the error is
> returned it will always be returned or escalated to fail-stop process failure (or
> more generally to a 'higher/more severe/more detailed' error class). A
> recovery proposal (similar to what we are developing for process failure)
> would allow the application to 'recover' the channel and continue
> communicating on it.
> 
> 
> The feeling was that this should be expanded into a full proposal, separate
> from the Run-Through Stabilization proposal. So we can continue with the
> RTS proposal, and bring this forward when it is ready.
> 
> 
> What to folks think about this idea?
> 
> -- Josh
> 
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft