[Mpi3-ft] run through stabilization user-guide

Bronevetsky, Greg bronevetsky1 at llnl.gov
Tue Feb 8 09:18:28 CST 2011


I'll add one more thing to Josh's explanation. A good analogy to the way we approached the FT spec is the design of the UI for a car. The sufficient UI is a steering wheel, a brake and a gearshift. But in practice users want to know more about the car to operate it efficiently: the maximum speed, the acceleration, the gearing ratios inside the transmission, the tread shape on the tires, etc. However, the latter properties are very implementation-specific, with different quality cars using different technologies to implement the car abstraction. In this case it is best to capture the minimum amount of information that is common to all implementations (wheel, brake, gearshift) and give implementers freedom to make either high-quality or low-quality implementations and independently inform the user about how well they perform.

If you write an application using the MPI FT interface, you'll be guaranteed that your application will run on any system. If that system has a good implementation of this interface, failures will be detected quickly and accurately and it will thus be able to efficiently run on unreliable hardware. If the system has a poor implementation, your application will frequently waste cycles waiting on failed nodes that have not yet been detected or will abort mysteriously because of failures that were not at all detected by MPI. This MPI is a good fit for non-mission-critical applications or those running on reliable hardware. In either case, you can combine the MPI FT specification and the MPI implementer's specs to run your application efficiently on the hardware of your choice.

Now, it would be nice to have a group outside of the MPI spec that looks at implementations and provides an unbiased guide about their performance and reliability. Unfortunately, such a guide is too low-level to include in the MPI specification.

Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky at llnl.gov
http://greg.bronevetsky.com 


> -----Original Message-----
> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
> bounces at lists.mpi-forum.org] On Behalf Of Joshua Hursey
> Sent: Tuesday, February 08, 2011 6:47 AM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: Re: [Mpi3-ft] run through stabilization user-guide
> 
> Toon,
> 
> Thanks for your feedback. I'll fix the first example, thanks for the catch.
> A bit more discussion on the other points below.
> 
> On Feb 8, 2011, at 8:14 AM, Toon Knapen wrote:
> 
> >
> > The spec explicitly doesn't define the meaning of failure because it is a
> very low-level concept. The MPI implementation is responsible for detecting
> failures and defining what qualifies as one. The only guarantee that you
> can rely on is that if the "failure" is bad enough that a given MPI rank
> can't communicate with others, then MPI will have to either abort the
> application completely or eventually report this failure via the FT API.
> >
> >
> > I understand it is really hard to define what is exactly a failure. But
> if the future standard does not describe it, is'nt there a risk that apps
> that rely on the FT API will become hard to port from one implementation to
> another ?
> >
> > For instance, suppose I have one process constantly sending messages to
> the other one. Suppose the second process is dead, the first process thus
> might either detect that the second process is down and inform the app. Or
> the first process might not tell anything to the app and buffer all
> messages to be send. But this buffer might grow so big that the first
> process will run out-of-memory and dies too.
> >
> > So I think it would be usefull that the mpi-library is given some bounds
> within which it should function and if it is not able to function within
> these bounds it should raise an error.
> >
> > For instance a simple limit might be that each node should respond to a
> ping within 1s (or 1ms. or ...). If a node can not be pinged (within this
> limit), the node is considered to be dead.
> >
> > Another bound that might be specified and that would serve my example
> above is e.g. the memory bounds within which MPI should function. For
> instance, the app (or user) might decide to allocate 100 MB of (buffer-)
> memory to MPI. If the app fills this buffer and the MPI-lib is not able to
> flush the data sufficiently fast to the other nodes, an error will be
> raised. At that point the app is aware that there is a failure to
> communicate and can take appropriate action: slow down a bit with the
> sending or consider the other node to have failed.
> >
> > In the above, it is the MPI lib that always detects the error. However if
> the MPI library does not guarantee what a failure is exactly, I might have
> to detect failures in the app (to be independent of the free interpretation
> of failure). In that case I might e.g. need to always do non-blocking sends
> and recvs (to avoid being blocked until eternity in case of a failure) and
> see if the messages arrive in a timely manner, if not I consider the other
> process to be dead. In that case however, I would also need functionality
> to tell to the MPI-lib that a specific (comm,rank) should be considered
> MPI_RANK_STATE_NULL.
> >
> > I'm sorry if the above has been discussed at length already and the forum
> already decided not to define 'failure' (I should'nt have left the MPI-
> scene for the last three years, what was I thinking ;-). Trying to define
> 'failure' might be opening pandora's box but I'm looking at this from the
> appication point-of-view. IMHO FT is either just about 'going down
> gracefully' or about 'trying to finish the job'.
> 
> It is important to clarify what we can and cannot specify in the standard.
> So I appreciate your help with explaining how you think the language can be
> more precise or better explained.
> 
> For this proposal, we are focused on graceful degradation of the job.
> Processes will fail(-stop), but the job as a whole is allowed the
> opportunity to continue operating.
> 
> We are defining process failures as fail-stop failures in the standard. So
> we cover processes that crash, and do not continue to participate in the
> parallel environment. How the MPI implementation detects process failure is
> not defined by the standard, except by the properties that the MPI
> implementation must provide to the application. The MPI implementation will
> provide a view of the failure detector that is 'perfect' from the
> perspective of the application (though internally there is a fair amount of
> flexibility on how to provide this guarantee).
> 
> This means that eventually all processes will know of a process failure,
> and that if the application receives notification of process failure then
> that process is guaranteed to be failed-stop. Once an alive process is
> notified of a peer process failure then any communication involving the
> failed process will complete with an error. So there is no worry that you
> will block completely on communication to a failed peer. The MPI library
> should be cleaning up internal buffers, etc at this time as well to
> conserve memory. Communication with non-failed processes will complete as
> normal. Collectives are disabled until the application re-enabled them with
> the collective MPI_Comm_validate_all().
> 
> Now you could block in the following scenario:
>  - Processes A, B, D are alive, Process C is failed.
>  - Processes are communicating in a ring using point-to-point operations as
> follows: A->B->C->D->A...
>  - Since processes B and D are directly interacting with C, then they will
> see failure from their send/recv point-to-point communication. But A may
> block waiting on either B or D.
> 
> It is the responsibility of the application to design around this type of
> situation to ensure continued progress of their application - making a
> fault aware application from an otherwise fault unaware application. There
> are a few ways to do this, but the best solution will always be domain
> specific.
> 
> 
> It is possible that the application detects that a peer process is faulty
> in a different way than fail-stop (e.g., byzantine). For example, a peer
> process may have incurred a soft-error memory corruption, and is sending
> invalid data (but valid from the MPI perspective). A peer process could be
> checking the values, and determine that the peer is faulty. At which point
> it can either:
>  - Coordinate with the other alive peers to exclude the faulty process, or
>  - Use MPI_Comm_kill() to request that the process be terminated.
> MPI_Comm_kill() is not described in the user's guide, but is in the main
> proposal. It allows one process to kill another without killing itself
> (which would happen if they used MPI_Abort). Is this a scenario that you
> were concerned about?
> 
> 
> As an aside, providing a memory boundary or specifying a heartbeat timeout
> value is generally difficult to do in the standard. But I don't think you
> necessarily need this functionality since it is a quality of implementation
> issue on how soon the MPI library will notify the application of a process
> failure, and how well it manages the buffers internally.
> 
> 
> >
> >
> > Note that this view is focused on process failures. When it is applied to
> things like network partitions (this includes the case you mentioned where
> one process can't talk to any other due to a failed network card) then
> processes on both parts of the partition may be informed that the others
> have failed. As such, when connectivity is restored, since MPI will be
> responsible to maintaining self-consistency of its previous notifications,
> it'll have to kill processes on one side of the partition to keep
> consistent with the notifications it gave to the other partition.
> >
> >
> > Considering many apps work in master-slave mode, I would like to be able
> to guarantee that the side on which the master resides is not killed.
> 
> You should be able to do this by creating subsets of communicators and
> setting error handers appropriately on the application end. Organize
> 'workers' into volatile groups (error handler = MPI_ERRORS_ARE_FATAL),
> while the 'manager' process(es) only ever participate with communicators
> that do not have a fatal error handler.
> 
> Does that help clarify?
> 
> Thanks for the feedback,
> Josh
> 
> 
> >
> > toon
> > _______________________________________________
> > mpi3-ft mailing list
> > mpi3-ft at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 
> ------------------------------------
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> 
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft




More information about the mpiwg-ft mailing list