[Mpi3-ft] MPI_Init / MPI_Finalize

Thu Aug 26 10:08:07 CDT 2010

Neat idea. Below are a couple of thoughts that occurred to me.

Versioning:
-----------
For the MPI_Init_version function, I would suggest that it be modeled after MPI_Get_version so have major and minor version numbers. I would also suggest changing the order of the arguments slightly to mimic MPI_Init() for familiarity (opens door for functional overloading in languages that allow for such things):
MPI_Init_version(
  int* argc,
  char ***argv,
  MPI_Errhandler errhandler,
  int required_version,
  int required_subversion );

The function would return an error (using the errhandler provided) if it cannot provide (at least?) the required version. The user can check which version they actually got using the MPI_Get_version() function directly after successful completion of MPI_Init (it might be > than the required version).

Part of the difficulty I have with the versioning is do we want an error to be raised if the required version cannot be provided exactly (MPI 2.2 or die)? At least the required version is available (MPI 2.2, 2.3, or 3.0 is ok, but not 2.1)? Or should we allow the user to specify a range to get features that were introduced in say 3.0, but not the features introduced after 3.3? I thin kit would be appropriate for this to return success if the required version is a minimal version, then the application can use MPI_Get_version to decide if the version provided is acceptable or not (if not then they call MPI_Abort()).

In my mind versioning gets bogged down in a resurgence of a discussion of (for good or bad) subsetting and backwards compatibility. There is value if figuring out if the MPI you were compiled with provides at least the run-though stabilization semantics. If we can side step the issue of versioning for now, I think that will help focus the discussion of the proposal a bit.

Combining Error Handler registration with MPI_Init
--------------------------------------------------
The idea of combining the error handler registration with a new MPI_Init function is interesting. I think this might have been mentioned on the call, though I can't remember by who.

So with the MPI_Comm_set_errhandler call the error handler is associated with a communicator and inherited by all new descendant communicators. Adding the error handler registration to MPI_init disconnects it from the communicator. So is this error handler associated with MPI_COMM_WORLD or all {inter|intra}communicators present at MPI_Init time? If an error occurs during MPI_Init and a user defined error handler is registered, what should we return for MPI_Comm (maybe MPI_COMM_NULL)?

One advantage of having the error handler registered with the MPI_Init function is that it allows the MPI implementation some flexibility in when it decides to handle the error handler registration during init, instead of having special code in the MPI_Comm_set_errhandler call to check if initialized.

The disadvantage is that it introduces a new API, instead of using an existing API. Though we are already introducing new APIs, so this may not be a big deal.

What do others think?

-- Josh

On Aug 26, 2010, at 12:14 AM, Fab Tillier wrote:

> I think perhaps a new MPI_Init function is in order, one that takes as input an error handler for MPI_COMM_WORLD.  I'd go a step further and have the function take as input parameter the major and minor version of MPI that the application is requesting.  Something like:
> 
> MPI_Init_version(
>    int version,
>    int* argc,
>    char ***argv,
>    MPI_Errhandler errhandler,
>    int required,
>    int* provided );
> 
> The version parameter would encode the major and minor version of the standard that the application expects.  This allows semantic changes to existing APIs (i.e. preserving the function signature but changing the behavior) without requiring introducing new APIs which should help control the API footprint of the standard.  For example, with this we could change the behavior of existing APIs to support run-through stabilization without violating previous versions of the standard. 
> 
> The errhandler would have to be one of the predefined error handlers and avoids the need to allow MPI_Comm_set_errhandler to be called (on MPI_COMM_WORLD only) before the MPI implementation is initialized.
> 
> Thoughts?
> -Fab
> 
> Joshua Hursey wrote on Wed, 25 Aug 2010 at 13:24:32
> 
>> During the discussion of the run-though stabilization proposal today on
>> the teleconf, we spent a while discussing the expected behavior of
>> MPI_Init and MPI_Finalize in the presence of process failures. I would
>> like to broaden the discussion a bit to help pin down the expected
>> behavior.
>> 
>> MPI_Init():
>> -----------
>> Problem: If a process fails before or during MPI_Init, what should the
>> MPI implementation do?
>> 
>> The current standard says nothing about the return value of MPI_Init()
>> (Ch. 8.7). To the greatest possible extent the application should not
>> be put in danger if it wishes to ignore errors (assumes
>> MPI_ERRORS_ARE_FATAL), so returning an error from this function (in
>> contrast to aborting the job) might be dangerous. However, if the
>> application is prepared to handle process failures, it is unable to
>> communicate that information to the MPI implementation until after the
>> completion of MPI_Init().
>> 
>> So a couple of solutions were presented each with pros and cons (please
>> fill in if I missed any):
>> 1) If a process fails in MPI_Init() (default error handler is
>> MPI_ERRORS_ARE_FATAL) then the entire job is aborted (similar to
>> calling MPI_Abort on MPI_COMM_WORLD).
>> 
>> 2) If a process fails in MPI_Init() the MPI implementation will return
>> an appropriate error code/class (e.g., MPI_ERR_RANK_FAIL_STOP), and all
>> subsequent calls into the MPI implementation will return the error
>> class MPI_ERR_OTHER (should be create a MPI_ERR_NOT_ACTIVE?).
>> Applications should eventually notice the error and terminate.
>> 
>> 3) Allow the application to register only the MPI_ERRORS_RETURN handle
>> on MPI_COMM_WORLD before MPI_Init() using the MPI_Errhandler_set()
>> function. Errors that occur before the MPI_Errhandler_set() call are
>> fatal. Errors afterward, including during MPI_Init() are not fatal.
>> 
>> In the cases where MPI_Init() returns MPI_ERR_RANK_FAIL_STOP to
>> indicate a process failure, is the library usable or not? If the
>> application can continue running through the failure, then the MPI
>> library should still be usable, thus MPI_Init() must be fault tolerant
>> in its initialization to be able to handle process failures. If the MPI
>> implementation finds itself in trouble and cannot continue it should
>> return MPI_ERR_CANNOT_CONTINUE from all subsequent calls including
>> MPI_Init, if possible.
>> 
>> 
>> MPI_Finalize():
>> ---------------
>> Problem: If a process fails before or during MPI_Finalize (and the
>> error handler is not MPI_ERRORS_ARE_FATAL), what should this function
>> return? Should that return value be consistent to all processes?
>> 
>> To preserve locality of fault handling, a local process should not be
>> explicitly forced to recognize the failure of a peer process that they
>> never interact with neither directly (e.g., point-to-point) or
>> indirectly (e.g., collective). So MPI_Finalize should be fault tolerant
>> and keep trying to complete even in the presence of failures.
>> 
>> MPI_Finalize is not required to be a collective operation, though it is
>> often implemented that way. An implementation may need to delay the
>> return from MPI_Finalize until its role in the failure information
>> distribution channel is complete. But we should not require a multi-
>> phase commit protocol to ensure that everyone either succeeds or
>> returns some error. Implementations may do so internally in order to
>> ensure that MPI_Finalize does not hang.
>> 
>> If MPI_Finalize returns an error (say MPI_ERR_RANK_FAIL_STOP indicating
>> a 'new to this rank' failure), what good is this information to the
>> application? It cannot query for which rank(s) failed since MPI has
>> been finalized. Nor can it initiate recovery. The best it could do is
>> assume that all other processes failed and take local action.
>> 
>> 
>> MPI_Finalize: MPI_COMM_WORLD process rank 0:
>> -------------------------------------------- In chapter 8, Example 8.7
>> illustrates that "Although it is not required that all processes return
>> from MPI_Finalize, it is required that at least process 0 in
>> MPI_COMM_WORLD return, so that users can know that the MPI portion of
>> the computation is over."
>> 
>> We deduced that the reasoning for this explanation was to allow for MPI
>> implementation that create and destroy MPI processes during
>> init/finalize from rank 0. Or worded differently, rank 0 is the only
>> rank that can be assumed to exist before MPI_Init and after MPI_Finalize.
>> 
>> Problem: So what if rank 0 fails at some point during the computation
>> (or just some point during MPI_Finalize)?
>> 
>> In the proposal, I added an advice to users to tell them to not depend
>> on any specific ranks to exist before MPI_Init or after MPI_Finalize.
>> So, in a faulty environment, the example will produce incorrect results
>> under certain failure scenarios (e.g., failure of rank 0).
>> 
>> In an MPI environment that depends on rank 0 for process creation and
>> destruction, the failure of rank 0 is (should be?) critical and the MPI
>> implementation will either abort the job or return
>> MPI_ERR_CANNOT_CONTINUE from all calls to the MPI implementation. So we
>> believe that the advice to users was a sufficient addition to this
>> section. What do others think?
>> 
>> 
>> So MPI_Init seems to be a more complex issue than MPI_Finalize. What do
>> folks think about the presented problems and possible solutions? Are
>> there other issues not mentioned here that we should be addressing?
>> 
>> -- Josh
>> 
>> Run-Through Stabilization Proposal:
>>  https://svn.mpi-forum.org/trac/mpi-forum-
>> web/wiki/ft/run_through_stabilization
>> 
>> ------------------------------------ Joshua Hursey Postdoctoral Research
>> Associate Oak Ridge National Laboratory
>> http://www.cs.indiana.edu/~jjhursey
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 

------------------------------------
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://www.cs.indiana.edu/~jjhursey