[Mpi-forum] MPI_Abort and error handlers

Bland, Wesley wesley.bland at intel.com
Fri Aug 7 09:08:40 CDT 2015


The standard is very unclear on this, which is why we’ve been working on it in the FTWG. Take a look at tickets 324 and 477:

https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/324
https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/477

The first ticket clarifies that MPI_ERRORS_ARE_FATAL causes all connected processes will abort when any error is raised. It also tries to clarify the behavior of MPI_ABORT. In particular, it removes the text:

Rationale. The communicator argument is provided to allow for future extensions of MPI to environments with, for example, dynamic process management. In particular, it allows but does not require an MPI implementation to abort a subset of MPI_COMM_WORLD. (End of rationale.)

And replaces it with the text:

Advice to implementors. When aborting a subset of processes, a high quality imple- mentation should be able to provide correct error handling for communicators con- taining both aborted and non-aborted processes. (End of advice to implementors.)

This is intended to clarify the situation that you point out.

The second ticket creates a new error handler that will abort only the processes in the communicator on which the error handler is called.

For your situation, you should be able to just replace the default error handler with your custom error handler to do cleanup. In fact, that’s what the original definition said anyway:

The purpose of these error handlers is to allow a user to issue user-defined error messages and to take actions unrelated to MPI (such as flushing I/O buffers) before a program exits.

Hope that helps. Feel free to take a look at the PDF on either of the tickets (it’s the same) and let me know if it looks like we missed something.

Thanks,
Wesley

On Aug 7, 2015, at 8:44 AM, Jeff Hammond <jeff.science at gmail.com<mailto:jeff.science at gmail.com>> wrote:

I am looking at MPI-3.1 8.3 "Error Handling" and 8.7.1 "Allowing User Functions at Process Termination" right now, but I cannot figure out the interaction between MPI_Abort and error handlers.


When MPI_Abort is called by one process, what is the effect on the others, besides "MPI will try to clean them up"?  I suppose one has to assume the worst case of "no cleanup" or "it's like kill -9" right now.  What does a high-quality implementation do?

What I am looking for a way to have error handlers called when MPI_Abort is called somewhere.  I don't expect this can be required, but "a high-quality implementation will do this" would be very useful.

The motivation is for one-sided job termination, e.g. shmem_global_exit and upc_global_exit (details below).  The challenge is that these functions require I/O flushing and resource release.  I really do not want to have to burn a thread just to satisfy this requirement in OSHMPI.

Thanks,

Jeff


OpenSHMEM 1.2 says this:

shmem_global_exit is a non-collective routine that allows any one PE to force termination of an OpenSHMEM program for all PEs, passing an exit status to the execution environment. This routine terminates the entire program, not just the OpenSHMEM portion. When any PE calls shmem_global_exit, it results in 27 the immediate notification to all PEs to terminate. shmem_global_exit flushes I/O and releases resources in accordance with C/C++/Fortran language requirements for normal program termination. If more than one PE calls shmem_global_exit, then the exit status returned to the environment shall be one of the values passed to shmem_global_exit as the status argument. There is no return to the caller of shmem_global_exit; control is returned from the OpenSHMEM program to the execution environment for all PEs.


shmem_global_exit may be used in situations where one or more PEs have determined that the program has completed and/or should terminate early. Accordingly, the integer status argument can be used to 38

pass any information about the nature of the exit, e.g an encountered error or a found solution. Since shmem_global_exit is a non-collective routine, there is no implied synchronization, and all PEs must ter- minate regardless of their current execution state. While I/O must be flushed for standard language I/O calls from C/C++/Fortran, it is implementation dependent as to how I/O done by other means (e.g. third party I/O libraries) is handled. Similarly, resources are released according to C/C++/Fortran standard language requirements, but this may not include all resources allocated for the OpenSHMEM program. However, a quality implementation will make a best effort to flush all I/O and clean up all resources.

UPC says this:

    7.2.1 Termination of all threads

    Synopsis

    1 #include <upc.h>
      void upc_global_exit(int status);

    Description

    2 upc_global_exit() flushes all I/O, releases all storage,
    and terminates the execution for all active threads.

--
Jeff Hammond
jeff.science at gmail.com<mailto:jeff.science at gmail.com>
http://jeffhammond.github.io/
_______________________________________________
mpi-forum mailing list
mpi-forum at lists.mpi-forum.org<mailto:mpi-forum at lists.mpi-forum.org>
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-forum



More information about the mpi-forum mailing list