[mpiwg-ft] help with advice to implementors accompanying MPI_Abort

HOLMES Daniel d.holmes at epcc.ed.ac.uk
Wed Jun 26 05:50:31 CDT 2019


Hi Aurélien,

My concern with changing this text to refer to a different communicator was that the user might have to set an error handler on MPI_COMM_WORLD in addition to the one on that (sub)communicator - to avoid MPI just aborting all MPI processes in MCW at the first sign of trouble. The first sentence says “high quality implementation should” so maybe there is no reason to worry here.

We will drop this item from our to-do list. Thanks!

Cheers,
Dan.
—
Dr Daniel Holmes PhD
Applications Consultant in HPC Research
d.holmes at epcc.ed.ac.uk<mailto:d.holmes at epcc.ed.ac.uk>
Phone: +44 (0) 131 651 3465
Mobile: +44 (0) 7940 524 088
Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, EH8 9BT
—
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
—

On 25 Jun 2019, at 19:33, Aurelien Bouteiller via mpiwg-ft <mpiwg-ft at lists.mpi-forum.org<mailto:mpiwg-ft at lists.mpi-forum.org>> wrote:

It does not appear to me that any change is necessary.

The only part about MPI_COMM_WORLD is in an 'as an example' clause.

That being said, substituting MPI_COMM_WORLD with some generic 'communicator comm' would also work.

Aurelien


On Tue, Jun 25, 2019 at 9:49 AM Pritchard Jr., Howard via mpiwg-ft <mpiwg-ft at lists.mpi-forum.org<mailto:mpiwg-ft at lists.mpi-forum.org>> wrote:
Hello MPI FTer’s,

The Sessions WG could use some help/suggestions about how to adjust the following advice to implementors that accompanies the definition of MPI_Abort:


\begin{implementors}

    After aborting a subset of processes, a high quality implementation should

    be able to provide error handling for communicators, windows, and files

    involving both aborted and non-aborted processes. As an example, if the

    user changes the error handler for \const{MPI\_COMM\_WORLD} to

    \const{MPI\_ERRORS\_RETURN} or a custom error handler, when a subset of

    \const{MPI\_COMM\_WORLD} is aborted, the remaining processes in

    \const{MPI\_COMM\_WORLD} should be able to continue communicating with each

    other and receive appropriate error codes when attempting communication

    with an aborted process.

\end{implementors}

We would like to generalize this advice to implementors to the case where MPI_COMM_WORLD isn’t a valid communicator, i.e. when an application is using the Sessions model.
We think that there would need to be some reworking of the existing text to cover the sessions use case.   Since the FT group has worked quite a bit on this text, we’d defer to your group for suggestions on how to generalize this text to cover the sessions use case.

Thanks very much for any help,

Howard

--

Howard Pritchard
HPC-ENV
Los Alamos National Laboratory

_______________________________________________
mpiwg-ft mailing list
mpiwg-ft at lists.mpi-forum.org<mailto:mpiwg-ft at lists.mpi-forum.org>
https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft
_______________________________________________
mpiwg-ft mailing list
mpiwg-ft at lists.mpi-forum.org<mailto:mpiwg-ft at lists.mpi-forum.org>
https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20190626/a4fdfc61/attachment-0001.html>


More information about the mpiwg-ft mailing list