From work at wesbland.com Mon Jun 10 08:49:25 2019 From: work at wesbland.com (Wesley Bland) Date: Mon, 10 Jun 2019 08:49:25 -0500 Subject: [mpiwg-ft] FTWG Con Call - 2019-06-10 Message-ID: <77EB46B1-0C5F-4730-9878-CE4809DBC60D@wesbland.com> The Fault Tolerance Working Group?s weekly con call is today at 12:00 PM Eastern. Today's agenda: * Continue fine and course grained recovery discussions (Aurelien and Ignacio, feel free to contribute details of what you?d like to talk about). If there's something else that people would like to discuss, please just send an email to the WG so we can get it on the agenda. Thanks, Wesley ......................................................................................................................................... Join from PC, Mac, Linux, iOS or Android: https://tennessee.zoom.us/j/632356722?pwd=lI4_169CGcewIumekTziMw Password: mpiforum Or iPhone one-tap (US Toll): +16468769923,632356722# or +16699006833,632356722# Or Telephone: Dial: +1 646 876 9923 (US Toll) +1 669 900 6833 (US Toll) Meeting ID: 632 356 722 International numbers available: https://zoom.us/u/6uINe Or an H.323/SIP room system: H.323: 162.255.37.11 (US West) or 162.255.36.11 (US East) Meeting ID: 632 356 722 Password: 364216 SIP: 632356722 at zoomcrc.com Password: 364216 ......................................................................................................................................... From howardp at lanl.gov Tue Jun 25 08:48:47 2019 From: howardp at lanl.gov (Pritchard Jr., Howard) Date: Tue, 25 Jun 2019 13:48:47 +0000 Subject: [mpiwg-ft] help with advice to implementors accompanying MPI_Abort Message-ID: <602B958C-12CB-43DC-907F-6082AA538D0C@lanl.gov> Hello MPI FTer?s, The Sessions WG could use some help/suggestions about how to adjust the following advice to implementors that accompanies the definition of MPI_Abort: \begin{implementors} After aborting a subset of processes, a high quality implementation should be able to provide error handling for communicators, windows, and files involving both aborted and non-aborted processes. As an example, if the user changes the error handler for \const{MPI\_COMM\_WORLD} to \const{MPI\_ERRORS\_RETURN} or a custom error handler, when a subset of \const{MPI\_COMM\_WORLD} is aborted, the remaining processes in \const{MPI\_COMM\_WORLD} should be able to continue communicating with each other and receive appropriate error codes when attempting communication with an aborted process. \end{implementors} We would like to generalize this advice to implementors to the case where MPI_COMM_WORLD isn?t a valid communicator, i.e. when an application is using the Sessions model. We think that there would need to be some reworking of the existing text to cover the sessions use case. Since the FT group has worked quite a bit on this text, we?d defer to your group for suggestions on how to generalize this text to cover the sessions use case. Thanks very much for any help, Howard -- Howard Pritchard HPC-ENV Los Alamos National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: From bouteill at icl.utk.edu Tue Jun 25 13:33:22 2019 From: bouteill at icl.utk.edu (Aurelien Bouteiller) Date: Tue, 25 Jun 2019 14:33:22 -0400 Subject: [mpiwg-ft] help with advice to implementors accompanying MPI_Abort In-Reply-To: <602B958C-12CB-43DC-907F-6082AA538D0C@lanl.gov> References: <602B958C-12CB-43DC-907F-6082AA538D0C@lanl.gov> Message-ID: It does not appear to me that any change is necessary. The only part about MPI_COMM_WORLD is in an 'as an example' clause. That being said, substituting MPI_COMM_WORLD with some generic 'communicator comm' would also work. Aurelien On Tue, Jun 25, 2019 at 9:49 AM Pritchard Jr., Howard via mpiwg-ft < mpiwg-ft at lists.mpi-forum.org> wrote: > Hello MPI FTer?s, > > > > The Sessions WG could use some help/suggestions about how to adjust the > following advice to implementors that accompanies the definition of > MPI_Abort: > > > > \begin{implementors} > > After aborting a subset of processes, a high quality implementation > should > > be able to provide error handling for communicators, windows, and > files > > involving both aborted and non-aborted processes. As an example, if > the > > user changes the error handler for \const{MPI\_COMM\_WORLD} to > > \const{MPI\_ERRORS\_RETURN} or a custom error handler, when a subset > of > > \const{MPI\_COMM\_WORLD} is aborted, the remaining processes in > > \const{MPI\_COMM\_WORLD} should be able to continue communicating > with each > > other and receive appropriate error codes when attempting > communication > > with an aborted process. > > \end{implementors} > > > > We would like to generalize this advice to implementors to the case where > MPI_COMM_WORLD isn?t a valid communicator, i.e. when an application is > using the Sessions model. > > We think that there would need to be some reworking of the existing text > to cover the sessions use case. Since the FT group has worked quite a bit > on this text, we?d defer to your group for suggestions on how to generalize > this text to cover the sessions use case. > > > > Thanks very much for any help, > > > > Howard > > > > -- > > > > Howard Pritchard > > HPC-ENV > > Los Alamos National Laboratory > > > _______________________________________________ > mpiwg-ft mailing list > mpiwg-ft at lists.mpi-forum.org > https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft > -------------- next part -------------- An HTML attachment was scrubbed... URL: From d.holmes at epcc.ed.ac.uk Wed Jun 26 05:50:31 2019 From: d.holmes at epcc.ed.ac.uk (HOLMES Daniel) Date: Wed, 26 Jun 2019 10:50:31 +0000 Subject: [mpiwg-ft] help with advice to implementors accompanying MPI_Abort In-Reply-To: References: <602B958C-12CB-43DC-907F-6082AA538D0C@lanl.gov> Message-ID: Hi Aur?lien, My concern with changing this text to refer to a different communicator was that the user might have to set an error handler on MPI_COMM_WORLD in addition to the one on that (sub)communicator - to avoid MPI just aborting all MPI processes in MCW at the first sign of trouble. The first sentence says ?high quality implementation should? so maybe there is no reason to worry here. We will drop this item from our to-do list. Thanks! Cheers, Dan. ? Dr Daniel Holmes PhD Applications Consultant in HPC Research d.holmes at epcc.ed.ac.uk Phone: +44 (0) 131 651 3465 Mobile: +44 (0) 7940 524 088 Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, EH8 9BT ? The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ? On 25 Jun 2019, at 19:33, Aurelien Bouteiller via mpiwg-ft > wrote: It does not appear to me that any change is necessary. The only part about MPI_COMM_WORLD is in an 'as an example' clause. That being said, substituting MPI_COMM_WORLD with some generic 'communicator comm' would also work. Aurelien On Tue, Jun 25, 2019 at 9:49 AM Pritchard Jr., Howard via mpiwg-ft > wrote: Hello MPI FTer?s, The Sessions WG could use some help/suggestions about how to adjust the following advice to implementors that accompanies the definition of MPI_Abort: \begin{implementors} After aborting a subset of processes, a high quality implementation should be able to provide error handling for communicators, windows, and files involving both aborted and non-aborted processes. As an example, if the user changes the error handler for \const{MPI\_COMM\_WORLD} to \const{MPI\_ERRORS\_RETURN} or a custom error handler, when a subset of \const{MPI\_COMM\_WORLD} is aborted, the remaining processes in \const{MPI\_COMM\_WORLD} should be able to continue communicating with each other and receive appropriate error codes when attempting communication with an aborted process. \end{implementors} We would like to generalize this advice to implementors to the case where MPI_COMM_WORLD isn?t a valid communicator, i.e. when an application is using the Sessions model. We think that there would need to be some reworking of the existing text to cover the sessions use case. Since the FT group has worked quite a bit on this text, we?d defer to your group for suggestions on how to generalize this text to cover the sessions use case. Thanks very much for any help, Howard -- Howard Pritchard HPC-ENV Los Alamos National Laboratory _______________________________________________ mpiwg-ft mailing list mpiwg-ft at lists.mpi-forum.org https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft _______________________________________________ mpiwg-ft mailing list mpiwg-ft at lists.mpi-forum.org https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft -------------- next part -------------- An HTML attachment was scrubbed... URL: