[MPIWG Fortran] Fortran coarrays - failed images

Bland, Wesley wesley.bland at intel.com
Mon Sep 7 22:43:53 CDT 2015


Yep. That's the same challenge were facing. Pointing people at that link is a reasonable place to start. That's my old group at UTK where the first ULFM implementation was done and they're still maintaining it. I think they have some examples and use cases there as well. Right now, I think the MPICH version of PMI is probably the best place to start for what we want from launchers. I think PMIx is doing similar things, but I'm less familiar with it. 

> On Sep 7, 2015, at 10:27 PM, Bill Long <longb at cray.com> wrote:
> 
> Hi Wesley,
> 
> Thanks for the detailed reply.  It looks like the OpenCoarrays implementors should not have any problem getting the “did someone fail?” and “who failed?” information from MPI.   It’s not clear how that will get handled for their GASnet alternative, thought I suspect the UPC clan will be interested in some form of FT on the really large systems.  Maybe PMI would be a place to start for them. 
> 
> The remaining boogieman is convincing the execution environment to not kill the job.  I’ll try to find out what our MPI group is considering as a first step.   The http://fault-tolerance.org/ link in the main ticket (OpenMPI implementation) also looks potentially useful.  If the various MPI implementations as well as the Fortran implementors ask the ALPS/PBS/SLURM/… people for the same thing we’re more likely to get it.
> 
> Cheers,
> Bill
> 
> 
>> On Sep 7, 2015, at 11:58 AM, Bland, Wesley <wesley.bland at intel.com> wrote:
>> 
>> Hi Bill,
>> 
>> It sounds like at least some of what coarrays requires would be provided by the ULFM proposal the FTWG is bringing forward. I’ll drop some comments inline.
>> 
>> 
>> 
>> 
>>> On 9/7/15, 12:45 AM, "mpiwg-fortran on behalf of Bill Long" <mpiwg-fortran-bounces at lists.mpi-forum.org on behalf of longb at cray.com> wrote:
>>> 
>>> Hi Jeff,
>>> 
>>> The current version of the TS is WG5 document N2074.  The failed image feature has not changed much recently, but still better to use the latest version.  N2074 is currently out for WG5 review.  Assuming it passes, this (minus the line numbers that we like but ISO doesn’t) is the version that will be sent to the ISO editors for publication. 
>>> 
>>> I think it would be an excellent idea for the TS/Fortran 2015  and MPI facilities for FT to be able to use common underlying infrastructure if possible.   I have not thought about how a program that uses both the Fortran and the MPI facilities would work (or how the MPI spec should be written in that regard), but if the links into components like PMI or SLURM would be the same, that would certainly help.   I wrote up a summary of the TS features for the benefit of the MPI FT experts, pasted in below. 
>>> 
>>> Cheers,
>>> Bill
>>> 
>>> ISO TS 18508 includes features that can be used to help a program
>>> react to the failure of an image. It is intended to be a minimal
>>> capability. Facilities are included for notification, inquiry,
>>> testing, and simple continuation of execution.
>>> 
>>> 
>>> Background:
>>> 
>>> The parallel programming model using coarrays that is included in
>>> Fortran 2008 assumes that the number of executing images remains
>>> constant for the entire program. As a consequence, failure of an image
>>> (typically from a hardware failure) aborted the whole program.  The
>>> addition of teams in TS 18508 allows for the possibility of the number
>>> of images decreasing following a failure by forming a new team
>>> consisting of the active images and continuing execution in that team.
>>> While this was not the main motivation for introducing teams, this
>>> observation lead to the addition of minimal resilience facilities to
>>> the TS.
>>> 
>>> 
>>> Notification:
>>> 
>>> The image control statements include an optional STAT= specifier will
>>> return an error status. The image selector syntax for remote
>>> references also allow an optional STAT= specifier.  The new collective
>>> and atomic subroutines have an optional STAT argument that will also
>>> return an error status. An error status of zero indicates success. If
>>> the operation involved communication with a failed image, the status
>>> returned is equal to the named constant STAT_FAILED_IMAGE that is
>>> defined in the intrinsic module ISO_FORTRAN_ENV, and execution
>>> continues.  If there is no status variable provided and the operation
>>> involves communication with a failed image, the program aborts.  A
>>> negative value of STAT_FAILED_IMAGE indicates that the processor
>>> cannot detect processor failure, and a positive value indicates it can.
>>> This effectively makes provision of this facility optional.
>>> 
>>> [MPI analog: Remote communication operations in MPI either return a
>>> status as a function result (C), or have an optional MPI subroutine
>>> argument that returns an error status (Fortran). A named constant
>>> MPI_FAILED_RANK could be added to the MPI spec to provide a
>>> corresponding capability.]
>> 
>> 
>> WB: We do propose adding a new error class called MPI_ERR_PROC_FAILED which would tell you that a rank has failed. There are separate functions available to find out exactly which process.
>> 
>>> 
>>> 
>>> Inquiry:
>>> 
>>> Two inquiry functions are provided that can return information on the
>>> status of images. IMAGE_STATUS( N ) will return STAT_FAILED_IMAGE if
>>> image N has failed. This could be used to check on the health of an
>>> image just before a loop that involved many accesses to that
>>> image. FAILED_IMAGES( ) returns a 1-D array containing the numbers of
>>> the failed images. This can be used to compute which images should be
>>> omitted from a new team that can be used for continued execution. If
>>> there are no failed images, the returned array has size zero.
>>> 
>>> [MPI analog: Adding corresponding functions ( IMAGE -> RANK ) to MPI
>>> would seem straightforward.]
>> 
>> 
>> WB: While there isn’t exactly a function in the proposal to find out whether a communicator or window has a failed rank in it, it’s relatively straightforward to get that functionality with a different function that we propose. MPI_COMM_AGREE will let you know if a rank on the communicator has failed and then you can use the query functions on the communicator to get the group of failed processes.
>> 
>>> 
>>> 
>>> Testing:
>>> 
>>> A new statement
>>> 
>>> FAIL IMAGE
>>> 
>>> is added. When executed by image N it causes image N to appear to have
>>> failed as seen from the other images.  This is included so that
>>> programmers can test recovery algorithms without having to wait for an
>>> actual failure. An image that executes this statement does not
>>> continue execution.
>>> 
>>> [MPI analog: One more function.]
>> 
>> 
>> WB: You can emulate failures in MPI by just having a process exit without calling finalize or MPI_Abort. That’s erroneous behavior and will start kicking off error handlers in a HQI.
>> 
>>> 
>>> 
>>> Continuation:
>>> 
>>> The FORM TEAM statement allows the program to create a new team.  In
>>> the case of a failed image, the strategy would be to form a new team
>>> consisting of the remaining active images.  The CHANGE TEAM statement
>>> causes a switch in the execution environment to the specified
>>> team. Note that whether the program can meaningfully continue depends
>>> on the algorithm being implemented and whether the program includes
>>> code to switch teams to continue.
>>> 
>>> [MPI analog: Fortran teams could be mapped to MPI communicators. One
>>> difference is that Fortran allows you to omit a team specification in
>>> most statements involving communication, in which case the "current"
>>> team is used. The MPI spec might want to specify that MPI_COMM_WORLD
>>> is redefined in the case that "team" is shrunk through rank failure,
>>> to reduce the need to modify existing code.]
>> 
>> WB: We decided against changing the makeup of MPI_COMM_WORLD mid-execution to follow the law of least astonishment. What we do provide is a function to shrink an existing communicator to create a new one without the failed processes. Then it’s just a matter of changing your handles to use the new communicator instead of MPI_COMM_WORLD. A well-behaved app should be duping MPI_COMM_WORLD first anyway, right? :)
>> 
>>> 
>>> 
>>> Implementation issue:
>>> 
>>> The program launching and management environment (PMI, ALPS, SLURM,
>>> ...) need to be modified to include an API that can provide failed
>>> image information to the program and keep the remaining images
>>> executing, as opposed to the current behavior of aborting all the
>>> images.  The API also needs a function that the program startup code
>>> can call to inform the management environment whether it is employing
>>> resilient features. It would certainly be advantageous to have the
>>> same API used for both Fortran and MPI.
>> 
>> 
>> WB: Agreed. We have a similar requirement from the launcher to inform MPI about failed processes. Technically, we don’t require it of the launcher as MPI prefers to remain launcher/runtime neutral, but it would be good if implementations could take advantage of the same systems. I know that for PMI in MPICH, we added a new key that stays updated with a list of failed processes. We don’t yet have an implementation on top of ALPS or SLURM so there isn’t a state of the art there as far as I know. In most instances, everything just aborts so that’s a behavior that we’d like to get changed too.
>> 
>> As for informing the runtime about whether to enable resilience or not, we’ve put in the proposal language to say that the implementation is free to not provide the resilience features as long as it provides the API functions as no-ops. It would just never return MPI_ERR_PROC_FAILED. I expect that implementations that can’t/don’t want to support FT would do this and if people don’t want to give up the little overhead, they could configure it out at compile time (or provide two libraries).
>> 
>>> 
>>> ——————————End of Summary ——————————————
>> 
>> WB: Thanks for the summary. If you’d like to see a little more detail that Jeff is discussing, it’s in our ticket: #325: https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/325 (main ticket #323: https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/323). It sounds to me like the main points that are needed for the new standard are there. It’d be good to have another set of eyes on the RMA part in particular though to make sure we’re providing all the needed semantics. I wouldn’t say any of us in the FTWG are RMA or Fortran experts.
>> 
>> Thanks,
>> Wesley
>> 
>>> 
>>> 
>>> 
>>>> On Sep 5, 2015, at 7:09 PM, Jeff Hammond <jeff.science at gmail.com> wrote:
>>>> 
>>>> So Fortran 2015 (TS 18508 - section 6 in the attachment) is going to support failed images.
>>>> 
>>>> The OpenCoarrays folks (who are responsible for enabling GCC 5+ Fortran coarray support) have started looking into how to support this feature.  They currently use MPI-3 RMA and GASNet as communication runtimes, but the need to support FT will likely push them in a new direction.  The options they have mentioned thus far are undesirable.
>>>> 
>>>> It would be great if there were more people who could help look at RMA FT, particularly as it pertains to Fortran 2015.  Are any of the Fortran WG folks savvy on how those failures map to MPI concepts and whether or not the MPI RMA FT discussion is going in the right direction?
>>>> 
>>>> Thanks,
>>>> 
>>>> Jeff
>>>> 
>>>> -- 
>>>> Jeff Hammond
>>>> jeff.science at gmail.com
>>>> http://jeffhammond.github.io/
>>>> <ISO-IECJTC1-SC22-WG5_N2056_Draft_TS_18508_Additional_Paralle.pdf>_______________________________________________
>>>> mpiwg-fortran mailing list
>>>> mpiwg-fortran at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-fortran
>>> 
>>> Bill Long                                                                       longb at cray.com
>>> Fortran Technical Support  &                                  voice:  651-605-9024
>>> Bioinformatics Software Development                     fax:  651-605-9142
>>> Cray Inc./ Cray Plaza, Suite 210/ 380 Jackson St./ St. Paul, MN 55101
>>> 
>>> 
>>> _______________________________________________
>>> mpiwg-fortran mailing list
>>> mpiwg-fortran at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-fortran
>> _______________________________________________
>> mpiwg-fortran mailing list
>> mpiwg-fortran at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-fortran
> 
> Bill Long                                                                       longb at cray.com
> Fortran Technical Support  &                                  voice:  651-605-9024
> Bioinformatics Software Development                     fax:  651-605-9142
> Cray Inc./ Cray Plaza, Suite 210/ 380 Jackson St./ St. Paul, MN 55101
> 
> 
> _______________________________________________
> mpiwg-fortran mailing list
> mpiwg-fortran at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-fortran



More information about the mpiwg-fortran mailing list