[mpiwg-ft] [MPI Forum] #323: User-Level Failure Mitigation

Jeff Hammond jeff.science at gmail.com
Tue Feb 3 15:14:11 CST 2015


How is win_flush not synchronizing? It cause global visibility of updates. I don't see how a non-synchronizing implementation could exist. 

Jeff 

Sent from my iPhone

> On Feb 3, 2015, at 12:50 PM, MPI Forum <mpi-forum at lists.mpi-forum.org> wrote:
> 
> #323: User-Level Failure Mitigation
> -------------------------------------+-----------------------------------
> Reporter:  bosilca                   |                  Owner:  bosilca
>    Type:  Enhancements to standard  |                 Status:  new
> Priority:  Scheduled                 |              Milestone:  Future
> Version:  MPI 4.0                   |             Resolution:
> Keywords:  FT                        |  Implementation status:  Completed
> -------------------------------------+-----------------------------------
> 
> Comment (by bouteill):
> 
> Replying to [comment:32 jhammond]:
>> This means that page 5 line 11 of the latest FT proposal must be amended
> somehow, as it pertains to the use of the phrase "epoch closing" (which
> should be "epoch-closing", no?), unless you deliberately mean to exclude
> {{{MPI_WIN_FLUSH(_LOCAL)(_ALL)}}} and {{{MPI_WIN_SYNC}}} from the list of
> functions that must raise a process failure exception.  And if they are
> excluded, then their relationship to FT is ambiguous, since they are
> neither communication operations nor epoch-closing synchronization.
>> 
>> I suppose that we should treat {{{MPI_WIN_FLUSH_LOCAL(_ALL)}}}
> differently from {{{MPI_WIN_FLUSH(_ALL)}}}, since the former is a local
> operation and the latter is a nonlocal one.  Given that
> {{{MPI_WIN_FLUSH(_ALL)}}} induce remote completion, they will detect
> remote process failures and thus can be required to raise these without
> introducing unreasonable overhead.
> 
> Ok, thinking more about this I came to the conclusion that the current
> text is correct: WIN_FLUSH is not local, but it is ordering more than
> remote completion, so it may not always detect errors. If it does (when
> the particular implementation does guarantee remote completion), it will
> raise an exception (as is possible in any cases), when the implementation
> is not synchronizing, it will not. Mandating the raising of the exception
> may make it more expensive.
> 
> We may want to add some rationale/advices to remember why we came to that
> conclusion (if you agree with me here) ?
> 
> -- 
> Ticket URL: <https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/323#comment:39>
> MPI Forum <https://svn.mpi-forum.org/>
> MPI Forum



More information about the mpiwg-ft mailing list