[mpiwg-ft] Notes from FTWG Plenary Session

Jeff Hammond jeff.science at gmail.com
Fri Dec 12 02:58:45 CST 2014

I don't want to rely upon work by PNNL to drive FT-RMA; clearly there
are issues there.

The FT WG has clearly made a lot of progress on non-RMA so far and I
don't want RMA users to be left out.  Let's work together to solve the
problem of non-Byzantine fault-tolerance in RMA by leveraging the
existing work you have done and the knowledge that RMA was designed
such that it could be implemented on top of Send-Recv (recall that the
Forum refuses to require progress in passive target RMA...).

The first step should be to make RMA symmetric w.r.t. Isend regarding
the implementation not being able to modify buffers outside of what is
require to provide the RMA semantics in question (i.e. MPI_Get does
not modify target window, MPI_Put does not modify input buffer, etc.).

I'll read stuff over the holiday and be prepared to work on this with
you in January.



On Fri, Dec 12, 2014 at 12:43 AM, George Bosilca <bosilca at icl.utk.edu> wrote:
> On Thu, Dec 11, 2014 at 1:26 AM, Jeff Hammond <jeff.science at gmail.com>
> wrote:
>> http://pubs.acs.org/doi/abs/10.1021/ct100439u is the paper I was
>> implicitly referencing.  They do RAID inside of GA.
>> I can only do this sanely with MPI RMA (ie without resorting to nproc
>> times as many windows as necessary) if and only iff I can continue to use
>> data after process failure if I know it could not have been corrupted.
> GA folks understood that minimal (and potentially intrusive) changes were
> required from the GA underlying runtime and communication library in order
> to support highly effective application specific fault tolerance methods.
> The paper you pointed out unfortunately leaves such discussions out, but it
> does prove a point similar to ULFM,  that building upon a fault tolerant
> communication library could have drastic benefits for applications in case
> of faults, while minimizing the impact of the failure-free execution.
> Now, for the sake of the discussion let's dig a little deeper. To be
> extremely pedantic, while there is a 2-level RAID inside GA, this cannot be
> compared with MPI (which is more like ARMCI). As you might notice, there is
> a subtle difference here, ARMCI do not guarantee the correctness in the
> traditional sense for one-sided operations (except obviously for get-based
> protocols). Instead, they use the GA-level data redundancy, together with
> write-based one-sided communication primitives and fences to ensure the
> existence of a consistent state. Brilliantly simplistic approach, at the
> opposite spectrum of the MPI Forum who seem to require data integrity on
> windows in all memory models.
> Anyway, the good thing is that in the context of the FT WG, we are way past
> the point where GA seems to be (for everything but one-sided). We do have a
> clear description of the expected behavior for all communications (except
> for one-sided), with a well described API (except for one-sided), and now 2
> widely available implementations (except for one-sided). Over the last 2
> years, many large scale applications have shown that taking advantage of
> these extensions, drastic improvements in the time-to-solution for these
> applications can be achieved in faulty environments. This is proof that
> instead of providing a limited-scope fault management model, ULFM expose a
> portable API, allowing application/library developers to design and
> implement highly efficient application/domain-specific fault tolerant
> models.
> Hopefully with the help of interested folks and with the support of the RMA
> WG, we could settle on an approach similar to FT-ARMCI, and start building
> something constructive from there. It is about time.
>> It is possible that the paper doesn't adequately explain things for this
>> context, in which case I will provide them later.
> We should not mix application/domain-specific with communication
> library-level fault tolerance. Most of these papers do a great job as
> exposing high-level strategies, domain-specific or application-level, to
> handle faults. They seems to imply some level of resilience from the
> underlying runtime and communication library, but unfortunately such details
> are extremely scarce.
> There are 2 questions I would appreciate to get more details on. In the
> light on the point raised in the discussion regarding the memory reuse in
> RMA, how are the GA folks dealing with this case? My understanding is that
> they leverage the GA-level data redundancy to ensure consistency. But if we
> suppose lingering messages in the network generated by the failed process,
> how do they ensure the viability of the shadow data and especially how do
> they maintain the data consistency across multiple subsequent failures?
> Sorry for the long email.
>   George.
> PS: Reading through these papers I noticed an unsettling thing. In the
> HIPC'10 paper that talk about FT-ARMCI, performance numbers are presented
> using a fault-tolerant ARMCI/GA. However, all the other papers, especially
> those published after the HIPC'10 paper, state that no fault-tolerant GA
> implementation exists, and present instead results obtained using the
> original GA implementation (convenient ...). I wonder what the reaction of
> the MPI Forum would have been if the FT WG would have dared to present fault
> tolerance related results using a stock MPI library.
>> Other stuff that may or may matter:
>> http://hpc.pnl.gov/people/vishnu/public/vishnu_overdecomposition.pdf
>> http://hpc.pnl.gov/people/vishnu/public/vishnu_hipc10.pdf
>> http://dx.doi.org/10.1109/PDP.2011.72
>> http://link.springer.com/chapter/10.1007/978-3-642-23397-5_34
>> I assume someone from Argonne has presented GVR to the WG?
>> Jeff
>> Sent from my iPhone
>> On Dec 10, 2014, at 10:12 PM, George Bosilca <bosilca at icl.utk.edu> wrote:
>> Jeff,
>> I was trying to find some references to the GA FT work you mentioned
>> during the plenary discussion today.
>> The only reference I could find about the FT capabilities of GA is [1] but
>> it is getting dusty. A more recent reference [2] addresses NWCHEM in
>> particular, but represents an application-specific user-level
>> checkpoint/restart strategy, requiring minimal support from the
>> communication library and that has little in common with the ongoing
>> discussion in the WG.
>> I would really appreciate if you could provide a reference.
>> Thanks,
>>   George.
>> [1] V. Tipparaju, M. Krishnan, B. Palmer, F. Petrini, and J. Nieplocha,
>> “Towards fault resilient Global Arrays.” in International Conference on
>> Parallel Computing, vol. 15, 2007, pp. 339–345.
>> [2] Nawab Ali, Sriram Krishnamoorthy, Niranjan Govind, Bruce Palmer, "A
>> Redundant Communication Approach to Scalable Fault Tolerance in PGAS
>> Programming Models", in PDP'11
>> On Wed, Dec 10, 2014 at 5:14 PM, Wesley Bland <wbland at anl.gov> wrote:
>>> I've posted notes from today's plenary session on the wiki page:
>>> https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ftwg2014-12-10
>>> I'm also attaching the slides to this email and I believe they'll be
>>> posted on the forum website by Martin at some point.
>>> Thanks,
>>> Wesley
>>> _______________________________________________
>>> mpiwg-ft mailing list
>>> mpiwg-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft

Jeff Hammond
jeff.science at gmail.com

More information about the mpiwg-ft mailing list