[mpiwg-ft] Notes from FTWG Plenary Session

Fri Dec 12 02:43:13 CST 2014

On Thu, Dec 11, 2014 at 1:26 AM, Jeff Hammond <jeff.science at gmail.com>
wrote:

> http://pubs.acs.org/doi/abs/10.1021/ct100439u is the paper I was
> implicitly referencing.  They do RAID inside of GA.
>
I can only do this sanely with MPI RMA (ie without resorting to nproc times
> as many windows as necessary) if and only iff I can continue to use data
> after process failure if I know it could not have been corrupted.
>

GA folks understood that minimal (and potentially intrusive) changes were
required from the GA underlying runtime and communication library in order
to support highly effective application specific fault tolerance methods.
The paper you pointed out unfortunately leaves such discussions out, but it
does prove a point similar to ULFM,  that building upon a fault tolerant
communication library could have drastic benefits for applications in case
of faults, while minimizing the impact of the failure-free execution.

Now, for the sake of the discussion let's dig a little deeper. To be
extremely pedantic, while there is a 2-level RAID inside GA, this cannot be
compared with MPI (which is more like ARMCI). As you might notice, there is
a subtle difference here, ARMCI do not guarantee the correctness in the
traditional sense for one-sided operations (except obviously for get-based
protocols). Instead, they use the GA-level data redundancy, together with
write-based one-sided communication primitives and fences to ensure the
existence of a consistent state. Brilliantly simplistic approach, at the
opposite spectrum of the MPI Forum who seem to require data integrity on
windows in all memory models.

Anyway, the good thing is that in the context of the FT WG, we are way past
the point where GA seems to be (for everything but one-sided). We do have a
clear description of the expected behavior for all communications (except
for one-sided), with a well described API (except for one-sided), and now 2
widely available implementations (except for one-sided). Over the last 2
years, many large scale applications have shown that taking advantage of
these extensions, drastic improvements in the time-to-solution for these
applications can be achieved in faulty environments. This is proof that
instead of providing a limited-scope fault management model, ULFM expose a
portable API, allowing application/library developers to design and
implement highly efficient application/domain-specific fault tolerant
models.

Hopefully with the help of interested folks and with the support of the RMA
WG, we could settle on an approach similar to FT-ARMCI, and start building
something constructive from there. It is about time.

> It is possible that the paper doesn't adequately explain things for this
> context, in which case I will provide them later.
>

We should not mix application/domain-specific with communication
library-level fault tolerance. Most of these papers do a great job as
exposing high-level strategies, domain-specific or application-level, to
handle faults. They seems to imply some level of resilience from the
underlying runtime and communication library, but unfortunately such
details are extremely scarce.

There are 2 questions I would appreciate to get more details on. In the
light on the point raised in the discussion regarding the memory reuse in
RMA, how are the GA folks dealing with this case? My understanding is that
they leverage the GA-level data redundancy to ensure consistency. But if we
suppose lingering messages in the network generated by the failed process,
how do they ensure the viability of the shadow data and especially how do
they maintain the data consistency across multiple subsequent failures?

Sorry for the long email.
  George.

PS: Reading through these papers I noticed an unsettling thing. In the
HIPC'10 paper that talk about FT-ARMCI, performance numbers are presented
using a fault-tolerant ARMCI/GA. However, all the other papers, especially
those published after the HIPC'10 paper, state that no fault-tolerant GA
implementation exists, and present instead results obtained using the
original GA implementation (convenient ...). I wonder what the reaction of
the MPI Forum would have been if the FT WG would have dared to present
fault tolerance related results using a stock MPI library.

> Other stuff that may or may matter:
>
> http://hpc.pnl.gov/people/vishnu/public/vishnu_overdecomposition.pdf
> http://hpc.pnl.gov/people/vishnu/public/vishnu_hipc10.pdf
> http://dx.doi.org/10.1109/PDP.2011.72
> http://link.springer.com/chapter/10.1007/978-3-642-23397-5_34
>
> I assume someone from Argonne has presented GVR to the WG?
>
> Jeff
>
> Sent from my iPhone
>
> On Dec 10, 2014, at 10:12 PM, George Bosilca <bosilca at icl.utk.edu> wrote:
>
> Jeff,
>
> I was trying to find some references to the GA FT work you mentioned
> during the plenary discussion today.
>
> The only reference I could find about the FT capabilities of GA is [1] but
> it is getting dusty. A more recent reference [2] addresses NWCHEM in
> particular, but represents an application-specific user-level
> checkpoint/restart strategy, requiring minimal support from the
> communication library and that has little in common with the ongoing
> discussion in the WG.
>
> I would really appreciate if you could provide a reference.
>
> Thanks,
>   George.
>
> [1] V. Tipparaju, M. Krishnan, B. Palmer, F. Petrini, and J. Nieplocha,
> “Towards fault resilient Global Arrays.” in International Conference on
> Parallel Computing, vol. 15, 2007, pp. 339–345.
> [2] Nawab Ali, Sriram Krishnamoorthy, Niranjan Govind, Bruce Palmer, "A
> Redundant Communication Approach to Scalable Fault Tolerance in PGAS
> Programming Models", in PDP'11
>
> On Wed, Dec 10, 2014 at 5:14 PM, Wesley Bland <wbland at anl.gov> wrote:
>
>> I've posted notes from today's plenary session on the wiki page:
>>
>> https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ftwg2014-12-10
>>
>> I'm also attaching the slides to this email and I believe they'll be
>> posted on the forum website by Martin at some point.
>>
>> Thanks,
>> Wesley
>>
>> _______________________________________________
>> mpiwg-ft mailing list
>> mpiwg-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20141212/eaf7bfed/attachment-0001.html>