[Mpi-forum] Discussion points from the MPI-<next> discussion today

Fri Sep 21 13:49:41 CDT 2012

On Fri, Sep 21, 2012 at 12:58 PM, Supalov, Alexander <
alexander.supalov at intel.com> wrote:

> Hi,
>
> Thanks. There may be another interesting twist to that.
>
> Imagine we start challenging the initial assumptions (axioms) that lay in
> the foundation of the MPI standard. Those are, e.g., reliable and pairwise
> ordered message delivery, non-fairness, and maybe a few more. If we take
> one out - what happens to the MPI? What part of it survives? What
> applications become possible that were not possible before? This idea is
> somewhat comparable to the post-Euclidean geometry, if you wish.
>
> E.g., take out the reliable message delivery. This may be perfect for
> Exascale: you will have to have rather stable algorithms anyway, and loss
> of a couple of messages should not have dramatic effect then, in this sea
> of data. And what kind of speedup might be possible with that? Gee.
>

I recommend being _extremely_ careful with this line of thinking. By and
large, these "fault-oblivious" methods converge so much slower than their
"reliable" siblings that you would be better off running on today's
workstation with good methods than on an exascale machine with the
"oblivious" methods. Don't forget David Keyes' slide showing how
algorithmic innovation has offered the same incredible number of orders of
magnitude improvement in simulation capability that hardware improvements
have provided over the past few decades. Most of these "oblivious"
algorithms necessarily discard decades of algorithmic innovation.

Furthermore, affordable assertions of correctness (and reasonable
convergence rates) in this unreliable landscape require statistical
characterization of the faults. If the faults are correlated in a way that
you have not accounted for, the claimed result can be wrong. Even worse,
the accuracy of this statistical characterization depends on the regularity
of the problem being solved. Most problems of great scientific and
industrial relevance (i.e., those applications that justify the existence
of an exascale machine) have poorly understood regularity. I'm not saying
that these "oblivious" algorithms don't have a place, but most existing
work is of extraordinarily low quality and the best hope, predicated on
huge advances in the quality of the underlying mathematics (e.g., from
stationary iterative methods to multigrid), would be a method that was
still only relevant for "toy problems" (the regularity of which is well
understood).

Proposals for ASCR's "Resilient Solvers" applied math call are currently in
review. They contain a lot of interesting approaches, but the most
practical involve the use of problem structure to identify faults and to
rapidly recover lost state in the event of node failure and similar. For
the methods that I proposed for our proposal, one of the most interesting
things for MPI/checkpointing library to provide is the ability to connect a
"spare" or restarted node to an existing communicator. Once the
communicator is attached, the application will build a subcommunicator
involving neighbor processes for accelerated algorithmic recovery. (We
should start a new thread if you want to discuss this further.)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpi-forum/attachments/20120921/2a8e7b23/attachment-0001.html>