<div class="gmail_quote">On Fri, Sep 21, 2012 at 12:58 PM, Supalov, Alexander <span dir="ltr"><<a href="mailto:alexander.supalov@intel.com" target="_blank">alexander.supalov@intel.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi,<br>

<br>

Thanks. There may be another interesting twist to that.<br>

<br>

Imagine we start challenging the initial assumptions (axioms) that lay in the foundation of the MPI standard. Those are, e.g., reliable and pairwise ordered message delivery, non-fairness, and maybe a few more. If we take one out - what happens to the MPI? What part of it survives? What applications become possible that were not possible before? This idea is somewhat comparable to the post-Euclidean geometry, if you wish.<br>


<br>

E.g., take out the reliable message delivery. This may be perfect for Exascale: you will have to have rather stable algorithms anyway, and loss of a couple of messages should not have dramatic effect then, in this sea of data. And what kind of speedup might be possible with that? Gee.<br>

</blockquote><div><br></div><div>I recommend being _extremely_ careful with this line of thinking. By and large, these "fault-oblivious" methods converge so much slower than their "reliable" siblings that you would be better off running on today's workstation with good methods than on an exascale machine with the "oblivious" methods. Don't forget David Keyes' slide showing how algorithmic innovation has offered the same incredible number of orders of magnitude improvement in simulation capability that hardware improvements have provided over the past few decades. Most of these "oblivious" algorithms necessarily discard decades of algorithmic innovation.</div>

<div><br></div><div>Furthermore, affordable assertions of correctness (and reasonable convergence rates) in this unreliable landscape require statistical characterization of the faults. If the faults are correlated in a way that you have not accounted for, the claimed result can be wrong. Even worse, the accuracy of this statistical characterization depends on the regularity of the problem being solved. Most problems of great scientific and industrial relevance (i.e., those applications that justify the existence of an exascale machine) have poorly understood regularity. I'm not saying that these "oblivious" algorithms don't have a place, but most existing work is of extraordinarily low quality and the best hope, predicated on huge advances in the quality of the underlying mathematics (e.g., from stationary iterative methods to multigrid), would be a method that was still only relevant for "toy problems" (the regularity of which is well understood).</div>

<div><br></div><div>Proposals for ASCR's "Resilient Solvers" applied math call are currently in review. They contain a lot of interesting approaches, but the most practical involve the use of problem structure to identify faults and to rapidly recover lost state in the event of node failure and similar. For the methods that I proposed for our proposal, one of the most interesting things for MPI/checkpointing library to provide is the ability to connect a "spare" or restarted node to an existing communicator. Once the communicator is attached, the application will build a subcommunicator involving neighbor processes for accelerated algorithmic recovery. (We should start a new thread if you want to discuss this further.)</div>

</div>