[Mpi3-ft] Alternative approach/proposals

Tue Feb 7 08:55:49 CST 2012

Hi folks

I'm a newbie to the group, so please excuse my ignorance over prior discussions. I have been reading the wiki documents with interest, and have scanned the latest alternative proposals. These are some initial impressions.

Having worked fault tolerance issues (albeit outside the MPI layer) for the past decade, it seems to me that the group is somewhat confronting the issue that FT techniques aren't as clear cut as we might desire - i.e., there are still multiple ways of dealing with faults, and no clear cut single approach out-shines all others. Thus, it would seem to me that taking a minimalist approach to the standard, as put forward by both alternative proposals, represents the best current next step while leaving researchers free to continue exploring the field.

What I'm wondering is if the group might receive a more positive response from the Forum in general if it took a multi-step approach, not trying to fully standardize FT from the very start. I know I'm personally concerned about the probable race conditions and performance impacts of trying to fully resolve FT - having fought some of those battles over the years, I suspect we are a some distance away from cleanly handling all scenarios, especially those involving multiple, near-simultaneous failures.

Perhaps we could start by proposing adoption of a standard calling for fault notification, possibly allowing that feature to be configured out so that users not requiring FT are unaffected by it? I note that both alternative proposals put forward tonight start with this basic capability, and it makes the most sense to me. If we could make MPI communication the equivalent of the more common "socket" IPC - i.e., if apps could be notified that a message failed to be delivered ala what we get from a typical socket - then that would seem to be a major step forward. The ability to not block in a procedure and allow an app to determine how to recover (perhaps providing a few simple tools as proposed by UTK) seems like something we can reasonably hope to implement without destabilizing the MPI code base.

As for my suggestion that the standard stipulate that implementations are allowed to enable such things via configuration, if desired, my rationale is simply that the fault frequency we see in industry is very low as our clusters are relatively small. In talking with a few compatriots and customers, and reviewing literature over the last few weeks, it appears that clusters of even a few thousand nodes are only seeing node failures (as opposed to the more common file system errors) every few days to weeks, and that rapid checkpoint/restart capabilities are under development to render those manageable. The most widely used clusters (in terms of percent of the market) are in the 64-128 node range, and report node failures less than once a year after initial burn-in. Thus, while a fascinating research topic and one that will likely become more applicable to future "extreme scale" clusters, it seems likely that the very large majority of MPI systems do not require these capabilities.

It would therefore seem more likely to be acceptable to the broader community if we left it as optional in some fashion - i.e., it should be in a "quality" implementation, but not required to be "on" unless specifically built that way. This would provide for a broader research and "try out" capability, without creating negative reactions on the part of those not requiring it.

HTH a bit, and I look forward to participating more in the discussions.
Ralph