[mpiwg-ft] User feedback
james.dinan at gmail.com
Mon Jun 16 10:25:21 CDT 2014
I was speaking recently with a commercial MPI user interested in fault
tolerance. I wanted to pass along the following feedback, which might be
useful to the working group:
- our application could probably continue to run for the farmer/worker type
- it probably cannot continue to run in those cases where the workers talk
to each other, such as in lattice-boltzmann type simulations that basically
use domain decomposition with neighbors exchanging halo cells, because in
this case data is lost on the failing node. We would need to roll back to
some kind of checkpoint.
- for farmer/worker type loads, we can work with any number of nodes. For
the other loads, we would need to have the same number of nodes - otherwise
we would need to do some serious reshuffling of data.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the mpiwg-ft