<div dir="ltr">Hi All,<div><br></div><div>I was speaking recently with a commercial MPI user interested in fault tolerance. I wanted to pass along the following feedback, which might be useful to the working group:</div><div>
<br></div><div>- our application could probably continue to run for the farmer/worker type loads.<br></div><div>
<p class="">- it probably cannot continue to run in those cases where the workers talk to each other, such as in lattice-boltzmann type simulations that basically use domain decomposition with neighbors exchanging halo cells, because in this case data is lost on the failing node. We would need to roll back to some kind of checkpoint.</p>
<p class="">- for farmer/worker type loads, we can work with any number of nodes. For the other loads, we would need to have the same number of nodes - otherwise we would need to do some serious reshuffling of data.</p></div>
<div>Best,</div><div> ~Jim.</div></div>