[mpiwg-ft] Issues with MPI_Comm_agree output num_agreed_failed
Whitlock, Matthew Joseph
mwhitlo at sandia.gov
Fri May 1 11:01:39 CDT 2026
Hello all,
The changes we've been suggesting for MPI_Comm_agree aren't quite as straightforward as we've been thinking.
Change 1 is to output a num_agreed_failed such that the first num_agreed_failed of the groups returned by a subsequent call to MPI_Comm_get_failed will be similar across all ranks. However, we cannot guarantee the similarity of the fail groups unless new_failed is false. Consider:
MPI_Comm_iagree(..., &num_agreed_failed, &new_failed, &request);
MPI_Test(&request, ...); // This rank passes known failures upwards in agreement's reduce step, but bcast step is not complete
MPI_Recv(...); // this operation fails, local rank now knows of a new failed rank X
MPI_Wait(&request, ...); // agree completes with new_failed. Rank X is not counted in num_agreed_failed if it contributed to the agreement before failing, but it may be placed before any newly detected failures in the local failed group.
Change 2 is to allow num_known_failed to be renamed max_num_expected_failed and be potentially larger than the number of locally known failures. All ranks which pass a value larger than the local number of known failed ranks, however, run into the above problem even if new_failed is false.
So we could only say that the fail groups up to num_agreed_failed are guaranteed similar if new_failed is false and the passed max_num_expected_failed was <= the size of the local fail group on all ranks. That would be pretty awkward in the standard, so the more realistic solutions are to
1.
internally repeat the agreement until !new_failed (performance implications, does not resolve the problem with change 2),
2.
return a consistent agreed_failed group instead of num_agreed_failed (extra group allocation, potentially extra data in agreement bcast step, but both could be avoided with an MPI_GROUP_IGNORE), or
3.
revert to the original function syntax.
I prefer B, which lets you safely optimistically proceed with a consistent returned group of failed ranks instead of calling MPI_Comm_agree in a loop until !new_failed. Implementations can pass a flag indicating if the local rank called with MPI_GROUP_IGNORE during the reduce step to avoid passing the failed ranks in the bcast step (if they even have a way to avoid that in current implementations).
Please consider and let me know your thoughts here or at the next meeting.
Thanks,
Matthew Whitlock
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20260501/b72bd5e2/attachment-0001.html>
More information about the mpiwg-ft
mailing list