[mpiwg-ft] FTWG Action Items
wbland at mcs.anl.gov
Thu Dec 19 10:06:02 CST 2013
Since we’re not going to have a call next week due to the break, I thought it would be good to try to get the action items from the previous meeting and surrounding discussions going. The first two are things that we should all discuss, the second two are more minor things that I’ve been taking care of, the last one is important, but will require more work on our part. Please take a look and provide feedback.
* Add another error code to replace the use of MPI_ERR_PENDING (MPI_ERR_FAILURE_PENDING?) to prevent the requirement of scanning all return codes to find out if there has been a failure that left a request pending.
* This is probably a good idea as it’s become clear that overloading MPI_ERR_PENDING is more confusing and adds overhead to what the user has to do to determine if their request is still valid or not. Coming up with the name is the only challenge here.
* For now, I’ll change the text to say MPI_ERR_FAILURE_PENDING (23 characters < 30 character limit), but if we come up with something better, we can change it again.
* Function to query the status of a communicator (is it revoked or not).
* If this were done via a communicator attribute, it would be the only one I’m aware of that provides information to the user that is set by the library. All others are stored by the user and retrieved by the user.
* Do we want to add a new API to take care of this, or can we confirm that it’s ok to do this via an attribute?
* I believe the intention of the description of MPI_COMM_AGREE is to ensure that after the agreement, all failures are propagated to all processes in the communicator, which will result in the next MPI_COMM_FAILURE_ACK/GET_ACKED returning a complete group of failed processes as of the last call to AGREE. However, the text specifically say that all failures have been detected. I think this might be the wrong choice of words since we don’t talk about detection anywhere else in the chapter other than to say that we don’t specify anything about failure detection. The text here should be tweaked.
*I’ve made changes to the text that I think make this more clear.
* We should add an example of a way to get a uniform group of failed processes via MPI_COMM_AGREE and FAILURE_ACK/GET_ACKED. I think this would be a good example of the relationship between the functions and it’s been a common question that we’ve received from users.
* I’ll work on drafting one of these.
* We need to make sure that we are compatible with the endpoints and init/finalize proposals.
* The next step for this is to probably read through their proposals and come up with a rough outline of what our text would need to add to cover these new concepts (if anything).
* The endpoints text is here: https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/380
* I don’t think we have anything to look at for init/finalize. That proposal seems to have morphed a few times, but the current version is not about re-entrant MPI, but about a thread-safe Init/Finalize, which is completely orthogonal to our work. Probably nothing to do here.
More information about the mpiwg-ft