[Mpi3-ft] API draft: Comments, questions, suggestions
rbarrett at ornl.gov
Tue Apr 21 13:20:22 CDT 2009
Pardon me for jumping in to a conversation that may have already taken
place, but I¹ll offer some comments regarding the proposed api anyway :) My
doc dated 20 Feb 2009, which I've attached here.
0.2 Initializing fault tolerance...
1. There are several parameters that may be set, each with a call to
MPI_COMM_SET_NAME. Would it make sense/be possible to define a structure (or
Fortran derived type) instead, one field per parameter? Default settings
upon instantiation of the struct/type, with user modifying for new settings.
May want to also keep single parameter setting function as well. I don¹t
think I feel strongly about this, just occurred to me that might be a
2. Also, has the Fortran setting of the recover functions been addressed? I
recall doing this a few years ago, required some acrobatics to pass a
Fortran function into a C world.
0.3 Restoring MPI processes
1. Function names don¹t adhere to the MPI_COMM_ prefix of the others.
2. MPI_GET_LOST_COMMUNICATOR : Last word s/b shortened to COMM, right? And
with (1), perhaps MPI_COMM_RESTORE_LIST is more descriptive?
0.5 Communicator state
1. Seems that the two versions are more analogous to MPI_Wait and MPI_Test
rather than blocking and non-blocking. For example, the non-blocking query
does not (seem to) have a completion routine, i.e. analogous to MPI_Wait
for, say MPI_Irecv. And in fact the current text claims that ivalidate is
the asynchronous version of validate. At the risk of lengthening names,
seems more like, MPI_COMM_VALIDATE_WAIT and MPI_COMM_VALIDATE_TEST. Also,
the mention of ³collective² in MPI_COMM_VALIDATE is not made for IVALIDATE
but IVALIDATE is (effectively) collective, too, correct?
2. Would (1) then lead to function bloat, eg
MPI_COMM_VALIDATE_WAIT/ANY/SOME/ALL? Ok, probably not ANY. And same for
Could a log file be generated (within MPI_Finalize), perhaps written to
/tmp/$USER, that lists fault tolerant ³incidents², etc? For example, the
total number of restored processes, perhaps the mean of the ³generation²?
PVM wrote a log file in this manner (forget what was in it), which the user
was (permitted to be) unaware of. Came in handy when, for example, a user
complained of something. Each process would maintain information, aggregated
upon termination. Could be overridden or otherwise managed via some
mechanism. Could envision a tool that monitors the individual process logs,
provides data to user code for writing to their log file, etc.
Again, I hope I¹m not intruding into well-trodden ground, but I would
greatly appreciate your feedback on this topic.
Application Performance Tools group
Computer Science and Mathematics Division
Oak Ridge National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the mpiwg-ft