[Mpi3-ft] radical idea?

Darius Buntinas buntinas at mcs.anl.gov
Tue Jul 19 15:07:18 CDT 2011

Here are some ideas on how we can change the interface based on the feedback we got.  I haven't thought too deeply on all the implications of these, but it's a starting point for discussion.


Let's not call (local) VALIDATE VALIDATE and, let's split out UP/DOWN from RANK_NULLification.  Then to query for UP/DOWN, use a handle to an explicit "state" object (as opposed to an implicit "snapshot"), then query that.  e.g.: 

    MPI_COMM_GET_STATE(comm, state_handle)
        IN:  MPI_COMM comm
        OUT: MPI_PROC_STATE state_handle
    and ditto for GROUP, FILE, WIN as necessary

    MPI_GET_PROC_STATE_SIZE(state_handle, mask, size)
        IN:  MPI_PROC_STATE state_handle
        IN:  int mask
        OUT: int size

    MPI_GET_PROC_STATE_LIST(state_handle, mask, list)
        IN:  MPI_PROC_STATE state_handle
        IN:  int mask
        OUT: int list[]

    MPI_GET_PROC_STATE_NEW(state_handle1, state_handle2, state_handle_new)
        IN:  MPI_PROC_STATE state_handle1
        IN:  MPI_PROC_STATE state_handle2
        OUT: MPI_PROC_STATE state_handle_new
    This gives newly failed processes in state_handle2 since state_handle1.

This addresses issues people had with different threads calling VALIDATE and resetting the "new" flag for other threads.

We can then have a MPI_COMM_NULLIFY() function (or whatever we decide to call it), that would effectively set the rank to MPI_RANK_NULL:

    MPI_COMM_NULLIFY(comm, rank)
        IN:  MPI_COMM comm
        IN:  int rank

    MPI_COMM_NULLIFY_STATE(comm, mask, state_handle)
    This sets all ranks described by mask and state_handle to PROC_NULL

    MPI_COMM_NULLIFY_GROUP(comm, group)
    Set all procs in group to PROC_NULL in comm.  Same as logically doing:
      foreach p in group
        MPI_COMM_NULLIFY(comm, rank-of-p-in-comm)

The operations would be idempotent, and can be called on either live or failed processes.  Note, a process can be (UP or DOWN) x (NORMAL or NULL).

VALIDATE_ALL can be renamed to VALIDATE.  It returns a state_handle that can be queried for failed processes.  Then we can describe it as having the effect of deciding on a common set of failed processes across the comm, setting state_handle to that set, and calling MPI_COMM_NULLIFY() on each failed process in that set.

    MPI_COMM_VALIDATE(comm, state_handle)
        IN:  MPI_COMM comm
        OUT: MPI_PROC_STATE state_handle

We may also want a function to "link" one comm's PROC_NULLified state with another, so that if comm_A and comm_B are linked, calling MPI_COMM_NULLIFY on comm_A also NULLifies it on comm_B.  We can have a restriction that comm_B is a subset of comm_A or vv.

More information about the mpiwg-ft mailing list