[Mpi3-ft] MPI_ANY_SOURCE

Josh Hursey jjhursey at open-mpi.org
Thu Oct 6 08:30:06 CDT 2011


Yeah that's a tough one...

Thread 0           Thread 1
--------------------------------------
                   Irecv(AS, TagB)[B]
Irecv(AS,TagA)[A]
/******** Process N fails ***********/
Wait()
->Warn
->Reenable()
->Wait()
                   Irecv(AS, TagC)[C]
                   Waitall()
                   --> ???
--------------------------------------

So [A] gets a warning, and Thread 0 fixes the communicator. What
should happen to [B] and [C]?
 - Should [B] and [C] both wait normally, since the warning was handled?
 - Should [B] return a warning (since it was logically before the
Reenable()), and [C] wait normally?
 - Should [B] and [C] both return a warning since it was not in there
thread context in which the Reenable was called?

The first solution seems to be what we have at the moment. The second
seems to imply that Reenable() needs to iterate over all pending
requests to mark them appropriately as 'to warn'. I don't think the
third option is viable since I'm not sure the concept of 'thread
context' is well founded in the standard - except for the 'main'
thread.

One solution is to say that the user is responsible for handling this
in their application though some complex locking scheme. I'm not 100%
convinced that this is a sound statement at the moment.

Another is what Darius suggested, that we Reenable each AS receive
request. We would still need a Reenable for the communicator so
blocking Recv(AS) could complete successfully. So we may have to
differentiate Reenable_blocking_AS(comm) and
Reenable_nonblocking_AS(request). Seems like this could get messy.



I've been thinking about two alternative designs that might be
something to consider (not abandoning the warning solution, but just a
thought experiment to try to and find something more elegant).

Alternative A:
--------------
Do nothing. ANY_SOURCE receives will wait for successful completion
without being interrupted by emerging failures.

The user is then responsible for canceling (or locally completing) the
AS receive operation when they determine that it cannot complete. They
would need to determine this out of band, possibly by polling in a
separate thread. We would need to think through an example
demonstrating how the user could do this.

With this we also lose our mechanism for general notification.
Previously we could use an MPI_Probe(ANY_SOURCE) to receive
notification of new failure without modifying the matching semantics.
With this we would need to introduce something new.

I like this only because it makes the MPI specification/implementation
simplier/easier. I don't like that it pushes the burden on the end
user.


Alternative B:
--------------
Per message callbacks. The user registers a special ANY_SOURCE warning
callback. If there are failures in the communicator, the callback is
fired and the user has to decide on a per-receive basis whether to
cancel or continue waiting.

The callback could be applied to both blocking and nonblocking receive
operations, and have a signature of something like:
  MPI_ANY_SOURCE_FAULT_CALLBACK(tag, comm, failed_grp)
The user would return 0 to continue waiting, and -1 to cancel the
operation and complete it with an error.

The problem with this is that the callback is that it may be difficult
to associate with a specific receive and line in the code. So it may
be difficult to figure out if it is ok to keep waiting from just the
matching bits and the group of failed processes. We also need to
specify what MPI calls can be used in the callback - I would say it is
restricted to something like MPI_Op, only local operations.



On Wed, Oct 5, 2011 at 3:10 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
>
> I suspect that we'd get thread issues with the warning approach:
>
>    Two threads post AS receives.
>    Thread A does a Wait and gets the warning, then does a reenable_AS
>    Thread B does a Wait and hangs because all of its senders are dead
>
> We didn't have this problem before because thread B's receives would have completed with an error and so it would have gotten an error when it tried to wait on it.
>
> We handled this case by using reader-writer locks to prevent another thread from reenabling anysources between when the user checks that potential senders are alive and when it posts the receive.  Now we'd need to hold the lock between when we check for senders and when the request actually completes.
>
> Does that mean we need to explicitly reenable each AS receive?
>
> -d
>
>
> On Oct 5, 2011, at 1:34 PM, Josh Hursey wrote:
>
>> Currently we state:
>> ----------------------
>> 17.6 Point-to-Point Communication
>> ...
>> When a process detects a new process failure, the ability to perform
>> wildcard receives (i.e., receives where MPI_ANY_SOURCE has been
>> specified for the source parameter) will be disabled on all
>> communicators that contain the failed process. When wildcard receives
>> are disabled on a communicator, all pending wildcard receive
>> operations on that communicator are completed and an error with class
>> MPI_ERR_PROC_FAIL_STOP will be returned for those operations. Any new
>> wildcard receive operations posted to a communicator with disabled
>> wildcard receives will be immediately completed and return an error
>> code of the class MPI_ERR_PROC_FAIL_STOP.
>> Wildcard receives can be re-enabled with the
>> MPI_COMM_REENABLE_ANY_SOURCE function described below.
>> ----------------------
>>
>> The problem is that by completing the pending wildcard receives with
>> an error, we may cause unintended matching of concurrent receives. For
>> example:
>> -------------------
>> Proc 0                Proc 1
>> Irecv(ANY,TAGX); [A]
>> Irecv(1, TAGX); [B]
>> Waitall()
>> /******* Proc 2 fails *********/
>>                      Send(0, TAGX) [C]
>>                      Send(0, TAGX) [D]
>> -------------------
>> The intention is that Irecv[A] will match Send[C] and Irecv[B] will
>> match Send[D]. But if another process fails (proc 2 in this example),
>> then Irecv[A] will complete in error. Then Irecv[B] will match
>> Send[C], and Send[D] will remain unmatched.
>>
>>
>> We need a mechanism that provides notification to the application
>> waiting on an ANY_SOURCE receive that some process failed in this
>> communicator while still providing the necessary matching guarantees.
>> The user can then choose if they want to cancel the operation or
>> continue waiting.
>>
>>
>> On the teleconference today, we were thinking through a warning based
>> concept. Instead of completing the pending nonblocking ANY_SOURCE
>> receives, we would return a special warning code (say
>> MPI_WARN_PROC_FAIL_STOP). The message would -not- be completed, but
>> return to the user to decide if they wish to continue waiting or
>> cancel the offending receive operation.
>>
>> For blocking ANY_SOURCE receives, the user would be returned an error
>> (MPI_ERR_PROC_FAIL_STOP), and the operation would be completed in
>> error. Since we do not have a request handle, there is no way to keep
>> this receive active while the situation is resolved. So we complete
>> the receive with an error, in a sense canceling the receive operation.
>>
>> For nonblocking ANY_SOURCE receives, the user would be returned a
>> warning (MPI_WARN_PROC_FAIL_STOP), and the request would remain
>> active. The application would then be able to either re-enable
>> ANY_SOURCE (via the API call) or cancel the offending receive
>> operation. The user is in control over the matching guarantees since
>> the request remains active, but cannot complete without action from
>> the user.
>>
>>
>> In the example above, Proc 0 would return MPI_ERR_IN_STATUS from
>> MPI_WAITALL. The status of the active requests would be
>> MPI_ERR_PENDING (per section 3.7.3), except for the offending
>> ANY_SOURCE operations which will have the error field set to
>> MPI_WARN_PROC_FAIL_STOP. So Irecv[A] would be MPI_ERR_PENDING, and
>> Irecv[B] would be MPI_WARN_PROC_FAIL_STOP. The user can then associate
>> the error with a specific request. Once a request is identified the
>> user can either cancel it, or re-enable ANY_SOURCE on the communicator
>> and continue waiting.
>>
>>
>> So I think this fixes the matching issue with the interface, but I'm
>> not sure if it addresses the concerns of the hardware matching folks.
>> Additionally, this proposal adds a new state to the request concept
>> 'active but not completable without action by the user' and we need to
>> think if this requires semantic qualification elsewhere in the
>> standard.
>>
>>
>> What to folks think about this solution?
>>
>> -- Josh
>>
>> --
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey




More information about the mpiwg-ft mailing list