[Mpi3-ft] New FT API

Mon Aug 15 13:54:07 CDT 2011

I snipped stuff that I think we pretty much agree on.

On Aug 12, 2011, at 9:15 AM, Josh Hursey wrote:

>>> 
>>> - Thread safety: There is another question of what happens if:
>>>  ThreadA: MPI_Recv(comm, MPI_ANY_SOURCE)
>>>  --- Rank X fails ---
>>>  ThreadB: Notice a failure of Rank X
>>>  ThreadB: MPI_Comm_recognize(comm, {rankX})
>>> There is a race between when the error of Rank X failure is reported
>>> to ThreadA, and when ThreadB recognizes the failure. If ThreadB
>>> recognizes the failure before ThreadA is put on the run queue, should
>>> ThreadA return an error? or should it keep processing? I think it
>>> should return an error, and we should discourage the users from such
>>> constructs, but I could be convinced otherwise.
>> 
>> I think failure detection should be atomic wrt threads of a process.  As soon as the failure is detected, all anysource requests should be completed with an error, regardless of which thread detected it or which thread is waiting on the request.  Now it's possible that ThreadB recognizes the new failure before ThreadA returns from its receive.  But ThreadA can still check for new failures by doing a comm_group_failure then a group_difference with a group of failed processes it requested earlier.
> 
> So this is a slight modification on the scenario that you had above.
> Thread A                      Thread B
> ========                      ========
> Recognize(comm, groupA)
> Recv(comm, anysource)
> ----------- PROC X FAILS ----------------
>                             Recognize(comm, groupB) // groupB contains proc X
> Recv() returns ??
> 
> I think that the Recv() should fail since when it was posted Proc X
> was not recognized yet. So to avoid unintentionally blocking ThreadA,
> it should return an error. The question becomes how to implement this
> semantic.
> 
> One way I thought that we could implement it is with a counter on the
> request and communicator. When Recognize() is called it stores the
> number of known alive processes (P_A) on the communicator, and sets
> the boolean (is_any_source_enabled) to true. When a Recv(ANY_SOURCE)
> is posted then that number (P_A) is copied onto the request, and
> allowed to be posted since (is_any_source_enabled) is true. When a new
> process fails in the communicator the boolean (is_any_source_enabled)
> is set to false. (is_any_source_enabled) should be used to force the
> Recv request to return an error, but if Thread B calls Recognize()
> before that check on the request then the check could return thinking
> that all is well. But since when ThreadB calls Recognize() it updates
> the counter (P_A), then in the request check we check not only the
> (is_any_source_enabled) value but also the value of (P_A) stored on
> the request against the value of (P_A) stored on the communicator. If
> they are different then the request should return in error.
> 
> So that's a long way of saying, I think that the Recv() in ThreadA
> should return an error, and I think there is a straightforward way of
> providing this semantic without much overhead.

Yeah, I think that's kind of how we handle anysources, by incrementing a counter whenever a failure is detected to avoid checking for anysource-inactive communicators when nothing's failed since the last check.

>>> - 'notion of thread-specific state in the MPI standard?' From what I
>>> could find, I do not think there is a notion of thread specific state
>>> in the MPI standard. There is a concept of the 'main thread', but I
>>> think that is as far as the standard goes in this regard.
>> 
>> Yeah, the main thread was the only instance we could come up with here.
>> 
>> I think what we really need to make this thread safe is another object that keeps track of whether the communicator is anysource enabled or of which processes have been recognized. This object would then need to be passed in to all receive operations.  Each thread can manage its own object and enable/disable anysource receives as necessary.  Of course this means adding a new parameter to receive operations which I don't think would make the forum happy.
>> 
>> Instead, by using thread-local storage, we make this object implicit.  If we want to make it implicit to avoid adding parameters to receive, without using thread-local storage, we run into the thread safety issues.
>> 
>> So the options I see are:
>>    1. Change the API and add a parameter to receive function
>>    2. Make the state implicit but don't use thread-local storage and have thread safety
>>        problems
>>    3. Make the state implicit and use thread-local storage
>> 
>> I think 3 is the least evil of the three.
> 
> I think all three are evil for different reasons, but I agree that we
> must have a solution that allows the user to write a thread safe
> program.
> 
> So I think (1) is something we should try to avoid. Adding a new Recv
> function and a new object seems hack'ish. We got a fair amount of push
> back from the Forum for adding implicit state on the communicator, so
> (2) and (3) seem to bring that back a little more strongly than we
> have so far in this thread. Potentially having any_source enabled in
> one thread but not another on the same communicator is a weird
> semantic to introduce, so we should avoid trending that way.
> 
> The (4) option would be to make this the user's problem. I think it
> would be easier on all sides if we just said that the example you
> cited was erroneous/dangerous since two threads are mutating the state
> (is_any_source_enabled) of a shared object (comm), and the user is
> responsible for coordinating that action.
> 
> The first approach I thought of is below, though I admit that it is
> off-the-cuff so might be rough or wrong.
> 
> If the threads use a reader/writer like construct to make sure that
> Recognize (writer) is only called when there are no outstanding
> Recv(ANY_SOURCE)'s posted (readers) then I think the user can avoid
> this problem.
> 
> The complexity comes in (as it does in all FT programs) when we think
> of when Recognize() is expected to be called. I would expect it to be
> called in response to a detected process failure.
> 
> So the program might look like (need different tags to provide thread context):
> Thread A                      Thread B
> ========                      ========
> while(...) {                  while(...) {
> if(!Recv(comm, any, tagA))    if(!Recv(comm, any, tagB))
>  Recognize(comm, groupA)        Recognize(comm, groupB)
> }                             }
> 
> Then with a reader/writer like construct we would force the user to
> complete all outstanding Recv(comm, any_source) operations before
> calling Recognize() on the communicator. Since Recv(any_soruce) is
> guaranteed to complete if there is a new error then there is no
> deadlock in waiting for the Recv(any_source) to complete.
> 
> So adding the reader/write locks we would have:
> Thread A                      Thread B
> ========                      ========
> while(...) {                  while(...) {
> reader_enter()                reader_enter()
> if(!Recv(comm, any, tagA))    if(!Recv(comm, any, tagB))
>  reader_leave()                 reader_leave()
>  writer_enter()                 writer_enter()
>  Recognize(comm, groupA)        Recognize(comm, groupB)
>  writer_leave()                 writer_leave()
> else                          else
>  reader_leave()                 reader_leave()
> }                             }
> 
> And they might interleave like:
> 
> Thread A                      Thread B
> ========                      ========
> reader_enter()                reader_enter()
> Recv(comm, any, tagA)         Recv(comm, any, tagB)
> ----------- PROC X FAILS ----------------
> // Recv fails
> reader_leave()
> writer_enter() // waits
>                              // Recv fails
>                              reader_leave()
>                              writer_enter() // no-wait
>                              Recognize(comm, groupB)//{X}
>                              writer_leave()
>                              reader_enter() // waits
> ----------- PROC Y FAILS ----------------
> Recognize(comm, groupA)//{X,Y}
> writer_leave()
>                              Recv(comm, any, tagB)
> 
> Which leaves us with the same problem :/
> 
> The solution is to coordinate the action of the threads during the
> Recognize() (write). If you think about it, it is a little odd that
> both threads are calling Recognize(), so elect just one. And we need a
> thread_barrier() to make sure that ThreadB does not starve ThreadA by
> looping through numerous failed Recv(comm, any, tagB) while not
> allowing ThreadA to Recognize the failures.
> 
> Thread A                      Thread B
> ========                      ========
> while(...) {                  while(...) {
> reader_enter()                reader_enter()
> if(!Recv(comm, any, tagA))    if(!Recv(comm, any, tagB))
>  reader_leave()                 reader_leave()
>  writer_enter()                 writer_enter()
>  Recognize(comm, groupA)
>  thread_barrier()               thread_barrier()
>  writer_leave()                 writer_leave()
> else                          else
>  reader_leave()                 reader_leave()
> }                             }
> 
> 
> And they might interleave like:
> 
> Thread A                      Thread B
> ========                      ========
> reader_enter()                reader_enter()
> Recv(comm, any, tagA)         Recv(comm, any, tagB)
> ----------- PROC X FAILS ----------------
> // Recv fails
> reader_leave()
> writer_enter() // waits
>                              // Recv fails
>                              reader_leave()
>                              writer_enter() // no-wait
>                              thread_barrier // wait for threadA
> ----------- PROC Y FAILS ----------------
> Recognize(comm, groupA)//{X,Y}
> thread_barrier() //no-wait
> writer_leave()
>                              writer_leave()
>                              reader_enter()
>                              Recv(comm, any, tagB)
> reader_enter()
> Recv(comm, any, tagA)
> 
> 
> That seems to solve things (to me anyway), but we need a termination
> detection fix to make sure this eventually stops in a uniform manner.
> Otherwise we might get:
> 
> Thread A                      Thread B
> ========                      ========
> reader_enter()                reader_enter()
> Recv(comm, any, tagA)         Recv(comm, any, tagB)
>                              // Success
>                              reader_leave()
>                              // Exit loop and continue
> ----------- PROC X FAILS ----------------
> // Recv fails
> reader_leave()
> writer_enter() // no-wait
> ----------- PROC Y FAILS ----------------
> Recognize(comm, groupA)//{X,Y}
> thread_barrier() // wait for ThreadB to join... but it won't
> // Deadlock
> 
> 
> I think that the easiest way to do this (there are others that we
> could devise to reduce the synchronization overhead between threads)
> would be to not react to the error each time, but process them inline
> even if there are no failures.
> 
> Thread A                      Thread B
> ========                      ========
> while(...) {                  while(...) {
> reader_enter()                reader_enter()
> Recv(comm, any, tagA)         Recv(comm, any, tagB))
> reader_leave()                reader_leave()
> 
> writer_enter()                writer_enter()
> Recognize(comm, groupA)
> thread_barrier()              thread_barrier()
> writer_leave()                writer_leave()
> }                             }
> 
> So we never get the odd mismatch between one thread exiting the loop,
> and the other waiting for it to rendezvous in the error condition.
> That is while the two loops are synchronized.
> 
> Is it cumbersome, yea. But it does seem to allow the user to manage it
> at the application level. Some additional complexity is required when
> thinking about accounting for processes failures in an application
> since the timing of the error associated with the process failure can
> occur at any time so the program logic must be able to adapt to such
> events. I think this threading example highlights a race condition
> when managing the perceived state of a process group after an error.

Ah yes, reader/writer locks is the way to go here.  I think the following does the same thing, but doesn't require the threads to go through the loop the same number of times.  Like you said, cumbersome, but it's possible for the user to do use Recognize() safely.  (Note, I didn't check safety/correctness rigorously.)

int recognize_cnt = 0 //global

int my_cnt = recognize_cnt - 1 //local to thread or block
                               // - 1 to force a check in first loop
while(...) {
    reader_enter()
    if (my_cnt != recognize_cnt) {
        //check failed processes and decide if ok to continue
        if (!ok_to_continue) {
            reader_leave()
            break
        }
        my_cnt == recognize_cnt
    }
    if (!Recv(...any...)) {
        reader_leave()
        writer_enter()
        if (my_cnt != recognize_cnt) {
            //other thread recognized
            writer_leave()
            continue
        }
        Recognize()
        ++recognize_cnt
        writer_leave()
        continue
    }
    //process received message
}

Note that in the "check failed processes and decide if ok to continue" part, since we're holding the reader lock, it can be OK to access the failed-process group returned by Recognize, and avoid calling Comm_group_failed().

> It should be noted that the synchronization problem highlighted is not
> exclusive to MPI_ANY_SOURCE, but since we have the ability to NULLIFY
> a process we have the same problem there.
> 
> Thread A                      Thread B
> ========                      ========
> while(...) {                  while(...) {
> if(!Recv(comm, procA, tagA))  if(!Recv(comm, procA, tagB))
>  Nullify(comm, procA)          Nullify(comm, procA)
> }                             }
> 
> So the two threads would need to coordinate the Nullify (writer)
> between the Recv (reader) in the two threads.

Right, but here, I think we wouldn't need the extra variables.  I think all we would need is Nullify() and Is_nullified().

Are we OK with requiring the user to do this in order to use Recognize safely?

-d