[Mpi-forum] Progress Question

Jim Dinan james.dinan at gmail.com
Sun Oct 11 16:42:22 CDT 2020


I think the question boils down to whether the MPI standard allows that
deadlock to happen. For example, in the transmission failure scenario with
software recovery, would MPI require processes to wait for successful
delivery before they can return from the barrier call?

 ~Jim.

On Sun, Oct 11, 2020 at 3:41 PM George Bosilca <bosilca at icl.utk.edu> wrote:

> In the scenario described by Jim, where a send must be reposted due to
> transient network failures, where the retransmission is handled in
> software, and where MPI provides no progress outside MPI calls, it seems
> plausible that an unexpected outcome will be reached (also assuming no
> timeouts or other corrective measures are taken by the network or software
> stack).
> - Rajeev’s example will eventually complete, because when the process
> impacted by the network issue will reach MPI_Finalize, the pending internal
> communications will be reissued and all processes will complete the
> MPI_Barrier.
> - In Jim’s original example it looks more likely that a deadlock will
> occur as there is no ensuing MPI call to reissue the message to be
> retransmitted, and a deadlock will occur.
>
> I don’t think we need transient network errors for such outcomes, it is
> enough to use a buffered send without followup MPI calls to reach the same
> delayed execution scenario.
>
> George.
>
> On Sun, Oct 11, 2020 at 14:28 Skjellum, Anthony via mpi-forum <
> mpi-forum at lists.mpi-forum.org> wrote:
>
>> Rajeev, No, I don't think so.  Did you all disagree with my reasoning?
>> Tony
>>
>>
>> Anthony Skjellum, PhD
>>
>> Professor of Computer Science and Chair of Excellence
>>
>> Director, SimCenter
>>
>> University of Tennessee at Chattanooga (UTC)
>>
>> tony-skjellum at utc.edu  [or skjellum at gmail.com]
>>
>> cell: 205-807-4968
>>
>>
>> ------------------------------
>> *From:* mpi-forum <mpi-forum-bounces at lists.mpi-forum.org> on behalf of
>> Thakur, Rajeev via mpi-forum <mpi-forum at lists.mpi-forum.org>
>> *Sent:* Sunday, October 11, 2020 2:23 PM
>> *To:* Jim Dinan <james.dinan at gmail.com>
>> *Cc:* Thakur, Rajeev <thakur at anl.gov>; Main MPI Forum mailing list <
>> mpi-forum at lists.mpi-forum.org>
>>
>> *Subject:* Re: [Mpi-forum] Progress Question
>>
>>
>> Does it mean that in the following program, although all processes have
>> called barrier, some process may not exit the barrier for 100 days?
>>
>>
>>
>> MPI_Init
>>
>> MPI_Barrier
>>
>> sleep(100 days)
>>
>> MPI_Finalize
>>
>>
>>
>> Rajeev
>>
>>
>>
>>
>>
>> *From: *Jim Dinan <james.dinan at gmail.com>
>> *Date: *Sunday, October 11, 2020 at 10:31 AM
>> *To: *"Thakur, Rajeev" <thakur at anl.gov>
>> *Cc: *Main MPI Forum mailing list <mpi-forum at lists.mpi-forum.org>
>> *Subject: *Re: [Mpi-forum] Progress Question
>>
>>
>>
>> Hi Rajeev,
>>
>>
>>
>> Yes, that's the question and my initial answer was the same as yours.
>> However, we then started talking about the implementation of the barrier,
>> which led to the second example. For example, consider a situation where
>> there is an error in transmission and the implementation needs to enter the
>> progress engine to retry a send operation in software.
>>
>>
>>
>>  ~Jim.
>>
>>
>>
>> On Sat, Oct 10, 2020 at 5:10 PM Thakur, Rajeev <thakur at anl.gov> wrote:
>>
>> Jim,
>>
>>       I don’t fully understand your question. Is it “If all processes
>> reach MPI_Barrier, are they guaranteed to exit the barrier without the need
>> for any other MPI function to be called on any process?” I would say yes.
>>
>>
>>
>> Rajeev
>>
>>
>>
>>
>>
>> *From: *mpi-forum <mpi-forum-bounces at lists.mpi-forum.org> on behalf of
>> Jim Dinan via mpi-forum <mpi-forum at lists.mpi-forum.org>
>> *Reply-To: *Main MPI Forum mailing list <mpi-forum at lists.mpi-forum.org>
>> *Date: *Saturday, October 10, 2020 at 12:31 PM
>> *To: *Main MPI Forum mailing list <mpi-forum at lists.mpi-forum.org>
>> *Cc: *Jim Dinan <james.dinan at gmail.com>
>> *Subject: *[Mpi-forum] Progress Question
>>
>>
>>
>> Hi All,
>>
>>
>>
>> A colleague recently asked a question that I wasn't able to answer
>> definitively. Is the following code guaranteed to make progress?
>>
>>
>>
>> MPI_Barrier();
>>
>> if rank == 1
>>
>>   create_file("test")
>>
>> if rank == 0
>>
>>    while not_exists("test")
>>
>>        sleep(1);
>>
>>
>>
>> That is, can rank 1 require rank 0 to make MPI calls after its return
>> from the barrier, in order for rank 1 to complete the barrier? If the code
>> were written as follows:
>>
>>
>>
>> isend(..., other_rank, &req[0])
>>
>> irecv(..., other_rank, &req[1])
>>
>> waitall(2, req)
>>
>> if rank == 1
>>
>>   create_file("test")
>>
>> if rank == 0
>>
>>    while not_exists("test")
>>
>>        sleep(1);
>>
>>
>>
>> I think it would clearly not guarantee progress since the send data can
>> be buffered. Is the same true for barrier?
>>
>>
>>
>> Cheers,
>>
>>  ~Jim.
>>
>> _______________________________________________
>> mpi-forum mailing list
>> mpi-forum at lists.mpi-forum.org
>> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpi-forum/attachments/20201011/3fa52127/attachment.html>


More information about the mpi-forum mailing list