[Mpi-forum] [EXT]: Progress Question

Jim Dinan james.dinan at gmail.com
Wed Oct 14 13:56:06 CDT 2020

The question does essentially boil down to whether a full fan-out of
nonblocking send/recv pairs followed by wait-all is a valid implementation
of MPI_Barrier. Reviewing the text that Dan cited for MPI 4.0:

> (§5.14, p234 MPI-2019-draft): “A correct, portable program must invoke
collective communications so that deadlock will not occur”

There isn't any convenient way the user can find out about remote
completion of a barrier (short of building their own barrier with
synchronous send). So, we can either interpret the above statement to place
a strong completion requirement on collectives (bad for performance). Or,
we can interpret it to mean that there's really no safe time when a user
can call into a blocking external interface. The RMA progress passage that
Martin referenced seems to support this latter interpretation with the
sockets example given in the rationale.


On Mon, Oct 12, 2020 at 11:17 AM HOLMES Daniel <d.holmes at epcc.ed.ac.uk>

> Hi Jim, et al,
> Unless the point-to-point pseudo-code given is proven to be a valid
> implementation of MPI Barrier, then reasoning about MPI Barrier using it as
> a basis is unlikely to be edifying.
> I also have a (possibly flawed) implementation of MPI Barrier that
> exhibits some odd semantics/behaviours and I could use that to assert
> (likely incorrectly) that MPI Barrier is defined in a way that exhibits
> those semantics/behaviours. However, that serves no purpose, so I won’t
> dwell on it any further.
> I’m glad that someone responded with a reference to the MPI Standard,
> thanks Martin. In that vein, here’s my tuppence:
> The definition of MPI Barrier in the MPI Standard states (§5.3, p149 in
> MPI-2019-draft):
> “If comm is an intracommunicator, MPI_BARRIER blocks the caller until all
> group members have called it. The call returns at any process only after
> all group members have entered the call.”
> There is a happens-before between “all MPI processes have entered the MPI
> Barrier call” and “MPI processes are permitted to leave the call”. That’s
> it; that’s all MPI Barrier does/is required to ensure.
> There is no indication or requirement for alacrity. This appears to be a
> valid (although stupid) implementation:
> int MPI_Barrier(MPI_comm comm) {
>    int ret = PMPI_Barrier(comm);
>    sleep(100days);
>    return ret;
> }
> There is no indication or suggestion for how or when MPI processes become
> aware that the necessary pre-condition for returning control to the user
> has been satisfied. Some may become aware of this situation a significant
> amount of time before/after others. Local completion does not guarantee
> remote completion in MPI (except for passive-target RMA, e.g.
> MPI_Win_unlock).
> There is no indication or requirement that the necessary pre-condition is
> also a sufficient pre-condition, although we may wish to assume that and we
> may wish to clarify the wording of the MPI Standard to specify that
> explicitly. If the MPI Standard text were changed to “The call returns at
> any process <strike>only</strike>immediately after all group members have
> entered the call.” then (given the other usage of immediately in the MPI
> Standard) we could assume that the procedure becomes strong local
> (immediate) once the necessary pre-condition is met. Without the word
> “immediate” in the sentence, the return of the MPI procedure is permitted
> to require remote progress, i.e. after the necessary pre-condition is met,
> it becomes weak local (called local in the MPI Standard). Some MPI
> libraries (can, if configured in a particular way) provide strong progress;
> however, MPI only requires weak progress. Weak progress means it is
> permitted for remote progress to happen only during remote MPI procedure
> calls.
> So,
> If MPI required “returns immediately after...” (which it does not) then
> every MPI process would be required to ensure the remote completion of its
> “send” (as well as local completion of the “recv”) before it returns
> control to the user. This would mean that our intuitive feel for what
> MPI_Barrier should do would be correct and the suggested point-to-point
> code would be an incorrect implementation of MPI_Barrier.
> If MPI required strong progress (which it does not) then every MPI process
> would eventually become aware that it is permitted to return control to the
> user, without additional remote MPI procedure calls. This would mean that
> our intuitive feel for what MPI_Barrier should do would be correct and the
> suggested point-to-point code would be a correct implementation of
> MPI_Barrier.
> As it is, our intuitive feel for what MPI_Barrier should do is probably
> wrong (i.e. not what MPI actually specifies), or at least too optimistic
> because it depends on a high quality implementation that exceeds what is
> minimally specified by the MPI Standard as required.
> As it is, the MPI_Barrier in the original question does not guard against
> problems with the non-MPI file operations - indeed, adding it introduces a
> new possibility of a deadlock, which would not be present in the code
> without the MPI_Barrier operation.
> I would argue that the original code is therefore erroneous
> (incorrect/non-portable) because (§5.14, p234 MPI-2019-draft):
> “A correct, portable program must invoke collective communications so that
> deadlock will not occur”
> One correct program that achieves what the original looks like it might be
> trying to achieve (IHMO) is as follows:
> if (rank == 1)
>   create_file("test”);
> MPI_Barrier();
> if (rank == 0)
>   while not_exists("test")
>     sleep(1);
> This program still assumes that the file creation actually creates the
> file and flushes it to a filesystem that makes it visible to the existence
> check but that must be true if the code-without-MPI is correct, i.e. adding
> MPI has not introduced a new problem to the code.
> Taking this reasoning about the minimal requirements of MPI Barrier (at
> least) one step too far, the only restriction on implementation of
> MPI_Barrier seems to be “do not return until <something happens>”, which
> suggests this is a valid (although very unhelpful) implementation:
> int MPI_Barrier(MPI_comm comm) {
>    while (1); // do not return, ever
> }
> To guard against low-quality/malicious implementations of the MPI
> Standard, we could either clarify the wording of the text about MPI_Barrier
> (and probably the text about every other MPI procedure) to include the
> concept of becoming an “immediate” procedure once certain criteria are met
> (likely to be a lot of effort/angst for some), or mandate strong progress
> for all MPI libraries (likely to be very unpopular for some).
> Cheers,
> Dan.
>> Dr Daniel Holmes PhD
> Architect (HPC Research)
> d.holmes at epcc.ed.ac.uk
> Phone: +44 (0) 131 651 3465
> Mobile: +44 (0) 7940 524 088
> Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh,
> EH8 9BT
>> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
> On 12 Oct 2020, at 10:04, Martin Schulz via mpi-forum <
> mpi-forum at lists.mpi-forum.org> wrote:
> Hi Jim, all,
> We had a similar discussion (in a smaller circle) during the terms
> discussions – at least to my understanding, all bets are off as soon as you
> add dependencies and wait conditions outside of MPI, like here with the
> file. A note to this point is in a rational (Section 11.7, page 491 in the
> 2019 draft) – based on that an MPI implementation is allowed to deadlock
> (or cause a deadlock) – if all dependencies would be in MPI calls, then
> “eventual” progress should be guaranteed – even if it is after the 100 days
> in Rajeev’s example: that would – as far as I understand – still be correct
> behavior, as no MPI call is guaranteed to return in a fixed finite time
> (all calls are at best “weak local”).
> Martin
> --
> Prof. Dr. Martin Schulz, Chair of Computer Architecture and Parallel
> Systems
> Department of Informatics, TU-Munich, Boltzmannstraße 3, D-85748 Garching
> Member of the Board of Directors at the Leibniz Supercomputing Centre (LRZ)
> Email: schulzm at in.tum.de
> *From: *mpi-forum <mpi-forum-bounces at lists.mpi-forum.org> on behalf of
> Jim Dinan via mpi-forum <mpi-forum at lists.mpi-forum.org>
> *Reply-To: *Main MPI Forum mailing list <mpi-forum at lists.mpi-forum.org>
> *Date: *Sunday, 11. October 2020 at 23:41
> *To: *"Skjellum, Anthony" <Tony-Skjellum at utc.edu>
> *Cc: *Jim Dinan <james.dinan at gmail.com>, Main MPI Forum mailing list <
> mpi-forum at lists.mpi-forum.org>
> *Subject: *Re: [Mpi-forum] [EXT]: Progress Question
> You can have a situation where the isend/irecv pair completes at process 0
> before process 1 has called irecv or waitall. Since process 0 is now busy
> waiting on the file, it will not make progress on MPI calls and can result
> in deadlock.
>  ~Jim.
> On Sat, Oct 10, 2020 at 2:17 PM Skjellum, Anthony <Tony-Skjellum at utc.edu>
> wrote:
> Jim, OK, my attempt at answering below.
> See if you agree with my annotations.
> -Tony
> Anthony Skjellum, PhD
> Professor of Computer Science and Chair of Excellence
> Director, SimCenter
> University of Tennessee at Chattanooga (UTC)
> tony-skjellum at utc.edu  [or skjellum at gmail.com]
> cell: 205-807-4968
> ------------------------------
> *From:* mpi-forum <mpi-forum-bounces at lists.mpi-forum.org> on behalf of
> Jim Dinan via mpi-forum <mpi-forum at lists.mpi-forum.org>
> *Sent:* Saturday, October 10, 2020 1:31 PM
> *To:* Main MPI Forum mailing list <mpi-forum at lists.mpi-forum.org>
> *Cc:* Jim Dinan <james.dinan at gmail.com>
> *Subject:* [EXT]: [Mpi-forum] Progress Question
> *External Email*
> Hi All,
> A colleague recently asked a question that I wasn't able to answer
> definitively. Is the following code guaranteed to make progress?
> MPI_Barrier();
> -- everything is uncertain to within one message, if layered on pt2pt;
> --- let's assume a power of 2, and recursive doubling (RD).
> --- At each stage, it posts an irecv and isend to its corresponding
> element in RD
> --- All stages must complete to get to the last stage.
> --- At the last stage, it appears like your example below for N/2
> independent process pairs, which appears always to complete.
> Oif rank == 1
>   create_file("test")
> if rank == 0
>    while not_exists("test")
>        sleep(1);
> That is, can rank 1 require rank 0 to make MPI calls after its return from
> the barrier, in order for rank 1 to complete the barrier? If the code were
> written as follows:
> isend(..., other_rank, &req[0])
> irecv(..., other_rank, &req[1])
> waitall(2, req)
> --- Assume both isends buffer on the send-side and return
> immediately--valid.
> --- Both irecvs are posted, but unmatched as yet.  Nothing has transferred
> on network.
> --- Waitall would mark the isends done at once, and work to complete the
> irecvs; in
>      that process, each would have to progress the isends across the
> network. On this comm
>      and all comms, incidentally.
> --- When waitall returns, the data has transferred to the receiver,
> otherwise the irecvs
>       aren't done.
> if rank == 1
>   create_file("test")
> if rank == 0
>    while not_exists("test")
>        sleep(1);
> I think it would clearly not guarantee progress since the send data can be
> buffered. Is the same true for barrier?
> Cheers,
>  ~Jim.
> *This message is not from a UTC.EDU <http://utc.edu/> address. Caution
> should be used in clicking links and downloading attachments from unknown
> senders or unexpected email. *
> _______________________________________________
> mpi-forum mailing list
> mpi-forum at lists.mpi-forum.org
> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpi-forum/attachments/20201014/b42b5d88/attachment-0001.html>

More information about the mpi-forum mailing list