[Mpi-forum] [EXT]: Progress Question

Wed Oct 14 13:56:06 CDT 2020

The question does essentially boil down to whether a full fan-out of
nonblocking send/recv pairs followed by wait-all is a valid implementation
of MPI_Barrier. Reviewing the text that Dan cited for MPI 4.0:

> (§5.14, p234 MPI-2019-draft): “A correct, portable program must invoke
collective communications so that deadlock will not occur”

There isn't any convenient way the user can find out about remote
completion of a barrier (short of building their own barrier with
synchronous send). So, we can either interpret the above statement to place
a strong completion requirement on collectives (bad for performance). Or,
we can interpret it to mean that there's really no safe time when a user
can call into a blocking external interface. The RMA progress passage that
Martin referenced seems to support this latter interpretation with the
sockets example given in the rationale.

 ~Jim.

On Mon, Oct 12, 2020 at 11:17 AM HOLMES Daniel <d.holmes at epcc.ed.ac.uk>
wrote:

> Hi Jim, et al,
>
> Unless the point-to-point pseudo-code given is proven to be a valid
> implementation of MPI Barrier, then reasoning about MPI Barrier using it as
> a basis is unlikely to be edifying.
> I also have a (possibly flawed) implementation of MPI Barrier that
> exhibits some odd semantics/behaviours and I could use that to assert
> (likely incorrectly) that MPI Barrier is defined in a way that exhibits
> those semantics/behaviours. However, that serves no purpose, so I won’t
> dwell on it any further.
>
> I’m glad that someone responded with a reference to the MPI Standard,
> thanks Martin. In that vein, here’s my tuppence:
>
> The definition of MPI Barrier in the MPI Standard states (§5.3, p149 in
> MPI-2019-draft):
> “If comm is an intracommunicator, MPI_BARRIER blocks the caller until all
> group members have called it. The call returns at any process only after
> all group members have entered the call.”
>
> There is a happens-before between “all MPI processes have entered the MPI
> Barrier call” and “MPI processes are permitted to leave the call”. That’s
> it; that’s all MPI Barrier does/is required to ensure.
>
> There is no indication or requirement for alacrity. This appears to be a
> valid (although stupid) implementation:
> int MPI_Barrier(MPI_comm comm) {
>    int ret = PMPI_Barrier(comm);
>    sleep(100days);
>    return ret;
> }
>
> There is no indication or suggestion for how or when MPI processes become
> aware that the necessary pre-condition for returning control to the user
> has been satisfied. Some may become aware of this situation a significant
> amount of time before/after others. Local completion does not guarantee
> remote completion in MPI (except for passive-target RMA, e.g.
> MPI_Win_unlock).
>
> There is no indication or requirement that the necessary pre-condition is
> also a sufficient pre-condition, although we may wish to assume that and we
> may wish to clarify the wording of the MPI Standard to specify that
> explicitly. If the MPI Standard text were changed to “The call returns at
> any process <strike>only</strike>immediately after all group members have
> entered the call.” then (given the other usage of immediately in the MPI
> Standard) we could assume that the procedure becomes strong local
> (immediate) once the necessary pre-condition is met. Without the word
> “immediate” in the sentence, the return of the MPI procedure is permitted
> to require remote progress, i.e. after the necessary pre-condition is met,
> it becomes weak local (called local in the MPI Standard). Some MPI
> libraries (can, if configured in a particular way) provide strong progress;
> however, MPI only requires weak progress. Weak progress means it is
> permitted for remote progress to happen only during remote MPI procedure
> calls.
>
> So,
>
> If MPI required “returns immediately after...” (which it does not) then
> every MPI process would be required to ensure the remote completion of its
> “send” (as well as local completion of the “recv”) before it returns
> control to the user. This would mean that our intuitive feel for what
> MPI_Barrier should do would be correct and the suggested point-to-point
> code would be an incorrect implementation of MPI_Barrier.
> If MPI required strong progress (which it does not) then every MPI process
> would eventually become aware that it is permitted to return control to the
> user, without additional remote MPI procedure calls. This would mean that
> our intuitive feel for what MPI_Barrier should do would be correct and the
> suggested point-to-point code would be a correct implementation of
> MPI_Barrier.
>
> As it is, our intuitive feel for what MPI_Barrier should do is probably
> wrong (i.e. not what MPI actually specifies), or at least too optimistic
> because it depends on a high quality implementation that exceeds what is
> minimally specified by the MPI Standard as required.
> As it is, the MPI_Barrier in the original question does not guard against
> problems with the non-MPI file operations - indeed, adding it introduces a
> new possibility of a deadlock, which would not be present in the code
> without the MPI_Barrier operation.
>
> I would argue that the original code is therefore erroneous
> (incorrect/non-portable) because (§5.14, p234 MPI-2019-draft):
> “A correct, portable program must invoke collective communications so that
> deadlock will not occur”
>
> One correct program that achieves what the original looks like it might be
> trying to achieve (IHMO) is as follows:
> if (rank == 1)
>   create_file("test”);
> MPI_Barrier();
> if (rank == 0)
>   while not_exists("test")
>     sleep(1);
> This program still assumes that the file creation actually creates the
> file and flushes it to a filesystem that makes it visible to the existence
> check but that must be true if the code-without-MPI is correct, i.e. adding
> MPI has not introduced a new problem to the code.
>
> Taking this reasoning about the minimal requirements of MPI Barrier (at
> least) one step too far, the only restriction on implementation of
> MPI_Barrier seems to be “do not return until <something happens>”, which
> suggests this is a valid (although very unhelpful) implementation:
> int MPI_Barrier(MPI_comm comm) {
>    while (1); // do not return, ever
> }
>
> To guard against low-quality/malicious implementations of the MPI
> Standard, we could either clarify the wording of the text about MPI_Barrier
> (and probably the text about every other MPI procedure) to include the
> concept of becoming an “immediate” procedure once certain criteria are met
> (likely to be a lot of effort/angst for some), or mandate strong progress
> for all MPI libraries (likely to be very unpopular for some).
>
> Cheers,
> Dan.
> —
> Dr Daniel Holmes PhD
> Architect (HPC Research)
> d.holmes at epcc.ed.ac.uk
> Phone: +44 (0) 131 651 3465
> Mobile: +44 (0) 7940 524 088
> Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh,
> EH8 9BT
> —
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
> —
>
> On 12 Oct 2020, at 10:04, Martin Schulz via mpi-forum <
> mpi-forum at lists.mpi-forum.org> wrote:
>
> Hi Jim, all,
>
> We had a similar discussion (in a smaller circle) during the terms
> discussions – at least to my understanding, all bets are off as soon as you
> add dependencies and wait conditions outside of MPI, like here with the
> file. A note to this point is in a rational (Section 11.7, page 491 in the
> 2019 draft) – based on that an MPI implementation is allowed to deadlock
> (or cause a deadlock) – if all dependencies would be in MPI calls, then
> “eventual” progress should be guaranteed – even if it is after the 100 days
> in Rajeev’s example: that would – as far as I understand – still be correct
> behavior, as no MPI call is guaranteed to return in a fixed finite time
> (all calls are at best “weak local”).
>
> Martin
>
>
>
> --
> Prof. Dr. Martin Schulz, Chair of Computer Architecture and Parallel
> Systems
> Department of Informatics, TU-Munich, Boltzmannstraße 3, D-85748 Garching
> Member of the Board of Directors at the Leibniz Supercomputing Centre (LRZ)
> Email: schulzm at in.tum.de
>
>
>
> *From: *mpi-forum <mpi-forum-bounces at lists.mpi-forum.org> on behalf of
> Jim Dinan via mpi-forum <mpi-forum at lists.mpi-forum.org>
> *Reply-To: *Main MPI Forum mailing list <mpi-forum at lists.mpi-forum.org>
> *Date: *Sunday, 11. October 2020 at 23:41
> *To: *"Skjellum, Anthony" <Tony-Skjellum at utc.edu>
> *Cc: *Jim Dinan <james.dinan at gmail.com>, Main MPI Forum mailing list <
> mpi-forum at lists.mpi-forum.org>
> *Subject: *Re: [Mpi-forum] [EXT]: Progress Question
>
> You can have a situation where the isend/irecv pair completes at process 0
> before process 1 has called irecv or waitall. Since process 0 is now busy
> waiting on the file, it will not make progress on MPI calls and can result
> in deadlock.
>
>  ~Jim.
>
> On Sat, Oct 10, 2020 at 2:17 PM Skjellum, Anthony <Tony-Skjellum at utc.edu>
> wrote:
>
> Jim, OK, my attempt at answering below.
>
> See if you agree with my annotations.
>
> -Tony
>
>
> Anthony Skjellum, PhD
> Professor of Computer Science and Chair of Excellence
> Director, SimCenter
> University of Tennessee at Chattanooga (UTC)
> tony-skjellum at utc.edu  [or skjellum at gmail.com]
> cell: 205-807-4968
>
>
> ------------------------------
> *From:* mpi-forum <mpi-forum-bounces at lists.mpi-forum.org> on behalf of
> Jim Dinan via mpi-forum <mpi-forum at lists.mpi-forum.org>
> *Sent:* Saturday, October 10, 2020 1:31 PM
> *To:* Main MPI Forum mailing list <mpi-forum at lists.mpi-forum.org>
> *Cc:* Jim Dinan <james.dinan at gmail.com>
> *Subject:* [EXT]: [Mpi-forum] Progress Question
>
> *External Email*
> Hi All,
>
> A colleague recently asked a question that I wasn't able to answer
> definitively. Is the following code guaranteed to make progress?
>
>
> MPI_Barrier();
> -- everything is uncertain to within one message, if layered on pt2pt;
> --- let's assume a power of 2, and recursive doubling (RD).
> --- At each stage, it posts an irecv and isend to its corresponding
> element in RD
> --- All stages must complete to get to the last stage.
> --- At the last stage, it appears like your example below for N/2
> independent process pairs, which appears always to complete.
> Oif rank == 1
>   create_file("test")
> if rank == 0
>    while not_exists("test")
>        sleep(1);
>
>
> That is, can rank 1 require rank 0 to make MPI calls after its return from
> the barrier, in order for rank 1 to complete the barrier? If the code were
> written as follows:
>
>
> isend(..., other_rank, &req[0])
> irecv(..., other_rank, &req[1])
> waitall(2, req)
> --- Assume both isends buffer on the send-side and return
> immediately--valid.
> --- Both irecvs are posted, but unmatched as yet.  Nothing has transferred
> on network.
> --- Waitall would mark the isends done at once, and work to complete the
> irecvs; in
>      that process, each would have to progress the isends across the
> network. On this comm
>      and all comms, incidentally.
> --- When waitall returns, the data has transferred to the receiver,
> otherwise the irecvs
>       aren't done.
> if rank == 1
>   create_file("test")
> if rank == 0
>    while not_exists("test")
>        sleep(1);
>
>
> I think it would clearly not guarantee progress since the send data can be
> buffered. Is the same true for barrier?
>
> Cheers,
>  ~Jim.
> *This message is not from a UTC.EDU <http://utc.edu/> address. Caution
> should be used in clicking links and downloading attachments from unknown
> senders or unexpected email. *
>
>
> _______________________________________________
> mpi-forum mailing list
> mpi-forum at lists.mpi-forum.org
> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpi-forum/attachments/20201014/b42b5d88/attachment-0001.html>