[Mpi-forum] [EXT]: Progress Question

HOLMES Daniel d.holmes at epcc.ed.ac.uk
Mon Oct 12 10:17:34 CDT 2020


Hi Jim, et al,

Unless the point-to-point pseudo-code given is proven to be a valid implementation of MPI Barrier, then reasoning about MPI Barrier using it as a basis is unlikely to be edifying.
I also have a (possibly flawed) implementation of MPI Barrier that exhibits some odd semantics/behaviours and I could use that to assert (likely incorrectly) that MPI Barrier is defined in a way that exhibits those semantics/behaviours. However, that serves no purpose, so I won’t dwell on it any further.

I’m glad that someone responded with a reference to the MPI Standard, thanks Martin. In that vein, here’s my tuppence:

The definition of MPI Barrier in the MPI Standard states (§5.3, p149 in MPI-2019-draft):
“If comm is an intracommunicator, MPI_BARRIER blocks the caller until all group members have called it. The call returns at any process only after all group members have entered the call.”

There is a happens-before between “all MPI processes have entered the MPI Barrier call” and “MPI processes are permitted to leave the call”. That’s it; that’s all MPI Barrier does/is required to ensure.

There is no indication or requirement for alacrity. This appears to be a valid (although stupid) implementation:
int MPI_Barrier(MPI_comm comm) {
   int ret = PMPI_Barrier(comm);
   sleep(100days);
   return ret;
}

There is no indication or suggestion for how or when MPI processes become aware that the necessary pre-condition for returning control to the user has been satisfied. Some may become aware of this situation a significant amount of time before/after others. Local completion does not guarantee remote completion in MPI (except for passive-target RMA, e.g. MPI_Win_unlock).

There is no indication or requirement that the necessary pre-condition is also a sufficient pre-condition, although we may wish to assume that and we may wish to clarify the wording of the MPI Standard to specify that explicitly. If the MPI Standard text were changed to “The call returns at any process <strike>only</strike>immediately after all group members have entered the call.” then (given the other usage of immediately in the MPI Standard) we could assume that the procedure becomes strong local (immediate) once the necessary pre-condition is met. Without the word “immediate” in the sentence, the return of the MPI procedure is permitted to require remote progress, i.e. after the necessary pre-condition is met, it becomes weak local (called local in the MPI Standard). Some MPI libraries (can, if configured in a particular way) provide strong progress; however, MPI only requires weak progress. Weak progress means it is permitted for remote progress to happen only during remote MPI procedure calls.

So,

If MPI required “returns immediately after...” (which it does not) then every MPI process would be required to ensure the remote completion of its “send” (as well as local completion of the “recv”) before it returns control to the user. This would mean that our intuitive feel for what MPI_Barrier should do would be correct and the suggested point-to-point code would be an incorrect implementation of MPI_Barrier.
If MPI required strong progress (which it does not) then every MPI process would eventually become aware that it is permitted to return control to the user, without additional remote MPI procedure calls. This would mean that our intuitive feel for what MPI_Barrier should do would be correct and the suggested point-to-point code would be a correct implementation of MPI_Barrier.

As it is, our intuitive feel for what MPI_Barrier should do is probably wrong (i.e. not what MPI actually specifies), or at least too optimistic because it depends on a high quality implementation that exceeds what is minimally specified by the MPI Standard as required.
As it is, the MPI_Barrier in the original question does not guard against problems with the non-MPI file operations - indeed, adding it introduces a new possibility of a deadlock, which would not be present in the code without the MPI_Barrier operation.

I would argue that the original code is therefore erroneous (incorrect/non-portable) because (§5.14, p234 MPI-2019-draft):
“A correct, portable program must invoke collective communications so that deadlock will not occur”

One correct program that achieves what the original looks like it might be trying to achieve (IHMO) is as follows:
if (rank == 1)
  create_file("test”);
MPI_Barrier();
if (rank == 0)
  while not_exists("test")
    sleep(1);
This program still assumes that the file creation actually creates the file and flushes it to a filesystem that makes it visible to the existence check but that must be true if the code-without-MPI is correct, i.e. adding MPI has not introduced a new problem to the code.

Taking this reasoning about the minimal requirements of MPI Barrier (at least) one step too far, the only restriction on implementation of MPI_Barrier seems to be “do not return until <something happens>”, which suggests this is a valid (although very unhelpful) implementation:
int MPI_Barrier(MPI_comm comm) {
   while (1); // do not return, ever
}

To guard against low-quality/malicious implementations of the MPI Standard, we could either clarify the wording of the text about MPI_Barrier (and probably the text about every other MPI procedure) to include the concept of becoming an “immediate” procedure once certain criteria are met (likely to be a lot of effort/angst for some), or mandate strong progress for all MPI libraries (likely to be very unpopular for some).

Cheers,
Dan.
—
Dr Daniel Holmes PhD
Architect (HPC Research)
d.holmes at epcc.ed.ac.uk<mailto:d.holmes at epcc.ed.ac.uk>
Phone: +44 (0) 131 651 3465
Mobile: +44 (0) 7940 524 088
Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, EH8 9BT
—
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
—

On 12 Oct 2020, at 10:04, Martin Schulz via mpi-forum <mpi-forum at lists.mpi-forum.org<mailto:mpi-forum at lists.mpi-forum.org>> wrote:

Hi Jim, all,

We had a similar discussion (in a smaller circle) during the terms discussions – at least to my understanding, all bets are off as soon as you add dependencies and wait conditions outside of MPI, like here with the file. A note to this point is in a rational (Section 11.7, page 491 in the 2019 draft) – based on that an MPI implementation is allowed to deadlock (or cause a deadlock) – if all dependencies would be in MPI calls, then “eventual” progress should be guaranteed – even if it is after the 100 days in Rajeev’s example: that would – as far as I understand – still be correct behavior, as no MPI call is guaranteed to return in a fixed finite time (all calls are at best “weak local”).

Martin



--
Prof. Dr. Martin Schulz, Chair of Computer Architecture and Parallel Systems
Department of Informatics, TU-Munich, Boltzmannstraße 3, D-85748 Garching
Member of the Board of Directors at the Leibniz Supercomputing Centre (LRZ)
Email: schulzm at in.tum.de<mailto:schulzm at in.tum.de>



From: mpi-forum <mpi-forum-bounces at lists.mpi-forum.org<mailto:mpi-forum-bounces at lists.mpi-forum.org>> on behalf of Jim Dinan via mpi-forum <mpi-forum at lists.mpi-forum.org<mailto:mpi-forum at lists.mpi-forum.org>>
Reply-To: Main MPI Forum mailing list <mpi-forum at lists.mpi-forum.org<mailto:mpi-forum at lists.mpi-forum.org>>
Date: Sunday, 11. October 2020 at 23:41
To: "Skjellum, Anthony" <Tony-Skjellum at utc.edu<mailto:Tony-Skjellum at utc.edu>>
Cc: Jim Dinan <james.dinan at gmail.com<mailto:james.dinan at gmail.com>>, Main MPI Forum mailing list <mpi-forum at lists.mpi-forum.org<mailto:mpi-forum at lists.mpi-forum.org>>
Subject: Re: [Mpi-forum] [EXT]: Progress Question

You can have a situation where the isend/irecv pair completes at process 0 before process 1 has called irecv or waitall. Since process 0 is now busy waiting on the file, it will not make progress on MPI calls and can result in deadlock.

 ~Jim.

On Sat, Oct 10, 2020 at 2:17 PM Skjellum, Anthony <Tony-Skjellum at utc.edu<mailto:Tony-Skjellum at utc.edu>> wrote:
Jim, OK, my attempt at answering below.

See if you agree with my annotations.

-Tony


Anthony Skjellum, PhD
Professor of Computer Science and Chair of Excellence
Director, SimCenter
University of Tennessee at Chattanooga (UTC)
tony-skjellum at utc.edu<mailto:tony-skjellum at utc.edu>  [or skjellum at gmail.com<mailto:skjellum at gmail.com>]
cell: 205-807-4968


________________________________
From: mpi-forum <mpi-forum-bounces at lists.mpi-forum.org<mailto:mpi-forum-bounces at lists.mpi-forum.org>> on behalf of Jim Dinan via mpi-forum <mpi-forum at lists.mpi-forum.org<mailto:mpi-forum at lists.mpi-forum.org>>
Sent: Saturday, October 10, 2020 1:31 PM
To: Main MPI Forum mailing list <mpi-forum at lists.mpi-forum.org<mailto:mpi-forum at lists.mpi-forum.org>>
Cc: Jim Dinan <james.dinan at gmail.com<mailto:james.dinan at gmail.com>>
Subject: [EXT]: [Mpi-forum] Progress Question

External Email
Hi All,

A colleague recently asked a question that I wasn't able to answer definitively. Is the following code guaranteed to make progress?

MPI_Barrier();
-- everything is uncertain to within one message, if layered on pt2pt;
--- let's assume a power of 2, and recursive doubling (RD).
--- At each stage, it posts an irecv and isend to its corresponding element in RD
--- All stages must complete to get to the last stage.
--- At the last stage, it appears like your example below for N/2 independent process pairs, which appears always to complete.
Oif rank == 1
  create_file("test")
if rank == 0
   while not_exists("test")
       sleep(1);

That is, can rank 1 require rank 0 to make MPI calls after its return from the barrier, in order for rank 1 to complete the barrier? If the code were written as follows:

isend(..., other_rank, &req[0])
irecv(..., other_rank, &req[1])
waitall(2, req)
--- Assume both isends buffer on the send-side and return immediately--valid.
--- Both irecvs are posted, but unmatched as yet.  Nothing has transferred on network.
--- Waitall would mark the isends done at once, and work to complete the irecvs; in
     that process, each would have to progress the isends across the network. On this comm
     and all comms, incidentally.
--- When waitall returns, the data has transferred to the receiver, otherwise the irecvs
      aren't done.
if rank == 1
  create_file("test")
if rank == 0
   while not_exists("test")
       sleep(1);

I think it would clearly not guarantee progress since the send data can be buffered. Is the same true for barrier?

Cheers,
 ~Jim.
This message is not from a UTC.EDU<http://utc.edu/> address. Caution should be used in clicking links and downloading attachments from unknown senders or unexpected email.

_______________________________________________
mpi-forum mailing list
mpi-forum at lists.mpi-forum.org<mailto:mpi-forum at lists.mpi-forum.org>
https://lists.mpi-forum.org/mailman/listinfo/mpi-forum

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpi-forum/attachments/20201012/b2e37fee/attachment-0001.html>


More information about the mpi-forum mailing list