[mpiwg-sessions] Cross-session progress
Rolf Rabenseifner
rabenseifner at hlrs.de
Sun Oct 31 14:45:30 CDT 2021
Dear Martin
> consistent and progressing). In the particular example, it is bad for the
> library owning session A to return from a call assuming "someone else" makes
> progress.
No, this exactly is not in the responsibility of the programmer.
It is just an effect introduced by the freedom for implementing MPI libraries
and by some timings - which call was first the MPI_Bsend in the neighbor
or the MPI_Recv in the other process.
This means, each normal halo data exchange that is implemented with
MPI_Bsend and MPI_Recv calls in software layer A (e.g. CFD)
and afterwards another halo exchange in software layer B
may already cause such deadlock and the application programmer
and the library writer heve no idea why the application cases
in 10% of the runs a deadlock.
> also not making it easier for the user. However, I just wanted to point out
One may interpret this sentence as that the evolment of this spirit was
in absence of a deeper knowledge of the progress rules of MPI.
(Which is not a surprise, because this is still hard to learn
based on the very short wording in MPI-3.1 (and now a bit better
after the many discussions in the MPI forum).
I expect that this "spirit" must be changed.
Best regards
Rolf
----- Original Message -----
> From: "Martin Schulz" <schulzm at in.tum.de>
> To: "Rolf Rabenseifner" <rabenseifner at hlrs.de>
> Cc: "mpiwg-sessions" <mpiwg-sessions at lists.mpi-forum.org>
> Sent: Sunday, October 31, 2021 5:28:23 PM
> Subject: Re: [mpiwg-sessions] Cross-session progress
> Hi Rolf, all,
>
> Fully agree with all you said - what I meant with close connection is that the
> calls are in the same program in a strict order and hence connected. If those
> would in separate libraries then one could also argue that these are badly
> designed libraries (each library should be designed that it is in itself
> consistent and progressing). In the particular example, it is bad for the
> library owning session A to return from a call assuming "someone else" makes
> progress.
>
> I do agree, though, this certainly is a different meaning for progress and it
> also not making it easier for the user. However, I just wanted to point out
> that this is against the spirit of sessions, where we wanted full resource
> isolation. With a cross-session progress, each call to MPI (or a global
> progress engine) has to go across all sessions and push the state of MPI
> forward (with clear scalability implications). With that the concepts of fault
> isolation (a fault happens on one session and the others are not affected at
> all) is much harder to implement.
>
> I am also currently not arguing for one or the other - I see merits and problems
> on either side.
>
> Martin
>
>
> --
> Prof. Dr. Martin Schulz, Chair of Computer Architecture and Parallel Systems
> Department of Informatics, TU-Munich, Boltzmannstraße 3, D-85748 Garching
> Member of the Board of Directors at the Leibniz Supercomputing Centre (LRZ)
> Email: schulzm at in.tum.de
>
>
>
>On 31.10.21, 11:47, "Rolf Rabenseifner" <rabenseifner at hlrs.de> wrote:
>
> Dear Martin,
>
> > However, I do wonder if this is the right thing to do - cross-session progress
> > does imply that there is a quite close connection and sharing of resources
> > between sessions, which is exactly what we wanted to avoid.
>
> Please read my example very carfully.
>
> > Process 0:
> >
> > MPI_Bsend(…, dest=1, comm_A); // Call 0-A
> >
> > MPI_Recv(…, source=1, comm_B); // Call 0-B
> >
> > Process 1:
> >
> > MPI_Recv(…, source=0, comm_A); // Call 1-A
> >
> > MPI_Send(…, dest=0, comm_B); // Call 1-B
>
> The communication in Session A / comm_A and the communication in Session B /
> comm_B
> is absolutely without any connection to each other, i.e., this example
> is definitely the contrary of "a quite close connection".
>
> The problem arises of that all local routines or locally acting calls,
> like the MPI_Recv in 1-A after the MPI_Bsend in 0-A is called
> (which my be some time before the MPI_Recv is called),
> can be implemented as "waek local" a term that is not defined in MPI,
> but should mean that it may not return until a specific other process
> (here Process 0) calls an unspecific (i.e., not semantically related)
> MPI procedure.
>
> If an MPI library does not use this feature, i.e., is doing progress
> with an asynchronous thread for all sessions (a) or several such threads
> for each session (b), then the problem does not exist.
>
> Then also no such info key is needed and in case of (b), a very clear
> separation of sessions is given (whether several such threads is efficient,
> is another question, but not of the MPI standard, only about the
> quality of an MPI library implementation).
>
> In my opinion, there is no reason for changeing the progress rule,
> which would make MPI more complicated for the user, because there
> are perfect options for the implementors and all are (and should be)
> invisible for the question whether a given MPI application is correct.
>
> Kind regards
> Rolf
>
>
> ----- Original Message -----
> > From: "Martin Schulz" <schulzm at in.tum.de>
> > To: "Rolf Rabenseifner" <rabenseifner at hlrs.de>, "mpiwg-sessions"
> > <mpiwg-sessions at lists.mpi-forum.org>
> > Sent: Sunday, October 31, 2021 10:49:18 AM
> > Subject: Re: [mpiwg-sessions] Cross-session progress
>
> > Hi Rolf, all,
> >
> > I agree that under the current rules, cross-session progress is required and
> > there is probably very little room to maneuver.
> >
> > However, I do wonder if this is the right thing to do - cross-session progress
> > does imply that there is a quite close connection and sharing of resources
> > between sessions, which is exactly what we wanted to avoid. Also, one could
> > claim that if you have a program that has such a close connection that blocking
> > in one session can harm progress in another, then you have written a bad
> > program, as clearly the two communication operations are logically connected
> > and hence should not be in two sessions.
> >
> > We also had the idea once that it may be useful that different MPI
> > implementations back different sessions, which would then mean that they cannot
> > be connected and also would not be able to know about the progress in the other
> > session.
> >
> > This would, of course, require new text that significantly changes progress
> > rules in MPI (not in the WPM, but in the Sessions Model) with a whole bunch of
> > consequences, but it would be matching the original idea of Sessions as
> > independent access points into the MPI library.
> >
> > Martin
> >
> >
> > --
> > Prof. Dr. Martin Schulz, Chair of Computer Architecture and Parallel Systems
> > Department of Informatics, TU-Munich, Boltzmannstraße 3, D-85748 Garching
> > Member of the Board of Directors at the Leibniz Supercomputing Centre (LRZ)
> > Email: schulzm at in.tum.de
> >
> >
> >
> >On 31.10.21, 10:00, "mpiwg-sessions on behalf of Rolf Rabenseifner via
> >mpiwg-sessions" <mpiwg-sessions-bounces at lists.mpi-forum.org on behalf of
> >mpiwg-sessions at lists.mpi-forum.org> wrote:
> >
> > Dear Dan and Joseph,
> >
> > I expect that such an info key makes no sense, because
> > the following example and related statement shows that
> > the rule is that we always have to require cross-session progress:
> >
> > _______________
> > The definition of "local" in MPI and the related progress rules
> > always allows that a local MPI routine or an MPI call that must
> > behave as local is still allowed to not return until in another
> > process an unspecific, i.e., semantically not related MPI
> > call happens (which is always guaranteed because latest a MPI
> > finalizing call must be invoked an this one is allowed to
> > block until all necessary progress will have happened).
> >
> > Let comm_A and comm_B be two communicators derived from
> > two different sessions or one of them being part of the
> > world model.
> > They may be used in two different software layers which are
> > independently programmed.
> > The following program would cause a deadlock if the
> > return of an MPI_RECV of a MPI_BSEND may not return until
> > such an unspecific MPI call happens in the process that
> > called MPI_BSEND and we would require that this unspecific
> > MPI call is done in the same session as MPI_BSEND.
> >
> > Process 0:
> >
> > MPI_Bsend(…, dest=1, comm_A); // Call 0-A
> >
> > MPI_Recv(…, source=1, comm_B); // Call 0-B
> >
> > Process 1:
> >
> > MPI_Recv(…, source=0, comm_A); // Call 1-A
> >
> > MPI_Send(…, dest=0, comm_B); // Call 1-B
> >
> > As long Call 1-A as does not return, Call 1-B is not executed
> > and therefore Call 0-B cannot return and therefore Process 0
> > cannot issue any further MPI call. This implies that the
> > Call 0-B must be that one that is the semantically not related MPI
> > in Process 0 that provides the progress for Call 1-A.
> > This very simple example shows that cross-session progress is needed.
> > ___________________________
> >
> > For all readers of this text who are not familiar with the
> > behavior of MPI_Bsend + MPI_Recv and the progress rule of MPI,
> > I recommend to look at Slide 589 in my MPI course.
> > You may also download the zip or tar file and test
> > MPI/tasks/C/Ch18/progress-test-bsend.c
> > by using
> > - a single threaded MPI library (i.e., with providing progress only
> > inside of MPI routines)
> > - and an MPI library that provides asynchronous progress.
> >
> > For the slides and examples (in C, Fortran and Python) please look at
> >
> > https://www.hlrs.de/training/par-prog-ws/MPI-course-material
> >
> > Kind regards
> > Rolf
> >
> > ----- Original Message -----
> > > From: "mpiwg-sessions" <mpiwg-sessions at lists.mpi-forum.org>
> > > To: "mpiwg-sessions" <mpiwg-sessions at lists.mpi-forum.org>
> > > Cc: "Joseph Schuchart" <schuchart at icl.utk.edu>
> > > Sent: Wednesday, October 27, 2021 6:37:27 PM
> > > Subject: Re: [mpiwg-sessions] Cross-session progress
> >
> > > Dan,
> > >
> > > I guess this info key would apply to a Session? I can imagine an
> > > assertion saying that you'll never block on communication, i.e., no
> > > blocking send/recv and no wait, unless you are sure it completes, to
> > > make sure you're not creating a blocking dependency. That is the scope
> > > you have control over over.
> > >
> > > This would allow an implementation to take this particular session out
> > > of the global progress scope (progress on the WPM or other sessions). A
> > > wait test with requests from that session would still require global
> > > progress though to resolve any dependencies from sessions that do not
> > > carry this assert or from the WPM. If all sessions carry this assert
> > > then of course it's only WPM communication that has to be progressed (if
> > > any). Would that be useful?
> > >
> > > Thanks
> > > Joseph
> > >
> > > On 10/27/21 12:16 PM, Dan Holmes via mpiwg-sessions wrote:
> > >> Hi all,
> > >>
> > >> During the HACC WG call today, we discussed whether progress can be
> > >> isolated by session. We devised this simple pseudo-code example
> > >> (below) that shows the answer is “no”. With current progress rules in
> > >> MPI-4.0 (unchanged from previous versions of MPI), the code must not
> > >> deadlock at the place(s) indicated by the comments, even with one
> > >> thread of execution, because the MPI_Recv procedure at process 0 must
> > >> progress the send operation from process 0, which means the MPI_Recv
> > >> procedure at process 1 is required to complete.
> > >>
> > >> If MPI is permitted to limit the scope of progress during the MPI_Recv
> > >> procedure to just the operations within a particular session, then it
> > >> is permitted to refuse to progress the send operation from process 0
> > >> and deadlock inevitably ensues, unless the two libraries use different
> > >> threads or MPI supports strong progress (both of which are optional).
> > >>
> > >> We suggested an INFO assertion that would give the user the
> > >> opportunity to assert that they would not code the application in a
> > >> way that resulted in this kind of deadlock. It might be hard for the
> > >> user to know for sure when it is safe to use such an INFO assertion,
> > >> especially in the general case and with opaque/closed-source
> > >> libraries. However, if the INFO assertion was supplied, MPI could be
> > >> implemented with separated/isolated progress. The scope of progress is
> > >> global (whole MPI process) at the moment — and that would have to be
> > >> the default scope/value for the INFO assertion. Smaller scopes could
> > >> be session, communicator/window/file, and even operation.
> > >>
> > >> Process 0:
> > >>
> > >> library_A.begin_call -> {MPI_Issend(…, comm_A); }
> > >>
> > >> library_B.begin_call -> {MPI_Recv(…, comm_B); } // deadlock ?
> > >>
> > >> library_A.end_call -> {MPI_Wait(…, comm_A); }
> > >>
> > >> library_B.end_call -> { }
> > >>
> > >> Process 1:
> > >>
> > >> library_A.begin_call -> {MPI_Recv(…, comm_A); } // deadlock ?
> > >>
> > >> library_B.begin_call -> {MPI_Issend(…, comm_B); }
> > >>
> > >> library_A.end_call -> { }
> > >>
> > >> library_B.end_call -> {MPI_Wait(…, comm_B); }
> > >>
> > >>
> > >> Cheers,
> > >> Dan.
> > >> —
> > >> Dr Daniel Holmes PhD
> > >> Executive Director
> > >> Chief Technology Officer
> > >> CHI Ltd
> > >> danholmes at chi.scot <mailto:danholmes at chi.scot>
> > >>
> > >>
> > >>
> > >>
> > >> _______________________________________________
> > >> mpiwg-sessions mailing list
> > >> mpiwg-sessions at lists.mpi-forum.org
> > >> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions
> > >
> > > _______________________________________________
> > > mpiwg-sessions mailing list
> > > mpiwg-sessions at lists.mpi-forum.org
> > > https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions
> >
> > --
> > Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de .
> > High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530 .
> > University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832 .
> > Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner .
> > Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room 1.307) .
> > _______________________________________________
> > mpiwg-sessions mailing list
> > mpiwg-sessions at lists.mpi-forum.org
> > https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions
>
> --
> Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de .
> High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530 .
> University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832 .
> Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner .
> Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room 1.307) .
--
Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de .
High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530 .
University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832 .
Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner .
Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room 1.307) .
More information about the mpiwg-sessions
mailing list