[mpiwg-sessions] Cross-session progress

Sun Oct 31 14:45:30 CDT 2021

Dear Martin

> consistent and progressing). In the particular example, it is bad for the
> library owning session A to return from a call assuming "someone else" makes
> progress.

No, this exactly is not in the responsibility of the programmer.
It is just an effect introduced by the freedom for implementing MPI libraries
and by some timings - which call was first the MPI_Bsend in the neighbor 
or the MPI_Recv in the other process.

This means, each normal halo data exchange that is implemented with
MPI_Bsend and MPI_Recv calls in software layer A (e.g. CFD)
and afterwards another halo exchange in software layer B
may already cause such deadlock and the application programmer
and the library writer heve no idea why the application cases 
in 10% of the runs a deadlock.

> also not making it easier for the user. However, I just wanted to point out

One may interpret this sentence as that the evolment of this spirit was
in absence of a deeper knowledge of the progress rules of MPI.
(Which is not a surprise, because this is still hard to learn
based on the very short wording in MPI-3.1 (and now a bit better
after the many discussions in the MPI forum).

I expect that this "spirit" must be changed.

Best regards
Rolf

----- Original Message -----
> From: "Martin Schulz" <schulzm at in.tum.de>
> To: "Rolf Rabenseifner" <rabenseifner at hlrs.de>
> Cc: "mpiwg-sessions" <mpiwg-sessions at lists.mpi-forum.org>
> Sent: Sunday, October 31, 2021 5:28:23 PM
> Subject: Re: [mpiwg-sessions] Cross-session progress

> Hi Rolf, all,
> 
> Fully agree with all you said - what I meant with close connection is that the
> calls are in the same program in a strict order and hence connected. If those
> would in separate libraries then one could also argue that these are badly
> designed libraries (each library should be designed that it is in itself
> consistent and progressing). In the particular example, it is bad for the
> library owning session A to return from a call assuming "someone else" makes
> progress.
> 
> I do agree, though, this certainly is a different meaning for progress and it
> also not making it easier for the user. However, I just wanted to point out
> that this is against the spirit of sessions, where we wanted full resource
> isolation. With a cross-session progress, each call to MPI (or a global
> progress engine) has to go across all sessions and push the state of MPI
> forward (with clear scalability implications). With that the concepts of fault
> isolation (a fault happens on one session and the others are not affected at
> all) is much harder to implement.
> 
> I am also currently not arguing for one or the other - I see merits and problems
> on either side.
> 
> Martin
> 
> 
> --
> Prof. Dr. Martin Schulz, Chair of Computer Architecture and Parallel Systems
> Department of Informatics, TU-Munich, Boltzmannstraße 3, D-85748 Garching
> Member of the Board of Directors at the Leibniz Supercomputing Centre (LRZ)
> Email: schulzm at in.tum.de
> 
> 
> 
>On 31.10.21, 11:47, "Rolf Rabenseifner" <rabenseifner at hlrs.de> wrote:
> 
>    Dear Martin,
> 
>    > However, I do wonder if this is the right thing to do - cross-session progress
>    > does imply that there is a quite close connection and sharing of resources
>    > between sessions, which is exactly what we wanted to avoid.
> 
>    Please read my example very carfully.
> 
>    >    Process 0:
>    > 
>    >      MPI_Bsend(…, dest=1, comm_A); // Call 0-A
>    > 
>    >      MPI_Recv(…, source=1, comm_B); // Call 0-B
>    > 
>    >    Process 1:
>    > 
>    >      MPI_Recv(…, source=0, comm_A); // Call 1-A
>    > 
>    >      MPI_Send(…, dest=0, comm_B); // Call 1-B
> 
>    The communication in Session A / comm_A and the communication in Session B /
>    comm_B
>    is absolutely without any connection to each other, i.e., this example
>    is definitely the contrary of "a quite close connection".
> 
>    The problem arises of that all local routines or locally acting calls,
>    like the MPI_Recv in 1-A after the MPI_Bsend in 0-A is called
>    (which my be some time before the MPI_Recv is called),
>    can be implemented as "waek local" a term that is not defined in MPI,
>    but should mean that it may not return until a specific other process
>    (here Process 0) calls an unspecific (i.e., not semantically related)
>    MPI procedure.
> 
>    If an MPI library does not use this feature, i.e., is doing progress
>    with an asynchronous thread for all sessions (a) or several such threads
>    for each session (b), then the problem does not exist.
> 
>    Then also no such info key is needed and in case of (b), a very clear
>    separation of sessions is given (whether several such threads is efficient,
>    is another question, but not of the MPI standard, only about the
>    quality of an MPI library implementation).
> 
>    In my opinion, there is no reason for changeing the progress rule,
>    which would make MPI more complicated for the user, because there
>    are perfect options for the implementors and all are (and should be)
>    invisible for the question whether a given MPI application is correct.
> 
>    Kind regards
>    Rolf
> 
> 
>    ----- Original Message -----
>    > From: "Martin Schulz" <schulzm at in.tum.de>
>    > To: "Rolf Rabenseifner" <rabenseifner at hlrs.de>, "mpiwg-sessions"
>    > <mpiwg-sessions at lists.mpi-forum.org>
>    > Sent: Sunday, October 31, 2021 10:49:18 AM
>    > Subject: Re: [mpiwg-sessions] Cross-session progress
> 
>    > Hi Rolf, all,
>    > 
>    > I agree that under the current rules, cross-session progress is required and
>    > there is probably very little room to maneuver.
>    > 
>    > However, I do wonder if this is the right thing to do - cross-session progress
>    > does imply that there is a quite close connection and sharing of resources
>    > between sessions, which is exactly what we wanted to avoid. Also, one could
>    > claim that if you have a program that has such a close connection that blocking
>    > in one session can harm progress in another, then you have written a bad
>    > program, as clearly the two communication operations are logically connected
>    > and hence should not be in two sessions.
>    > 
>    > We also had the idea once that it may be useful that different MPI
>    > implementations back different sessions, which would then mean that they cannot
>    > be connected and also would not be able to know about the progress in the other
>    > session.
>    > 
>    > This would, of course, require new text that significantly changes progress
>    > rules in MPI (not in the WPM, but in the Sessions Model) with a whole bunch of
>    > consequences, but it would be matching the original idea of Sessions as
>    > independent access points into the MPI library.
>    > 
>    > Martin
>    > 
>    > 
>    > --
>    > Prof. Dr. Martin Schulz, Chair of Computer Architecture and Parallel Systems
>    > Department of Informatics, TU-Munich, Boltzmannstraße 3, D-85748 Garching
>    > Member of the Board of Directors at the Leibniz Supercomputing Centre (LRZ)
>    > Email: schulzm at in.tum.de
>    > 
>    > 
>    > 
>    >On 31.10.21, 10:00, "mpiwg-sessions on behalf of Rolf Rabenseifner via
>    >mpiwg-sessions" <mpiwg-sessions-bounces at lists.mpi-forum.org on behalf of
>    >mpiwg-sessions at lists.mpi-forum.org> wrote:
>    > 
>    >    Dear Dan and Joseph,
>    > 
>    >    I expect that such an info key makes no sense, because
>    >    the following example and related statement shows that
>    >    the rule is that we always have to require cross-session progress:
>    > 
>    >    _______________
>    >    The definition of "local" in MPI and the related progress rules
>    >    always allows that a local MPI routine or an MPI call that must
>    >    behave as local is still allowed to not return until in another
>    >    process an unspecific, i.e., semantically not related MPI
>    >    call happens (which is always guaranteed because latest a MPI
>    >    finalizing call must be invoked an this one is allowed to
>    >    block until all necessary progress will have happened).
>    > 
>    >    Let comm_A and comm_B be two communicators derived from
>    >    two different sessions or one of them being part of the
>    >    world model.
>    >    They may be used in two different software layers which are
>    >    independently programmed.
>    >    The following program would cause a deadlock if the
>    >    return of an MPI_RECV of a MPI_BSEND may not return until
>    >    such an unspecific MPI call happens in the process that
>    >    called MPI_BSEND and we would require that this unspecific
>    >    MPI call is done in the same session as MPI_BSEND.
>    > 
>    >    Process 0:
>    > 
>    >      MPI_Bsend(…, dest=1, comm_A); // Call 0-A
>    > 
>    >      MPI_Recv(…, source=1, comm_B); // Call 0-B
>    > 
>    >    Process 1:
>    > 
>    >      MPI_Recv(…, source=0, comm_A); // Call 1-A
>    > 
>    >      MPI_Send(…, dest=0, comm_B); // Call 1-B
>    > 
>    >    As long Call 1-A as does not return, Call 1-B is not executed
>    >    and therefore Call 0-B cannot return and therefore Process 0
>    >    cannot issue any further MPI call. This implies that the
>    >    Call 0-B must be that one that is the semantically not related MPI
>    >    in Process 0 that provides the progress for Call 1-A.
>    >    This very simple example shows that cross-session progress is needed.
>    >    ___________________________
>    > 
>    >    For all readers of this text who are not familiar with the
>    >    behavior of MPI_Bsend + MPI_Recv and the progress rule of MPI,
>    >    I recommend to look  at Slide 589 in my MPI course.
>    >    You may also download the zip or tar file and test
>    >      MPI/tasks/C/Ch18/progress-test-bsend.c
>    >    by using
>    >    - a single threaded MPI library (i.e., with providing progress only
>    >      inside of MPI routines)
>    >    - and an MPI library that provides asynchronous progress.
>    > 
>    >    For the slides and examples (in C, Fortran and Python) please look at
>    > 
>    >      https://www.hlrs.de/training/par-prog-ws/MPI-course-material
>    > 
>    >    Kind regards
>    >    Rolf
>    > 
>    >    ----- Original Message -----
>    >    > From: "mpiwg-sessions" <mpiwg-sessions at lists.mpi-forum.org>
>    >    > To: "mpiwg-sessions" <mpiwg-sessions at lists.mpi-forum.org>
>    >    > Cc: "Joseph Schuchart" <schuchart at icl.utk.edu>
>    >    > Sent: Wednesday, October 27, 2021 6:37:27 PM
>    >    > Subject: Re: [mpiwg-sessions] Cross-session progress
>    > 
>    >    > Dan,
>    >    > 
>    >    > I guess this info key would apply to a Session? I can imagine an
>    >    > assertion saying that you'll never block on communication, i.e., no
>    >    > blocking send/recv and no wait, unless you are sure it completes, to
>    >    > make sure you're not creating a blocking dependency. That is the scope
>    >    > you have control over over.
>    >    > 
>    >    > This would allow an implementation to take this particular session out
>    >    > of the global progress scope (progress on the WPM or other sessions). A
>    >    > wait test with requests from that session would still require global
>    >    > progress though to resolve any dependencies from sessions that do not
>    >    > carry this assert or from the WPM. If all sessions carry this assert
>    >    > then of course it's only WPM communication that has to be progressed (if
>    >    > any). Would that be useful?
>    >    > 
>    >    > Thanks
>    >    > Joseph
>    >    > 
>    >    > On 10/27/21 12:16 PM, Dan Holmes via mpiwg-sessions wrote:
>    >    >> Hi all,
>    >    >>
>    >    >> During the HACC WG call today, we discussed whether progress can be
>    >    >> isolated by session. We devised this simple pseudo-code example
>    >    >> (below) that shows the answer is “no”. With current progress rules in
>    >    >> MPI-4.0 (unchanged from previous versions of MPI), the code must not
>    >    >> deadlock at the place(s) indicated by the comments, even with one
>    >    >> thread of execution, because the MPI_Recv procedure at process 0 must
>    >    >> progress the send operation from process 0, which means the MPI_Recv
>    >    >> procedure at process 1 is required to complete.
>    >    >>
>    >    >> If MPI is permitted to limit the scope of progress during the MPI_Recv
>    >    >> procedure to just the operations within a particular session, then it
>    >    >> is permitted to refuse to progress the send operation from process 0
>    >    >> and deadlock inevitably ensues, unless the two libraries use different
>    >    >> threads or MPI supports strong progress (both of which are optional).
>    >    >>
>    >    >> We suggested an INFO assertion that would give the user the
>    >    >> opportunity to assert that they would not code the application in a
>    >    >> way that resulted in this kind of deadlock. It might be hard for the
>    >    >> user to know for sure when it is safe to use such an INFO assertion,
>    >    >> especially in the general case and with opaque/closed-source
>    >    >> libraries. However, if the INFO assertion was supplied, MPI could be
>    >    >> implemented with separated/isolated progress. The scope of progress is
>    >    >> global (whole MPI process) at the moment — and that would have to be
>    >    >> the default scope/value for the INFO assertion. Smaller scopes could
>    >    >> be session, communicator/window/file, and even operation.
>    >    >>
>    >    >> Process 0:
>    >    >>
>    >    >> library_A.begin_call -> {MPI_Issend(…, comm_A); }
>    >    >>
>    >    >> library_B.begin_call -> {MPI_Recv(…, comm_B); } // deadlock ?
>    >    >>
>    >    >> library_A.end_call -> {MPI_Wait(…, comm_A); }
>    >    >>
>    >    >> library_B.end_call -> { }
>    >    >>
>    >    >> Process 1:
>    >    >>
>    >    >> library_A.begin_call -> {MPI_Recv(…, comm_A); } // deadlock ?
>    >    >>
>    >    >> library_B.begin_call -> {MPI_Issend(…, comm_B); }
>    >    >>
>    >    >> library_A.end_call -> { }
>    >    >>
>    >    >> library_B.end_call -> {MPI_Wait(…, comm_B); }
>    >    >>
>    >    >>
>    >    >> Cheers,
>    >    >> Dan.
>    >    >> —
>    >    >> Dr Daniel Holmes PhD
>    >    >> Executive Director
>    >    >> Chief Technology Officer
>    >    >> CHI Ltd
>    >    >> danholmes at chi.scot <mailto:danholmes at chi.scot>
>    >    >>
>    >    >>
>    >    >>
>    >    >>
>    >    >> _______________________________________________
>    >    >> mpiwg-sessions mailing list
>    >    >> mpiwg-sessions at lists.mpi-forum.org
>    >    >> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions
>    >    > 
>    >    > _______________________________________________
>    >    > mpiwg-sessions mailing list
>    >    > mpiwg-sessions at lists.mpi-forum.org
>    >    > https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions
>    > 
>    >    --
>    >    Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de .
>    >    High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530 .
>    >    University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832 .
>    >    Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner .
>    >    Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room 1.307) .
>    >    _______________________________________________
>    >    mpiwg-sessions mailing list
>    >    mpiwg-sessions at lists.mpi-forum.org
>    >     https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions
> 
>    --
>    Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de .
>    High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530 .
>    University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832 .
>    Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner .
>     Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room 1.307) .

-- 
Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de .
High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530 .
University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832 .
Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner .
Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room 1.307) .