[mpiwg-sessions] PMIx allocate resources request

Tue Sep 20 14:25:11 CDT 2022

Hi all,

The choices Ralph outlines fit with how cancellation of point-to-point messages work in MPI. There’s a moment in time (implementation specific and invisible to the user) when it becomes too late to cancel the message operation. MPI mandates that either the cancellation succeeds or the message operation succeeds but not both and not neither. As long as exactly one of the two processes A and B gets PMIX_SUCCESS and the other gets an error, then it’ll work fine.

MPI has deprecated cancellation of send operations because it was too hard to implement given the existence of eager send protocols and the completion semantics of MPI, which is always “I’ve done my part”. Both the allocation request and the cancellation request require remote action and rely on a remote state change so they look more like receiving in MPI than sending. So there should be no implementation concerns.

MPI programs have no way for process A to “tell process B about” an operation request handle returned to process A. Copying the bytes doesn’t work because the value is meaningless to any other process. What is the scope for meaningfulness of PMIX handles/IDs? What needs to be shared for this transfer to make sense? Same process, same server, same host, same scheduler? Does that scope translate into a usable MPI semantic?

Cheers,
Dan.

Sent from my iPhone

> On 20 Sep 2022, at 19:22, Ralph Castain via mpiwg-sessions <mpiwg-sessions at lists.mpi-forum.org> wrote:
> 
> Assuming okay with the folks on this list, it should work for me. Here are some thoughts I sent to Dominik:
> 
> ...let me offer you a slightly different thought model.
> 
> First, there is nothing to say that the allocation request and cancellation must come from the same process. Indeed, in many cases we've been studying, the cancellation directive is likely not only to come from a different process, but also come multiple times from multiple sources. This is why I keep pushing people away from MPI as the example - MPI's bulk synchronous model leads people to think in a "linear" fashion, while dynamic environments are inherently non-linear.
> 
> Second, the internal operations of the scheduler should never be visible to the application. If one scheduler delays actual cancellation of the allocation request until the end of a schedule calculation while another does it immediately, that shouldn't impact the application, nor should it affect the methodology resulting from the thought model.
> 
> So in this case, let's look at the following model. Process A issues an allocation request - doesn't matter if they use the blocking or non-blocking API as the former just calls the latter with an internal callback function. The request flows to the client's local server, which relays it to the host and returns PMIX_SUCCESS to the client, thereby indicating that the request was successfully filed with the host (as opposed to PMIX_ERR_NOT_SUPPORTED because the host doesn't support that API) and the assigned request ID. Process A then shares that ID with process B.
> 
> The host relays the request to the scheduler. In the case of a static scheduler, this eventually triggers a recomputation of the overall schedule - could be a delay before that happens. In the case of a dynamic scheduler, the request is immediately added to the continual schedule computation. Either way, what happens in here doesn't matter to us.
> 
> At some point, process B determines that the application doesn't require the operation specified in the request by process A. For example, it could be that B has indicators that the job will complete sooner and therefore the allocation extension isn't needed. Anyway, B now issues a cancellation order on the assigned request ID. This flows to its local server, which again relays it to the host and returns PMIX_SUCCESS to the client, thereby indicating that the request was successfully filed with the host - remember, the host could NOT_SUPPORT a cancellation request.
> 
> So we now have two processes waiting for a response from the host. Process A is sitting in a wait state looking for a response to its allocation request. Process B is sitting in a wait state looking for a response to the cancellation request.
> 
> We therefore want the scheduler to immediately return a PMIX_ERR_CANCELLED response to process A, thus releasing it from its wait state and telling it that the requested allocation operation is not going to be performed. We also want the scheduler to immediately return a PMIX_SUCCESS response to process B, thereby indicating that the cancellation request was successfully performed.
> 
> Note that we don't care what the scheduler has to do internally to accomplish these tasks. The static scheduler might well have to complete its operation before discarding the result, while the dynamic scheduler simply removes the requested allocation from its ongoing computations. Doesn't matter to us - one just needs to internally track that the allocation resulting from its computation is to be discarded, while the other just drops the request from consideration. Neither case impacts what we see.
> 
> This leaves open the issue of the race conditions between these two requests. Our goal must be to ensure that both processes are released (i.e., dont' hang waiting for a response) - consistency between the processes is the responsibility of the application.
> 
> Let's take the extreme cases. First, assume that the scheduler completes the allocation procedure and has sent a notification to process A. The request from process B then arrives just after the response was sent to A, but before it can arrive - i.e., there is no time for process A to alert process B that the allocation has been given.
> 
> We have a couple of options here. First, we could have the scheduler generate a PMIx event indicating that the allocation has been terminated by cancellation and take the allocation back, responding with PMIX_SUCCESS to process B. Thus, process A would receive PMIX_SUCCESS that the allocation was granted, but almost immediately afterwards receive an event indicating that the allocation had been rescinded by cancellation.
> 
> Another option would be to return an error to the cancellation request on the grounds that is is too late - the allocation was already granted. The application can then internally resolve the conflict and "return" the allocation if required.
> 
> My approach is to use the second option as it provides the best consistency at the scheduler. The cancellation request is too late - let the application deal with it.
> 
> Second extreme case is the reverse of the two. Let's assume that the scheduler has just finished setting up the allocation, and before it can notify A that the allocation is complete and ready, it gets a cancellation order from B. We could return PMIX_SUCCESS  to A and give A the allocation, and return PMIX_ERR_XXX to B indicating that it was too late to cancel. Or we could return PMIX_ERR_CANCELLED to A indicating that we couldn't give it the allocation because it was cancelled, and PMIX_SUCCESS to B indicating that the cancellation request was honored.
> 
> My approach here is the second one - return the cancellation error to A and success to B, and then tear down whatever was done in support of the initial request. Reason is that we had not yet notified A that the allocation was ready, and therefore A could not have started responding to that allocation. Yes, it could take a while for the host to internally deal with the change - but again, that is invisible to the application.
> 
> Hope that helps.
> Ralph
> 
> 
>> On Sep 20, 2022, at 11:14 AM, Martin SCHREIBER <martin.schreiber at univ-grenoble-alpes.fr> wrote:
>> 
>> Dear Ralph,
>> 
>> I'd be highly interested in this and I'd assume that the others are as
>> well.
>> 
>> How about using the MPI Session WG on October 3rd for this if nobody
>> has other plans for this?
>> 
>> Thanks & all the best,
>> 
>> Martin
>> 
>> 
>> 
>>> On Wed, 2022-09-14 at 20:34 -0700, Ralph Castain via mpiwg-sessions
>>> wrote:
>>> Hi folks
>>> 
>>> I was in a meeting earlier today where it emerged that you folks
>>> might be having some discussions related to the
>>> PMIx_Allocate_resources API? It sounded like there might be some
>>> confusion over that operation and how it works, plus there have been
>>> some developments regarding it in other forums.
>>> 
>>> If you let me know when you might be discussing this topic, I'd be
>>> happy to drop in to answer questions and provide updates on it.
>>> Ralph
>>> 
>>> _______________________________________________
>>> mpiwg-sessions mailing list
>>> mpiwg-sessions at lists.mpi-forum.org
>>> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions
>> 
>> -- 
>> Prof. Dr. Martin Schreiber
>> 
>> Applied Mathematics / High Performance Scientific Comp. for PDEs
>> Université Grenoble Alpes / Laboratoire Jean Kuntzmann, France
>> 
>> For Time-X Euro-HPC project:
>> Informatics / Computer Architecture and Parallel Systems
>> Technical University of Munich, Germany
>> 
>> 
>> 
>> 
>> 
>> 
> 
> _______________________________________________
> mpiwg-sessions mailing list
> mpiwg-sessions at lists.mpi-forum.org
> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions