[Mpi-forum] MPI_Request_free restrictions

HOLMES Daniel d.holmes at epcc.ed.ac.uk
Sat Aug 15 05:46:20 CDT 2020


Hi Jim,

Consider a simple test that does an MPI_Isend that has no matching recv, frees the request, and then calls MPI_Finalize. Does the above text say this should work? Or not?

IMHO, this program is erroneous because the “receiver” process does not comply with the requirements of MPI_FINALIZE, i.e. it must initiate all MPI calls needed to complete its involvement in MPI communications - it must initiate a matching receive operation (or the sender must cancel their send).

The actual behaviour is undefined - MPI might raise an error (if it notices), it might hang in MPI_FINALIZE at the sender process (e.g. because it has a large send buffer that it is waiting for a receiver to drain before it releases it), or it may seem to complete successfully (e.g. if the message is small, was sent eagerly, and the “receiver” doesn’t look at its unexpected message queue because it has no reason to do so). This ambiguity doesn’t matter because the program is erroneous - it can do anything - be happy if it doesn’t set the data centre on fire.

Cheers,
Dan.
—
Dr Daniel Holmes PhD
Architect (HPC Research)
d.holmes at epcc.ed.ac.uk<mailto:d.holmes at epcc.ed.ac.uk>
Phone: +44 (0) 131 651 3465
Mobile: +44 (0) 7940 524 088
Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, EH8 9BT
—
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
—

On 14 Aug 2020, at 18:28, Jim Dinan <james.dinan at gmail.com<mailto:james.dinan at gmail.com>> wrote:

Sorry, we seem to have lost the mailing list for the last couple messages below (my fault).

The text on MPI_FINALIZE does not mandate “no pending communication”, it requires “all MPI calls needed to complete its involvement …"
"Before an MPI process invokes MPI_FINALIZE, the process must perform all MPI calls needed to complete its involvement in MPI communications associated with the World Model. It must locally complete all MPI operations that it initiated and must execute matching calls needed to complete MPI communications initiated by other processes. For example, if the process executed a nonblocking send, it must eventually call MPI_WAIT, MPI_TEST, MPI_REQUEST_FREE, or any derived function” §10.2.2 in MPI-4.0

Consider a simple test that does an MPI_Isend that has no matching recv, frees the request, and then calls MPI_Finalize.

Does the above text say this should work? Or not?

 ~Jim.

On Fri, Aug 14, 2020 at 9:28 AM HOLMES Daniel <d.holmes at epcc.ed.ac.uk<mailto:d.holmes at epcc.ed.ac.uk>> wrote:
Hi Jim,

If the user releases their reference, the MPI library will need to add this handle to some internal data structure. IIRC, never requiring MPI to do this was a design guideline for MPI 3.0.

This is, I guess, the design choice that supports the current prohibition in the RMA chapter, i.e. calling MPI_REQUEST_FREE for a request-based RMA operation is erroneous. It’s a small overhead, but there is no trade-off (AFAIK) that could mitigate/outweigh it.

Freeing an active request seems like it would leak application memory. For example, if you free an active send/recv request, how can the user safely access the send/recv buffer?

This is the reason that freeing an active point-to-point request is discouraged in the MPI Standard (and should, IMHO, be prohibited).
“It is preferable, in general, to free requests when they are inactive.” §3.9

Arguments like “but I can discover remote completion of the operation” do not provide a guarantee of local completion and/or freeing of local resources. That issue is mentioned in the MPI Standard to justify the discouragement, but it could equally well justify a strict prohibition.
“Active receive requests should not be freed. Otherwise, it will not be possible to check that the receive has completed.” §3.9

The MPI Forum is unlikely to vote for upgrading the discouragement to a prohibition for point-to-point (because back-compat, sigh).

Is it effectively leaked (i.e. never returned back to the user by the MPI library)?

It is effectively leaked until MPI_FINALIZE returns.

And how will the user meet the no-pending-communication requirement of MPI_Finalize?

The text on MPI_FINALIZE does not mandate “no pending communication”, it requires “all MPI calls needed to complete its involvement …"
"Before an MPI process invokes MPI_FINALIZE, the process must perform all MPI calls needed to complete its involvement in MPI communications associated with the World Model. It must locally complete all MPI operations that it initiated and must execute matching calls needed to complete MPI communications initiated by other processes. For example, if the process executed a nonblocking send, it must eventually call MPI_WAIT, MPI_TEST, MPI_REQUEST_FREE, or any derived function” §10.2.2 in MPI-4.0

The "execute matching calls needed to complete MPI communications initiated by other processes” bit is easy - just initiate (meaning MPI_Isend/MPI_Irecv or MPI_START) the matching point-to-point MPI procedure at the other MPI process. The progress rule in §3.5 guarantees that “If a pair of matching send and receives have been initiated then at least one of these two operations will complete, independently of other actions in the system” and “[each] will complete, unless the [other] is satisfied by another message.” So, in a correct MPI program, where all sends have a matching receive and vice versa, all those point-to-point communication operations will complete (eventually, possibly during MPI_FINALIZE).

The “locally complete” bit is what you’re really asking about. Of course, strictly, MPI_REQUEST_FREE does not “locally complete” and so it should not be relevant in this pre-finalise instruction; it is listed here precisely because of the historical exception permitting freeing of active point-to-point requests. Thus, “MPI_ISEND, MPI_REQUEST_FREE, MPI_FINALIZE” is an explicitly allowed exception, even though it would otherwise breach the rule.

Cheers,
Dan.
—
Dr Daniel Holmes PhD
Architect (HPC Research)
d.holmes at epcc.ed.ac.uk<mailto:d.holmes at epcc.ed.ac.uk>
Phone: +44 (0) 131 651 3465
Mobile: +44 (0) 7940 524 088
Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, EH8 9BT
—
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
—

On 13 Aug 2020, at 20:17, Jim Dinan <james.dinan at gmail.com<mailto:james.dinan at gmail.com>> wrote:

Sorry, I got my wires crossed there. Apply what I wrote to MPI_Request_free on an active request.

Assume that the MPI library allocates space on the heap for the internal request object and returns a handle (e.g. pointer) through the MPI_Request object. The user is required to hang onto this handle and wait/test on it in the future, so MPI doesn't need to hold a reference. If the user releases their reference, the MPI library will need to add this handle do some internal data structure. IIRC, never requiring MPI do this was a design guideline for MPI 3.0.

But also, freeing an active request seems like it would leak application memory. For example, if you free an active send/recv request, how can the user safely access the send/recv buffer? Is it effectively leaked (i.e. never returned back to the user by the MPI library)? And how will the user meet the no-pending-communication requirement of MPI_Finalize?

 ~Jim.

On Thu, Aug 13, 2020 at 10:07 AM HOLMES Daniel <d.holmes at epcc.ed.ac.uk<mailto:d.holmes at epcc.ed.ac.uk>> wrote:
Hi Jim,

To be clear, I think that MPI_CANCEL is evil and should be removed from the MPI Standard entirely at the earliest convenience.

I am certainly not arguing that it be permitted for more MPI operations.

I thought the discussion was focused on MPI_REQUEST_FREE and whether or not it can/should be used on an active request.

If a particular MPI implementation does not keep a reference to the request between MPI_RPUT and MPI_REQUEST_FREE, but needs that reference to process the completion event, then that MPI implementation would be required to keep a reference to the request from MPI_REQUEST_FREE until that important task had been done, perhaps until the close epoch call. This requires no new memory because the user is giving up their reference to the request, so MPI can safely use the request it is passed in MPI_REQUEST_FREE without copying it. As you say, MPI takes over the responsibility for processing the completion event.

Your question about why the implementation should be required to take on this complexity is a good one. That, I guess, is why freeing any active request is a bad idea. MPI is required to differentiate completion of individual operations (so it can implement MPI_WAIT) but that means something must process completion at some point for each individual operation. In RMA, that responsibility can be discharged earlier than in other parts of the MPI interface, but the real question is “why should MPI offer to take on this responsibility in the first place?”

Thanks, that helps (me at least).

Cheers,
Dan.
—
Dr Daniel Holmes PhD
Architect (HPC Research)
d.holmes at epcc.ed.ac.uk<mailto:d.holmes at epcc.ed.ac.uk>
Phone: +44 (0) 131 651 3465
Mobile: +44 (0) 7940 524 088
Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, EH8 9BT
—
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
—

On 13 Aug 2020, at 14:43, Jim Dinan <james.dinan at gmail.com<mailto:james.dinan at gmail.com>> wrote:

The two cases you mentioned would have the same behavior at an application level. However, there may be important differences in the implementation of each operation. For example, an MPI_Put operation may be configured to not generate a completion event, whereas an MPI_Rput would. The library may be relying on the user to make a call on the request to process the event and clean up resources. The implementation can take over this responsibility if the user cancels the request, but why should we ask implementers to take on this complexity and overhead?

My $0.02 is that MPI_Cancel is subtle and complicated, and we should be very careful about where we allow it. I don't see the benefit to the programming model outweighing the complexity and overhead in the MPI runtime for the case of MPI_Rput. I also don't know that we were careful enough in specifying the RMA memory model that a canceled request-based RMA operation will still have well-defined behavior. My understanding is that MPI_Cancel is required primarily for canceling receive requests to meet MPI's quiescent shutdown requirement.

 ~Jim.

On Thu, Aug 13, 2020 at 8:11 AM HOLMES Daniel via mpi-forum <mpi-forum at lists.mpi-forum.org<mailto:mpi-forum at lists.mpi-forum.org>> wrote:
Hi all,

To increase my own understanding of RMA, what is the difference (if any) between a request-based RMA operation where the request is freed without being completed and before the epoch is closed and a “normal” RMA operation?

MPI_LOCK() ! or any other "open epoch at origin" procedure call
doUserWorkBefore()
MPI_RPUT(&req)
MPI_REQUEST_FREE(&req)
doUserWorkAfter()
MPI_UNLOCK() ! or the matching “close epoch at origin" procedure call

vs:

MPI_LOCK() ! or any other "open epoch at origin" procedure call
doUserWorkBefore()
MPI_PUT()
doUserWorkAfter()
MPI_UNLOCK() ! or the matching “close epoch at origin" procedure call

Is this a source-to-source translation that is always safe in either direction?

In RMA, in contrast to the rest of MPI, there are two opportunities for MPI to “block” and do non-local work to complete an RMA operation: 1) during MPI_WAIT for the request (if any - the user may not be given a request or the user may choose to free the request without calling MPI_WAIT or the user might call nonblocking MPI_TEST) and 2) during the close epoch procedure, which is always permitted to be sufficiently non-local to guarantee that the RMA operation is complete and its freeing stage has been done. It seems that a request-based RMA operation becomes identical to a “normal” RMA operation if the user calls MPI_REQUEST_FREE on the request. This is like “freeing" the request from a nonblocking point-to-point operation but without the guarantee of a later synchronisation procedure that can actually complete the operation and actually do the freeing stage of the operation.

In collectives, there is no “ensure all operations so far are now done” procedure call because there is no concept of epoch for collectives.
In point-to-point, there is no “ensure all operations so far are now done” procedure call because there is no concept of epoch for point-to-point.
In file operations, there is no “ensure all operations so far are now done” procedure call because there is no concept of epoch for file operations. (There is MPI_FILE_SYNC but it is optional so MPI cannot rely on it being called.)
In these cases, the only non-local procedure that is guaranteed to happen is MPI_FINALIZE, hence all outstanding non-local work needed by the “freed” operation might be delayed until that procedure is called.

The issue with copying parameters is also moot because all of them are passed-by-value (implicitly copied) or are data-buffers and covered by “conflicting accesses” RMA rules.

Thus, to me it seems to me that RMA is a very special case - it could support different semantics, but that does not provide a good basis for claiming that the rest of the MPI Standard can support those different semantics - unless we introduce an epoch concept into the rest of the MPI Standard. This is not unreasonable: the notifications in GASPI, for example, guarantee completion of not just the operation they are attached to but *all* operations issued in the “queue” they represent since the last notification. Their queue concept serves the purpose of an epoch. I’m sure there are other examples in other APIs. It seems to me likely that the proposal for MPI_PSYNC for partitioned communication operations is moving in the direction of an epoch, although limited to remote completion of all the partitions in a single operation, which accidentally guarantees that the operation can be freed locally using a local procedure.

Cheers,
Dan.
—
Dr Daniel Holmes PhD
Architect (HPC Research)
d.holmes at epcc.ed.ac.uk<mailto:d.holmes at epcc.ed.ac.uk>
Phone: +44 (0) 131 651 3465
Mobile: +44 (0) 7940 524 088
Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, EH8 9BT
—
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
—

On 13 Aug 2020, at 01:40, Skjellum, Anthony via mpi-forum <mpi-forum at lists.mpi-forum.org<mailto:mpi-forum at lists.mpi-forum.org>> wrote:

FYI, one argument (also used to force us to add restrictions on MPI persistent collective initialization to be blocking)... The MPI_Request_free on an NBC poses a problem for the cases where there are array types
posed (e.g., Alltoallv/w)... It will not be knowable to the application if the vectors are in use by MPI still after
the  free on an active request.  We do *not* mandate that the MPI implementation copy such arrays currently, so they are effectively "held as unfreeable" by the MPI implementation till MPI_Finalize.  The user cannot deallocate them in a correct program till after MPI_Finalize.

Another effect for NBC of releasing an active request, IMHO,  is that you don't know when send buffers are free to be deallocated or receive buffers are free to be deallocated... since you don't know when the transfer is complete OR the buffers are no longer used by MPI (till after MPI_Finalize).

Tony




Anthony Skjellum, PhD
Professor of Computer Science and Chair of Excellence
Director, SimCenter
University of Tennessee at Chattanooga (UTC)
tony-skjellum at utc.edu<mailto:tony-skjellum at utc.edu>  [or skjellum at gmail.com<mailto:skjellum at gmail.com>]
cell: 205-807-4968

________________________________
From: mpi-forum <mpi-forum-bounces at lists.mpi-forum.org<mailto:mpi-forum-bounces at lists.mpi-forum.org>> on behalf of Jeff Hammond via mpi-forum <mpi-forum at lists.mpi-forum.org<mailto:mpi-forum at lists.mpi-forum.org>>
Sent: Saturday, August 8, 2020 12:07 PM
To: Main MPI Forum mailing list <mpi-forum at lists.mpi-forum.org<mailto:mpi-forum at lists.mpi-forum.org>>
Cc: Jeff Hammond <jeff.science at gmail.com<mailto:jeff.science at gmail.com>>
Subject: Re: [Mpi-forum] MPI_Request_free restrictions

We should fix the RMA chapter with an erratum. I care less about NBC but share your ignorance of why it was done that way.

Sent from my iPhone

On Aug 8, 2020, at 6:51 AM, Balaji, Pavan via mpi-forum <mpi-forum at lists.mpi-forum.org<mailto:mpi-forum at lists.mpi-forum.org>> wrote:

 Folks,

Does someone remember why we disallowed users from calling MPI_Request_free on nonblocking collective requests?  I remember the reasoning for not allowing cancel (i.e., the operation might have completed on some processes, but not all), but not for Request_free.  AFAICT, allowing the users to free the request doesn’t make any difference to the MPI library.  The MPI library would simply maintain its own refcount to the request and continue forward till the operation completes.  One of our users would like to free NBC requests so they don’t have to wait for the operation to complete in some situations.

Unfortunately, when I added the Rput/Rget operations in the RMA chapter, I copy-pasted that text into RMA as well without thinking too hard about it.  My bad!  Either the RMA committee missed it too, or they thought of a reason that I can’t think of now.

Can someone clarify or remind me what the reason was?

Regards,

  — Pavan

MPI-3.1 standard, page 197, lines 26-27:

“It is erroneous to call MPI_REQUEST_FREE or MPI_CANCEL for a request associated with a nonblocking collective operation.”

_______________________________________________
mpi-forum mailing list
mpi-forum at lists.mpi-forum.org<mailto:mpi-forum at lists.mpi-forum.org>
https://lists.mpi-forum.org/mailman/listinfo/mpi-forum
_______________________________________________
mpi-forum mailing list
mpi-forum at lists.mpi-forum.org<mailto:mpi-forum at lists.mpi-forum.org>
https://lists.mpi-forum.org/mailman/listinfo/mpi-forum

_______________________________________________
mpi-forum mailing list
mpi-forum at lists.mpi-forum.org<mailto:mpi-forum at lists.mpi-forum.org>
https://lists.mpi-forum.org/mailman/listinfo/mpi-forum



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpi-forum/attachments/20200815/26719183/attachment-0001.html>


More information about the mpi-forum mailing list