[Mpi-forum] MPI_Request_free restrictions
Quincey Koziol
koziol at lbl.gov
Fri Aug 21 10:19:41 CDT 2020
Hi Dan,
Ah, very useful to know, thanks! Is there a nonblocking version of MPI_COMM_DISCONNECT? (I’ve searched the web for MPI_COMM_IDISCONNECT and it comes up empty, but that’s not canonical :-)
If not, can a capability like this be added to any “wish lists”? Ideally, calling something like MPI_COMM_IDISCONNECT and then having the request for that operation complete would mean that MPI_COMM_FREE would be guaranteed to be both nonblocking and complete locally. Thoughts?
Quincey
> On Aug 21, 2020, at 10:01 AM, HOLMES Daniel <d.holmes at epcc.ed.ac.uk> wrote:
>
> Hi Quincey,
>
> Calling MPI_COMM_FREE when some requests representing nonblocking or persistent operations are still active is not prohibited by MPI and seems to work successfully in all the MPI libraries I’ve tested.
>
> The normative description for MPI_COMM_FREE in the MPI Standard specifically calls out that it will only mark the communicator for freeing later and may return to the user before pending/ongoing communication is complete. It does not require that the completion procedure has been called for active operations.
>
> We discussed in the Forum (as recently as the meeting this week) that this is a key difference between MPI_COMM_FREE and MPI_COMM_DISCONNECT - the latter states that the user is required to call the completion procedure(s) for all operations using a communicator before disconnecting it using MPI_COMM_DISCONNECT, which will wait for all pending communication to complete internally.
>
> OTOH, I’m not sure that doing this buys you as much as you think it might.
>
> MPI_COMM_FREE is a collective procedure, so it is permitted to wait until MPI_COMM_FREE has been called at all other MPI processes in the communicator, i.e. it can have blocking-barrier-like semantics. All collective operations must be initialised in the same order at all processes in the communicator. So a valid implementation could do all the pending work inside MPI_COMM_FREE but the Standard also permits an implementation that does nothing other than change a “ready-for-freeing” flag on the local communicator object.
>
>> Am I allowed to call MPI_COMM_FREE while I have an uncompleted request for a nonblocking collective operation (like MPI_IBARRIER) on the communicator?
>
> Yes.
>
>> Will MPI_COMM_FREE block for completion of the NBC op?
>
> No.
>
> Cheers,
> Dan.
> —
> Dr Daniel Holmes PhD
> Architect (HPC Research)
> d.holmes at epcc.ed.ac.uk <mailto:d.holmes at epcc.ed.ac.uk>
> Phone: +44 (0) 131 651 3465
> Mobile: +44 (0) 7940 524 088
> Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, EH8 9BT
> —
> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
> —
>
>> On 21 Aug 2020, at 15:26, Quincey Koziol via mpi-forum <mpi-forum at lists.mpi-forum.org <mailto:mpi-forum at lists.mpi-forum.org>> wrote:
>>
>> Hi Dan,
>> I agree with you about MPI barriers, but that’s why I said it was a simplified pseudocode. :-) We do have more mechanisms in place for handling the "fence-ness” of the operation, but barriers are a component and I’d like to move to a nonblocking version when possible.
>>
>> Having more, hopefully all, of the file operations get equivalent nonblocking versions would be _very_ nice, and I could simplify our internal code more if a nonblocking MPI_FILE_SYNC was available. A nonblocking version of MPI_FILE_SET_SIZE would also be high on my list.
>>
>> Yes, I grok the behavior of MPI_FILE_CLOSE, but don’t want to add a barrier on top of it. :-)
>>
>>
>> One new question: Am I allowed to call MPI_COMM_FREE while I have an uncompleted request for a nonblocking collective operation (like MPI_IBARRIER) on the communicator? Will MPI_COMM_FREE block for completion of the NBC op?
>>
>>
>> Thanks!
>> Quincey
>>
>>
>>
>>> On Aug 15, 2020, at 6:07 AM, HOLMES Daniel <d.holmes at epcc.ed.ac.uk <mailto:d.holmes at epcc.ed.ac.uk>> wrote:
>>>
>>> Hi Quincey,
>>>
>>> The MPI barrier operation (whether blocking, nonblocking, or persistent) does not guarantee “memory fence” semantics (either for the content of memory or the content of files).
>>>
>>> Perhaps you are looking for MPI_FILE_SYNC?
>>>
>>> "If other processes have made updates to the storage device, then all such updates become visible to subsequent reads of fh by the calling process.” §13.6.1
>>>
>>> "MPI_FILE_SYNC is a collective operation.” §13.6.1
>>>
>>> Used correctly (user must locally complete their I/O operations before calling it), this does provide a “fence”-like guarantee *for the file*, which is what your code looks like you are attempting. That is, all remote writes to the file that were initiated remotely (and locally completed at the remote process) before the matching remote call to MPI_FILE_SYNC are guaranteed to be visible in the file using subsequent locally issued MPI read operations once the local call to MPI_FILE_SYNC completes locally.
>>>
>>> There is currently no nonblocking or persistent expression of this MPI procedure - watch this space: this is on the to-do list for MPI-Next.
>>>
>>> As Jim points out, the performance problem you note is most likely due to the implicit MPI_FILE_SYNC-like synchronisation done internally by MPI during the MPI_FILE_CLOSE procedure call. All enqueued file operations targeting the file will be flushed to the file during MPI_FILE_CLOSE. If file operations are not flushed to the file concurrently with the application stuff or the MPI communication operations, then they will still be enqueued when MPI_FILE_CLOSE is called.
>>>
>>> Cheers,
>>> Dan.
>>> —
>>> Dr Daniel Holmes PhD
>>> Architect (HPC Research)
>>> d.holmes at epcc.ed.ac.uk <mailto:d.holmes at epcc.ed.ac.uk>
>>> Phone: +44 (0) 131 651 3465
>>> Mobile: +44 (0) 7940 524 088
>>> Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, EH8 9BT
>>> —
>>> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
>>> —
>>>
>>>> On 14 Aug 2020, at 17:32, Quincey Koziol via mpi-forum <mpi-forum at lists.mpi-forum.org <mailto:mpi-forum at lists.mpi-forum.org>> wrote:
>>>>
>>>> Hi Dan,
>>>> I believe that Pavan was referring to my conversation with him about MPI_Request_free. Here’s my situation: I’d like to use MPI_Ibarrier as a form of “memory fence” between some of the metadata reads and writes in HDF5. Here’s some [very] simplified pseudocode for what I’d like to do:
>>>>
>>>> ===============================
>>>>
>>>> <open HDF5 file> // sets up a communicator for internal HDF5 communication about this file
>>>>
>>>> do {
>>>> MPI_Ibarrier(<file’s communicator>, &request);
>>>>
>>>> <application stuff>
>>>>
>>>> // HDF5 operation:
>>>> if(<operation is read or write>) {
>>>> MPI_Wait(&request);
>>>> <perform read / write>
>>>> }
>>>> else { // operation is a file close
>>>> MPI_Request_free(&request);
>>>> MPI_File_close(…);
>>>> MPI_Comm_free(<file’s communicator>);
>>>> }
>>>> } while (<file is open>);
>>>>
>>>> ===============================
>>>>
>>>> What I am really trying to avoid is calling MPI_Wait at file close, since it is semantically unnecessary and only increases the latency from the application’s perspective. If I can’t call MPI_Request_free on a nonblocking collective operation’s request (and it looks like I can’t, right now), I will have to put the request and file’s communicator into a “cleanup” list that is polled periodically [on each rank] with MPI_Test and disposed of when the nonblocking barrier completes locally.
>>>>
>>>> So, I’d really like to be able to call MPI_Request_free on the nonblocking barrier’s request.
>>>>
>>>> Thoughts?
>>>>
>>>> Quincey
>>>>
>>>>
>>>>> On Aug 13, 2020, at 9:07 AM, HOLMES Daniel via mpi-forum <mpi-forum at lists.mpi-forum.org <mailto:mpi-forum at lists.mpi-forum.org>> wrote:
>>>>>
>>>>> Hi Jim,
>>>>>
>>>>> To be clear, I think that MPI_CANCEL is evil and should be removed from the MPI Standard entirely at the earliest convenience.
>>>>>
>>>>> I am certainly not arguing that it be permitted for more MPI operations.
>>>>>
>>>>> I thought the discussion was focused on MPI_REQUEST_FREE and whether or not it can/should be used on an active request.
>>>>>
>>>>> If a particular MPI implementation does not keep a reference to the request between MPI_RPUT and MPI_REQUEST_FREE, but needs that reference to process the completion event, then that MPI implementation would be required to keep a reference to the request from MPI_REQUEST_FREE until that important task had been done, perhaps until the close epoch call. This requires no new memory because the user is giving up their reference to the request, so MPI can safely use the request it is passed in MPI_REQUEST_FREE without copying it. As you say, MPI takes over the responsibility for processing the completion event.
>>>>>
>>>>> Your question about why the implementation should be required to take on this complexity is a good one. That, I guess, is why freeing any active request is a bad idea. MPI is required to differentiate completion of individual operations (so it can implement MPI_WAIT) but that means something must process completion at some point for each individual operation. In RMA, that responsibility can be discharged earlier than in other parts of the MPI interface, but the real question is “why should MPI offer to take on this responsibility in the first place?”
>>>>>
>>>>> Thanks, that helps (me at least).
>>>>>
>>>>> Cheers,
>>>>> Dan.
>>>>> —
>>>>> Dr Daniel Holmes PhD
>>>>> Architect (HPC Research)
>>>>> d.holmes at epcc.ed.ac.uk <mailto:d.holmes at epcc.ed.ac.uk>
>>>>> Phone: +44 (0) 131 651 3465
>>>>> Mobile: +44 (0) 7940 524 088
>>>>> Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, EH8 9BT
>>>>> —
>>>>> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
>>>>> —
>>>>>
>>>>>> On 13 Aug 2020, at 14:43, Jim Dinan <james.dinan at gmail.com <mailto:james.dinan at gmail.com>> wrote:
>>>>>>
>>>>>> The two cases you mentioned would have the same behavior at an application level. However, there may be important differences in the implementation of each operation. For example, an MPI_Put operation may be configured to not generate a completion event, whereas an MPI_Rput would. The library may be relying on the user to make a call on the request to process the event and clean up resources. The implementation can take over this responsibility if the user cancels the request, but why should we ask implementers to take on this complexity and overhead?
>>>>>>
>>>>>> My $0.02 is that MPI_Cancel is subtle and complicated, and we should be very careful about where we allow it. I don't see the benefit to the programming model outweighing the complexity and overhead in the MPI runtime for the case of MPI_Rput. I also don't know that we were careful enough in specifying the RMA memory model that a canceled request-based RMA operation will still have well-defined behavior. My understanding is that MPI_Cancel is required primarily for canceling receive requests to meet MPI's quiescent shutdown requirement.
>>>>>>
>>>>>> ~Jim.
>>>>>>
>>>>>> On Thu, Aug 13, 2020 at 8:11 AM HOLMES Daniel via mpi-forum <mpi-forum at lists.mpi-forum.org <mailto:mpi-forum at lists.mpi-forum.org>> wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> To increase my own understanding of RMA, what is the difference (if any) between a request-based RMA operation where the request is freed without being completed and before the epoch is closed and a “normal” RMA operation?
>>>>>>
>>>>>> MPI_LOCK() ! or any other "open epoch at origin" procedure call
>>>>>> doUserWorkBefore()
>>>>>> MPI_RPUT(&req)
>>>>>> MPI_REQUEST_FREE(&req)
>>>>>> doUserWorkAfter()
>>>>>> MPI_UNLOCK() ! or the matching “close epoch at origin" procedure call
>>>>>>
>>>>>> vs:
>>>>>>
>>>>>> MPI_LOCK() ! or any other "open epoch at origin" procedure call
>>>>>> doUserWorkBefore()
>>>>>> MPI_PUT()
>>>>>> doUserWorkAfter()
>>>>>> MPI_UNLOCK() ! or the matching “close epoch at origin" procedure call
>>>>>>
>>>>>> Is this a source-to-source translation that is always safe in either direction?
>>>>>>
>>>>>> In RMA, in contrast to the rest of MPI, there are two opportunities for MPI to “block” and do non-local work to complete an RMA operation: 1) during MPI_WAIT for the request (if any - the user may not be given a request or the user may choose to free the request without calling MPI_WAIT or the user might call nonblocking MPI_TEST) and 2) during the close epoch procedure, which is always permitted to be sufficiently non-local to guarantee that the RMA operation is complete and its freeing stage has been done. It seems that a request-based RMA operation becomes identical to a “normal” RMA operation if the user calls MPI_REQUEST_FREE on the request. This is like “freeing" the request from a nonblocking point-to-point operation but without the guarantee of a later synchronisation procedure that can actually complete the operation and actually do the freeing stage of the operation.
>>>>>>
>>>>>> In collectives, there is no “ensure all operations so far are now done” procedure call because there is no concept of epoch for collectives.
>>>>>> In point-to-point, there is no “ensure all operations so far are now done” procedure call because there is no concept of epoch for point-to-point.
>>>>>> In file operations, there is no “ensure all operations so far are now done” procedure call because there is no concept of epoch for file operations. (There is MPI_FILE_SYNC but it is optional so MPI cannot rely on it being called.)
>>>>>> In these cases, the only non-local procedure that is guaranteed to happen is MPI_FINALIZE, hence all outstanding non-local work needed by the “freed” operation might be delayed until that procedure is called.
>>>>>>
>>>>>> The issue with copying parameters is also moot because all of them are passed-by-value (implicitly copied) or are data-buffers and covered by “conflicting accesses” RMA rules.
>>>>>>
>>>>>> Thus, to me it seems to me that RMA is a very special case - it could support different semantics, but that does not provide a good basis for claiming that the rest of the MPI Standard can support those different semantics - unless we introduce an epoch concept into the rest of the MPI Standard. This is not unreasonable: the notifications in GASPI, for example, guarantee completion of not just the operation they are attached to but *all* operations issued in the “queue” they represent since the last notification. Their queue concept serves the purpose of an epoch. I’m sure there are other examples in other APIs. It seems to me likely that the proposal for MPI_PSYNC for partitioned communication operations is moving in the direction of an epoch, although limited to remote completion of all the partitions in a single operation, which accidentally guarantees that the operation can be freed locally using a local procedure.
>>>>>>
>>>>>> Cheers,
>>>>>> Dan.
>>>>>> —
>>>>>> Dr Daniel Holmes PhD
>>>>>> Architect (HPC Research)
>>>>>> d.holmes at epcc.ed.ac.uk <mailto:d.holmes at epcc.ed.ac.uk>
>>>>>> Phone: +44 (0) 131 651 3465
>>>>>> Mobile: +44 (0) 7940 524 088
>>>>>> Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, EH8 9BT
>>>>>> —
>>>>>> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
>>>>>> —
>>>>>>
>>>>>>> On 13 Aug 2020, at 01:40, Skjellum, Anthony via mpi-forum <mpi-forum at lists.mpi-forum.org <mailto:mpi-forum at lists.mpi-forum.org>> wrote:
>>>>>>>
>>>>>>> FYI, one argument (also used to force us to add restrictions on MPI persistent collective initialization to be blocking)... The MPI_Request_free on an NBC poses a problem for the cases where there are array types
>>>>>>> posed (e.g., Alltoallv/w)... It will not be knowable to the application if the vectors are in use by MPI still after
>>>>>>> the free on an active request. We do *not* mandate that the MPI implementation copy such arrays currently, so they are effectively "held as unfreeable" by the MPI implementation till MPI_Finalize. The user cannot deallocate them in a correct program till after MPI_Finalize.
>>>>>>>
>>>>>>> Another effect for NBC of releasing an active request, IMHO, is that you don't know when send buffers are free to be deallocated or receive buffers are free to be deallocated... since you don't know when the transfer is complete OR the buffers are no longer used by MPI (till after MPI_Finalize).
>>>>>>>
>>>>>>> Tony
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Anthony Skjellum, PhD
>>>>>>> Professor of Computer Science and Chair of Excellence
>>>>>>> Director, SimCenter
>>>>>>> University of Tennessee at Chattanooga (UTC)
>>>>>>> tony-skjellum at utc.edu <mailto:tony-skjellum at utc.edu> [or skjellum at gmail.com <mailto:skjellum at gmail.com>]
>>>>>>> cell: 205-807-4968
>>>>>>>
>>>>>>> From: mpi-forum <mpi-forum-bounces at lists.mpi-forum.org <mailto:mpi-forum-bounces at lists.mpi-forum.org>> on behalf of Jeff Hammond via mpi-forum <mpi-forum at lists.mpi-forum.org <mailto:mpi-forum at lists.mpi-forum.org>>
>>>>>>> Sent: Saturday, August 8, 2020 12:07 PM
>>>>>>> To: Main MPI Forum mailing list <mpi-forum at lists.mpi-forum.org <mailto:mpi-forum at lists.mpi-forum.org>>
>>>>>>> Cc: Jeff Hammond <jeff.science at gmail.com <mailto:jeff.science at gmail.com>>
>>>>>>> Subject: Re: [Mpi-forum] MPI_Request_free restrictions
>>>>>>>
>>>>>>> We should fix the RMA chapter with an erratum. I care less about NBC but share your ignorance of why it was done that way.
>>>>>>>
>>>>>>> Sent from my iPhone
>>>>>>>
>>>>>>>> On Aug 8, 2020, at 6:51 AM, Balaji, Pavan via mpi-forum <mpi-forum at lists.mpi-forum.org <mailto:mpi-forum at lists.mpi-forum.org>> wrote:
>>>>>>>>
>>>>>>>> Folks,
>>>>>>>>
>>>>>>>> Does someone remember why we disallowed users from calling MPI_Request_free on nonblocking collective requests? I remember the reasoning for not allowing cancel (i.e., the operation might have completed on some processes, but not all), but not for Request_free. AFAICT, allowing the users to free the request doesn’t make any difference to the MPI library. The MPI library would simply maintain its own refcount to the request and continue forward till the operation completes. One of our users would like to free NBC requests so they don’t have to wait for the operation to complete in some situations.
>>>>>>>>
>>>>>>>> Unfortunately, when I added the Rput/Rget operations in the RMA chapter, I copy-pasted that text into RMA as well without thinking too hard about it. My bad! Either the RMA committee missed it too, or they thought of a reason that I can’t think of now.
>>>>>>>>
>>>>>>>> Can someone clarify or remind me what the reason was?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> — Pavan
>>>>>>>>
>>>>>>>> MPI-3.1 standard, page 197, lines 26-27:
>>>>>>>>
>>>>>>>> “It is erroneous to call MPI_REQUEST_FREE or MPI_CANCEL for a request associated with a nonblocking collective operation.”
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> mpi-forum mailing list
>>>>>>>> mpi-forum at lists.mpi-forum.org <mailto:mpi-forum at lists.mpi-forum.org>
>>>>>>>> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum <https://lists.mpi-forum.org/mailman/listinfo/mpi-forum>
>>>>>>> _______________________________________________
>>>>>>> mpi-forum mailing list
>>>>>>> mpi-forum at lists.mpi-forum.org <mailto:mpi-forum at lists.mpi-forum.org>
>>>>>>> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum <https://lists.mpi-forum.org/mailman/listinfo/mpi-forum>
>>>>>> _______________________________________________
>>>>>> mpi-forum mailing list
>>>>>> mpi-forum at lists.mpi-forum.org <mailto:mpi-forum at lists.mpi-forum.org>
>>>>>> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum <https://lists.mpi-forum.org/mailman/listinfo/mpi-forum>
>>>>>
>>>>> _______________________________________________
>>>>> mpi-forum mailing list
>>>>> mpi-forum at lists.mpi-forum.org <mailto:mpi-forum at lists.mpi-forum.org>
>>>>> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum <https://lists.mpi-forum.org/mailman/listinfo/mpi-forum>
>>>>
>>>> _______________________________________________
>>>> mpi-forum mailing list
>>>> mpi-forum at lists.mpi-forum.org <mailto:mpi-forum at lists.mpi-forum.org>
>>>> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum <https://lists.mpi-forum.org/mailman/listinfo/mpi-forum>
>>>
>>
>> _______________________________________________
>> mpi-forum mailing list
>> mpi-forum at lists.mpi-forum.org <mailto:mpi-forum at lists.mpi-forum.org>
>> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpi-forum/attachments/20200821/2cac984e/attachment-0001.html>
More information about the mpi-forum
mailing list