[Mpi-forum] MPI_Request_free restrictions

Jim Dinan james.dinan at gmail.com
Fri Aug 14 12:40:10 CDT 2020


The overhead of cleanup doesn't go away; the MPI runtime would need to
create a similar cleanup list and process it. It looks to me like the
performance problem might actually be caused by the Ibarrier not making
asynchronous progress when application stuff is happening.

 ~Jim.

 ~Jim.

On Fri, Aug 14, 2020 at 12:33 PM Quincey Koziol via mpi-forum <
mpi-forum at lists.mpi-forum.org> wrote:

> Hi Dan,
> I believe that Pavan was referring to my conversation with him about
> MPI_Request_free.  Here’s my situation: I’d like to use MPI_Ibarrier as a
> form of “memory fence” between some of the metadata reads and writes in
> HDF5.   Here’s some [very] simplified pseudocode for what I’d like to do:
>
> ===============================
>
> <open HDF5 file>   // sets up a communicator for internal HDF5
> communication about this file
>
> do {
> MPI_Ibarrier(<file’s communicator>, &request);
>
> <application stuff>
>
> // HDF5 operation:
> if(<operation is read or write>) {
> MPI_Wait(&request);
> <perform read / write>
> }
> else {  // operation is a file close
> MPI_Request_free(&request);
> MPI_File_close(…);
> MPI_Comm_free(<file’s communicator>);
> }
> } while (<file is open>);
>
> ===============================
>
> What I am really trying to avoid is calling MPI_Wait at file close, since
> it is semantically unnecessary and only increases the latency from the
> application’s perspective.   If I can’t call MPI_Request_free on a
> nonblocking collective operation’s request (and it looks like I can’t,
> right now), I will have to put the request and file’s communicator into a
> “cleanup” list that is polled periodically [on each rank] with MPI_Test and
> disposed of when the nonblocking barrier completes locally.
>
> So, I’d really like to be able to call MPI_Request_free on the nonblocking
> barrier’s request.
>
> Thoughts?
>
> Quincey
>
>
> On Aug 13, 2020, at 9:07 AM, HOLMES Daniel via mpi-forum <
> mpi-forum at lists.mpi-forum.org> wrote:
>
> Hi Jim,
>
> To be clear, I think that MPI_CANCEL is evil and should be removed from
> the MPI Standard entirely at the earliest convenience.
>
> I am certainly not arguing that it be permitted for more MPI operations.
>
> I thought the discussion was focused on MPI_REQUEST_FREE and whether or
> not it can/should be used on an active request.
>
> If a particular MPI implementation does not keep a reference to the
> request between MPI_RPUT and MPI_REQUEST_FREE, but needs that reference to
> process the completion event, then that MPI implementation would be
> required to keep a reference to the request from MPI_REQUEST_FREE until
> that important task had been done, perhaps until the close epoch call. This
> requires no new memory because the user is giving up their reference to the
> request, so MPI can safely use the request it is passed in MPI_REQUEST_FREE
> without copying it. As you say, MPI takes over the responsibility for
> processing the completion event.
>
> Your question about why the implementation should be required to take on
> this complexity is a good one. That, I guess, is why freeing any active
> request is a bad idea. MPI is required to differentiate completion of
> individual operations (so it can implement MPI_WAIT) but that means
> something must process completion at some point for each individual
> operation. In RMA, that responsibility can be discharged earlier than in
> other parts of the MPI interface, but the real question is “why should MPI
> offer to take on this responsibility in the first place?”
>
> Thanks, that helps (me at least).
>
> Cheers,
> Dan.
>> Dr Daniel Holmes PhD
> Architect (HPC Research)
> d.holmes at epcc.ed.ac.uk
> Phone: +44 (0) 131 651 3465
> Mobile: +44 (0) 7940 524 088
> Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh,
> EH8 9BT
>> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
>>
> On 13 Aug 2020, at 14:43, Jim Dinan <james.dinan at gmail.com> wrote:
>
> The two cases you mentioned would have the same behavior at an application
> level. However, there may be important differences in the implementation of
> each operation. For example, an MPI_Put operation may be configured to not
> generate a completion event, whereas an MPI_Rput would. The library may be
> relying on the user to make a call on the request to process the event and
> clean up resources. The implementation can take over this responsibility if
> the user cancels the request, but why should we ask implementers to take on
> this complexity and overhead?
>
> My $0.02 is that MPI_Cancel is subtle and complicated, and we should be
> very careful about where we allow it. I don't see the benefit to the
> programming model outweighing the complexity and overhead in the MPI
> runtime for the case of MPI_Rput. I also don't know that we were careful
> enough in specifying the RMA memory model that a canceled request-based RMA
> operation will still have well-defined behavior. My understanding is that
> MPI_Cancel is required primarily for canceling receive requests to meet
> MPI's quiescent shutdown requirement.
>
>  ~Jim.
>
> On Thu, Aug 13, 2020 at 8:11 AM HOLMES Daniel via mpi-forum <
> mpi-forum at lists.mpi-forum.org> wrote:
>
>> Hi all,
>>
>> To increase my own understanding of RMA, what is the difference (if any)
>> between a request-based RMA operation where the request is freed without
>> being completed and before the epoch is closed and a “normal” RMA operation?
>>
>> MPI_LOCK() ! or any other "open epoch at origin" procedure call
>> doUserWorkBefore()
>> MPI_RPUT(&req)
>> MPI_REQUEST_FREE(&req)
>> doUserWorkAfter()
>> MPI_UNLOCK() ! or the matching “close epoch at origin" procedure call
>>
>> vs:
>>
>> MPI_LOCK() ! or any other "open epoch at origin" procedure call
>> doUserWorkBefore()
>> MPI_PUT()
>> doUserWorkAfter()
>> MPI_UNLOCK() ! or the matching “close epoch at origin" procedure call
>>
>> Is this a source-to-source translation that is always safe in either
>> direction?
>>
>> In RMA, in contrast to the rest of MPI, there are two opportunities for
>> MPI to “block” and do non-local work to complete an RMA operation: 1)
>> during MPI_WAIT for the request (if any - the user may not be given a
>> request or the user may choose to free the request without calling MPI_WAIT
>> or the user might call nonblocking MPI_TEST) and 2) during the close epoch
>> procedure, which is always permitted to be sufficiently non-local to
>> guarantee that the RMA operation is complete and its freeing stage has been
>> done. It seems that a request-based RMA operation becomes identical to a
>> “normal” RMA operation if the user calls MPI_REQUEST_FREE on the request.
>> This is like “freeing" the request from a nonblocking point-to-point
>> operation but without the guarantee of a later synchronisation procedure
>> that can actually complete the operation and actually do the freeing stage
>> of the operation.
>>
>> In collectives, there is no “ensure all operations so far are now done”
>> procedure call because there is no concept of epoch for collectives.
>> In point-to-point, there is no “ensure all operations so far are now
>> done” procedure call because there is no concept of epoch
>> for point-to-point.
>> In file operations, there is no “ensure all operations so far are now
>> done” procedure call because there is no concept of epoch for file
>> operations. (There is MPI_FILE_SYNC but it is optional so MPI cannot rely
>> on it being called.)
>> In these cases, the only non-local procedure that is guaranteed to happen
>> is MPI_FINALIZE, hence all outstanding non-local work needed by the “freed”
>> operation might be delayed until that procedure is called.
>>
>> The issue with copying parameters is also moot because all of them are
>> passed-by-value (implicitly copied) or are data-buffers and covered by
>> “conflicting accesses” RMA rules.
>>
>> Thus, to me it seems to me that RMA is a very special case - it could
>> support different semantics, but that does not provide a good basis for
>> claiming that the rest of the MPI Standard can support those different
>> semantics - unless we introduce an epoch concept into the rest of the MPI
>> Standard. This is not unreasonable: the notifications in GASPI, for
>> example, guarantee completion of not just the operation they are attached
>> to but *all* operations issued in the “queue” they represent since the last
>> notification. Their queue concept serves the purpose of an epoch. I’m sure
>> there are other examples in other APIs. It seems to me likely that the
>> proposal for MPI_PSYNC for partitioned communication operations is moving
>> in the direction of an epoch, although limited to remote completion of all
>> the partitions in a single operation, which accidentally guarantees that
>> the operation can be freed locally using a local procedure.
>>
>> Cheers,
>> Dan.
>>>> Dr Daniel Holmes PhD
>> Architect (HPC Research)
>> d.holmes at epcc.ed.ac.uk
>> Phone: +44 (0) 131 651 3465
>> Mobile: +44 (0) 7940 524 088
>> Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh,
>> EH8 9BT
>>>> The University of Edinburgh is a charitable body, registered in Scotland,
>> with registration number SC005336.
>>>>
>> On 13 Aug 2020, at 01:40, Skjellum, Anthony via mpi-forum <
>> mpi-forum at lists.mpi-forum.org> wrote:
>>
>> FYI, one argument (also used to force us to add restrictions on MPI
>> persistent collective initialization to be blocking)... The
>> MPI_Request_free on an NBC poses a problem for the cases where there are
>> array types
>> posed (e.g., Alltoallv/w)... It will not be knowable to the application
>> if the vectors are in use by MPI still after
>> the  free on an active request.  We do *not* mandate that the MPI
>> implementation copy such arrays currently, so they are effectively "held as
>> unfreeable" by the MPI implementation till MPI_Finalize.  The user
>> cannot deallocate them in a correct program till after MPI_Finalize.
>>
>> Another effect for NBC of releasing an active request, IMHO,  is that you
>> don't know when send buffers are free to be deallocated or receive buffers
>> are free to be deallocated... since you don't know when the transfer is
>> complete OR the buffers are no longer used by MPI (till after MPI_Finalize).
>>
>> Tony
>>
>>
>>
>>
>> Anthony Skjellum, PhD
>> Professor of Computer Science and Chair of Excellence
>> Director, SimCenter
>> University of Tennessee at Chattanooga (UTC)
>> tony-skjellum at utc.edu  [or skjellum at gmail.com]
>> cell: 205-807-4968
>>
>> ------------------------------
>> *From:* mpi-forum <mpi-forum-bounces at lists.mpi-forum.org> on behalf of
>> Jeff Hammond via mpi-forum <mpi-forum at lists.mpi-forum.org>
>> *Sent:* Saturday, August 8, 2020 12:07 PM
>> *To:* Main MPI Forum mailing list <mpi-forum at lists.mpi-forum.org>
>> *Cc:* Jeff Hammond <jeff.science at gmail.com>
>> *Subject:* Re: [Mpi-forum] MPI_Request_free restrictions
>>
>> We should fix the RMA chapter with an erratum. I care less about NBC but
>> share your ignorance of why it was done that way.
>>
>> Sent from my iPhone
>>
>> On Aug 8, 2020, at 6:51 AM, Balaji, Pavan via mpi-forum <
>> mpi-forum at lists.mpi-forum.org> wrote:
>>
>>  Folks,
>>
>> Does someone remember why we disallowed users from calling
>> MPI_Request_free on nonblocking collective requests?  I remember the
>> reasoning for not allowing cancel (i.e., the operation might have completed
>> on some processes, but not all), but not for Request_free.  AFAICT,
>> allowing the users to free the request doesn’t make any difference to the
>> MPI library.  The MPI library would simply maintain its own refcount to the
>> request and continue forward till the operation completes.  One of our
>> users would like to free NBC requests so they don’t have to wait for the
>> operation to complete in some situations.
>>
>> Unfortunately, when I added the Rput/Rget operations in the RMA chapter,
>> I copy-pasted that text into RMA as well without thinking too hard about
>> it.  My bad!  Either the RMA committee missed it too, or they thought of a
>> reason that I can’t think of now.
>>
>> Can someone clarify or remind me what the reason was?
>>
>> Regards,
>>
>>   — Pavan
>>
>> MPI-3.1 standard, page 197, lines 26-27:
>>
>> “It is erroneous to call MPI_REQUEST_FREE or MPI_CANCEL for a request
>> associated with a nonblocking collective operation.”
>>
>> _______________________________________________
>> mpi-forum mailing list
>> mpi-forum at lists.mpi-forum.org
>> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum
>>
>> _______________________________________________
>> mpi-forum mailing list
>> mpi-forum at lists.mpi-forum.org
>> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum
>>
>>
>> _______________________________________________
>> mpi-forum mailing list
>> mpi-forum at lists.mpi-forum.org
>> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum
>>
>
> _______________________________________________
> mpi-forum mailing list
> mpi-forum at lists.mpi-forum.org
> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum
>
>
> _______________________________________________
> mpi-forum mailing list
> mpi-forum at lists.mpi-forum.org
> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpi-forum/attachments/20200814/afd4d2ad/attachment-0001.html>


More information about the mpi-forum mailing list