[Mpi-forum] MPI "Allocate receive" proposal

Mon Aug 26 14:45:00 CDT 2013

On Aug 26, 2013, at 2:25 PM, "Underwood, Keith D" <keith.d.underwood at intel.com> wrote:

>> As per the use case, at least ADLB and Charm++ will benefit from this.
>> They don't know what message size will come in.  But I'm sure there are
>> more.
> 
> But, do they know *approximately* the size of the message that is expected?  Because, if they do, then most of the advantage isn't there.  I struggle a little bit with the idea of well coded apps that have little enough idea about the size of the message that they need this. 

Remember that the proposed Arecv interface solves a few problems:

1) There is currently no good way to receive totally unknown-size messages and known-size (or at least reasonably bounded-size) messages and still drive your messaging from a single MPI_Wait*/MPI_Test* call.  The Arecv model fits nicely with these existing completion routines.

2) Unexpected messages tend to introduced an extra copy of some or all of the data, depending on the message size.  This can be eliminated with a pre-posted Arecv.

3) For unexpected-size larger messages we go all the way back up the stack to wait for the user to give us a buffer, then go back down to send a CTS back (possibly with buffer/key/etc. info) to the sender.  The Arecv eliminates the need to rendezvous all the way up to the application.

The primary motivation for this proposal was some code I was writing that had client-server semantics where the requests from the client could not be neatly bounded.  Furthermore, I wanted to drive the application from an MPI_Waitsome loop instead of an {MPI_Testsome;MPI_Improbe} loop (i.e., problem #1).  The fact that #2 & #3 happen to be solved by this solution are pure gravy from my point of view.

>> My concern is not whether someone can use it, but rather with respect to
>> how much the MPI implementation can do without the sender knowing if
>> the receiver is going to receive the message using RECV or ARECV.  But
>> maybe with Iarecv, it'll not be a problem.
> 
> I'm more concerned about calling malloc() in the library.  At the end of the day, even with the measly int-limited max message size, you will *have* to malloc in that path.
> 
> I am also wondering if there is a better way to do this by exposing the traditional rendezvous?  Sure, if you are under the rendezvous threshold, toss the eager receive buffer up.  If you are above the threshold, toss it up to the library and provide a way to do a "pull" of the rest of the data.  You could even go crazy and push an opaque handle out at the receiver to use in the pull.  Given the implementation implications of this concept, my bias would be to not hide reality from the user.  We should really only abstract the things for which the MPI library can do optimizations on the user's behalf.  If we exposed and efficient way for the user to do rendezvous themselves, that may be better.

I welcome your specific proposal :)

Some sort of Mprobe/Improbe which can go into an existing MPI completion routine would be fine, but I couldn't come up with semantics that I liked for that.  There are also all sorts of active-message-like solutions that you could come up with, but I don't personally want to fight to standardize what you can or cannot do in such a callback just so that I can get the more modest functionality that I actually desire.

-Dave