[Mpi3-hybridpm] Fwd: MPI shared memory allocation issues

Mon Mar 28 13:28:54 CDT 2011

Forwarding this mail to a broader audience, based on our discussion here at the March 2011 MPI Forum meeting.  There was additional correspondence on this thread that I can forward as needed, but this forwarded mail contains my core argument against the original (allocate+free+fence) proposal.

-Dave

Begin forwarded message:

> From: Dave Goodell <goodell at mcs.anl.gov>
> Date: February 24, 2011 11:18:20 AM CST
> To: Ron Brightwell <rbbrigh at sandia.gov>, Douglas Miller <dougmill at us.ibm.com>, "Bronis R. de Supinski" <bronis at llnl.gov>, Jim Dinan <dinan at mcs.anl.gov>, Pavan Balaji <balaji at mcs.anl.gov>, Marc Snir <snir at illinois.edu>
> Subject: MPI shared memory allocation issues
> 
> I voiced concerns at the last MPI forum meeting about the proposed MPI extensions for allocating shared memory.  In particular I was concerned about "MPI_Shm_fence".  Pavan asked me to write up a quick email to this group in order to help clarify my view in the discussion; this is that email.  Please widen the distribution list as appropriate, I just mailed the addresses that Pavan indicated to me.  FYI, I am not currently subscribed to the mpi3-hybrid list.
> 
> First, I view multithreaded programming within an OS process and multiprocess programming using a shared memory region to be essentially the same problem.  There are probably alternative interpretations of the words "process" and "thread" that could muddy the picture here, but for the sake of clarity, let's use the conventional meanings for the moment.  Also, I am not interested in discussing distributed shared memory (DSM) here, I think that bringing it up just confuses the discussion further.  My primary objections to the proposal are valid entirely within a discussion of conventional shared memory, processes, and threads.
> 
> Given that preface, I believe that many, if not all, of the issues raised by Boehm's paper, "Threads Cannot Be Implemented As a Library" [1], apply here.  In particular, some variation on the example from section 4.3 is probably an issue, but the others seem to apply as well.  The performance example is also relevant here, but in an even more dramatic fashion given the dearth of synchronization primitives offered by the proposal.
> 
> I do not believe that we can specify a way to program the provided shared memory in any way that is robust and useful to the user, because C and Fortran do not give us enough of a specification in order to do so.  Without getting into the business of compiler writing, MPI has no way to give the user any meaningful guarantees.  Just as Boehm noted about pthreads, we can probably come up with an approach that will work most of the time.  But that's a pretty flimsy guarantee for a standard like MPI.
> 
> If you ignore the difficulty in specifying an interface that can actually be used correctly, then another issue arises.  The only proposed synchronization mechanism virtually guarantees that the user can at best utilize the allocated shared memory region to share data that is written once and otherwise read-only. Any other shared memory programming techniques are either going to be non-portable (e.g., using pthread mutexes or calls/macros from some atomic operations library), or they will be limited to potentially slow dark-ages techniques such as Dekker's Algorithm with excessive MPI_Shm_fence-ing.  So does this proposal really empower the user in any meaningful way?
> 
> I don't see a compelling advantage to putting this into MPI as opposed to providing this as some third-party library on top of MPI.  Sure, it's easy to implement the allocate/free calls inside of MPI because the machinery is typically there.  But a third-party library would be able to escape some of the extremely generic portability constraints of the MPI standard and would therefore be able to provide a more robust interface to the user.  A discussion of DSM might make putting it into MPI more compelling because access to the network hardware might be involved, but I'm not particularly interested in having that discussion right now.  I think that MPI-3 RMA would probably be more suitable for that use case.
> 
> -Dave
> 
> [1] http://www.hpl.hp.com/techreports/2004/HPL-2004-209.pdf
>