[Mpi3-hybridpm] Fwd: MPI shared memory allocation issues

Sun Feb 12 10:01:24 CST 2012

Dave,

On Mon, Mar 28, 2011 at 01:28:48PM -0500, Dave Goodell wrote:
> Forwarding this mail to a broader audience, based on our discussion
> here at the March 2011 MPI Forum meeting.  There was additional
> correspondence on this thread that I can forward as needed, but this
> forwarded mail contains my core argument against the original
> (allocate+free+fence) proposal.
Thanks! This is an important discussion. Let me recap below what we
discussed in the RMA group when moving towards the newer integrated
allocate_shared.

> Begin forwarded message:
> 
> > From: Dave Goodell <goodell at mcs.anl.gov>
> > Date: February 24, 2011 11:18:20 AM CST
> > To: Ron Brightwell <rbbrigh at sandia.gov>, Douglas Miller <dougmill at us.ibm.com>, "Bronis R. de Supinski" <bronis at llnl.gov>, Jim Dinan <dinan at mcs.anl.gov>, Pavan Balaji <balaji at mcs.anl.gov>, Marc Snir <snir at illinois.edu>
> > Subject: MPI shared memory allocation issues
> > 
> > I voiced concerns at the last MPI forum meeting about the proposed
> > MPI extensions for allocating shared memory.  In particular I was
> > concerned about "MPI_Shm_fence".  Pavan asked me to write up a quick
> > email to this group in order to help clarify my view in the
> > discussion; this is that email.  Please widen the distribution list
> > as appropriate, I just mailed the addresses that Pavan indicated to
> > me.  FYI, I am not currently subscribed to the mpi3-hybrid list.
> > 
> > First, I view multithreaded programming within an OS process and
> > multiprocess programming using a shared memory region to be
> > essentially the same problem.  There are probably alternative
> > interpretations of the words "process" and "thread" that could muddy
> > the picture here, but for the sake of clarity, let's use the
> > conventional meanings for the moment.  Also, I am not interested in
> > discussing distributed shared memory (DSM) here, I think that
> > bringing it up just confuses the discussion further.  My primary
> > objections to the proposal are valid entirely within a discussion of
> > conventional shared memory, processes, and threads.
> > 
> > Given that preface, I believe that many, if not all, of the issues
> > raised by Boehm's paper, "Threads Cannot Be Implemented As a
> > Library" [1], apply here.  In particular, some variation on the
> > example from section 4.3 is probably an issue, but the others seem
> > to apply as well.  The performance example is also relevant here,
> > but in an even more dramatic fashion given the dearth of
> > synchronization primitives offered by the proposal.
> > 
> > I do not believe that we can specify a way to program the provided
> > shared memory in any way that is robust and useful to the user,
> > because C and Fortran do not give us enough of a specification in
> > order to do so.  Without getting into the business of compiler
> > writing, MPI has no way to give the user any meaningful guarantees.
> > Just as Boehm noted about pthreads, we can probably come up with an
> > approach that will work most of the time.  But that's a pretty
> > flimsy guarantee for a standard like MPI.
Yes, as Boehm points out, serial compiler optimizations can have
very bad effects on concurrently running code accessing close-by memory.
However, as you point out, the situation is equivalent to what we have
today in pthreads and the proposal does not claim any more, it says "The
consistency of load/store accesses from/to the shared memory as observed
by the user program depends on the architecture.". We can extend it to
include the compiler and maybe reference Boehm's paper (I would see this
as a ticket 0 change).

I agree to the general sentiment that it is impossible to implement
shared memory semantics in a language that doesn't even have a real
memory model. However, at the same time, I want to remind us that
the behavior of Fortran was *never* 100% correct in MPI <= 2.2 (and we
rely in this TR for MPI-3.0). At the same time, Fortran/MPI programs are
ubiquitous :-).

But let's discuss Boehm's identified correctness issues here:

* 4.1 Concurrent modification

This is only an issue if users rely on the consistency of the underlying
hardware, if they use Win_flush and friends (as advised), such a
reordering would be illegal. One downside is that this will only work in
C and Fortran will probably may have all kinds of wacky problems with code
movement as usual (however, they should be in a position to fix this
with the new bindings and the Fortran TR).

* 4.2 Rewriting of Adjacent Data

This applies to the unified window as well where we simply specify the
byte granularity of updates (an architecture could work with larger
chunks (e.g., words) and cause the same trouble). So this issue is not
limited to the shared memory window, especially when fast remote memory
access hardware is used. Here we face the general trade-off between fast
hardware access and safety (securing it through a software layer). We
decided that byte-consistency is something we can expect from vendors.
Also, the vendor library is always free to return MPI_ERR_RMA_SHARED
(like he can always choose to not offer the unified memory model).

* 4.3 Register Promotion

While this is certainly a problem with threads, it would not be one with
MPI windows because they have to dereference the address of the accessed
memory and would thus prevent register promotion. Copying the data into
a faster memory region would also not harm because the remote side has
to query the addresses anyway. Again, restrictions may apply for Fortran
codes.

> > If you ignore the difficulty in specifying an interface that can
> > actually be used correctly, then another issue arises.  The only
> > proposed synchronization mechanism virtually guarantees that the
> > user can at best utilize the allocated shared memory region to share
> > data that is written once and otherwise read-only. Any other shared
> > memory programming techniques are either going to be non-portable
> > (e.g., using pthread mutexes or calls/macros from some atomic
> > operations library), or they will be limited to potentially slow
> > dark-ages techniques such as Dekker's Algorithm with excessive
> > MPI_Shm_fence-ing.  So does this proposal really empower the user in
> > any meaningful way?
I agree. This should be addressed by the merge into the RMA context
which offers all required functionality (we avoided memory locks on
purpose because they are evil).

> > I don't see a compelling advantage to putting this into MPI as
> > opposed to providing this as some third-party library on top of MPI.
> > Sure, it's easy to implement the allocate/free calls inside of MPI
> > because the machinery is typically there.  But a third-party library
> > would be able to escape some of the extremely generic portability
> > constraints of the MPI standard and would therefore be able to
> > provide a more robust interface to the user.  A discussion of DSM
> > might make putting it into MPI more compelling because access to the
> > network hardware might be involved, but I'm not particularly
> > interested in having that discussion right now.  I think that MPI-3
> > RMA would probably be more suitable for that use case.
First, I am against DSM. Second, I believe that it may be very valuable
to have this kind of functionality in MPI because virtually all
large-scale codes have to become hybrid. The main issue are the
associated memory saving (on-node communication with MPI is often
sufficiently fast). I believe the current practice of mixing OpenMP and
MPI to achieve this simple goal may be suboptimal (OpenMP supports only
the "shared everything" (threaded) model and enables thus a whole new
class of bugs and races). 

All the Best,
  Torsten

-- 
 bash$ :(){ :|:&};: --------------------- http://www.unixer.de/ -----
Torsten Hoefler         | Performance Modeling and Simulation Lead
Blue Waters Directorate | University of Illinois (UIUC)
1205 W Clark Street     | Urbana, IL, 61801
NCSA Building           | +01 (217) 244-7736