[mpiwg-rma] Short question on the ccNUMA memory reality

Tue Aug 5 10:30:17 CDT 2014

Bronis and all,

yes, after all what I now know, I expect that you are right.

My proposal for Bill G. was based on guarantees from rank 
to rank only with synchronizations in between.

Thanks for all the answers.
Rolf

----- Original Message -----
> From: "Bronis R. de Supinski" <bronis at llnl.gov>
> To: "MPI WG Remote Memory Access working group" <mpiwg-rma at lists.mpi-forum.org>
> Cc: "Bill Long" <longb at cray.com>
> Sent: Tuesday, August 5, 2014 5:00:15 PM
> Subject: Re: [mpiwg-rma] Short question on the ccNUMA memory reality
> 
> 
> Rolf:
> 
> Re:
> > thank you for your answer.
> > I was not precise enough with my question.
> > My notations should represent store and load instructions,
> > i.e., assembler level, i.e., what's going on hardware level.
> 
> The "level" is not really relevant. As Pavan noted, unless you use
> atomic instructions, you get no guarantees. With ordinary
> loads and stores, your snippets all contain race conditions.
> The exact result will depend on the system but as a rule,
> you can assume that a program with races provides NO guarantees.
> 
> Sure, you can eliminate the values coming from registers but
> they can come from the cache and you are not provide any
> general atomicity guarantees. Most hardware will provide
> bit-level atomicity on ordinary loads and stores but if
> you straddle a cache line, you lose those guarantees.
> 
> Overall, the issues that arise are more complex than you
> want to deal with. That's why we have compilers for threaded
> programs. Simply put, shared memory programming without compiler
> assistance is not something ordinary programmers should do.
> 
> Bronis
> 
> 
> > x should be part of an MPI shared memory allocated with
> > MPI_Win_allocate_shared, i.e., the assembler on rank 0
> > does two stores and the assembler on rank 1 does four loads.
> >
> > The code was
> >> rank 0     rank 1
> >> _______    print x
> >> x=val_1    print x
> >> x=val_2    print x
> >> _______    print x
> >
> > When I understand Bill Long correctly, then the 4 load(x)
> > end up in
> >> old_val
> >> val_2   (because it was going through a very slow network path)
> >> val_1
> >> val_2
> > wheras three enterleaved load instructions on rank 0
> > (i.e., in the same execution stream (thread) as the two stores)
> > will always see
> >  old_val
> >  val_1
> >  val_2
> > 
> > This is independent of compilers, because I only want to
> > look at assembbler level.
> >
> > Bill Long, are you sure? I would expect that all loads go through
> > the
> > 1st Level Cache and as soon as it sees val_2 it should not be
> > possible to see with a later issued instruction val_1.
> >
> > Best regards
> > Rolf
> >
> >
> > ----- Original Message -----
> >> From: "Bill Long" <longb at cray.com>
> >> To: "Rolf Rabenseifner" <rabenseifner at hlrs.de>
> >> Cc: "MPI WG Remote Memory Access working group"
> >> <mpiwg-rma at lists.mpi-forum.org>
> >> Sent: Tuesday, August 5, 2014 3:12:58 PM
> >> Subject: Re: Short question on the ccNUMA memory reality
> >> 
> >> Hi Rolf,
> >> 
> >> 
> >> I assume you are expecting answers from people like Pavan and Bill
> >> G
> >> for the MPI RMA perspective, and for the Fortran rules from me.
> >> (For
> >> Fortran, map Rank 0 -> image 1 and Rank 1 -> Image 2).
> >> 
> >> On Aug 5, 2014, at 2:33 AM, Rolf Rabenseifner
> >> <rabenseifner at hlrs.de>
> >> wrote:
> >> 
> >> > Dear expert on ccNUMA,
> >> > 
> >> > three questions, which hopefully may be trivial:
> >> > 
> >> > 1. Question (sequential consistency on one location):
> >> > ------------
> >> > 
> >> > Do I understand correctly that in the following patter
> >> > on a shared Memory or a ccNUM shared memory
> >> 
> >> I assume you mean what I think of as distributed memory here.
> >>  Otherwise, these are questions about OpenMP, and the rank 0 /
> >>  rank
> >> 1 separation does not make sense.
> >> 
> >> > 
> >> > rank 0     rank 1
> >> >           print x
> >> > x=val_1    print x
> >> > x=val_2    print x
> >> >           print x
> >> 
> >> This depends on whether x is declared in a way that makes it
> >> accessible from a remote rank.  If not, then the code is illegal.
> >> So
> >> I’ll assume it is accessible.
> >> 
> >> Since there is no synchronization between rank 0 and rank 1,
> >> compiler
> >> elimination of the x = val_1 assignment is allowed (and likely
> >> expected).
> >> 
> >> Even if the assignment is not eliminated, the print x on rank 1
> >> involves a “get” of x from rank 0.  Assuming it is declared
> >> volatile, to eliminate the “snap to local temp” optimization on
> >> rank
> >> 1, it is still possible to get the values out of order.  For
> >> example, the print #2 could get routed through the network via
> >> some
> >> extra long path, while print #3 could get routed on the fastest
> >> path, resulting in the two “get” operations appearing at rank 0 in
> >> an unpredictable order.
> >> 
> >> [Note that by the Fortran rules, the code is not legal anyway.  If
> >> you define a variable on one rank and reference it (the print
> >> counts
> >> as a reference) on another rank the execution segments containing
> >> the definition and reference have to be ordered unless the
> >> definition and reference are both explicitly atomic.  Segment
> >> ordering is done with synchronization statements.]
> >> 
> >> 
> >> > 
> >> > the print statements can print only in the following
> >> > sequence
> >> > - some times the previous value
> >> > - some times val_1
> >> > - and after some time val_2 and it then stays to print val_2
> >> > 
> >> > and that it can never be that a sequence with val_2 before val_1
> >> > can be produced, i.e.,
> >> >  old_val
> >> >  val_2
> >> >  val_1
> >> >  val_2
> >> > is impossible.
> >> > 
> >> > Also other values are impossible, e.g., some bit or byte-mix
> >> > from val_1 and val_2.
> >> 
> >> 
> >> On most hardware, assuming the values are “normal” size, such as
> >> 64-bit, you will not get mixed bits.  However, if you want to
> >> ensure
> >> that, use the atomic get and put intrinsics.  If the accesses had
> >> been properly ordered by synchronizations, then you do not need
> >> the
> >> atomics to ensure whole values.
> >> 
> >> 
> >> > 
> >> > 2. Question:
> >> > -----------
> >> > What is the largest size that the memory operations are atomic,
> >> > i.e., that we do not see a bit or byte-mix from val_1 and val_2?
> >> > Is it 1, 4, 8, 16 bytes or can it be a total struct that fits
> >> > into a cacheline?
> >> 
> >> 
> >> Since the types in Fortran are parameterized, the processor
> >> supplies
> >> a KIND value for the integer and logical types for which atomic
> >> operations are supported.  Typically kind corresponds to  64 bits,
> >> but could be 32 on some architectures.
> >> 
> >> 
> >> > 
> >> > 3. Question (about two updates):
> >> > -----------
> >> > 
> >> > rank 0       rank 1
> >> > x=x_ld
> >> > y=yold
> >> > ---- necessary synchronizations -----
> >> >             print x (which shows xold)
> >> >             print y (which shows yold)
> >> > ---- necessary synchronizations -----
> >> > x=xnew
> >> > y=ynew
> >> >             print x
> >> >             print y
> >> >             after some time
> >> >             print x
> >> >             print y
> >> > 
> >> > Possible results are
> >> > - xold,yold  xold,yold  xnew,ynew
> >> > - xold,yold  xnew,yold  xnew,ynew
> >> > - xold,yold  xold,ynew  xnew,ynew
> >> >   i.e., the y=ynew can arrive at another process
> >> >         faster than the x=xnew, although the storing
> >> >         process issues the stores in the sequence
> >> >         x=xnew, y=ynew.
> >> > - xold,yold  xnew,ynew  xnew,ynew
> >> > 
> >> > The assignments should represent the store instructions,
> >> > and not the source code (because the compiler may modify
> >> > sequence of instructions compared to the source code)
> >> > 
> >> > Do I understand correctly, that the sequence of two
> >> > store instructions two two different locations in one process
> >> > may be visible at another process in a different sequence?
> >> 
> >> Correct.  As before, the “get” operations implied in the print
> >> statements could arrive out of order.  Also, while unlikely, the
> >> compiler could decide to reverse the order of the assignments
> >> since
> >> they are independent. (Assuming x and y cannot be aliased by, for
> >> example, being pointers.)
> >> 
> >> Cheers,
> >> Bill
> >> 
> >> > 
> >> > I ask all these questions to understand which memory model
> >> > can be defined for MPI shared memory windows.
> >> > 
> >> > Best regards
> >> > Rolf
> >> > 
> >> > --
> >> > Dr. Rolf Rabenseifner . . . . . . . . . .. email
> >> > rabenseifner at hlrs.de
> >> > High Performance Computing Center (HLRS) . phone
> >> > ++49(0)711/685-65530
> >> > University of Stuttgart . . . . . . . . .. fax ++49(0)711 /
> >> > 685-65832
> >> > Head of Dpmt Parallel Computing . . .
> >> > www.hlrs.de/people/rabenseifner
> >> > Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room
> >> > 1.307)
> >> 
> >> Bill Long
> >>                                                                       longb at cray.com
> >> Fortran Technical Suport  &
> >>                                  voice:
> >>  651-605-9024
> >> Bioinformatics Software Development                     fax:
> >>  651-605-9142
> >> Cray Inc./ Cray Plaza, Suite 210/ 380 Jackson St./ St. Paul, MN
> >> 55101
> >> 
> >> 
> >
> >
> > ----- Original Message -----
> >> From: "William Gropp" <wgropp at illinois.edu>
> >> To: "MPI WG Remote Memory Access working group"
> >> <mpiwg-rma at lists.mpi-forum.org>
> >> Cc: "Bill Long" <longb at cray.com>
> >> Sent: Tuesday, August 5, 2014 2:38:07 PM
> >> Subject: Re: [mpiwg-rma] Short question on the ccNUMA memory
> >> reality
> >> 
> >> 
> >> Question 1 is hard to answer.  First, as it came out in my output,
> >> the first and last “print x” straddle the two ranks, so I’m not
> >> sure
> >> what that means.  But the following is valid for the compiler:
> >> 
> >> 
> >> On rank 0.  Since x = val_2 follows x = val_1 with no intervening
> >> use
> >> of x, a good optimizing compiler may eliminate the store to x with
> >> val_1 as unnecessary.
> >> 
> >> 
> >> On rank 1,  Since x is never assigned in this block, a compiler is
> >> permitted to make a copy in a register (which would have the value
> >> x
> >> had before this block was entered) and than print that value each
> >> time print is called  (e.g., it could build the call stack in
> >> memory, and just pass that unchanged call stack to print for each
> >> call).
> >> 
> >> 
> >> And don’t forget - there’s a difference between when the print
> >> completes on a process/thread and when the output appears - the
> >> system could flush all print output from rank 1 before any print
> >> output from rank 0 appears.
> >> 
> >> 
> >> If x is volatile, then the compiler isn’t free to do the second
> >> (save
> >> x and never load it again).  
> >> 
> >> 
> >> Bill
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> William Gropp
> >> Director, Parallel Computing Institute Thomas M. Siebel Chair in
> >> Computer Science
> >> 
> >> 
> >> University of Illinois Urbana-Champaign
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> On Aug 5, 2014, at 2:53 AM, Balaji, Pavan < balaji at anl.gov >
> >> wrote:
> >> 
> >> 
> >> Huh?
> >> 
> >> 
> >> 
> >> 1. Question (sequential consistency on one location):
> >> ------------
> >> 
> >> Do I understand correctly that in the following patter
> >> on a shared Memory or a ccNUM shared memory
> >> 
> >> rank 0     rank 1
> >>        print x
> >> x=val_1    print x
> >> x=val_2    print x
> >>        print x
> >> 
> >> the print statements can print only in the following
> >> sequence  
> >> - some times the previous value
> >> - some times val_1
> >> - and after some time val_2 and it then stays to print val_2
> >> 
> >> and that it can never be that a sequence with val_2 before val_1  
> >> can be produced, i.e.,
> >> old_val
> >> val_2
> >> val_1
> >> val_2
> >> is impossible.
> >> 
> >> 
> >> 
> >> _______________________________________________
> >> mpiwg-rma mailing list
> >> mpiwg-rma at lists.mpi-forum.org
> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >
> > --
> > Dr. Rolf Rabenseifner . . . . . . . . . .. email
> > rabenseifner at hlrs.de
> > High Performance Computing Center (HLRS) . phone
> > ++49(0)711/685-65530
> > University of Stuttgart . . . . . . . . .. fax ++49(0)711 /
> > 685-65832
> > Head of Dpmt Parallel Computing . . .
> > www.hlrs.de/people/rabenseifner
> > Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room
> > 1.307)
> > _______________________________________________
> > mpiwg-rma mailing list
> > mpiwg-rma at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> _______________________________________________
> mpiwg-rma mailing list
> mpiwg-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma

-- 
Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de
High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530
University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832
Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner
Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room 1.307)