[mpiwg-rma] Short question on the ccNUMA memory reality

Tue Aug 5 09:48:10 CDT 2014

Bill & Bill,

thank you for your answer.
I was not precise enough with my question.
My notations should represent store and load instructions,
i.e., assembler level, i.e., what's going on hardware level.
x should be part of an MPI shared memory allocated with
MPI_Win_allocate_shared, i.e., the assembler on rank 0 
does two stores and the assembler on rank 1 does four loads.

The code was
> rank 0     rank 1
> _______    print x
> x=val_1    print x
> x=val_2    print x
> _______    print x

When I understand Bill Long correctly, then the 4 load(x)
end up in
> old_val
> val_2   (because it was going through a very slow network path)
> val_1
> val_2
wheras three enterleaved load instructions on rank 0 
(i.e., in the same execution stream (thread) as the two stores)
will always see
  old_val
  val_1
  val_2

This is independent of compilers, because I only want to
look at assembbler level.

Bill Long, are you sure? I would expect that all loads go through the
1st Level Cache and as soon as it sees val_2 it should not be
possible to see with a later issued instruction val_1.

Best regards
Rolf

----- Original Message -----
> From: "Bill Long" <longb at cray.com>
> To: "Rolf Rabenseifner" <rabenseifner at hlrs.de>
> Cc: "MPI WG Remote Memory Access working group" <mpiwg-rma at lists.mpi-forum.org>
> Sent: Tuesday, August 5, 2014 3:12:58 PM
> Subject: Re: Short question on the ccNUMA memory reality
> 
> Hi Rolf,
> 
> 
> I assume you are expecting answers from people like Pavan and Bill G
> for the MPI RMA perspective, and for the Fortran rules from me. (For
> Fortran, map Rank 0 -> image 1 and Rank 1 -> Image 2).
> 
> On Aug 5, 2014, at 2:33 AM, Rolf Rabenseifner <rabenseifner at hlrs.de>
> wrote:
> 
> > Dear expert on ccNUMA,
> > 
> > three questions, which hopefully may be trivial:
> > 
> > 1. Question (sequential consistency on one location):
> > ------------
> > 
> > Do I understand correctly that in the following patter
> > on a shared Memory or a ccNUM shared memory
> 
> I assume you mean what I think of as distributed memory here.
>  Otherwise, these are questions about OpenMP, and the rank 0 / rank
> 1 separation does not make sense.
> 
> > 
> > rank 0     rank 1
> >           print x
> > x=val_1    print x
> > x=val_2    print x
> >           print x
> 
> This depends on whether x is declared in a way that makes it
> accessible from a remote rank.  If not, then the code is illegal. So
> I’ll assume it is accessible.
> 
> Since there is no synchronization between rank 0 and rank 1, compiler
> elimination of the x = val_1 assignment is allowed (and likely
> expected).
> 
> Even if the assignment is not eliminated, the print x on rank 1
> involves a “get” of x from rank 0.  Assuming it is declared
> volatile, to eliminate the “snap to local temp” optimization on rank
> 1, it is still possible to get the values out of order.  For
> example, the print #2 could get routed through the network via some
> extra long path, while print #3 could get routed on the fastest
> path, resulting in the two “get” operations appearing at rank 0 in
> an unpredictable order.
> 
> [Note that by the Fortran rules, the code is not legal anyway.  If
> you define a variable on one rank and reference it (the print counts
> as a reference) on another rank the execution segments containing
> the definition and reference have to be ordered unless the
> definition and reference are both explicitly atomic.  Segment
> ordering is done with synchronization statements.]
> 
> 
> > 
> > the print statements can print only in the following
> > sequence
> > - some times the previous value
> > - some times val_1
> > - and after some time val_2 and it then stays to print val_2
> > 
> > and that it can never be that a sequence with val_2 before val_1
> > can be produced, i.e.,
> >  old_val
> >  val_2
> >  val_1
> >  val_2
> > is impossible.
> > 
> > Also other values are impossible, e.g., some bit or byte-mix
> > from val_1 and val_2.
> 
> 
> On most hardware, assuming the values are “normal” size, such as
> 64-bit, you will not get mixed bits.  However, if you want to ensure
> that, use the atomic get and put intrinsics.  If the accesses had
> been properly ordered by synchronizations, then you do not need the
> atomics to ensure whole values.
> 
> 
> > 
> > 2. Question:
> > -----------
> > What is the largest size that the memory operations are atomic,
> > i.e., that we do not see a bit or byte-mix from val_1 and val_2?
> > Is it 1, 4, 8, 16 bytes or can it be a total struct that fits
> > into a cacheline?
> 
> 
> Since the types in Fortran are parameterized, the processor supplies
> a KIND value for the integer and logical types for which atomic
> operations are supported.  Typically kind corresponds to  64 bits,
> but could be 32 on some architectures.
> 
> 
> > 
> > 3. Question (about two updates):
> > -----------
> > 
> > rank 0       rank 1
> > x=x_ld
> > y=yold
> > ---- necessary synchronizations -----
> >             print x (which shows xold)
> >             print y (which shows yold)
> > ---- necessary synchronizations -----
> > x=xnew
> > y=ynew
> >             print x
> >             print y
> >             after some time
> >             print x
> >             print y
> > 
> > Possible results are
> > - xold,yold  xold,yold  xnew,ynew
> > - xold,yold  xnew,yold  xnew,ynew
> > - xold,yold  xold,ynew  xnew,ynew
> >   i.e., the y=ynew can arrive at another process
> >         faster than the x=xnew, although the storing
> >         process issues the stores in the sequence
> >         x=xnew, y=ynew.
> > - xold,yold  xnew,ynew  xnew,ynew
> > 
> > The assignments should represent the store instructions,
> > and not the source code (because the compiler may modify
> > sequence of instructions compared to the source code)
> > 
> > Do I understand correctly, that the sequence of two
> > store instructions two two different locations in one process
> > may be visible at another process in a different sequence?
> 
> Correct.  As before, the “get” operations implied in the print
> statements could arrive out of order.  Also, while unlikely, the
> compiler could decide to reverse the order of the assignments since
> they are independent. (Assuming x and y cannot be aliased by, for
> example, being pointers.)
> 
> Cheers,
> Bill
> 
> > 
> > I ask all these questions to understand which memory model
> > can be defined for MPI shared memory windows.
> > 
> > Best regards
> > Rolf
> > 
> > --
> > Dr. Rolf Rabenseifner . . . . . . . . . .. email
> > rabenseifner at hlrs.de
> > High Performance Computing Center (HLRS) . phone
> > ++49(0)711/685-65530
> > University of Stuttgart . . . . . . . . .. fax ++49(0)711 /
> > 685-65832
> > Head of Dpmt Parallel Computing . . .
> > www.hlrs.de/people/rabenseifner
> > Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room
> > 1.307)
> 
> Bill Long
>                                                                       longb at cray.com
> Fortran Technical Suport  &                                  voice:
>  651-605-9024
> Bioinformatics Software Development                     fax:
>  651-605-9142
> Cray Inc./ Cray Plaza, Suite 210/ 380 Jackson St./ St. Paul, MN 55101
> 
> 

----- Original Message -----
> From: "William Gropp" <wgropp at illinois.edu>
> To: "MPI WG Remote Memory Access working group" <mpiwg-rma at lists.mpi-forum.org>
> Cc: "Bill Long" <longb at cray.com>
> Sent: Tuesday, August 5, 2014 2:38:07 PM
> Subject: Re: [mpiwg-rma] Short question on the ccNUMA memory reality
> 
> 
> Question 1 is hard to answer.  First, as it came out in my output,
> the first and last “print x” straddle the two ranks, so I’m not sure
> what that means.  But the following is valid for the compiler:
> 
> 
> On rank 0.  Since x = val_2 follows x = val_1 with no intervening use
> of x, a good optimizing compiler may eliminate the store to x with
> val_1 as unnecessary.
> 
> 
> On rank 1,  Since x is never assigned in this block, a compiler is
> permitted to make a copy in a register (which would have the value x
> had before this block was entered) and than print that value each
> time print is called  (e.g., it could build the call stack in
> memory, and just pass that unchanged call stack to print for each
> call).
> 
> 
> And don’t forget - there’s a difference between when the print
> completes on a process/thread and when the output appears - the
> system could flush all print output from rank 1 before any print
> output from rank 0 appears.
> 
> 
> If x is volatile, then the compiler isn’t free to do the second (save
> x and never load it again).  
> 
> 
> Bill
> 
> 
> 
> 
> 
> 
> William Gropp
> Director, Parallel Computing Institute Thomas M. Siebel Chair in
> Computer Science
> 
> 
> University of Illinois Urbana-Champaign
> 
> 
> 
> 
> 
> 
> 
> On Aug 5, 2014, at 2:53 AM, Balaji, Pavan < balaji at anl.gov > wrote:
> 
> 
> Huh?
> 
> 
> 
> 1. Question (sequential consistency on one location):
> ------------
> 
> Do I understand correctly that in the following patter
> on a shared Memory or a ccNUM shared memory
> 
> rank 0     rank 1
>        print x
> x=val_1    print x
> x=val_2    print x
>        print x
> 
> the print statements can print only in the following
> sequence  
> - some times the previous value
> - some times val_1
> - and after some time val_2 and it then stays to print val_2
> 
> and that it can never be that a sequence with val_2 before val_1  
> can be produced, i.e.,
> old_val
> val_2
> val_1
> val_2
> is impossible.
> 
> 
> 
> _______________________________________________
> mpiwg-rma mailing list
> mpiwg-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma

-- 
Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de
High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530
University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832
Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner
Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room 1.307)