[mpiwg-rma] Short question on the ccNUMA memory reality

Tue Aug 5 10:00:15 CDT 2014

Rolf:

Re:
> thank you for your answer.
> I was not precise enough with my question.
> My notations should represent store and load instructions,
> i.e., assembler level, i.e., what's going on hardware level.

The "level" is not really relevant. As Pavan noted, unless you use
atomic instructions, you get no guarantees. With ordinary
loads and stores, your snippets all contain race conditions.
The exact result will depend on the system but as a rule,
you can assume that a program with races provides NO guarantees.

Sure, you can eliminate the values coming from registers but
they can come from the cache and you are not provide any
general atomicity guarantees. Most hardware will provide
bit-level atomicity on ordinary loads and stores but if
you straddle a cache line, you lose those guarantees.

Overall, the issues that arise are more complex than you
want to deal with. That's why we have compilers for threaded
programs. Simply put, shared memory programming without compiler
assistance is not something ordinary programmers should do.

Bronis

> x should be part of an MPI shared memory allocated with
> MPI_Win_allocate_shared, i.e., the assembler on rank 0 
> does two stores and the assembler on rank 1 does four loads.
>
> The code was
>> rank 0     rank 1
>> _______    print x
>> x=val_1    print x
>> x=val_2    print x
>> _______    print x
>
> When I understand Bill Long correctly, then the 4 load(x)
> end up in
>> old_val
>> val_2   (because it was going through a very slow network path)
>> val_1
>> val_2
> wheras three enterleaved load instructions on rank 0 
> (i.e., in the same execution stream (thread) as the two stores)
> will always see
>  old_val
>  val_1
>  val_2
> 
> This is independent of compilers, because I only want to
> look at assembbler level.
>
> Bill Long, are you sure? I would expect that all loads go through the
> 1st Level Cache and as soon as it sees val_2 it should not be
> possible to see with a later issued instruction val_1.
>
> Best regards
> Rolf
>
>
> ----- Original Message -----
>> From: "Bill Long" <longb at cray.com>
>> To: "Rolf Rabenseifner" <rabenseifner at hlrs.de>
>> Cc: "MPI WG Remote Memory Access working group" <mpiwg-rma at lists.mpi-forum.org>
>> Sent: Tuesday, August 5, 2014 3:12:58 PM
>> Subject: Re: Short question on the ccNUMA memory reality
>> 
>> Hi Rolf,
>> 
>> 
>> I assume you are expecting answers from people like Pavan and Bill G
>> for the MPI RMA perspective, and for the Fortran rules from me. (For
>> Fortran, map Rank 0 -> image 1 and Rank 1 -> Image 2).
>> 
>> On Aug 5, 2014, at 2:33 AM, Rolf Rabenseifner <rabenseifner at hlrs.de>
>> wrote:
>> 
>> > Dear expert on ccNUMA,
>> > 
>> > three questions, which hopefully may be trivial:
>> > 
>> > 1. Question (sequential consistency on one location):
>> > ------------
>> > 
>> > Do I understand correctly that in the following patter
>> > on a shared Memory or a ccNUM shared memory
>> 
>> I assume you mean what I think of as distributed memory here.
>>  Otherwise, these are questions about OpenMP, and the rank 0 / rank
>> 1 separation does not make sense.
>> 
>> > 
>> > rank 0     rank 1
>> >           print x
>> > x=val_1    print x
>> > x=val_2    print x
>> >           print x
>> 
>> This depends on whether x is declared in a way that makes it
>> accessible from a remote rank.  If not, then the code is illegal. So
>> I’ll assume it is accessible.
>> 
>> Since there is no synchronization between rank 0 and rank 1, compiler
>> elimination of the x = val_1 assignment is allowed (and likely
>> expected).
>> 
>> Even if the assignment is not eliminated, the print x on rank 1
>> involves a “get” of x from rank 0.  Assuming it is declared
>> volatile, to eliminate the “snap to local temp” optimization on rank
>> 1, it is still possible to get the values out of order.  For
>> example, the print #2 could get routed through the network via some
>> extra long path, while print #3 could get routed on the fastest
>> path, resulting in the two “get” operations appearing at rank 0 in
>> an unpredictable order.
>> 
>> [Note that by the Fortran rules, the code is not legal anyway.  If
>> you define a variable on one rank and reference it (the print counts
>> as a reference) on another rank the execution segments containing
>> the definition and reference have to be ordered unless the
>> definition and reference are both explicitly atomic.  Segment
>> ordering is done with synchronization statements.]
>> 
>> 
>> > 
>> > the print statements can print only in the following
>> > sequence
>> > - some times the previous value
>> > - some times val_1
>> > - and after some time val_2 and it then stays to print val_2
>> > 
>> > and that it can never be that a sequence with val_2 before val_1
>> > can be produced, i.e.,
>> >  old_val
>> >  val_2
>> >  val_1
>> >  val_2
>> > is impossible.
>> > 
>> > Also other values are impossible, e.g., some bit or byte-mix
>> > from val_1 and val_2.
>> 
>> 
>> On most hardware, assuming the values are “normal” size, such as
>> 64-bit, you will not get mixed bits.  However, if you want to ensure
>> that, use the atomic get and put intrinsics.  If the accesses had
>> been properly ordered by synchronizations, then you do not need the
>> atomics to ensure whole values.
>> 
>> 
>> > 
>> > 2. Question:
>> > -----------
>> > What is the largest size that the memory operations are atomic,
>> > i.e., that we do not see a bit or byte-mix from val_1 and val_2?
>> > Is it 1, 4, 8, 16 bytes or can it be a total struct that fits
>> > into a cacheline?
>> 
>> 
>> Since the types in Fortran are parameterized, the processor supplies
>> a KIND value for the integer and logical types for which atomic
>> operations are supported.  Typically kind corresponds to  64 bits,
>> but could be 32 on some architectures.
>> 
>> 
>> > 
>> > 3. Question (about two updates):
>> > -----------
>> > 
>> > rank 0       rank 1
>> > x=x_ld
>> > y=yold
>> > ---- necessary synchronizations -----
>> >             print x (which shows xold)
>> >             print y (which shows yold)
>> > ---- necessary synchronizations -----
>> > x=xnew
>> > y=ynew
>> >             print x
>> >             print y
>> >             after some time
>> >             print x
>> >             print y
>> > 
>> > Possible results are
>> > - xold,yold  xold,yold  xnew,ynew
>> > - xold,yold  xnew,yold  xnew,ynew
>> > - xold,yold  xold,ynew  xnew,ynew
>> >   i.e., the y=ynew can arrive at another process
>> >         faster than the x=xnew, although the storing
>> >         process issues the stores in the sequence
>> >         x=xnew, y=ynew.
>> > - xold,yold  xnew,ynew  xnew,ynew
>> > 
>> > The assignments should represent the store instructions,
>> > and not the source code (because the compiler may modify
>> > sequence of instructions compared to the source code)
>> > 
>> > Do I understand correctly, that the sequence of two
>> > store instructions two two different locations in one process
>> > may be visible at another process in a different sequence?
>> 
>> Correct.  As before, the “get” operations implied in the print
>> statements could arrive out of order.  Also, while unlikely, the
>> compiler could decide to reverse the order of the assignments since
>> they are independent. (Assuming x and y cannot be aliased by, for
>> example, being pointers.)
>> 
>> Cheers,
>> Bill
>> 
>> > 
>> > I ask all these questions to understand which memory model
>> > can be defined for MPI shared memory windows.
>> > 
>> > Best regards
>> > Rolf
>> > 
>> > --
>> > Dr. Rolf Rabenseifner . . . . . . . . . .. email
>> > rabenseifner at hlrs.de
>> > High Performance Computing Center (HLRS) . phone
>> > ++49(0)711/685-65530
>> > University of Stuttgart . . . . . . . . .. fax ++49(0)711 /
>> > 685-65832
>> > Head of Dpmt Parallel Computing . . .
>> > www.hlrs.de/people/rabenseifner
>> > Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room
>> > 1.307)
>> 
>> Bill Long
>>                                                                       longb at cray.com
>> Fortran Technical Suport  &                                  voice:
>>  651-605-9024
>> Bioinformatics Software Development                     fax:
>>  651-605-9142
>> Cray Inc./ Cray Plaza, Suite 210/ 380 Jackson St./ St. Paul, MN 55101
>> 
>> 
>
>
> ----- Original Message -----
>> From: "William Gropp" <wgropp at illinois.edu>
>> To: "MPI WG Remote Memory Access working group" <mpiwg-rma at lists.mpi-forum.org>
>> Cc: "Bill Long" <longb at cray.com>
>> Sent: Tuesday, August 5, 2014 2:38:07 PM
>> Subject: Re: [mpiwg-rma] Short question on the ccNUMA memory reality
>> 
>> 
>> Question 1 is hard to answer.  First, as it came out in my output,
>> the first and last “print x” straddle the two ranks, so I’m not sure
>> what that means.  But the following is valid for the compiler:
>> 
>> 
>> On rank 0.  Since x = val_2 follows x = val_1 with no intervening use
>> of x, a good optimizing compiler may eliminate the store to x with
>> val_1 as unnecessary.
>> 
>> 
>> On rank 1,  Since x is never assigned in this block, a compiler is
>> permitted to make a copy in a register (which would have the value x
>> had before this block was entered) and than print that value each
>> time print is called  (e.g., it could build the call stack in
>> memory, and just pass that unchanged call stack to print for each
>> call).
>> 
>> 
>> And don’t forget - there’s a difference between when the print
>> completes on a process/thread and when the output appears - the
>> system could flush all print output from rank 1 before any print
>> output from rank 0 appears.
>> 
>> 
>> If x is volatile, then the compiler isn’t free to do the second (save
>> x and never load it again).  
>> 
>> 
>> Bill
>> 
>> 
>> 
>> 
>> 
>> 
>> William Gropp
>> Director, Parallel Computing Institute Thomas M. Siebel Chair in
>> Computer Science
>> 
>> 
>> University of Illinois Urbana-Champaign
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Aug 5, 2014, at 2:53 AM, Balaji, Pavan < balaji at anl.gov > wrote:
>> 
>> 
>> Huh?
>> 
>> 
>> 
>> 1. Question (sequential consistency on one location):
>> ------------
>> 
>> Do I understand correctly that in the following patter
>> on a shared Memory or a ccNUM shared memory
>> 
>> rank 0     rank 1
>>        print x
>> x=val_1    print x
>> x=val_2    print x
>>        print x
>> 
>> the print statements can print only in the following
>> sequence  
>> - some times the previous value
>> - some times val_1
>> - and after some time val_2 and it then stays to print val_2
>> 
>> and that it can never be that a sequence with val_2 before val_1  
>> can be produced, i.e.,
>> old_val
>> val_2
>> val_1
>> val_2
>> is impossible.
>> 
>> 
>> 
>> _______________________________________________
>> mpiwg-rma mailing list
>> mpiwg-rma at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>
> -- 
> Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de
> High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530
> University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832
> Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner
> Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room 1.307)
> _______________________________________________
> mpiwg-rma mailing list
> mpiwg-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma