[mpiwg-rma] Short question on the ccNUMA memory reality
Rolf Rabenseifner
rabenseifner at hlrs.de
Tue Aug 5 09:48:10 CDT 2014
Bill & Bill,
thank you for your answer.
I was not precise enough with my question.
My notations should represent store and load instructions,
i.e., assembler level, i.e., what's going on hardware level.
x should be part of an MPI shared memory allocated with
MPI_Win_allocate_shared, i.e., the assembler on rank 0
does two stores and the assembler on rank 1 does four loads.
The code was
> rank 0 rank 1
> _______ print x
> x=val_1 print x
> x=val_2 print x
> _______ print x
When I understand Bill Long correctly, then the 4 load(x)
end up in
> old_val
> val_2 (because it was going through a very slow network path)
> val_1
> val_2
wheras three enterleaved load instructions on rank 0
(i.e., in the same execution stream (thread) as the two stores)
will always see
old_val
val_1
val_2
This is independent of compilers, because I only want to
look at assembbler level.
Bill Long, are you sure? I would expect that all loads go through the
1st Level Cache and as soon as it sees val_2 it should not be
possible to see with a later issued instruction val_1.
Best regards
Rolf
----- Original Message -----
> From: "Bill Long" <longb at cray.com>
> To: "Rolf Rabenseifner" <rabenseifner at hlrs.de>
> Cc: "MPI WG Remote Memory Access working group" <mpiwg-rma at lists.mpi-forum.org>
> Sent: Tuesday, August 5, 2014 3:12:58 PM
> Subject: Re: Short question on the ccNUMA memory reality
>
> Hi Rolf,
>
>
> I assume you are expecting answers from people like Pavan and Bill G
> for the MPI RMA perspective, and for the Fortran rules from me. (For
> Fortran, map Rank 0 -> image 1 and Rank 1 -> Image 2).
>
> On Aug 5, 2014, at 2:33 AM, Rolf Rabenseifner <rabenseifner at hlrs.de>
> wrote:
>
> > Dear expert on ccNUMA,
> >
> > three questions, which hopefully may be trivial:
> >
> > 1. Question (sequential consistency on one location):
> > ------------
> >
> > Do I understand correctly that in the following patter
> > on a shared Memory or a ccNUM shared memory
>
> I assume you mean what I think of as distributed memory here.
> Otherwise, these are questions about OpenMP, and the rank 0 / rank
> 1 separation does not make sense.
>
> >
> > rank 0 rank 1
> > print x
> > x=val_1 print x
> > x=val_2 print x
> > print x
>
> This depends on whether x is declared in a way that makes it
> accessible from a remote rank. If not, then the code is illegal. So
> I’ll assume it is accessible.
>
> Since there is no synchronization between rank 0 and rank 1, compiler
> elimination of the x = val_1 assignment is allowed (and likely
> expected).
>
> Even if the assignment is not eliminated, the print x on rank 1
> involves a “get” of x from rank 0. Assuming it is declared
> volatile, to eliminate the “snap to local temp” optimization on rank
> 1, it is still possible to get the values out of order. For
> example, the print #2 could get routed through the network via some
> extra long path, while print #3 could get routed on the fastest
> path, resulting in the two “get” operations appearing at rank 0 in
> an unpredictable order.
>
> [Note that by the Fortran rules, the code is not legal anyway. If
> you define a variable on one rank and reference it (the print counts
> as a reference) on another rank the execution segments containing
> the definition and reference have to be ordered unless the
> definition and reference are both explicitly atomic. Segment
> ordering is done with synchronization statements.]
>
>
> >
> > the print statements can print only in the following
> > sequence
> > - some times the previous value
> > - some times val_1
> > - and after some time val_2 and it then stays to print val_2
> >
> > and that it can never be that a sequence with val_2 before val_1
> > can be produced, i.e.,
> > old_val
> > val_2
> > val_1
> > val_2
> > is impossible.
> >
> > Also other values are impossible, e.g., some bit or byte-mix
> > from val_1 and val_2.
>
>
> On most hardware, assuming the values are “normal” size, such as
> 64-bit, you will not get mixed bits. However, if you want to ensure
> that, use the atomic get and put intrinsics. If the accesses had
> been properly ordered by synchronizations, then you do not need the
> atomics to ensure whole values.
>
>
> >
> > 2. Question:
> > -----------
> > What is the largest size that the memory operations are atomic,
> > i.e., that we do not see a bit or byte-mix from val_1 and val_2?
> > Is it 1, 4, 8, 16 bytes or can it be a total struct that fits
> > into a cacheline?
>
>
> Since the types in Fortran are parameterized, the processor supplies
> a KIND value for the integer and logical types for which atomic
> operations are supported. Typically kind corresponds to 64 bits,
> but could be 32 on some architectures.
>
>
> >
> > 3. Question (about two updates):
> > -----------
> >
> > rank 0 rank 1
> > x=x_ld
> > y=yold
> > ---- necessary synchronizations -----
> > print x (which shows xold)
> > print y (which shows yold)
> > ---- necessary synchronizations -----
> > x=xnew
> > y=ynew
> > print x
> > print y
> > after some time
> > print x
> > print y
> >
> > Possible results are
> > - xold,yold xold,yold xnew,ynew
> > - xold,yold xnew,yold xnew,ynew
> > - xold,yold xold,ynew xnew,ynew
> > i.e., the y=ynew can arrive at another process
> > faster than the x=xnew, although the storing
> > process issues the stores in the sequence
> > x=xnew, y=ynew.
> > - xold,yold xnew,ynew xnew,ynew
> >
> > The assignments should represent the store instructions,
> > and not the source code (because the compiler may modify
> > sequence of instructions compared to the source code)
> >
> > Do I understand correctly, that the sequence of two
> > store instructions two two different locations in one process
> > may be visible at another process in a different sequence?
>
> Correct. As before, the “get” operations implied in the print
> statements could arrive out of order. Also, while unlikely, the
> compiler could decide to reverse the order of the assignments since
> they are independent. (Assuming x and y cannot be aliased by, for
> example, being pointers.)
>
> Cheers,
> Bill
>
> >
> > I ask all these questions to understand which memory model
> > can be defined for MPI shared memory windows.
> >
> > Best regards
> > Rolf
> >
> > --
> > Dr. Rolf Rabenseifner . . . . . . . . . .. email
> > rabenseifner at hlrs.de
> > High Performance Computing Center (HLRS) . phone
> > ++49(0)711/685-65530
> > University of Stuttgart . . . . . . . . .. fax ++49(0)711 /
> > 685-65832
> > Head of Dpmt Parallel Computing . . .
> > www.hlrs.de/people/rabenseifner
> > Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room
> > 1.307)
>
> Bill Long
> longb at cray.com
> Fortran Technical Suport & voice:
> 651-605-9024
> Bioinformatics Software Development fax:
> 651-605-9142
> Cray Inc./ Cray Plaza, Suite 210/ 380 Jackson St./ St. Paul, MN 55101
>
>
----- Original Message -----
> From: "William Gropp" <wgropp at illinois.edu>
> To: "MPI WG Remote Memory Access working group" <mpiwg-rma at lists.mpi-forum.org>
> Cc: "Bill Long" <longb at cray.com>
> Sent: Tuesday, August 5, 2014 2:38:07 PM
> Subject: Re: [mpiwg-rma] Short question on the ccNUMA memory reality
>
>
> Question 1 is hard to answer. First, as it came out in my output,
> the first and last “print x” straddle the two ranks, so I’m not sure
> what that means. But the following is valid for the compiler:
>
>
> On rank 0. Since x = val_2 follows x = val_1 with no intervening use
> of x, a good optimizing compiler may eliminate the store to x with
> val_1 as unnecessary.
>
>
> On rank 1, Since x is never assigned in this block, a compiler is
> permitted to make a copy in a register (which would have the value x
> had before this block was entered) and than print that value each
> time print is called (e.g., it could build the call stack in
> memory, and just pass that unchanged call stack to print for each
> call).
>
>
> And don’t forget - there’s a difference between when the print
> completes on a process/thread and when the output appears - the
> system could flush all print output from rank 1 before any print
> output from rank 0 appears.
>
>
> If x is volatile, then the compiler isn’t free to do the second (save
> x and never load it again).
>
>
> Bill
>
>
>
>
>
>
> William Gropp
> Director, Parallel Computing Institute Thomas M. Siebel Chair in
> Computer Science
>
>
> University of Illinois Urbana-Champaign
>
>
>
>
>
>
>
> On Aug 5, 2014, at 2:53 AM, Balaji, Pavan < balaji at anl.gov > wrote:
>
>
> Huh?
>
>
>
> 1. Question (sequential consistency on one location):
> ------------
>
> Do I understand correctly that in the following patter
> on a shared Memory or a ccNUM shared memory
>
> rank 0 rank 1
> print x
> x=val_1 print x
> x=val_2 print x
> print x
>
> the print statements can print only in the following
> sequence
> - some times the previous value
> - some times val_1
> - and after some time val_2 and it then stays to print val_2
>
> and that it can never be that a sequence with val_2 before val_1
> can be produced, i.e.,
> old_val
> val_2
> val_1
> val_2
> is impossible.
>
>
>
> _______________________________________________
> mpiwg-rma mailing list
> mpiwg-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
--
Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de
High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530
University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832
Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner
Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room 1.307)
More information about the mpiwg-rma
mailing list