[Mpi-forum] Comment on Fortran WG5 ballot N1846

Sun Apr 3 12:53:38 CDT 2011

Bill, John, and all together,

I want to explain the background of our comment from the MPI Forum,
why we expect that ASYNCHRONOUS and VOLATILE are quite different
und why we want to extent ASYNCHRONOUS to MPI nonblocking accesses
***without*** giving ASYNCHRONOUS a semantics similar to VOLATILE.

For this, I have a few questions, i.e., I would like to see
whether the background of our comment is correct, or not:

1. About the ASYNCHRONOUS attribute:
   I expect that the Fortran 2008 standard defines the outcome
   only if a user program is correct.
   If a program uses asynchronous Fortran I/O on a data structure 
   buf (maybe an array or a derived type), and makes load and 
   stores on buf (as part of numerical operations) in a program 
   unit where buf is declared as ASYNCHRONOUS, then the outcome
   of the program is only defined, if the application is free
   of race-conditions.
   Freeness of race-conditions requires that store accesses 
   are done only on that part of buf that are not part of
   asynchronous Fortran I/O. Load accesses can be done only
   on that part of buf that is not part of the asynchronous
   Fortran input.

   Is this correct?

2. A compiler can do optimizations based on the assumption
   that this freeness of race-conditions is given.

   If the application violates the requirement of
   freeness of race-conditions, then the optimizations
   are still okay because then, the outcome of the
   program is anyway undefined.

   Is this correct?

3. My third question is about such allowed optimizations:

   Let buf_read be the part of buf that is used for
   asynchronous Fortran input, and buf_write by output,
   and buf_no_IO is not used in such I/O.

   The compiler can assume that loads are in buf_no_IO 
   and buf_write, and all stores only in buf_no_IO.

   All optimizations are allowed that modify the sequence
   of accesses as long as the numerical outcome is still 
   correct. All intermediate caching in registers, 
   local memories or memory caches and reuse of the data
   form this caches is allowed to minimize memory loads,
   e.g., blocked execution of loop nests to achieve an
   optimal cache reuse.
   If the same location is written more than once,
   only the last value must be written to the memory.
   All such optimizations are not in any conflict with
   the ongoing asynchronous I/O on buf because these
   optimizations touch only buf_no_IO (with loads and 
   stores) and buf_write (only with loads).

   Is this correct?

4. One type of optimization seems to be invalid:

   The optimization must not use elements of buf that are not
   used in the numerical execution, because these parts may
   be part of buf_read or buf_write.
   I.e., if some elements of buf should be temporarily 
   overwritten for a loop fusion, then 
   copying the value of such an element into a scratch buffer,
   then executing the fused loop, and then restoring the 
   element based on the scratch value,
   such an optimization is forbidden when buf is declared
   with the ASYNCHRONOUS attribute.

   Is this correct?

   (This is the example in our comment to your ballot,
   that we want to exclude through the use of the ASYNCHRONOUS 
   attribute.)

5. If buf is without an ASYNCHRONOUS (or VOLATILE) attribute,
   then the optimization in 4. would be valid as long as 
   there are no other constraints.

   Correct? 

6. Now about VOLATILE, i.e., we declare buf as VOLATILE:
   All accesses to buf must be done directly from main memory.
   Usage of the registers or local memories inside of the CPU 
   or GPU for caching data and reusing it for several accesses
   is forbidden.

   Is this correct?

7. With VOLATILE, it is also not allowed to modify the sequence
   of accesses, because "most recent" in Fortran 2008, 
   5.3.19 VOLATILE attribute, Note 5.25, implies also that 
   the sequence of the execution must not be modified.

   Is this correct?

8. The statements in 6. and 7. together mean that
   all the optimizations showed in 3. are forbidden.

   For me, these are the major differences between 
   VOLATILE and ASYNCHROUNOUS.

   Is this correct?
   Do I miss something important?

Now about our comment for ballot 1846 on TR29113 in version N1845:

We need exactly the meaning of the ASYNCHRONOUS attribute extended 
for other meanings of asynchronous input/output as e.g.,
asynchronous I/O in C (aio), nonblocking communication, one-sided
communication and non-blocking parallel I/O in MPI.

We want to have this extension with the same possibilities (see 3.) 
and restrictions (see 4.) on numerical optimizations as if
the asynchronous Fortran I/O is implemented asynchronously.
This sentence defines our goal.

I expect that our wording 

>> 1 An entity with the ASYNCHRONOUS attribute is a variable that may be
>>     subject to asynchronous input/output or other asynchronous access
>>     by means other than Fortran.
>>
>>     Asynchronous input/output can be any asynchronous access to the
>>     variable or parts of it, potentially within the scope of this
>>     standard by Fortran asynchronous I/O or possibly via methods
>>     outside the scope of this standard, such as the libc/POSIX
>>     asynchronous IO (aio) or nonblocking message passing, one-sided
>>     communication, or nonblocking parallel I/O as part of the Message
>>     Passing Interface (MPI) standard.

is not enough. It may be that an additional clarification is
necessary to make clear that the application still has to be free
of race-conditions. 

Last question:
Which wording would you recommend to achieve our goal?

Best regards
Rolf

----- Original Message -----
> From: "Richard L. Graham" <rlgraham at ornl.gov>
> To: longb at cray.com
> Cc: "reinhold bader" <reinhold.bader at lrz.de>, "John Reid" <John.Reid at stfc.ac.uk>, "Main MPI Forum mailing list"
> <mpi-forum at lists.mpi-forum.org>, "Craig Rasmussen" <rasmussn at lanl.gov>
> Sent: Thursday, March 31, 2011 7:18:38 PM
> Subject: Re: [Mpi-forum] Comment on Fortran WG5 ballot N1846
> Bill,
> The concern that was raised to me (I have not followed the standard,
> but
> rely on other folks in the forum for this) is that there is potential
> for
> incorrect behavior if a single data object is being used both for
> computation and for asynchronous communication (disjoint portions of
> the
> object are involved in each). The specific concern, if I understood
> things correctly - others, please correct me if I am wrong - that
> compilers may respect the ASYNCHRONOUS only with respect to other
> fortran
> libraries, such as fortran async I/O, and feel free to modify portions
> of
> the buffer being used for communications, only later to restore the
> parameters. If this is indeed what could happen, this would clearly be
> a
> correctness problem for MPI. Seems to me that if this is indeed an
> issue,
> MPI will not be the only thing impacted, but also code using other
> languages, async libraries, etc.
> It is possible that there is a misunderstanding here, but with the
> concern about potential correctness issues, want to raise this earlier
> rather than later.
> Is there something to be concerned about here ?
> 
> Thanks,
> Rich
> 
> On 3/30/11 11:24 PM, "Bill Long" <longb at cray.com> wrote:
> 
> >Hi Rich and others,
> >
> >
> >On 3/30/11 11:51 AM, Graham, Richard L. wrote:
> >> Dear member of the WG5 and J3 Fortran standardization committee,
> >
> >As I understand the core of this thread, the problem is that some
> >vendors don't actually implement anything for ASYNCHRONOUS because
> >their
> >libraries do not support actual asynchronous I/O operations, which
> >makes
> >the ASYNCHRONOUS semantics moot. If all of the vendors supported the
> >semantics as expected, there would be no need for any changes.
> >
> >>
> >> We recommend the following change of the first two paragraphs in
> >> this
> >> section of the Fortran 2008 standard:
> >>
> >> 5.3.4 ASYNCHRONOUS attribute
> >>
> >> 1 An entity with the ASYNCHRONOUS attribute is a variable that may
> >> be
> >>     subject to asynchronous input/output or other asynchronous
> >>     access
> >>     by means other than Fortran.
> >>
> >
> >The "or other asynchronous access be means other than Fortran" seems
> >semantically identical to "an object may be referenced, defined, or
> >become undefined, by means not specified by the program.". The second
> >phrase is, of course, the definition of the VOLATILE attribute. The
> >proposed change, then, seems to have the effect of making
> >ASYNCHRONOUS
> >the same as VOLATILE. Some comments on that possibility:
> >
> >1) It is probably true that some vendors implement ASYNCHRONOUS that
> >way
> >already.
> >
> >2) The vendors that are more precise in their ASYNCHRONOUS
> >implementation would not be in favor of this change.
> >
> >3) If this is the only solution, then using VOLATILE directly, and
> >forgetting about ASYNCHRONOUS, would be a solution for the MPI
> >non-blocking transfer routines. The implementation of VOLATILE is
> >likely more uniform and widespread.
> >
> >
> >> This should be treated only as a wording proposal. We provide the
> >> above text only as a suggestion; J3/WG5 are certainly free to
> >> adjust
> >> it as they feel necessary. The key idea is that we need
> >> ASYNCHRONOUS
> >> to also include potential memory modifications from agents outside
> >> the
> >> scope of the Fortran standard.
> >>
> >
> >Right. Sounds like VOLATILE to me.
> >
> >> The following snipit is a correct code that allows the MPI library
> >> to
> >> internally use DMA, communications offload (e.g., to a NIC), or
> >> other
> >> method to asynchronously communicate parts of the array while the
> >> user
> >> application/main CPU simultaneously reads and writes to
> >> ***another***
> >> part of the same array. Specifically, this code snipit is is free
> >> of
> >> race conditions.
> >>
> >> USE mpi_f08
> >> REAL, ASYNCHRONOUS :: buf(100,100)
> >> ! communicating a boundary of the array:
> >> CALL MPI_Irecv(buf(1,1:100),...req,...)
> >> DO j=1,100
> >>     DO i=2,100
> >>       buf(i,j)=.... ! work only on the inner area
> >>     END DO
> >> END DO
> >> CALL MPI_Wait(req,...)
> >>
> >
> >It is not clear to me how this program example would show any
> >difference
> >in performance or semantics if ASYNCHRONOUS were replaced by
> >VOLATILE.
> >Even with the current definitions of the attributes, it is extremely
> >unlikely that a compiler applies the asynchronous attribute on an
> >element by element basis for the array buf.
> >
> >
> >>
> >> With the current status of Fortran 2008 + TR 29113 (Version N1845),
> >> a
> >> compiler can ignore the ASYNCHRONOUS attribute. For example, a
> >> compiler may choose to not implement Fortran asynchronous I/O in an
> >> asynchronous way, meaning that there is no need for the compiler to
> >> make any restrictions on optimization.
> >>
> >> To be clear: with the current Fortran 2008, this optimization is
> >> still
> >> allowed. With our proposal, this optimization will be prohibited.
> >>
> >> We want to mention that the use of VOLATILE would solve the
> >> problem,
> >> but at a high performance price because it would also prohibit any
> >> optimization of the "work" part (e.g., automatic cache
> >> optimization,
> >> instruction reordering, etc.). Especially since Fortran is used
> >> with
> >> MPI in high-performance, numerically intensive applications, we
> >> feel
> >> that VOLATILE cannot be recommended as a solution.
> >>
> >
> >So, it appears that you are saying that the compiler cannot make the
> >element-by-element distinction for the VOLATILE attribute, but it
> >could
> >for ASYNCHRONOUS (in order for your example to optimize the way you
> >hope). Seems like a weak point in the argument.
> >
> >I agree that VOLATILE has the issues you state. But I don't see how
> >the
> >new definition of ASYNCHRONOUS avoids the same issues.
> >
> >The MPI standard can state what is required of compiler
> >implementations
> >for them to be compatible with MPI. I think a more direct solution
> >would be to require that the compiler vendor supports the
> >ASYNCHRONOUS
> >semantics if it is to be used with MPI.
> >
> >Cheers,
> >Bill
> >
> >
> >
> >--
> >Bill Long longb at cray.com
> >Fortran Technical Support & voice: 651-605-9024
> >Bioinformatics Software Development fax: 651-605-9142
> >Cray Inc./Cray Plaza, Suite 210/380 Jackson St./St. Paul, MN 55101
> >
> >
> 
> 
> _______________________________________________
> mpi-forum mailing list
> mpi-forum at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-forum

-- 
Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de
High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530
University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832
Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner
Nobelstr. 19, D-70550 Stuttgart, Germany . (Office: Allmandring 30)