[Mpi-forum] Comment on Fortran WG5 ballot N1846

Tue Apr 12 08:59:18 CDT 2011

Bill and John and all members of the Fortran and MPI commitees,

Thank you both very much for your detailled answers.

For me it looks like that the Fortran and 
MPI standardization committees have only a few options,
which have all drawbacks:

Option 1:
---------

   Keep the meaning of ASYNCHRONOUS in the Fortran standard
   as it is, and clearly state in the MPI standard that
   overlapping of numerical code with nonblocking MPI
   communication (point-to-point, collective, or one-sided)
   or nonblocking parallel MPI-I/O requires that
   variables of which a part is associated with a
   storage unit in a pending nonblocking MPI operation
   must not be used in any Fortran numerical statement.

   This is mainly necessary to prevent that such a
   variable is copied into a local memory (e.g., on a GPU)
   before that operation and back to the main memory 
   after the numerical operation is done.

   Additionally, the ASYNCHRONOUS attribute can also not be
   used to perevent the usual register optimization problems
   with MPI_Wait calls, because a compiler can fully ignore
   the ASYNCHRONOUS attribute as long as it has implemented
   Fortran asynchronous input/output as synchronous I/O.   

   Drawbacks:

   - The application cannot use parts of a buffer in
     nonblocking communication and other parts in numerical code.

     If somebody has written such an application then
     it was always invalid according to the rules in the 
     Fortran standard. 
     With Option 1, we do not solve the problem.
     We only tell the application programmer that he/she
     has to modify his/her application. 

   - Formally, the use of two parts of the same buffer
     in two MPI_Irecv calls and ssuning the second call while 
     the first one is still pending means that
     with the first call, the buffer starts to be
     something like a pending input storage sequence affector
     which would not allow to make the second MPI_Irecv call
     with another part of the same buffer. 
     I hope, we cann fully ignore this drawback 
     based on the rules of the TR 29113 for 
       TYPE(*),DIMENSION(..) :: buffer

Option 2:
---------

   (Option 2 is not an option, see below) 

   The Fortran standardization committee modifies the Fortran
   standard as part of TR 29113 to guarantee that the 
   ASYNCHRONOUS is also valid for MPI nonblocking operations
   in addition to Fortran asynchronous input/output.

   This may be done by modifying 9.6.2.5 paragraph 6:

     A pending input/output storage sequence affector is a 
     variable of which any part is associated with a storage 
     unit in a pending input/output storage sequence.

   into:

     A pending input/output storage sequence affector is a 
     variable of which any part is associated with a storage 
     unit in a pending input/output storage sequence
     or used in a pending asynchronous operation by means
     other than Fortran, such as the libc/POSIX asynchronous IO (aio)
     or nonblocking message passing, one-sided communication,
     or nonblocking parallel I/O as part of the
     Message passing Interface (MPI) standard.

   With this modification, the MPI standard can give the advice
   to users to use the ASYNCHRONOUS attribute when overlapping
   computation and communication with different parts of the same
   variable (i.e., array or derived type). 

   The clear drawbacks:

   - A compiler with a blocking implementation of asynchronous
     Fortran input/output must implement the ASYNCHRONOUS
     attribute.
     As you mentioned, the "easy route" may be to just use
     the VOLATILE semantics with significant performance
     drawbacks for Fortran asynchronous IO and MPI nonblocking
     operation.

   - Compiler that try to implement ASYNCHRONOUS without this 
     "VOLATILE easy route" may still switch of significant 
     optimizations, because the copying of an array into a
     local memory and back (e.g., into the GPU local memory)
     can be done only on the basis of a whole array 
     and not on the basis of exactly such elements that are
     really used within the numerics, i.e.,
     the whole pending input/output storage sequence affector
     is excluded from the optimization and not only the
     storage units in a pending input/output or nonblocking 
     operation.

   - The wording for this modification of the Fortran
     standard may be difficult.

   And the major disadvantage:

   - It is a wrong solution, because Fortran 2008 states
     in 9.6.4.1, paragaphs 5 and 6: 

      "For asynchronous output, a pending input/output 
      storage sequence affector (9.6.2.5) shall not be redefined,
      become undefined, or have its pointer association status 
      changed.
      For asynchronous input, a pending input/output storage 
      sequence affector shall not be referenced, become defined,
      become undefined, become associated with a dummy argument 
      that has the VALUE attribute, or have its pointer
      association status changed." 

     The ASYNCHRONOUS attribute always works for the whole
     affector, i.e., if only one word of an array is used in 
     pending asynchronous input/output or nonblocking operations,
     the whole array must not be referenced/redifined,
     depending on the usage in the pending operation.

   For the programmer, it may have been better if he
   clearly separates data structures that are used in 
   nonblocking communication or asynchronous IO
   from those used in numerical code.
   Then all optimization can be done.
   VOLATILE is never used.

Option 3:
---------

   For the use with a MPI-3.0 library, the MPI-3.0 standard requires
   a compiler which have implemented the ASYNCHRONOUS
   attribute also for MPI nonblocking operations,
   maybe in a way that parts of the array can be used
   for computation and other parts for communication, 
   together with the statement that a high-quality implementation
   does not implement the ASYNCHRONOUS attribute by the 
   semantics of VOLATILE.

   The second, third and forth drawback from Option 2 are still valid.

   It may be a bad practice to restrict MPI-3.0 to compilers
   that have a special quality of the implementation of
   a Fortran standard feature (here ASYNCHRONOUS). 

Unclear behavior of the ASYNCHRONOUS attribute:
----------------------------------------------- 

   For me, the job of ASYNCHRONOUS is still unclear.
   If an ASYNCHRONOUS variable is used in a load
   it can be accessed directly or through caches (local memories, ...)
   because it can be only part of a pending output operation.
   If it is part of a store, no restrictions are there,
   because it cannot be part of a currently pending 
   input/output operation. 
   When using local memories (e.g. on a GPU)
   then the compiler must keep track whether a store
   operation was really done, because only in this case
   the writing back from local memory to main memory
   is allowed.
   But this problem is outside of the scope of the
   MPI standardization. 

Did I summarize the main options correctly?

Are there other options?

Who would vote for which option?

Provided that I have summarized correctly after your detailed
answer, I would tend to go only with Option 1, i.e., to keep
the 16 year old problem still unsolved and only 
describe it correctly. 

John, I'm currently preparing the text for the MPI-3.0 standard.
I'm not sure whether it makes sense to copy parts of it into the
Fortran standard.

Thank you all for your patience with this discussion.
I still believe that it is important to have the problems
between MPI nonblocking and Fortran solved, although
the solution may be only a correct description of the problem
and work-arounds for the aplication programmers. 

As soon as I'll have all information together, I have to start
an additional discussion on derived types and BIND(C).
For the moment, it would be helpful to have several replies
and opinions on this problem. 

I need to know whether Option 1 is the correct path for
both, the Fortran and the MPI committee. 

Best regards 
Rolf

-- 
Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de
High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530
University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832
Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner
Nobelstr. 19, D-70550 Stuttgart, Germany . (Office: Allmandring 30)