[Mpi-forum] Comment on Fortran WG5 ballot N1846
Rolf Rabenseifner
rabenseifner at hlrs.de
Tue Apr 12 08:59:18 CDT 2011
Bill and John and all members of the Fortran and MPI commitees,
Thank you both very much for your detailled answers.
For me it looks like that the Fortran and
MPI standardization committees have only a few options,
which have all drawbacks:
Option 1:
---------
Keep the meaning of ASYNCHRONOUS in the Fortran standard
as it is, and clearly state in the MPI standard that
overlapping of numerical code with nonblocking MPI
communication (point-to-point, collective, or one-sided)
or nonblocking parallel MPI-I/O requires that
variables of which a part is associated with a
storage unit in a pending nonblocking MPI operation
must not be used in any Fortran numerical statement.
This is mainly necessary to prevent that such a
variable is copied into a local memory (e.g., on a GPU)
before that operation and back to the main memory
after the numerical operation is done.
Additionally, the ASYNCHRONOUS attribute can also not be
used to perevent the usual register optimization problems
with MPI_Wait calls, because a compiler can fully ignore
the ASYNCHRONOUS attribute as long as it has implemented
Fortran asynchronous input/output as synchronous I/O.
Drawbacks:
- The application cannot use parts of a buffer in
nonblocking communication and other parts in numerical code.
If somebody has written such an application then
it was always invalid according to the rules in the
Fortran standard.
With Option 1, we do not solve the problem.
We only tell the application programmer that he/she
has to modify his/her application.
- Formally, the use of two parts of the same buffer
in two MPI_Irecv calls and ssuning the second call while
the first one is still pending means that
with the first call, the buffer starts to be
something like a pending input storage sequence affector
which would not allow to make the second MPI_Irecv call
with another part of the same buffer.
I hope, we cann fully ignore this drawback
based on the rules of the TR 29113 for
TYPE(*),DIMENSION(..) :: buffer
Option 2:
---------
(Option 2 is not an option, see below)
The Fortran standardization committee modifies the Fortran
standard as part of TR 29113 to guarantee that the
ASYNCHRONOUS is also valid for MPI nonblocking operations
in addition to Fortran asynchronous input/output.
This may be done by modifying 9.6.2.5 paragraph 6:
A pending input/output storage sequence affector is a
variable of which any part is associated with a storage
unit in a pending input/output storage sequence.
into:
A pending input/output storage sequence affector is a
variable of which any part is associated with a storage
unit in a pending input/output storage sequence
or used in a pending asynchronous operation by means
other than Fortran, such as the libc/POSIX asynchronous IO (aio)
or nonblocking message passing, one-sided communication,
or nonblocking parallel I/O as part of the
Message passing Interface (MPI) standard.
With this modification, the MPI standard can give the advice
to users to use the ASYNCHRONOUS attribute when overlapping
computation and communication with different parts of the same
variable (i.e., array or derived type).
The clear drawbacks:
- A compiler with a blocking implementation of asynchronous
Fortran input/output must implement the ASYNCHRONOUS
attribute.
As you mentioned, the "easy route" may be to just use
the VOLATILE semantics with significant performance
drawbacks for Fortran asynchronous IO and MPI nonblocking
operation.
- Compiler that try to implement ASYNCHRONOUS without this
"VOLATILE easy route" may still switch of significant
optimizations, because the copying of an array into a
local memory and back (e.g., into the GPU local memory)
can be done only on the basis of a whole array
and not on the basis of exactly such elements that are
really used within the numerics, i.e.,
the whole pending input/output storage sequence affector
is excluded from the optimization and not only the
storage units in a pending input/output or nonblocking
operation.
- The wording for this modification of the Fortran
standard may be difficult.
And the major disadvantage:
- It is a wrong solution, because Fortran 2008 states
in 9.6.4.1, paragaphs 5 and 6:
"For asynchronous output, a pending input/output
storage sequence affector (9.6.2.5) shall not be redefined,
become undefined, or have its pointer association status
changed.
For asynchronous input, a pending input/output storage
sequence affector shall not be referenced, become defined,
become undefined, become associated with a dummy argument
that has the VALUE attribute, or have its pointer
association status changed."
The ASYNCHRONOUS attribute always works for the whole
affector, i.e., if only one word of an array is used in
pending asynchronous input/output or nonblocking operations,
the whole array must not be referenced/redifined,
depending on the usage in the pending operation.
For the programmer, it may have been better if he
clearly separates data structures that are used in
nonblocking communication or asynchronous IO
from those used in numerical code.
Then all optimization can be done.
VOLATILE is never used.
Option 3:
---------
For the use with a MPI-3.0 library, the MPI-3.0 standard requires
a compiler which have implemented the ASYNCHRONOUS
attribute also for MPI nonblocking operations,
maybe in a way that parts of the array can be used
for computation and other parts for communication,
together with the statement that a high-quality implementation
does not implement the ASYNCHRONOUS attribute by the
semantics of VOLATILE.
The second, third and forth drawback from Option 2 are still valid.
It may be a bad practice to restrict MPI-3.0 to compilers
that have a special quality of the implementation of
a Fortran standard feature (here ASYNCHRONOUS).
Unclear behavior of the ASYNCHRONOUS attribute:
-----------------------------------------------
For me, the job of ASYNCHRONOUS is still unclear.
If an ASYNCHRONOUS variable is used in a load
it can be accessed directly or through caches (local memories, ...)
because it can be only part of a pending output operation.
If it is part of a store, no restrictions are there,
because it cannot be part of a currently pending
input/output operation.
When using local memories (e.g. on a GPU)
then the compiler must keep track whether a store
operation was really done, because only in this case
the writing back from local memory to main memory
is allowed.
But this problem is outside of the scope of the
MPI standardization.
Did I summarize the main options correctly?
Are there other options?
Who would vote for which option?
Provided that I have summarized correctly after your detailed
answer, I would tend to go only with Option 1, i.e., to keep
the 16 year old problem still unsolved and only
describe it correctly.
John, I'm currently preparing the text for the MPI-3.0 standard.
I'm not sure whether it makes sense to copy parts of it into the
Fortran standard.
Thank you all for your patience with this discussion.
I still believe that it is important to have the problems
between MPI nonblocking and Fortran solved, although
the solution may be only a correct description of the problem
and work-arounds for the aplication programmers.
As soon as I'll have all information together, I have to start
an additional discussion on derived types and BIND(C).
For the moment, it would be helpful to have several replies
and opinions on this problem.
I need to know whether Option 1 is the correct path for
both, the Fortran and the MPI committee.
Best regards
Rolf
--
Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de
High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530
University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832
Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner
Nobelstr. 19, D-70550 Stuttgart, Germany . (Office: Allmandring 30)
More information about the mpi-forum
mailing list