Hello gents,
This email thread is about removing the send buffer access
restrictions from the MPI standard. I’ve attached the original proposal
(below) and the slides I presented in the Jan forum meeting.
Let me know if you think that we need to add or remove content.
I think that this proposal is good for the March forum meeting
with few things we need to complete.
TODO list:
-
Post this proposal on
the wiki pages
-
Address any cons not
addressed in this proposal. Do you know of any?
-
Add Examples from real
scientific applications that would benefit from this proposal.
If you have any example please bring it forward.
-
Please review the
proposal against the MPI spec to verify if any additional modifications to the
spec are required.
Please let me know if you think we need a conf call for this
proposal.
Thanks,
.Erez
From:
owner-mpi-21@mpi-forum.org [mailto:owner-mpi-21@mpi-forum.org] On Behalf Of Erez
Haba
Sent: Monday, December 17, 2007 10:14 AM
To: mpi-21@mpi-forum.org
Subject: [mpi-21] Proposal EH1: Send buffer access
This
is the proposal to remove the access restriction on the user send buffers. I
think it make sense to apply this to the 2.1 document however I’m okay
with this being discussed for 2.2.
Thanks,
.Erez
Background:
MPI
1.1 standard prohibits users from accessing for read their send buffer until
the send operation completes. Be it access in the same thread in case of an
async send operation or by another thread in the case of blocking send
operation. The rational in the MPI 1.1 standard was to enable the performance
for DMA engine that is not cache-coherent with the main processor.
Proposal:
Remove
the access restriction on the send buffers.
Rational:
This
restriction is counter intuitive for many programmers. It leads to programs
that are not compliant with the spec, unaware of this limitation. This
limitation prohibits a common usage of the async send API’s, and or the
blocking sends.
For
example the following code sequence is not compliant with the spec.
void* my_buffer;
…
MPI_Isend(my_buffer, 100, MPI_CHAR, 3, 0, MPI_COMM_WORLD,
&request1);
MPI_Isend(my_buffer, 100, MPI_CHAR, 4, 0, MPI_COMM_WORLD,
&request2);
The
email thread discussing the cache-coherency machines limitation concluded that
it’s a non issue for these machines. The application should resolve the
multi-threaded non cache-coherency issues anyhow for sending data.
Cons:
This
change will render implementations that modify the user buffer in-place (like
byte-swap) invalid.
Proposed
changes to the MPI document:
mpi1-report.pdf:
Section
3.4 Page 27 Line 17
Change:
The
send call described in Section 3.2.1 is blocking: it does not return until the message
data
and envelope have been safely stored away so that the sender is free to access and
overwrite the send buffer.
To:
(change “access and overwrite” to “modify”)
The
send call described in Section 3.2.1 is blocking: it does not return until the message
data
and envelope have been safely stored away so that the sender is free to
modify
the send buffer.
mpi1-report.pdf:
Section
3.4 Page 29 Line 48
Change:
In
a multi-threaded implementation of MPI, the system may de-schedule a thread that
is
blocked on a send or receive operation, and schedule another thread for
execution in the
same
address space. In such a case it is the user’s responsibility not to access or modify a
communication
buffer until the communication completes. Otherwise, the outcome of the
computation
is undefined.
To:
(remove “access or”)
In
a multi-threaded implementation of MPI, the system may de-schedule a thread that
is
blocked on a send or receive operation, and schedule another thread for
execution in the
same
address space. In such a case it is the user’s responsibility not to
modify a
communication
buffer until the communication completes. Otherwise, the outcome of the
computation
is undefined.
mpi1-report.pdf:
Section
3.4 Page 30 Lines 4-9
Remove:
Rationale. We prohibit read accesses to a
send buffer while it is being used, even
though the send operation is not supposed to alter the
content of this buffer. This
may seem more stringent than necessary, but the additional
restriction causes little
loss of functionality and allows better performance on some
systems — consider the
case where data transfer is done by a DMA engine that is not
cache-coherent with the
main processor. (End of rationale.)
mpi1-report.pdf:
Section
3.9 Page 58 Line 11
Change:
If
the request is for a send with ready mode, then a matching receive should be posted
before
the call is made. The communication buffer should not be accessed after the call,
and
until the operation completes.
To:
(change “accessed” to “modified”)
If
the request is for a send with ready mode, then a matching receive should be
posted
before
the call is made. The communication buffer should not be modified after the call,
and
until the operation completes.
mpi1-report.pdf:
Section
4.1 Page 94 Line 12
Change:
Collective
routine calls can (but are not required to) return as soon as their
participation
in
the collective communication is complete. The completion of a call indicates
that the
caller
is now free to access locations in the
communication buffer.
To:
(change “access” to “modify”)
Collective
routine calls can (but are not required to) return as soon as their
participation
in
the collective communication is complete. The completion of a call indicates
that the
caller
is now free to modify locations in the
communication buffer.
mpi2-report.pdf:
Section
6.3 Page 112 Line 20-31
Remove:
Rationale. The rule above is more lenient
than for message passing, where we do
not allow two concurrent sends, with overlapping send
buffers. Here, we allow two
concurrent puts with overlapping send buffers. The reasons
for this relaxation are
1. Users do not like that restriction, which is not very
natural (it prohibits concurrent
reads).
2. Weakening the rule does not prevent efficient
implementation, as far as we know.
3. Weakening the rule is important for performance of RMA: we
want to associate
one synchronization call with as many RMA operations
is possible. If puts from
overlapping buffers cannot be concurrent, then we need to
needlessly add synchronization
points in the code.
(End of rationale.)