[Mpi-22] please review - Send Buffer Access (ticket #45)

Wed Dec 10 10:50:33 CST 2008

Dear Brian,

Thanks. Now that I've already got my implementor's view out on the balance between the user friendliness and implementation freedom in the reply to your other letter, let me follow you on the way back to the proposal and counter-arguments.

I reply below using AS> as a prefix. Before I do this, let me sum up my position:

1) The send access proposal restricts freedom of implementation. This statement is based on historic precedent and several realistic, forward looking scenarios mentioned earlier in this trail. The only argument in favor of 1) is the desire to make an unknown number of incorrect MPI programs valid. I've never seen any data on how many applications were involved, by the way. Do you know? One? Two? Or more?

2) The const proposal depends on 1) and makes it irreversible thru a syntactic change. Since 1) is questionable, 2) is questionable by inference. Moreover, 2) has a syntactic weakness of its own due to the const semantics in the C language, again identified earlier in this trail.

Whether or not the unrelated MPI_INIT_ASSERTED proposal is going to help with 1) remains to be seen. Accepting 2) now will definitely close that door.

Even though a compromise (like accepting 1) now with the possibility to fix it later thru MPI_INIT_ASSERTED, and not accepting 2)) may look tempting, this may not be the "right" way if you ask me.

Breaking something only to fix it later, maybe, while encouraging dangerous programming practices for now, does not seem totally prudent to me. "MPI standard is changing all the time" is not the signal we want to send.

Finally, that the one-sided communication does not impose a restriction on the buffer access is a different matter. It is unclear to me whether this precedent should be used to justify the proposed change. It may itself be open to review - possibly, not now.

This is why I would propose to vote both proposals down when it comes to the vote that, as I've already agreed, is the correct procedure to follow.

Best regards.

Alexander

-----Original Message-----
From: mpi-22-bounces_at_[hidden] [mailto:mpi-22-bounces_at_[hidden]] On Behalf Of Barrett, Brian W
Sent: Wednesday, December 10, 2008 6:36 AM
To: MPI 2.2
Subject: Re: [Mpi-22] please review - Send Buffer Access (ticket #45)

Ok, now that I've gotten my user's point of view off my chest, back to my
view as an MPI implementer.  I've tried to respond to all Alexander's
objections.  If I missed one, I apologize.

> 1) The memory remapping scenario IO brought up a couple of days ago

Unless someone's done something fun with operating system and architecture
design I'm not aware of, remapping has one problem that always must be
considered...  Send buffers are not required to be (and frequently aren't)
page aligned or a multiple of page size.  Therefore, completely removing the
send buffer from the user's process has the problem of also taking legal
addresses with it (which  would violate the standard).  IBM's solution is
elegant in that it allows remapping without removing from the sender's
process space.  Sandia has a solution called SMARTMAP that is both not
patented and allows single copy transfers in shared memory environments.

AS> See Hubert's reply. The point of this particular argument is that the possible restriction of the proposal upon the implementation freedom may have been overlooked. My bet is that we're getting but a corner of this matter raised - see other points.

Point number 2 was a procedural argument.  I believe others are in a better
position than I to comment on this.  My understanding, however, is that a
technical objection can cause the vote to fail, but is not grounds on
preventing a vote (particularly a second vote).  If it were, we'd never get
anything done.

AS> I agreed on the procedure after Erez' and Bill's clarifications. We're going to have whatever vote is due. We're preparing our position to voice it before the voting.

> 3) Imagine send buffers have to pinned in the memory. To avoid doing this too
> often, these registrations will normally be cached. If more than one send can
> be used for a buffer or, for that matter, overlapping portions of the same
> buffer, say by different threads, access to the lookup-and-pin will have to be
> made atomic. This will further complicate implementation and introduce a
> potentially costly mutual exclusion primitive into the critical path.

The caching problem already exists.  Consider a case where a large send is
completed, then multiple small sends occur within that base and bound after
the first is completed.  This situation is perfectly legal, happens in codes
in the wild, and must be dealt with by MPI implementations.  If that's not
enough, consider a case where the buffer is part of an active Window (which
is legal, as long as the buffers in use for communication don't overlap).
All these cases certainly should be handled by an MPI today.

AS> Let me try to understand this. You say above that the large send is completed. If I got this correctly, the buffer is no longer engaged. Where is the problem to discuss here? We're discussing sends that try to send the same buffer while it's already engaged. Pleas clarify.

> 4) I wonder what a const modifier will do for a buffer identifies by
> MPI_BOTTOM and/or a derived data type, possibly with holes in it. How will
> this square up with the C language sequence association rules?

This sounds like an issue for the const proposal, which is different from
the send buffer access proposal.  I'm not sure I have enough data to form an
opinion on the const proposal, but I'm fairly sure we can discuss the send
buffer access proposal without considering this issue.

AS> The const proposal depends on the send buffer proposal. This argument revealed yet another possible complication in the const proposal.

> 5) Note also if both #45 and #46 will be introduced, there will be no way to
> retract this, even with the help of the MPI_INIT_ASSERTED, should we later
> decide to introduce assertion like MPI_NO_SEND_BUFFER_READ_ACCESS. The const
> modifier from #46 will make that syntactically useless.

If both are passed, that might be true.  It could be argued the const
proposal depends on the access proposal.  However, it can not be rationally
argued that the access proposal in any way depends upon the const proposal.

The send buffer access proposal can certainly be passed and an assert added
later (at whatever point the init_assert proposal is integrated into the
standard) that allows MPI implementations to modify the send buffer.

You raise a good point about the const proposal.  But it has absolutely no
bearing on the send buffer access proposal.

AS> This comment was primarily directed at the const proposal, you're right.

> 6) Finally, what will happen in the Fortran interface? With the
> copy-in/copy-out possibly happening on the MPI subroutine boundary for array
> sections? If more than one send is allowed, the application can pretty easily
> exhaust any virtual memory with a couple of long enough vectors.

How does that change from today?  Today users send multiple buffers at the
same time, and seem to cope with memory exhaustion issues just fine.  So
soon they might be able to remove the data copy they've had to make at the
user level to work around the MPI access restriction, so there's actually
less virtual memory in use.  Seems like a win to me.

AS> I'm not sure we're talking about the same problem here. Fortran copy-in/out is done by compiler runtime to create contiguous data out of a possibly noncontiguous array section. If more than one Send is allowed, there will possibly be as many hidden copies. If this goes on and on, the memory may get exhausted.

Sending different buffers is OK. The main argument of the proposal is, as far as I got it, that the user may resend the same buffer over and over again. This buffer may be large, otherwise why trying to write an incorrect program? And a big buffer copied many time may be an extra issue.

> 7) In-place compression and/or encryption of the messages. Compression in
> particular can work wonders on monotonous messages, and cost less time in
> total than the transmission of so many giga-zeroes, for example. Again, having
> send buffer access allowed and const modifier attached will kill this huge
> optimization opportunity. Too bad.

While I hope you're joking about the giga-zeros, you do raise a valid
concern, in that there are a number of optimizations regarding compression,
encryption, and endian-swapping that may be eliminated by this proposal.  On
the flip side, as I argued in a previous e-mail, the user gains quite a bit
in usability.  We have to balance these two factors.  Since users know where
my office is, I tend to lean towards making their lives easier, particularly
when it doesn't cause extra work for me.  But I already sent an e-mail on
that point...

AS> I see and respect this point of view. I just beg to suggest that the balance in this case may have to be adjusted to allow more freedom of implementation, which would generally follow the spirit of many (most?) decisions MPI Forum made before.

Our experience with Open MPI was that the potential for performance in other
parts of the MPI (collectives, etc.) far outweighed any send-side tricks we
could think of (and you haven't brought up any we didn't think of).  So if
we wanted to do compression or encryption, it would be done with send-side
bounce buffers.  Since a software pipeline would practically be required to
get good performance, the bounce buffer would not have to scale with the
size of the communication buffer but instead with the properties of the
network pipeline.  Of course, my opinion would be that it would be much
simpler and much higher performance to support compression or encryption as
part of the NIC as the data is streamed to the network.  Otherwise, you're
burning memory bandwidth doing the extra copy (even in the modify the send
buffer case), and memory bandwidth is a precious resource for HPC
applications.

AS> We're talking about possibly huge savings here. And the proposal seems to limit the ways in which these savings can be achieved. We're engineers and researchers here, we can work around any obstacle (death and taxes included). My point is that we may have a better application to the creativity than artificially imposing limitation on the implementation in this particular place.

One other point to consider.  If I was a user, I'd expect that my one-sided
traffic also be compressed, encrypted, or endian-swapped.  The standard
already requires multiple accesses be legal for one-sided communication.  So
you're going to have a situation where some communication can use a
send-modify implementation and some can not.  I'm not familiar with how
Intel's MPI is architected, but Open MPI is architected such that decisions
such as compression, encryption, and endian-swapping would be made at a low
enough level that the code path is the same whether the message is a
point-to-point send or a one-sided put.  Since that's some of the most
complicated code in Open MPI, I can't foresee adding a second code path just
to get a (dubious) performance benefit.

AS> We're talking about big savings here (2 and more times, potentially). I'd not call that dubious, to start with.

Coming back to the argument that one-sided allow access to the engaged buffer, I ask myself whether that was the right decision and why it was made. It's not that I wanted to reverse that right now. It's about whether a questionable decision made once, for whatever reason, should predefine our actions since then. A precedent may be positive and negative. In this case, it's probably negative. It should not be use to justify the intention to do a wrong thing once again. It should be reviewed in the light of what is right.

Brian

--
   Brian W. Barrett
   Dept. 1422: Scalable Computer Architectures
   Sandia National Laboratories
_______________________________________________
mpi-22 mailing list
mpi-22_at_[hidden]
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-22
---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.