[Mpi3-ft] RTS: Collectives and validate

Thu Jan 26 12:43:38 CST 2012

A bit of shorthand for the discussion:
 - Dense communicator/collectives: collectives valid only over dense
communicators (those with only alive processes)
 - Sparse communicator/collectives: collectives valid over sparse
communicators (those with alive and dead processes - dead processes
'recognized' via MPI_Comm_validate)

In many ways this is a similar discussion as to the SHRINK versus BLANK
communicator modes. I am still struggling to see the controversy over
standardizing the current 'option sparse' mode. So if someone can present
that point, I would greatly appreciate it. From what I heard in the
meeting, the problem was more with the shorthand of 'collectively
inactive/active' than with the semantic clarifications/modifications.

Option Dense:
-------------
The suggested proposal is that we only allow dense collectives, and that
spare collectives can be built on top of the dense collectives.

Option Sparse:
-------------
Allows for collectives to be defined over sparse communicators. These
semantics are described in 17.7.2 of the current proposal (less than one
page of clarifications).

What is the impact of implementing 'option sparse' on top of 'option dense'?

A library that emulates 'option sparse' on top of 'option dense' would:
 - Need to provide one hidden buffer to translate the user provided spare
buffer to a dense buffer that the MPI library would accept. Then it would
need to copy the data back after the operation is complete. This would
effectively eliminate the benefit of MPI_IN_PLACE for these collectives.
This would also contribute significantly to the memory overhead, and
performance of the collective operation. This change affects 4 collective
primitives {gather, gatherv, scatter, scatterv, <and nonblocking versions>}
 - Need to provide two hidden buffers to translate between the sparse
buffer and dense buffer (as above). MPI_IN_PLACE provides no benefits. This
change affects 5 collective primitives {allgather, allgatherv, alltoall,
alltoallv, alltoallw, <and nonblocking versions>}.
 - No need to change 8 collective primitives {barrier, bcast, reduce,
allreduce, reduce_scatter, reduce_scatter_block, scan, exscan, <and
nonblocking versions>}. Each of these only has one buffer that is not
relative to the numbering of processes in the communicator. Other than the
intermediate library will have to virtualize ranks.

Implementing 'option dense' with 'option sparse' is trivial. In both
options I believe that we need a MPI_Comm_validate operation, if not just
for the fault tolerant agreement

We have performed initial studies of the performance implications of
supporting 'option sparse' in a prototype in Open MPI (Citations at
bottom). Since that time we have made further improvements in the, already
good, performance. The algorithmic modifications were trivial. In one
option (rerouting), we added a check to make sure the peer is alive before
sending to it, and route around it if it is dead. In another other option,
we precompute the communication tree at validate time to avoid the lookup
altogether. There are still other implementation options that we are
planning on investigating. None of which require the memory overhead of the
intermediate library above, and most of which achieve good performance when
comparing 'dense' and 'sparse' communicators.

Unlike the MPI_ANY_SOURCE issue where there are problems with discussing
progress and some disagreement on the semantic changes, in the case of
collectives there is not a problem with progress and there has not been any
disagreement on the semantic clarifications in section 17.7.2. So since the
intermediate library is considerably heavyweight and there are use cases
that desire the 'sparse' representation then I see no motivation for
compromising the current standard language.

-- Josh 'team sparse' Hursey

---------------------------------------------
Hursey, J., Graham, R.
"Analyzing fault aware collective performance in a process fault tolerant
MPI"
2012
http://dx.doi.org/10.1016/j.parco.2011.10.010
---------------------------------------------
Hursey, J., Graham, R.
"Preserving Collective Performance across Process Failure for a Fault
Tolerant MPI"
2011
http://dx.doi.org/10.1109/IPDPS.2011.274
---------------------------------------------
Hursey, J., Naughton, T., Vallee, G., Graham, R.
"A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI"
2011
http://www.springerlink.com/content/850513k973553121/
---------------------------------------------

On Wed, Jan 25, 2012 at 5:38 PM, Josh Hursey <jjhursey at open-mpi.org> wrote:

> It has been proposed that one a process fails in a communicator the
> posting of new collectives is disallowed, and the user must create a new,
> dense communicator if they want to have access to collective operations
> again. This is different than the current proposal where after calling
> MPI_Comm_validate the posting of new collective operations is allowed over
> the communicator - even though it has 'blanks' in it.
>
> We have strong use cases for being able to call collectives over
> communicators with "blanks" in them. So this functionality is required.
> However, it was mentioned that a third-party library might be able to 'fake
> it' by converting a sparse communicator (exposed to the user) into
> operations over the dense communicator (required by MPI). The point would
> be to introduce the 'validate/re-enable collectives' semantic as a separate
> ticket after the RTS proposal is voted in so as to eliminate the need for
> such a library.
>
> There are some problems with the interposition library solution since, for
> example, with vector collectives the library would need to do some double
> buffering to adjust the sparse data buffer provided by the user into a
> dense data buffer that MPI would require. There are other issues, but I
> wanted to think through this a bit more before elaborating more on this
> point.
>
> One thing I noted on the call was that (from my inspection of the
> codebase) FT-MPI uses a dense shadow communicator for many of their
> collective operations. In Open MPI, we experimented with a different method
> in which we either worked around the failed processes or created a
> rebalanced communication tree for collectives over a sparse communicator
> (we published the results of this initial study). The point is that by
> exposing the MPI library to the sparse communicator (collective over a
> communicator with 'blanks') the MPI library has more flexibility in how it
> manages the collective communication. And as part of that flexibility it
> does not require all the badness that the interposition library would have
> to go through to provide the same functionality.
>
> So since the additional semantics for collectives are not controversial
> (IMHO) and there is a strong use case then why would we not try to get it
> right from the beginning?
>
> In short, I have reservations about this proposal, but I wanted more time
> to work through the implications. I'll need to post more on that tomorrow
> (hopefully). But, as always, comments are welcome in the meantime.
>
> -- Josh
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>

-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20120126/312ad2b4/attachment-0001.html>