<div>A bit of shorthand for the discussion:</div><div> - Dense communicator/collectives: collectives valid only over dense communicators (those with only alive processes)</div><div> - Sparse communicator/collectives: collectives valid over sparse communicators (those with alive and dead processes - dead processes 'recognized' via MPI_Comm_validate)</div>

<div><br></div><div>In many ways this is a similar discussion as to the SHRINK versus BLANK communicator modes. I am still struggling to see the controversy over standardizing the current 'option sparse' mode. So if someone can present that point, I would greatly appreciate it. From what I heard in the meeting, the problem was more with the shorthand of 'collectively inactive/active' than with the semantic clarifications/modifications.</div>

<div><br></div><div><br></div><div>Option Dense:</div><div>-------------</div><div>The suggested proposal is that we only allow dense collectives, and that spare collectives can be built on top of the dense collectives.</div>

<div><br></div><div>Option Sparse:</div><div>-------------</div><div>Allows for collectives to be defined over sparse communicators. These semantics are described in 17.7.2 of the current proposal (less than one page of clarifications). </div>

<div><br></div><div><br></div><div>What is the impact of implementing 'option sparse' on top of 'option dense'?</div><div><br></div><div>A library that emulates 'option sparse' on top of 'option dense' would:</div>

<div> - Need to provide one hidden buffer to translate the user provided spare buffer to a dense buffer that the MPI library would accept. Then it would need to copy the data back after the operation is complete. This would effectively eliminate the benefit of MPI_IN_PLACE for these collectives. This would also contribute significantly to the memory overhead, and performance of the collective operation. This change affects 4 collective primitives {gather, gatherv, scatter, scatterv, <and nonblocking versions>}</div>

<div> - Need to provide two hidden buffers to translate between the sparse buffer and dense buffer (as above). MPI_IN_PLACE provides no benefits. This change affects 5 collective primitives {allgather, allgatherv, alltoall, alltoallv, alltoallw, <and nonblocking versions>}.</div>

<div> - No need to change 8 collective primitives {barrier, bcast, reduce, allreduce, reduce_scatter, reduce_scatter_block, scan, exscan, <and nonblocking versions>}. Each of these only has one buffer that is not relative to the numbering of processes in the communicator. Other than the intermediate library will have to virtualize ranks.</div>

<div><br></div><div><br></div><div>Implementing 'option dense' with 'option sparse' is trivial. In both options I believe that we need a MPI_Comm_validate operation, if not just for the fault tolerant agreement </div>

<div><br></div><div>We have performed initial studies of the performance implications of supporting 'option sparse' in a prototype in Open MPI (Citations at bottom). Since that time we have made further improvements in the, already good, performance. The algorithmic modifications were trivial. In one option (rerouting), we added a check to make sure the peer is alive before sending to it, and route around it if it is dead. In another other option, we precompute the communication tree at validate time to avoid the lookup altogether. There are still other implementation options that we are planning on investigating. None of which require the memory overhead of the intermediate library above, and most of which achieve good performance when comparing 'dense' and 'sparse' communicators.</div>

<div><br></div><div><br></div><div>Unlike the MPI_ANY_SOURCE issue where there are problems with discussing progress and some disagreement on the semantic changes, in the case of collectives there is not a problem with progress and there has not been any disagreement on the semantic clarifications in section 17.7.2. So since the intermediate library is considerably heavyweight and there are use cases that desire the 'sparse' representation then I see no motivation for compromising the current standard language.</div>

<div><br></div><div><br></div><div>-- Josh 'team sparse' Hursey</div><div><br></div><div><br></div><div>---------------------------------------------</div><div>Hursey, J., Graham, R.</div><div>"Analyzing fault aware collective performance in a process fault tolerant MPI"</div>

<div>2012</div><div><a href="http://dx.doi.org/10.1016/j.parco.2011.10.010">http://dx.doi.org/10.1016/j.parco.2011.10.010</a></div><div>---------------------------------------------</div><div>Hursey, J., Graham, R.</div><div>

"Preserving Collective Performance across Process Failure for a Fault Tolerant MPI"</div><div>2011</div><div><a href="http://dx.doi.org/10.1109/IPDPS.2011.274">http://dx.doi.org/10.1109/IPDPS.2011.274</a></div><div>

---------------------------------------------</div><div>Hursey, J., Naughton, T., Vallee, G., Graham, R.</div><div>"A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI"</div><div>2011</div>

<div><a href="http://www.springerlink.com/content/850513k973553121/">http://www.springerlink.com/content/850513k973553121/</a></div><div>---------------------------------------------</div><div><br></div><br><div class="gmail_quote">

On Wed, Jan 25, 2012 at 5:38 PM, Josh Hursey <span dir="ltr"><<a href="mailto:jjhursey@open-mpi.org">jjhursey@open-mpi.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

It has been proposed that one a process fails in a communicator the posting of new collectives is disallowed, and the user must create a new, dense communicator if they want to have access to collective operations again. This is different than the current proposal where after calling MPI_Comm_validate the posting of new collective operations is allowed over the communicator - even though it has 'blanks' in it.<div>


<br></div><div>We have strong use cases for being able to call collectives over communicators with "blanks" in them. So this functionality is required. However, it was mentioned that a third-party library might be able to 'fake it' by converting a sparse communicator (exposed to the user) into operations over the dense communicator (required by MPI). The point would be to introduce the 'validate/re-enable collectives' semantic as a separate ticket after the RTS proposal is voted in so as to eliminate the need for such a library.</div>


<div><br></div><div>There are some problems with the interposition library solution since, for example, with vector collectives the library would need to do some double buffering to adjust the sparse data buffer provided by the user into a dense data buffer that MPI would require. There are other issues, but I wanted to think through this a bit more before elaborating more on this point.</div>


<div><br></div><div>One thing I noted on the call was that (from my inspection of the codebase) FT-MPI uses a dense shadow communicator for many of their collective operations. In Open MPI, we experimented with a different method in which we either worked around the failed processes or created a rebalanced communication tree for collectives over a sparse communicator (we published the results of this initial study). The point is that by exposing the MPI library to the sparse communicator (collective over a communicator with 'blanks') the MPI library has more flexibility in how it manages the collective communication. And as part of that flexibility it does not require all the badness that the interposition library would have to go through to provide the same functionality.</div>


<div><br></div><div>So since the additional semantics for collectives are not controversial (IMHO) and there is a strong use case then why would we not try to get it right from the beginning?</div><div><br></div><div>In short, I have reservations about this proposal, but I wanted more time to work through the implications. I'll need to post more on that tomorrow (hopefully). But, as always, comments are welcome in the meantime.</div>

<span class="HOEnZb"><font color="#888888">

<div><br></div><div>-- Josh<br clear="all"><div><br></div>-- <br>Joshua Hursey<br>Postdoctoral Research Associate<br>Oak Ridge National Laboratory<br><a href="http://users.nccs.gov/~jjhursey" target="_blank">http://users.nccs.gov/~jjhursey</a><br>


</div>

</font></span></blockquote></div><br><br clear="all"><div><br></div>-- <br>Joshua Hursey<br>Postdoctoral Research Associate<br>Oak Ridge National Laboratory<br><a href="http://users.nccs.gov/~jjhursey" target="_blank">http://users.nccs.gov/~jjhursey</a><br>