[mpi3-coll] Telecon to discuss DV-collectives (Alltoalldv)

Adam T. Moody moody20 at llnl.gov
Thu Oct 13 12:55:47 CDT 2011

A couple more things...

In Santorini, Rich brought up a couple of concerns that should be 
considered.  For one, he suggested that a slightly more general 
interface might be better in which you specify a base count for all 
processes, and then provide a list for processes that are different than 
that count.  This could subsume the interface I listed below if you set 
the base count to be 0 and then list each non-zero item.  The nice thing 
about the base count approach is that it nicely handles the "mostly 
regular" case, in which nearly all procs have the same amount of data 
but only a few have a little more or a little less.  For example, this 
interface might look something like the following (added basecount and 
removed displacements which need some thought here):

  sendbuf, sbasecount, nsends, sendranks[], sendcounts[], sendtype, /* 
O(sbasecount*P + k) list */
  recvbuf, rbasecount, nrecvs, recvranks[], recvcounts[], rectype, /* 
O(rbasecount*P + k) list */

In the above, you would send/receive basecount data items from all 
procs, except for a few ranks, whose counts are listed explicitly in 
O(k) lists.  Setting sbasecount/rbasecount=0 essentially reduces this 
interface to the one below (ignoring displacements).

The other thing Rich was concerned about was whether an interface could 
be specified in such a way to reduce communication of the distributed 
count values.  One extension that might help here is to allow the 
application to provide minimum and maximum count values that would apply 
globally across all processes.  For example, with a maximum count value 
in gatherdv, you could set up a tree expecting the maxcount from all 
children but just receive less during each step.  On the otherhand, if 
the min and max values are far apart, the implementation might fall back 
to something more dynamic so it doesn't allocate a bunch of temporary 
memory that it'll never use.  If an application can't specify minimum or 
maximum values, it could always pass MPI_UNDEFINED for the min/max values.

Adam T. Moody wrote:

>Hi Torsten,
>Soon after we decided to request alltoallv to be added to the dv ticket, 
>I realized there is one important difference between this and the 
>dynamic sparse data exchange (DSDE) case.  With alltoallv, the receiver 
>knows which ranks it will recieve data from, but it doesn't with DSDE.
>I think for alltoalldv, you just need each process to provide two lists: 
>a send list and receive list.  Where the current API looks like this:
>  sendbuf, sendcounts[], sdispls[], sendtype,  /* O(P) list */
>  recvbuf, recvcounts[], rdispls[], rectype,  /* O(P) list */
>  comm
>Provide a new O(k) interface like so (have to add a count to each list 
>to give its length, and a list of ranks):
>  nsends, sendbuf, sendranks[], sendcounts[], sdispls[], sendtype,  /* 
>O(k) list */
>  nrecvs, recvbuf, recvranks[], recvcounts[], rdispls[], rectype,  /* 
>O(k) list */
>  comm
>I think the interface you're pondering would solve the tougher problem 
>of DSDE.
>Torsten Hoefler wrote:
>>Hello Coll-WG,
>>At the last meeting, we decided to push the scalable (dv) collective
>>proposal further towards a reading. The present forum members were
>>rather clearly supporting the proposal by straw-vote.
>>We also decided to include alltoalldv in the ticket, a call where every
>>sender specifies the destinations it sends to as a list. We did not
>>discuss the specification of the receive buffer though. If we force this
>>to be if size P blocks (for P processes in the comm, and a block being
>>count*sizeof(extent datatype)), then we're back to non-scalable again. I
>>see the following alternatives:
>>1) MPI allocates memory for the received blocks and returns a list of
>>  nodes where it received from and the allocated buffer with the
>>  received data
>>2) the user allocates a buffer of size N (<=P) and provides it to the
>>  MPI library, the library fills the buffer and returns a list of
>>  source nodes. If a process received from more than N nodes, the call
>>  fails (MSG truncated).
>>3) the user specifies a callback function for each received block :-)
>>I prefer 3, however, this has the same issues as active messages and
>>other callbacks and will most likely be discussed to death. 2 seems thus
>>most reasonable. Does anybody have another proposal?
>>We may want to split the ticket into two parts (separating out
>>I think we should have a quick (~30 mins) telecon to discuss this
>>matter. Please indicate your availability in the following doodle before
>>Friday 10/7 if you're interested to participate in the discussion.
>>The ticket is https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/264 .
>>Thanks & Best,
>> Torsten Hoefler
>mpi3-coll mailing list
>mpi3-coll at lists.mpi-forum.org

More information about the mpiwg-coll mailing list