[mpi3-coll] Telecon to discuss DV-collectives (Alltoalldv)
Adam T. Moody
moody20 at llnl.gov
Thu Oct 13 12:55:47 CDT 2011
A couple more things...
In Santorini, Rich brought up a couple of concerns that should be
considered. For one, he suggested that a slightly more general
interface might be better in which you specify a base count for all
processes, and then provide a list for processes that are different than
that count. This could subsume the interface I listed below if you set
the base count to be 0 and then list each non-zero item. The nice thing
about the base count approach is that it nicely handles the "mostly
regular" case, in which nearly all procs have the same amount of data
but only a few have a little more or a little less. For example, this
interface might look something like the following (added basecount and
removed displacements which need some thought here):
MPI_Alltoalldv(
sendbuf, sbasecount, nsends, sendranks[], sendcounts[], sendtype, /*
O(sbasecount*P + k) list */
recvbuf, rbasecount, nrecvs, recvranks[], recvcounts[], rectype, /*
O(rbasecount*P + k) list */
comm
);
In the above, you would send/receive basecount data items from all
procs, except for a few ranks, whose counts are listed explicitly in
O(k) lists. Setting sbasecount/rbasecount=0 essentially reduces this
interface to the one below (ignoring displacements).
The other thing Rich was concerned about was whether an interface could
be specified in such a way to reduce communication of the distributed
count values. One extension that might help here is to allow the
application to provide minimum and maximum count values that would apply
globally across all processes. For example, with a maximum count value
in gatherdv, you could set up a tree expecting the maxcount from all
children but just receive less during each step. On the otherhand, if
the min and max values are far apart, the implementation might fall back
to something more dynamic so it doesn't allocate a bunch of temporary
memory that it'll never use. If an application can't specify minimum or
maximum values, it could always pass MPI_UNDEFINED for the min/max values.
-Adam
Adam T. Moody wrote:
>Hi Torsten,
>Soon after we decided to request alltoallv to be added to the dv ticket,
>I realized there is one important difference between this and the
>dynamic sparse data exchange (DSDE) case. With alltoallv, the receiver
>knows which ranks it will recieve data from, but it doesn't with DSDE.
>
>I think for alltoalldv, you just need each process to provide two lists:
>a send list and receive list. Where the current API looks like this:
>
>MPI_Alltoallv(
> sendbuf, sendcounts[], sdispls[], sendtype, /* O(P) list */
> recvbuf, recvcounts[], rdispls[], rectype, /* O(P) list */
> comm
>);
>
>Provide a new O(k) interface like so (have to add a count to each list
>to give its length, and a list of ranks):
>
>MPI_Alltoalldv(
> nsends, sendbuf, sendranks[], sendcounts[], sdispls[], sendtype, /*
>O(k) list */
> nrecvs, recvbuf, recvranks[], recvcounts[], rdispls[], rectype, /*
>O(k) list */
> comm
>);
>
>I think the interface you're pondering would solve the tougher problem
>of DSDE.
>-Adam
>
>
>Torsten Hoefler wrote:
>
>
>
>>Hello Coll-WG,
>>
>>At the last meeting, we decided to push the scalable (dv) collective
>>proposal further towards a reading. The present forum members were
>>rather clearly supporting the proposal by straw-vote.
>>
>>We also decided to include alltoalldv in the ticket, a call where every
>>sender specifies the destinations it sends to as a list. We did not
>>discuss the specification of the receive buffer though. If we force this
>>to be if size P blocks (for P processes in the comm, and a block being
>>count*sizeof(extent datatype)), then we're back to non-scalable again. I
>>see the following alternatives:
>>
>>1) MPI allocates memory for the received blocks and returns a list of
>> nodes where it received from and the allocated buffer with the
>> received data
>>2) the user allocates a buffer of size N (<=P) and provides it to the
>> MPI library, the library fills the buffer and returns a list of
>> source nodes. If a process received from more than N nodes, the call
>> fails (MSG truncated).
>>3) the user specifies a callback function for each received block :-)
>>
>>I prefer 3, however, this has the same issues as active messages and
>>other callbacks and will most likely be discussed to death. 2 seems thus
>>most reasonable. Does anybody have another proposal?
>>
>>We may want to split the ticket into two parts (separating out
>>alltoalldv).
>>
>>I think we should have a quick (~30 mins) telecon to discuss this
>>matter. Please indicate your availability in the following doodle before
>>Friday 10/7 if you're interested to participate in the discussion.
>>
>>http://www.doodle.com/4wkqnsgi8nfhfdw3
>>
>>The ticket is https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/264 .
>>
>>Thanks & Best,
>> Torsten Hoefler
>>
>>
>>
>>
>>
>
>_______________________________________________
>mpi3-coll mailing list
>mpi3-coll at lists.mpi-forum.org
>http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-coll
>
>
More information about the mpiwg-coll
mailing list