[Mpi3-rma] FW: MPI 3 RMA Examples needed

Jeff Hammond jeff.science at gmail.com
Mon Feb 1 10:13:23 CST 2010


To add to Robert's point about non-scalable metadata (remote addresses
etc.), it will prohibitively expensive to use Global Arrays (GA) on
our next machine without remote method invocation (RMI).  According to
the Internet [1], that machine will have ~0.75M cores.  In the most
minimal scenario (8 bytes per remote address), that's 6 MB to index
the whole machine, if one is going to run one process per core.  If
one estimates using IB metadata with static all-connectivity (stupid,
of course), 44 kB * 0.75M = 33 GB, which is impossible.  While
globally-scoped arrays are probably a poor idea to start with, some
applications will try to use them.  If one stores only one copy of
metadata per node, O(50K)*sizeof(metadata) = 50 MB per array if one is
fairly stingy.

A feature I hope to use to ameliorate this problem is to run RMA
through a metadata server.  One can do this right now with put/get,
but the latency will be worse than with RMI.  At present, one would
have to do two remote gets, a blocking one to the metadata server, and
then the acquisition of the actual data.  With RMI, the metadata
operation need not block on the initiator and requires two
active-messages and the put from the remote location to the initiator.

Assuming we cross the 2M node barrier in ~2015, the only practical way
to have globally-scoped data will be to use hierarchical metadata
servers, where the lack of RMI will lead to even less efficient remote
metadata acquisition.  It would not surprise me if the get(+get)^tiers
method for RMA will be an order-of-magnitude slower than the
(RMI)^tiers+put method, given the likelihood of network contention
across a hierarchical network at this scale.

While we can debate the sanity of globally-scoped data at exascale,
there will always be science which is not amenable to simple domain
decomposition and thus requires some type of global view.  Whether the
data structures are simple ala GA or complex ala Madness, RMI is the
only sensible ways to implement these involve RMI.

Jeff

On Sun, Jan 31, 2010 at 2:00 PM, Robert Harrison <...> wrote:
> Bill,
>
> A compelling and near universal example is maintaining
> and updating remote data structures.  Only arrays or simple static
> data structures can be treated efficiently with RMA, and even this assumes
> a non-scalable replication of addresses across all nodes.  More complex
> structures (trees, hash tables, ...) and scalable solutions even for arrays
> require remote method invocation to access state and to perform some
> computation.
>
> This example also motivates considerations of optional message ordering
> (for sequential consistency), optional atomicity (for correctness with multiple
> updaters), etc.  Even within a single application there is utility for different
> variants.
>
> Some remote operations will return a value and how to do so efficiently
> without running into network/NIC flow problems seems to be an issue.
>
> A concrete example is the sparse tree (in dimensions 1,2,..,6) in madness.
> These are stored in distributed hash tables.  Minimally, we must be able to insert,
> erase, replace, and read entries.  More generally the entries are objects and we
> wish to invoke their methods so that applications can be composed in the spirit
> of Charm++ ... messaging between objects addressed by their name in a
> namespace (container).
>
> Best wishes
>
>   Robert

-- 
Jeff Hammond
Argonne Leadership Computing Facility
jhammond at mcs.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond




More information about the mpiwg-rma mailing list