[mpi3-coll] [Mpi-forum] additional keys for MPI_COMM_SPLIT_TYPE

Jeff Hammond jhammond at alcf.anl.gov
Mon Mar 25 12:38:30 CDT 2013


Hi Sayantan,

I don't want to standardize the (key-value)-pairs (KVP) that can be
used in an MPI_Info for this call, at least at this point, since these
may be extremely architecture- or implementation-specific.

I understand the motivation for e.g. all multi-rail Infiniband
implementations to behave the same way, but this seems like a very
specific thing that is best left to the implementation+user
communities to sort out before standardizing anything.

The other issue is that there are many KVP that we could standardize
and it's a lot of work to go down this path and I'm not sure it's
beneficial.  Do we need an MPI Side document to tell us what KVP
should exist for nodes with multiple NVIDIA GPUs in them?  My guess is
that NVIDIA will rally their fanpeople and a consensus will emerge
such that no effort from the MPI Forum is required.

Best,

Jeff

On Mon, Mar 25, 2013 at 12:24 PM, Sur, Sayantan <sayantan.sur at intel.com> wrote:
> Does it make sense for some of these keys (esp. that deal with devices, multi-rail, paths etc.) to be part of the MPI Side documents?
>
> Sayantan
>
>
>> -----Original Message-----
>> From: mpi-forum-bounces at lists.mpi-forum.org [mailto:mpi-forum-
>> bounces at lists.mpi-forum.org] On Behalf Of Jim Dinan
>> Sent: Monday, March 25, 2013 7:16 AM
>> To: mpi-forum at lists.mpi-forum.org
>> Subject: Re: [Mpi-forum] additional keys for MPI_COMM_SPLIT_TYPE
>>
>> Hi Jeff,
>>
>> Please also include ticket #297 in the discussion (merge with your ticket, or
>> otherwise):
>>
>> https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/297
>>
>> This ticket proposes MPI_COMM_TYPE_NEIGHBORHOOD, which looks similar
>> to MPI_COMM_TYPE_LOCALE.
>>
>>   ~Jim.
>>
>> On 3/24/13 2:36 PM, Jeff Hammond wrote:
>> > Martin encouraged me to socialize this with the Forum.  The idea here
>> > seems broader than just one working group so I'm sending to the entire
>> > Forum for feedback.
>> >
>> > See https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/372.  The text
>> > is included below in case it makes it easier to read on your
>> > phone...because I know this is that urgent :-)
>> >
>> > Jeff
>> >
>> >
>> > ---------- Forwarded message ----------
>> > From: MPI Forum <mpi-forum at lists.mpi-forum.org>
>> > Date: Sun, Mar 24, 2013 at 1:50 PM
>> > Subject: [MPI Forum] #372: additional keys for MPI_COMM_SPLIT_TYPE
>> > To:
>> >
>> >
>> > #372: additional keys for MPI_COMM_SPLIT_TYPE
>> > -------------------------------------+--------------------------------
>> > -------------------------------------+-----
>> >                       Reporter:       |                       Owner:
>> >    jhammond                           |                      Status:  new
>> >                           Type:       |                   Milestone:
>> >    Enhancements to standard           |  2013/03/11 Chicago, USA
>> >                       Priority:       |                    Keywords:
>> >    Forum feedback requested           |          Author: Bill Gropp:  0
>> >                        Version:  MPI  |          Author: Adam Moody:  0
>> >    <next>                             |       Author: Dick Treumann:  0
>> >          Implementation status:       |      Author: George Bosilca:  0
>> >    Waiting                            |  Author: Bronis de Supinski:  0
>> >            Author: Rich Graham:  0    |        Author: Jeff Squyres:  0
>> >        Author: Torsten Hoefler:  0    |   Author: Rolf Rabenseifner:  0
>> > Author: Jesper Larsson Traeff:  0    |
>> >             Author: David Solt:  0    |
>> >          Author: Rajeev Thakur:  0    |
>> >      Author: Alexander Supalov:  0    |
>> > -------------------------------------+--------------------------------
>> > -------------------------------------+-----
>> >   {{{MPI_COMM_SPLIT_TYPE}}} currently supports only on key,
>> >   {{{MPI_COMM_TYPE_SHARED}}}, which refers to shared memory
>> domains, as
>> >   supported by shared memory windows
>> ({{{MPI_WIN_ALLOCATE_SHARED}}}).
>> >
>> >   This ticket proposes additional keys that appear useful enough to justify
>> >   standardization.
>> >
>> >   The first key address the need for users to have a portable way of
>> >   querying properties of the filesystem.  This key requires the user to
>> >   specify the specific file path of interest using an MPI_Info object.  The
>> >   communicator returned represents the set of processes that can write to
>> a
>> >   single instance of that file path.  For a local disk, it is likely (but
>> >   not necessary) that this communicator be the same as returned by
>> >   {{{MPI_COMM_TYPE_SHARED}}}.  On the other hand, globally shared
>> >   filesystems will return a duplicate of {{{MPI_COMM_WORLD}}}.  When
>> the
>> >   implementation cannot determine the answer, the resulting
>> communicator is
>> >   {{{MPI_COMM_NULL}}} and users cannot assume any information about
>> the
>> >   path.
>> >
>> >   To be perfectly honest, the choice of {{{key =
>> MPI_COMM_TYPE_SHARED}}} to
>> >   be only used for shared-memory is unfortunately, because we could have
>> >   instead used this key for anything that could be shared and let the
>> >   differences be enumerated by {{{MPI_Info}}}.  For exampled, shared
>> memory
>> >   windows could use {{{(key,value)=("memory","shared")}}} (granted, this
>> is
>> >   a somewhat silly set of options, but it would naturally permit one to use
>> >   e.g. {{{(key,value)=("filepath","/home")}}} and
>> >   {{{(key,value)=("device","gpu0")}}}.  We could implement the new set of
>> >   options using the existing key if we standardize {{{MPI_INFO_NULL}}} to
>> >   mean "shared memory" but that isn't particularly appealing.
>> >
>> >   In addition to {{{MPI_COMM_TYPE_FILEPATH}}}, which is illustrated
>> below, I
>> >   think that {{{MPI_COMM_TYPE_DEVICE}}} is useful to standardize as a key
>> >   without specifying the possible options that the {{{MPI_Info}}} can take,
>> >   with the exception of noting that, if this key is used with
>> >   {{{MPI_INFO_NULL}}}, the result is always {{{MPI_COMM_NULL}}}.
>> Devices
>> >   that implementations could define include coprocessors (aka accelerators
>> >   or GPUs), NICs (e.g. in the case of multi-rail IB systems) and any number
>> >   of other machine-specific cases.  The purpose of standardizing the key is
>> >   to encourage implementations to take the most portable route when
>> trying
>> >   to implement this type of feature, thus confining all of the non-portable
>> >   aspects to the definition of the {{{(key,value)}}} pair in the info
>> >   object.
>> >
>> >   I have also considered  {{{MPI_COMM_TYPE_LOCALE}}}, which would also
>> allow
>> >   the implementation to define info keys to specific, e.g. subcomms sharing
>> >   an IO node on Cray or Blue Gene, but this could just as easily be
>> >   implemented using the {{{MPI_COMM_TYPE_DEVICE}}} key.
>> Furthermore, since
>> >   devices can be specified as a filepath (they often have an associated with
>> >   a {{{/dev/something}}} on Linux systems), there is no compelling reason
>> to
>> >   add more than one key.  This is, of course, the reason why overloading
>> >   {{{MPI_COMM_TYPE_SHARED}}} via info objects seems like the most
>> >   straightforward choice except in regards to backwards compatibility.
>> >
>> >   What do people think?  Can we augment {{{MPI_COMM_TYPE_SHARED}}}
>> with info
>> >   keys and leave the case of {{{MPI_INFO_NULL}}} to mean shared
>> memory, do
>> >   we break backwards compatibility or do we add a new key that has the
>> >   desired catch-all properties?
>> >
>> >   Here is an example program illustrating the use of the proposed
>> >   functionality for {{{MPI_COMM_TYPE_FILEPATH}}}:
>> >   {{{
>> >
>> >   #include <stdio.h>
>> >   #include <stdlib.h>
>> >   #include <string.h>
>> >   #include <mpi.h>
>> >
>> >   int main( int argc, char *argv[] )
>> >   {
>> >       int rank;
>> >       MPI_Info i1, i2, i3, i4, i5;
>> >       MPI_Comm c0, c1, c2, c3, c4, c5;
>> >       int result, happy=0;
>> >
>> >       MPI_Init(&argc,&argv);
>> >       MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>> >
>> >       MPI_Info_create( &i1 );
>> >       MPI_Info_create( &i2 );
>> >       MPI_Info_create( &i3 );
>> >       MPI_Info_create( &i4 );
>> >       MPI_Info_create( &i5 );
>> >
>> >       MPI_Info_set( i1, (char*)"path", (char*)"/global-shared-fs" );
>> >       MPI_Info_set( i2, (char*)"path", (char*)"/proc-local-fs" );
>> >       MPI_Info_set( i3, (char*)"path", (char*)"/node-local-fs" );
>> >       MPI_Info_set( i4, (char*)"path", (char*)"/proc-zero-fs" );
>> >       MPI_Info_set( i5, (char*)"path", (char*)"/dev/rand" );
>> >
>> >       MPI_Comm_split_type( MPI_COMM_WORLD,
>> MPI_COMM_TYPE_SHARED, 0,
>> >   MPI_INFO_NULL, &c0 );
>> >
>> >       MPI_Comm_split_type( MPI_COMM_WORLD,
>> MPI_COMM_TYPE_FILEPATH, 0, i1,
>> >   &c1 );
>> >       MPI_Comm_split_type( MPI_COMM_WORLD,
>> MPI_COMM_TYPE_FILEPATH, 0, i2,
>> >   &c2 );
>> >       MPI_Comm_split_type( MPI_COMM_WORLD,
>> MPI_COMM_TYPE_FILEPATH, 0, i3,
>> >   &c3 );
>> >       MPI_Comm_split_type( MPI_COMM_WORLD,
>> MPI_COMM_TYPE_FILEPATH, 0, i4,
>> >   &c4 );
>> >       MPI_Comm_split_type( MPI_COMM_WORLD,
>> MPI_COMM_TYPE_FILEPATH, 0, i5,
>> >   &c5 );
>> >
>> >       /* a globally visible shared filesystem should result in a comm that
>> >   is equivalent to world */
>> >       MPI_Comm_compare( MPI_COMM_WORLD, c1, &result );
>> >       if ( result == MPI_CONGRUENT) happy++;
>> >
>> >       /* a process-local filesystem should result in MPI_COMM_SELF */
>> >       MPI_Comm_compare( MPI_COMM_SELF, c2, &result );
>> >       if ( result == MPI_CONGRUENT) happy++;
>> >
>> >       /* a filesystem shared within the node is likely to result in a
>> >   communicator equivalent
>> >           to the one that supports shared memory, provided shared memory is
>> >   available */
>> >       MPI_Comm_compare( c0, c3, &result );
>> >       if ( result == MPI_CONGRUENT) happy++;
>> >
>> >       /* the /proc-zero-fs is only visible from rank 0 of world... */
>> >       if (rank==0) {
>> >          MPI_Comm_compare( MPI_COMM_SELF, c4, &result );
>> >          if ( result == MPI_CONGRUENT) happy++;
>> >       } else {
>> >          if ( c4 == MPI_COMM_NULL) happy++;
>> >       }
>> >
>> >       /* the sharable nature of /dev/rand is probably a meaningless concept
>> >   so
>> >           we expect the implementation to return MPI_COMM_NULL for c5 */
>> >       if ( c5 == MPI_COMM_NULL) happy++;
>> >
>> >       MPI_Comm_free( &c1 );
>> >       MPI_Comm_free( &c2 );
>> >       MPI_Comm_free( &c3 );
>> >       MPI_Comm_free( &c4 );
>> >       MPI_Comm_free( &c5 );
>> >
>> >       MPI_Info_free( &i1 );
>> >       MPI_Info_free( &i2 );
>> >       MPI_Info_free( &i3 );
>> >       MPI_Info_free( &i4 );
>> >       MPI_Info_free( &i5 );
>> >
>> >       MPI_Finalize( );
>> >       return 0;
>> >   }
>> >   }}}
>> >
>> > --
>> > Ticket URL: <https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/372>
>> > MPI Forum <https://svn.mpi-forum.org/> MPI Forum
>> >
>> >
>> _______________________________________________
>> mpi-forum mailing list
>> mpi-forum at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-forum
>
> _______________________________________________
> mpi-forum mailing list
> mpi-forum at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-forum



-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond



More information about the mpiwg-coll mailing list