[Mpi3-ft] WG call on tuesday aug. 9, 3pm est

Tue Jul 16 15:09:53 CDT 2013

Hi George,

Thanks for the detailed response.  Let's set aside the
MPI_Comm_create_group discussion.  You raised a good point about the
difficulty in reaching consensus on the group argument that has not come up
before.  If there is some mechanism for consensus, then Comm_create_group
will just work.

I still have some remaining concerns about supporting a usage model where
the communicator has "holes" in it.

The group query routine that I was thinking of is a local query, which
would return information on which processes I know to have failed.  An
application won't be able to track failures that occur within libraries, so
this query is necessary to find out what is locally known.  I assume that
MPI will have to track this information anyway, in case you attempt to
communicate with a failed process.

Consensus.  Is it true that right now, that the only ways to reach
consensus on the failed group is to revoke the communicator, shrink it, and
then compare groups, or to implement your own consensus protocol on top of
send/recv?  The concern I have is that the revoke method requires the
application to destroy the old communicator and renumber the processes,
which the programmer may not want to do.  The ranks could have some
important meaning to the application.  I am probably going to regret asking
this, but is it possible to include an MPI_Comm_resume() function that
re-activates a revoked communicator with holes in it?

 ~Jim.

On Mon, Jul 15, 2013 at 9:28 AM, George Bosilca <bosilca at icl.utk.edu> wrote:

> Jim, let me try to articulate a comprehensible answer without going too
> deep into technicalities.
>
> On Jul 10, 2013, at 16:07 , Jim Dinan <james.dinan at gmail.com> wrote:
>
> Hi George,
>
> Re: comm_create and friends -- I'd be interested in hearing the result of
>> the F2F discussion.  I've been the person complaining about this, so if it
>> would be helpful for me to join you guys over the phone next Friday, let me
>> know.
>>
>>
>> I did not have the opportunity to hear your complaint about this. Can you
>> please summarize it here on the mailing list to make sure 1) that we are
>> all at the same level of understanding; 2) the complaint reach all the
>> interested audience and 3) we have the opportunity to address it as
>> accurately as possible.
>>
>
> I'm concerned that the current spec is missing a few features that are
> needed for the usage model where users continue to use a communicator with
> "holes" in it.  In order for this model to work well, users must be able to
> query which processes are known to have failed in a given communicator.
>
>
> The proposed FT framework was designed to address a wide spectrum of
> possible approaches, by covering two extreme cases, on one side where
> global knowledge and global actions are required and on the other side
> where only local knowledge and local actions are required, while providing
> building blocks for all other intermediary approaches.
>
> Looking specifically at your concern it seems that you want to continue to
> use a communicator with holes while avoiding the revoke, but expect to need
> global knowledge about the failures. Thus you propose to have a special
> call GET_ALL_FAILED that provides you the list of failed processes.
> Interesting idea but unfortunately it has pitfalls.
>
> - In order to be able to use the list of failed processes to create new
> communicators using the COMM_CREATE_GROUP this list should be identical on
> all processes. Therefore the call to GET_ALL_FAILED is similar to an
> agreement (collective and must have a consensus meaning). Then the obvious
> question will be how do you synchronize all the living processes to call
> this blocking function together? This is why you need an atomic broadcast
> (guaranteed delivery despite failure) that can interrupt the normal
> behavior of the application. And this is typically the goal of the revoke
> functionality. You could implement your own protocol of failure knowledge
> propagation inside your communication scheme, but imagine how much your
> code will be impacted by this: at any point where you do a communication,
> you must be able to receive either a normal message of your application, or
> a notification message.
>
> without the revoke functionality (the knowledge about the fault is only
> propagated to processes that directly communicate with the dead nodes)? Do
> you really intend to implement your own protocol of failure knowledge
> propagation integrated in the application?
>
>
> Given the current interface, users are not able to query
> the set of failed processes without creating a new communicator and
> translating ranks, via MPI_Comm_shrink. This requires the user to first
> revoke the parent communicator, which is something we want to avoid.
>
>
> Again the revoke is an optional step. If you have a special way to have a
> consensus over all still alive processes about the list of dead processes,
> you just have to use it. However, for the sake of performance and
> portability we strongly believe that MPI should propose such a
> functionality for the users, that unlike you, don't want to implement their
> own consensus.
>
>
> Along these same lines, we should also ensure that MPI_Comm_create_group
> will work as expected when a communicator with holes is used, provided
> the output group excludes failed processes (i.e., the operation is not
> collective over any failed processes).
>
>
> I fail to see your concern here as I can't imagine which of the part of
> the current proposal prevent you from using such an approach. So from a
> technical point of view this approach should work in the context of the
> current proposal
>
> Now from a performance point of view the story is slightly different.
> While in the case of MPI_Comm_shrink highly optimized implementations have
> the opportunity for merge all the stage of the operation (consensus over
> the dead processes and creation of a new communicator) in a single step,
> your approach will have to validate (via an agreement) that the newly
> created communicator is indeed valid (aka, no new failure has been
> discovered during the communicator creation).
>
>
>  Given the current interface, users are not able to query the set of
> failed processes without creating a new communicator and translating ranks,
> via  MPI_Comm_shrink.  This requires the user to first revoke the parent
> communicator, which is something we want to avoid.
>
>
> Again the revoke is an optional step. If you have a special way to have a
> consensus over all still alive processes about the list of dead processes,
> you just have to use it. However, for the sake of performance and
> portability we strongly believe that MPI should propose such a
> functionality for the users, that unlike you, don't want to implement their
> own consensus.
>
> Along these same lines, we should also ensure that MPI_Comm_create_group
> will work as expected when a communicator with holes is used, provided the
> output group excludes failed processes (i.e., the operation is not
> collective over any failed processes).
>
>
> I fail to see your concern here as I can't imagine which of the part of
> the current proposal prevent you from using such an approach. So from a
> technical point of view this approach should work in the context of the
> current proposal
>
> Now from a performance point of view the story is slightly different.
> While in the case of MPI_Comm_shrink highly optimized implementations have
> the opportunity for merge all the stage of the operation (consensus over
> the dead processes and creation of a new communicator) in a single step,
> your approach will have to validate (via an agreement) that the newly
> created communicator is indeed valid (aka, no new failure has been
> discovered during the communicator creation). And to do so, you will have
> to have a communicator without holes, because agree will not provide you
> appropriate information on communicators with holes.
>
>  This might not require any text changes, unless we want to allow this
> operation on revoked communicators.
>
>
> At this point I get confused as I was under the impression that your goal
> was to avoid revoking the communicator?
>
> Anyway, we do not want original meaning of COMM_CREATE_GROUP to be tainted
> by FT semantic. Especially true for comm_create_group which is supposed to
> be super scalable. See above for why it serves little purpose anyway, if it
> is not to become an expensive agreement operation.
>
>   George.
>
> Re: Roadmap -- Before a reading, it might be helpful to give a brief
>> presentation to the Forum again giving the high level ideas and
>> justifications for each new addition to the FT proposal.  I think it's been
>> long enough that people have forgotten the details and this might help them
>> feel more comfortable that the proposal is complete and self-consistent.
>>
>>
>> There was at least one [more or less] "brief" presentation of the FT
>> proposal at every meeting for the last two years. I would even emphasize
>> the fact that over the last year no major modification of the proposal has
>> been put forward, fact that might indicate a certain level of completeness
>> and self-consistency.
>>
>
> I'm just trying to convey the temperature in the room, as I felt it.  I
> think a 15 minute, very high level warm-up immediately before the reading
> on the usage models, big ideas, and conventions would go a long way toward
> prepping the audience.
>
>  ~Jim.
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20130716/80e43496/attachment-0001.html>