[Mpi3-ft] WG call on tuesday aug. 9, 3pm est

George Bosilca bosilca at icl.utk.edu
Mon Jul 15 08:28:56 CDT 2013


Jim, let me try to articulate a comprehensible answer without going too deep into technicalities.

On Jul 10, 2013, at 16:07 , Jim Dinan <james.dinan at gmail.com> wrote:

> Hi George,
> 
>> Re: comm_create and friends -- I'd be interested in hearing the result of the F2F discussion.  I've been the person complaining about this, so if it would be helpful for me to join you guys over the phone next Friday, let me know.
> 
> I did not have the opportunity to hear your complaint about this. Can you please summarize it here on the mailing list to make sure 1) that we are all at the same level of understanding; 2) the complaint reach all the interested audience and 3) we have the opportunity to address it as accurately as possible.
> 
> I'm concerned that the current spec is missing a few features that are needed for the usage model where users continue to use a communicator with "holes" in it.  In order for this model to work well, users must be able to query which processes are known to have failed in a given communicator.

The proposed FT framework was designed to address a wide spectrum of possible approaches, by covering two extreme cases, on one side where global knowledge and global actions are required and on the other side where only local knowledge and local actions are required, while providing building blocks for all other intermediary approaches.

Looking specifically at your concern it seems that you want to continue to use a communicator with holes while avoiding the revoke, but expect to need global knowledge about the failures. Thus you propose to have a special call GET_ALL_FAILED that provides you the list of failed processes. Interesting idea but unfortunately it has pitfalls.

- In order to be able to use the list of failed processes to create new communicators using the COMM_CREATE_GROUP this list should be identical on all processes. Therefore the call to GET_ALL_FAILED is similar to an agreement (collective and must have a consensus meaning). Then the obvious question will be how do you synchronize all the living processes to call this blocking function together? This is why you need an atomic broadcast (guaranteed delivery despite failure) that can interrupt the normal behavior of the application. And this is typically the goal of the revoke functionality. You could implement your own protocol of failure knowledge propagation inside your communication scheme, but imagine how much your code will be impacted by this: at any point where you do a communication, you must be able to receive either a normal message of your application, or a notification message.

>> without the revoke functionality (the knowledge about the fault is only propagated to processes that directly communicate with the dead nodes)? Do you really intend to implement your own protocol of failure knowledge propagation integrated in the application?
>> 
>>> Given the current interface, users are not able to query
>>> the set of failed processes without creating a new communicator and
>>> translating ranks, via MPI_Comm_shrink. This requires the user to first
>>> revoke the parent communicator, which is something we want to avoid.
>> 
>> Again the revoke is an optional step. If you have a special way to have a consensus over all still alive processes about the list of dead processes, you just have to use it. However, for the sake of performance and portability we strongly believe that MPI should propose such a functionality for the users, that unlike you, don't want to implement their own consensus.
>> 
>>> Along these same lines, we should also ensure that MPI_Comm_create_group
>>> will work as expected when a communicator with holes is used, provided
>>> the output group excludes failed processes (i.e., the operation is not
>>> collective over any failed processes). 
>> 
>> I fail to see your concern here as I can't imagine which of the part of the current proposal prevent you from using such an approach. So from a technical point of view this approach should work in the context of the current proposal
>> 
>> Now from a performance point of view the story is slightly different. While in the case of MPI_Comm_shrink highly optimized implementations have the opportunity for merge all the stage of the operation (consensus over the dead processes and creation of a new communicator) in a single step, your approach will have to validate (via an agreement) that the newly created communicator is indeed valid (aka, no new failure has been discovered during the communicator creation).

>  Given the current interface, users are not able to query the set of failed processes without creating a new communicator and translating ranks, via  MPI_Comm_shrink.  This requires the user to first revoke the parent communicator, which is something we want to avoid.

Again the revoke is an optional step. If you have a special way to have a consensus over all still alive processes about the list of dead processes, you just have to use it. However, for the sake of performance and portability we strongly believe that MPI should propose such a functionality for the users, that unlike you, don't want to implement their own consensus.

> Along these same lines, we should also ensure that MPI_Comm_create_group will work as expected when a communicator with holes is used, provided the output group excludes failed processes (i.e., the operation is not collective over any failed processes).

I fail to see your concern here as I can't imagine which of the part of the current proposal prevent you from using such an approach. So from a technical point of view this approach should work in the context of the current proposal

Now from a performance point of view the story is slightly different. While in the case of MPI_Comm_shrink highly optimized implementations have the opportunity for merge all the stage of the operation (consensus over the dead processes and creation of a new communicator) in a single step, your approach will have to validate (via an agreement) that the newly created communicator is indeed valid (aka, no new failure has been discovered during the communicator creation). And to do so, you will have to have a communicator without holes, because agree will not provide you appropriate information on communicators with holes.

>  This might not require any text changes, unless we want to allow this operation on revoked communicators.

At this point I get confused as I was under the impression that your goal was to avoid revoking the communicator?

Anyway, we do not want original meaning of COMM_CREATE_GROUP to be tainted by FT semantic. Especially true for comm_create_group which is supposed to be super scalable. See above for why it serves little purpose anyway, if it is not to become an expensive agreement operation.

  George.

>> Re: Roadmap -- Before a reading, it might be helpful to give a brief presentation to the Forum again giving the high level ideas and justifications for each new addition to the FT proposal.  I think it's been long enough that people have forgotten the details and this might help them feel more comfortable that the proposal is complete and self-consistent.
> 
> There was at least one [more or less] "brief" presentation of the FT proposal at every meeting for the last two years. I would even emphasize the fact that over the last year no major modification of the proposal has been put forward, fact that might indicate a certain level of completeness and self-consistency. 
> 
> I'm just trying to convey the temperature in the room, as I felt it.  I think a 15 minute, very high level warm-up immediately before the reading on the usage models, big ideas, and conventions would go a long way toward prepping the audience.
> 
>  ~Jim.
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20130715/26427ca8/attachment-0001.html>


More information about the mpiwg-ft mailing list