[Mpi3-ft] Re: WG call on tuesday aug. 9, 3pm est

Wesley Bland wbland at mcs.anl.gov
Wed Jul 17 12:59:50 CDT 2013


On July 16, 2013 at 6:25:46 PM, George Bosilca (bosilca at icl.utk.edu) wrote:

Jim,

On Jul 16, 2013, at 22:09 , Jim Dinan <james.dinan at gmail.com> wrote:

Hi George,

Thanks for the detailed response.  Let's set aside the MPI_Comm_create_group discussion.  You raised a good point about the difficulty in reaching consensus on the group argument that has not come up before.  If there is some mechanism for consensus, then Comm_create_group will just work.

We can hardly settle here. I was not trying to pinpoint the difficulty of using MPI_Comm_create_group to build a new communicator, I was gently trying to suggest the impossibility of doing so with only the MPI_Comm_create_group call.

Using MPI_Comm_agree to agree that the creation of the communicator based on the group (MPI_Comm_create_group) was successful requires you to have a __valid__ (potentially revoked) communicator to do so. You can't use the old communicator, as you will not be able to make a difference between the processes that failed before and after the MPI_Comm_create_group. Similarly, you can't use the newly created communicator either, as it might be incomplete (you did not have an agreement on the group of dead processes on the first place).

However, as I mentioned before, there are scenarios where you can use this function if complemented with a shrink (that will in fact build both the consistent knowledge about the dead processes and a communicator to further agree on).

I still have some remaining concerns about supporting a usage model where the communicator has "holes" in it.

The group query routine that I was thinking of is a local query, which would return information on which processes I know to have failed.

Any particular reason the current get_acked() is not satisfying?

 An application won't be able to track failures that occur within libraries,

Within libraries?

so this query is necessary to find out what is locally known.  I assume that MPI will have to track this information anyway, in case you attempt to communicate with a failed process.

There is no such requirement from the MPI library in the current proposal. In many applications global knowledge is not required, so there is no reason to propagate such information and force the associated overhead on the application. In these applications the failure will eventually be discovered if a communication with the dead process is initiated.

Consensus.  Is it true that right now, that the only ways to reach consensus on the failed group is to revoke the communicator, shrink it, and then compare groups, or to implement your own consensus protocol on top of send/recv?  The concern I have is that the revoke method requires the application to destroy the old communicator and renumber the processes, which the programmer may not want to do.  The ranks could have some important meaning to the application.

As I hinted in my previous email there are many other ways to turn around this problem. You can use topologies, or you can use a shadow communicator that will deal with the revoke operations, while your "working" communicator (which is meant only for point-to-point communications) remains "unrevoked" (but eventually the failures will be acknowledged).

I am probably going to regret asking this, but is it possible to include an MPI_Comm_resume() function that re-activates a revoked communicator with holes in it?

For many reasons related to the complexities of distributed systems (lack of synchronization in the error detection, divergent view of the entire system from each process) this operation must have a consensus meaning. Thus in terms of cost it is similar to MPI_Comm_shrink (except the reordering of the processes). It might provide some limited benefit, for people that want to use such type of scenario. Now, if by doing the re-enable you expect that the communicator will behave as a freshly new communicator and everything MPI-related, file, one-sided, collective will just work on this communicator with holes … then we're talking about something so complex that I would not even dare considering for inclusion in the standard.
I don't think that was the intent here. I think the intent was to essentially put the communicator into the same state as is was before the revoke, where the difference is that now all of the processes in the communicator know about the failure. This means that they can continue pt2pt, cannot use collectives, can use wildcards (if they've called FAILURE_ACK), and whatever else we say you can do after a failure. The use case we're trying to enable is to say that you can notify the other processes of a failure (global failure knowledge and the deadlock prevention that comes with it) without having to create a new communicator. Yes, it will be as expensive as revoking and shrinking the communicator, but the difference is that you don't have to shrink the communicator to make it work. If you want a communicator with holes, you use your existing communicator. If you want a communicator without holes, you shrink (or call COMM_CREATE_GROUP).



Wesley



  George.


 ~Jim.


On Mon, Jul 15, 2013 at 9:28 AM, George Bosilca <bosilca at icl.utk.edu> wrote:
Jim, let me try to articulate a comprehensible answer without going too deep into technicalities.

On Jul 10, 2013, at 16:07 , Jim Dinan <james.dinan at gmail.com> wrote:

Hi George,

Re: comm_create and friends -- I'd be interested in hearing the result of the F2F discussion.  I've been the person complaining about this, so if it would be helpful for me to join you guys over the phone next Friday, let me know.

I did not have the opportunity to hear your complaint about this. Can you please summarize it here on the mailing list to make sure 1) that we are all at the same level of understanding; 2) the complaint reach all the interested audience and 3) we have the opportunity to address it as accurately as possible.

I'm concerned that the current spec is missing a few features that are needed for the usage model where users continue to use a communicator with "holes" in it.  In order for this model to work well, users must be able to query which processes are known to have failed in a given communicator.

The proposed FT framework was designed to address a wide spectrum of possible approaches, by covering two extreme cases, on one side where global knowledge and global actions are required and on the other side where only local knowledge and local actions are required, while providing building blocks for all other intermediary approaches.

Looking specifically at your concern it seems that you want to continue to use a communicator with holes while avoiding the revoke, but expect to need global knowledge about the failures. Thus you propose to have a special call GET_ALL_FAILED that provides you the list of failed processes. Interesting idea but unfortunately it has pitfalls.

- In order to be able to use the list of failed processes to create new communicators using the COMM_CREATE_GROUP this list should be identical on all processes. Therefore the call to GET_ALL_FAILED is similar to an agreement (collective and must have a consensus meaning). Then the obvious question will be how do you synchronize all the living processes to call this blocking function together? This is why you need an atomic broadcast (guaranteed delivery despite failure) that can interrupt the normal behavior of the application. And this is typically the goal of the revoke functionality. You could implement your own protocol of failure knowledge propagation inside your communication scheme, but imagine how much your code will be impacted by this: at any point where you do a communication, you must be able to receive either a normal message of your application, or a notification message.

without the revoke functionality (the knowledge about the fault is only propagated to processes that directly communicate with the dead nodes)? Do you really intend to implement your own protocol of failure knowledge propagation integrated in the application?


Given the current interface, users are not able to query
the set of failed processes without creating a new communicator and
translating ranks, via MPI_Comm_shrink. This requires the user to first
revoke the parent communicator, which is something we want to avoid.

Again the revoke is an optional step. If you have a special way to have a consensus over all still alive processes about the list of dead processes, you just have to use it. However, for the sake of performance and portability we strongly believe that MPI should propose such a functionality for the users, that unlike you, don't want to implement their own consensus.


Along these same lines, we should also ensure that MPI_Comm_create_group
will work as expected when a communicator with holes is used, provided
the output group excludes failed processes (i.e., the operation is not
collective over any failed processes). 

I fail to see your concern here as I can't imagine which of the part of the current proposal prevent you from using such an approach. So from a technical point of view this approach should work in the context of the current proposal

Now from a performance point of view the story is slightly different. While in the case of MPI_Comm_shrink highly optimized implementations have the opportunity for merge all the stage of the operation (consensus over the dead processes and creation of a new communicator) in a single step, your approach will have to validate (via an agreement) that the newly created communicator is indeed valid (aka, no new failure has been discovered during the communicator creation).

 Given the current interface, users are not able to query the set of failed processes without creating a new communicator and translating ranks, via  MPI_Comm_shrink.  This requires the user to first revoke the parent communicator, which is something we want to avoid.

Again the revoke is an optional step. If you have a special way to have a consensus over all still alive processes about the list of dead processes, you just have to use it. However, for the sake of performance and portability we strongly believe that MPI should propose such a functionality for the users, that unlike you, don't want to implement their own consensus.

Along these same lines, we should also ensure that MPI_Comm_create_group will work as expected when a communicator with holes is used, provided the output group excludes failed processes (i.e., the operation is not collective over any failed processes).

I fail to see your concern here as I can't imagine which of the part of the current proposal prevent you from using such an approach. So from a technical point of view this approach should work in the context of the current proposal

Now from a performance point of view the story is slightly different. While in the case of MPI_Comm_shrink highly optimized implementations have the opportunity for merge all the stage of the operation (consensus over the dead processes and creation of a new communicator) in a single step, your approach will have to validate (via an agreement) that the newly created communicator is indeed valid (aka, no new failure has been discovered during the communicator creation). And to do so, you will have to have a communicator without holes, because agree will not provide you appropriate information on communicators with holes.

 This might not require any text changes, unless we want to allow this operation on revoked communicators.

At this point I get confused as I was under the impression that your goal was to avoid revoking the communicator?

Anyway, we do not want original meaning of COMM_CREATE_GROUP to be tainted by FT semantic. Especially true for comm_create_group which is supposed to be super scalable. See above for why it serves little purpose anyway, if it is not to become an expensive agreement operation.

  George.

Re: Roadmap -- Before a reading, it might be helpful to give a brief presentation to the Forum again giving the high level ideas and justifications for each new addition to the FT proposal.  I think it's been long enough that people have forgotten the details and this might help them feel more comfortable that the proposal is complete and self-consistent.

There was at least one [more or less] "brief" presentation of the FT proposal at every meeting for the last two years. I would even emphasize the fact that over the last year no major modification of the proposal has been put forward, fact that might indicate a certain level of completeness and self-consistency. 

I'm just trying to convey the temperature in the room, as I felt it.  I think a 15 minute, very high level warm-up immediately before the reading on the usage models, big ideas, and conventions would go a long way toward prepping the audience.

 ~Jim.
_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft


_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20130717/0b1afb92/attachment-0001.html>


More information about the mpiwg-ft mailing list