[Mpi3-ft] mpi3-ft Digest, Vol 21, Issue 4

Barrett, Richard F. rbarrett at ornl.gov
Tue Oct 13 11:06:26 CDT 2009


Folks,
I am furiously working the text :) and would appreciate it if anyone else wanting to do so either wait for my commit or directly coordinate with me in order to avoid potential conflicts in the repository. My goal is to commit sometime Friday, regardless of how well I have been able to address the list of issues thus far discussed. After committing I will send out a summary and listing of my changes.
Richard


On 10/12/09 4:35 PM, "mpi3-ft-request at lists.mpi-forum.org" <mpi3-ft-request at lists.mpi-forum.org> wrote:

Send mpi3-ft mailing list submissions to
        mpi3-ft at lists.mpi-forum.org

To subscribe or unsubscribe via the World Wide Web, visit
        http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
or, via email, send a message with subject or body 'help' to
        mpi3-ft-request at lists.mpi-forum.org

You can reach the person managing the list at
        mpi3-ft-owner at lists.mpi-forum.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of mpi3-ft digest..."


Today's Topics:

   1. Re: mpi3-ft Digest, Vol 20, Issue 2 (Solt, David George)
   2. Re: mpi3-ft Digest, Vol 20, Issue 2 (Greg Bronevetsky)


----------------------------------------------------------------------

Message: 1
Date: Mon, 12 Oct 2009 16:14:42 +0000
From: "Solt, David George" <david.solt at hp.com>
Subject: Re: [Mpi3-ft] mpi3-ft Digest, Vol 20, Issue 2
To: "MPI 3.0 Fault Tolerance and Dynamic Process Control working
        Group"  <mpi3-ft at lists.mpi-forum.org>
Message-ID:
        <F14CDB5DAAD62742A2D7F497F9DA7E05318D28301B at GVW0547EXC.americas.hpqcorp.net>

Content-Type: text/plain; charset="us-ascii"

Hi all,

I'm terribly sorry I was not at the last meeting or responding to e-mail.  I was on vacation and forgot to turn on my "out of office", I guess I was a bit too anxious for my holiday.   Anyhow, I'm excited to look through your responses and I'm glad there was some value to my feedback.   I will take a look this week at this.

Thanks,
Dave

-----Original Message-----
From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Barrett, Richard F.
Sent: Friday, October 09, 2009 10:13 AM
To: mpi3-ft at lists.mpi-forum.org
Subject: Re: [Mpi3-ft] mpi3-ft Digest, Vol 20, Issue 2

Folks,
   Awhile back David Solt sent some questions concerning the FT API to the working group mail list, and I've interleaved my thoughts on the topics. It seems I have more questions than answers, so I would appreciate your feedback prior to sending anything out to the list. David, thanks greatly for your careful reading of the draft. This is invaluable feedback, so much so that it's taken a long time to address, though still incompletely.

General:

 1.  Discussion occurred during the recent telecon, though I don't believe I fully captured that discussion.
 2.  I will attempt a stronger discussion of rejoin, restore, recover (aka R^3). May be more wordy that a spec calls for, so may be pulled out later, included in a more discussion-like paper, but will need a crisp discussion in the spec.
 3.  I will attempt to address some of the issues in the draft spec, esp. "Advice to" parts discussed below. However, your input greatly needed, now or once I put out a strawman.
 4.  I've also mentioned some specific places I will modify the spec.
 5.  I've also asked for anyone's input in many areas. And of course welcome it in any area.
 6.  And I've also asked for clarification from David.

Please keep in mind that I am from the user side, so please understand ignorant statements/questions regarding obvious system (and other) issues. I hope this is only to the extent that it encourages you to correct rather than ignore :) And finally, I've been up 72 hours with about 6 dedicated to sleep, and since I am not nearly as resilient as Rich Graham, I am likely incoherent. (Obviously I lean heavily on excuses:)

Richard
> -----Original Message-----
> From: Solt, David George
> Sent: Tuesday, September 01, 2009 10:36 AM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: [Mpi3-ft] Questions about the the Proposed API document
>
> Hi all,
>
> Admittedly, I have missed a lot of discussion during recent months, so feel
> free to ignore questions that have already been answered.
>
> Thanks,
> Dave
>
>
> 0)  MPI_ERROR_REPORTING_FN
>
>         a)  How is this different from using MPI_ERRORHANDLER_CREATE and
> MPI_COMM_ERRHANDLER_SET

There appears to be (close-to-) consensus that the current error handling functionality does, or should be able to, provide the mechanisms we are after. It seems the intent may not be specific to out work, but perhaps could be "modified" to allow for the FT functionality we've been discussing. For example, the "Advice to Implementers" in the MPI error handling section (p277) talks about FT: "A good quality implementation will, to the greatest possible extent, circumscribe the impact of an error, so that normal processing can continue after an error handler was invoked."

However, as currently written the MPI error handler seems designed to eliminate user participation in the process: "The purpose of these error handlers is to allow a user to issue user-defined error messages and to take action unrelated to MPI (such as flushing buffers) before a program exits." (p277) Otoh, there is text that seems to muddle that notion. Anyway, the apparent goal of what we've written thus far is to allow the user to supplement error messages and handling internal to a FT mpi implementation. That is, among other things, the user has already attached FT attributes to communicator(s).

>
> 1)  MPI_Comm_validate.
>
>         a) Is it required that all callers return the same value for
> failed_process_count and failed_ranks?

This is collective, so all participants should receive the same values. That said, this points to a particular issue we need to address/discuss, ie additional failures that occur after a process returns from the call while other processes have not. Anyone have a paragraph or so for inclusion in an advice to users? And probably implementers as well?

>         b) If ranks are partitioned into two groups A and B such that all
> ranks in A can communicate and all ranks in B can communicate, but a rank in A
> cannot communicate with a rank in B, what should failed_ranks return for a
> rank in A?  a rank in B?

Does partition mean subcommunicators? If so, we're covered. If not, we're still covered :) If a rank cannot communicate with another, their is a failure, and it should be returned as such.

>         c) I was told that MPI_Comm_validate uses a phased system such that
> the result of the call is based on the callers' states prior to the call or at
> the start of the call but with the understanding that the results are not
> guaranteed to be accurate at the return of the call.   Is this accurate?  If
> so, can you show an example of where this call would either simplify an
> application code or allow for a recovery case that would not be possible
> without it?

I believe this is correct, but need an example. Anyone?

> 2) MPI_Comm_Ivalidate.
>
>         a) Is there a way to wait on the resulting request in such a way that
> you can access the failed_process_count and failed_ranks data?

Oversight on our part - I will add the arrays so that this blocking and non-blocking apis are analogous to, for example, MPI_Recv and MPI_Irecv.

> 3) MPI_Comm_restorable.
>
>         a) Does this return count=0 for a rank that is already a member of the
> original application launched MPI_Comm_world?
>         The following assumes that the answer to the above question is yes:
> In order for this to have the data necessary, a "replacement" process must be
> created through MPI_Comm_restore (i.e. user's can't bring their own singletons
> into existence through a scheduler, etc.)

Sounds like a good idea. But are we missing something? Can you please elaborate?

> 4) MPI_Comm_rejoin.
>
>         a) Is this intended only to be used by a process that was not
> previously a member of comm_names and the caller replaces an exited rank that
> was a member of comm_names?

Yes.

>         b) Must MPI_Comm_rejoin and MPI_Comm_restore be used in matching way
> between existing ranks and newly created ranks?  If ranks A and B call
> MPI_Comm_restore, which creates a new replacement rank C, will the call to
> MPI_Comm_restore hang until MPI_Comm_rejoin is called by C?

Greg: Comm_rejoin called by any rank process - too generic. If I have a process created by comm_restore, can rejoin. Restore is about existence, rejoin about communicators. By default, restore gets a process into mpi_comm_world and comm_self. Otherwise, need to rejoin other comms. I'll try to make this clear in the text.

> 5) MPI_Comm_restore.
>
>         a) Does this create processes (I have assumed so in Q#4b above)?   If
> so, I suggest that we learn from the problem with MPI_Comm_spawn from MPI-2
> that interaction with a scheduler should be considered as we develop the API.

We need to better understand the comm_spawn issue. Current schedulers won't let you change size of resource allocated. Need "Advice to Users" in MPI_Comm_spawn section as well as FT in order to "alert" the user to the potential issues.

> 6) MPI_Comm_proc_gen/MPI_Comm_gen
>
>         a) The name MPI_Comm_proc_gen seems like it should be MPI_Proc_gen.
> I see that all other routines are prefixed with MPI_Comm_, but I think that
> they all genuinely involve aspects of a communicator except for this one.

Agreed, it is not as much about communicators as other routines. So where should this then go? "Process Creation and Mgmt" seems appropriate, but that section seems ignored by most, and besides, the statement is that  "If restored by MPI fault tolerance capabilities, the process generation is incremented by one. The initial generation is zero." To me then this would be useful for tracking the fault and resilience history of a particular run. From that it then makes sense to me to include it in the FT chapter since it is really a "utility" for that functionality. Comments?

>         b) Ranks 0,1,2 are part of a restorable communicator C.  Rank 2 dies.
> Rank 0 calls MPI_Comm_restore.   Is rank 1 obligated to make any calls before
> using communicator C successfully?   What will MPI_Comm_proc_gen return for
> rank 0? 1?  2?   What will MPI_Comm_gen return for rank 0? 1? 2?

I believe the immediate above discussion addresses the latter two questions. But does point to the need to strengthen the text (which Rainer calls for in the discussion text), which I'll try to do. Perhaps enough to say that this is a local call? Regarding the first question, this is part of the, should I say, "race condition" issue discussed above? The communicator is "broken", which rank 1 will be informed of when it make a call to the communicator. So my question then: MPI_Comm_restore is collective, and therefore all processes in the communicator must call it before any can return. What if a rank rarely (legal) if ever (bad code) actually contacts that rank? But wait, that's certainly the situation for MPI_COMM_WORLD.
>
> 7) General question:
>
>         a) If rank x fails to communicate using point-to-point communication
> to rank y over communicator C, is it guaranteed that any collective call made
> by rank x or y on communicator C will immediately fail (even if the path
> between x and y is not used for the collective)?  (or is it up to the
> implementation)

Sounds like an undefined situation (i.e. A rank is a member of a communicator across which it doesn't communicate). Implementation dependent behavior, then? Which calls for an "Advice to U/I".

> Some more questions for us to think about.  It is quite possible that I have
> some fundamental flaws in my thinking that make some, many or all of these
> questions invalid.  So, I ask that if anyone sees a basic fallacy in my view
> of how these calls are intended to work that you point that out to me first
> and I can review my questions and see if there are still issues that do not
> make sense to me.

My view is that any confusion you might have is certainly from an educated perspective, so must be addressed/clarified.

>
> 8) If MPI_Comm_rejoin passes MPI_COMM_WORLD, then does it really change its
> size of MPI_COMM_WORLD?

My understanding is that we don't want to "replace" ranks unless and until the user makes it so. So new ranks may append to the set of ranks, which included failed ones. But MPI_COMM_SIZE will return the number of active ranks. Did I say this correctly?

> 9) Why does MPI_Comm_restore take an array of ranks to restore?   Shouldn't
> the restored ranks be based on MPI_PROC_RESTORE_POLICY? Or maybe a better way
> to ask: "What call does MPI_PROC_RESTORE_POLICY influence?"

Calls for a "Rationale" or "Advice" text imo. I don't believe we want an API that takes an array in one case and a scalar in another. Analogous to transmission of data. But again, am I understanding the question?

> 10) The API document says this regarding MPI_Comm_Irestore: "It is local in
> scope, and thus restores local communications (point-to-point, one-sided,
> data-type creation, etc.), but not collective communications."  If this is
> true, then how do you restore collective communications?   Can you then go on
> to collectively call MPI_Comm_restore_all?   If you do, would every rank need
> to specify that no new ranks are to be created since they have already been
> created by the earlier call to MPI_Comm_restore?  Also, I don't think it is
> *really* local in scope, if it was, there would be no reason to have a
> non-blocking version.

I expect much of this will be clarified with a stronger discussion of R^3 (mentioned above).But the non-blocking to my understanding is designed to execution to continue while restoration takes place. For example, a Monte Carlo code can simply say, "that process is not there so I will process without its data. But I would like 'it' back if possible, so I'll check later.

>
> 11) MPI_COMM_REJOIN - It seems like the resulting communicator should be
> collective-capable if the calling process was created through a call to
> MPI_Comm_restore_all and not collective-capable if created through a call to
> MPI_Comm_restore?  If we go with that, there should be a way for the caller of
> MPI_Comm_rejoin to know the status of the communicator with respect to
> collectives.

Another that will become clear with a stronger discussion of R^3 (he says optimistically, or is that lackadaisically? :)

> 12) MPI_Comm_restore specifies ranks of communicators that should be restored.
> I assume it will block until the restored ranks call MPI_Comm_rejoin on those
> communicators?   (I say that because of the line "[MPI_Comm_restore ] is local
> in scope, and thus restores local communications...".  Restoring local
> communications to who?   I assume to the process created by MPI_Comm_restore?
> If it does not block, how does it know when it can safely talk to the restored
> ranks using any of the communicators they are in?   So, I assume it blocks.
> That seems to imply that the restored rank MUST call MPI_Comm_rejoin on the
> communicator referenced in its creation.   If rank 0 calls MPI_Comm_restore()
> and passes in Rank 3 of communicator FOO, then the restored process must call
> MPI_Comm_rejoin on communicator FOO.  But when the restored rank 3 calls
> MPI_Comm_recoverable, it could return several communicators and rank 3 has no
> idea that it MUST call MPI_Comm_rejoin on some, but is not req!
>  uired to call MPI_Comm_rejoin on others?

If it blocks, the process can't call MPI_Comm_rejoin. So do we have a contradiction in the spec? Here's to hoping that clarification of R^3 makes this question go away :)

> 13) What does MPI_COMM_WORLD look like before the new process calls
> MPI_COMM_REJOIN.  If the process was created through a call to
> MPI_Comm_restore that specified multiple ranks to be restored, are all of
> those ranks together in an MPI_COMM_WORLD until they call MPI_Comm_rejoin?  Is
> the MPI_Comm_rejoin call collective across all of those newly created
> processes or can they all call one at a time at their leisure?

A "broken" communication will have invalid ranks, revealed to the calling process when it attempts to use the rank, and in the manner specified by the FT configuration (default or user specified). But again, the R^3 text will have to clear this up. Also, see the "Discussion" text in the spec for an additional issue. And regarding MPI_COMM_WORLD (and MPI_COMM_SELF), these are by default intended to be restored (from a local view). Hmmm, that statement confuses even me. So he punts back to R^3.

> 14) Is there anything we are proposing with MPI_Comm_rejoin/restore that
> cannot be accomplished with MPI_Comm_spawn, MPI_Comm_merge?  The only thing I
> can think of is that MPI_COMM_WORLD cannot be "fixed" using
> MPI_Comm_spawn/merge, but only because it is a constant.

My understanding is that we are attempting to bridge the gap between an "invisibly" fault tolerant implementation and a fully user controlled scheme, where that gap may be small (or non-existent?) to large.

> 15) ranks_to_restore struct is not defined in the version of API I have.

I've confused myself with my attempt to be clever. Is this the "Who's on first and what's on second" discussion (note Abbott and Costello reference), meaning I didn't understand either so gave placeholder names. Does your copy include this discussion?

> 16) MPI_Comm_restore seems to be based on the idea that some ranks have
> exited.   What if rank A cannot talk to rank B, but rank B still exists and
> can talk to rank C?  What does it mean to restore a rank in this case?  None
> of the ranks are gone, they are just having communication problems.   It seems
> like there should be some way to come up with a failure free set of ranks such
> that all the ranks in the set can communicate across all process pairs.

My understanding is that a process in a communicator that cannot communicate with all processes in that communicator indicates a fault. But who is at fault may be the appropriate question that you are asking. R^3 discussion? Your last sentence, however, seems to me to point to a new communicator the user would have to create.

> 17) Ranks 0, 1, & 2 are in Comm FOO. Rank 2 dies.   Rank 0 calls
> MPI_Comm_restore({FOO,2}) and can now communicate with 2 once again using
> point-to-point calls?   Is there a way that 1 can ever restore communication
> to the new rank 2?   I believe the only way is that all ranks (including the
> new rank 2) collectively call MPI_Comm_restore({})?  I'm not sure that is a
> problem, but I wanted to check my understanding of how these calls work.

The the first answer that popped into my head contradicts something I said above (ie the communicator is broken). So R^3?

Just my pass at addressing a broad set of issues. Please, please, please, don't try to spare my feelings, just view this as the start of what should be a storng disucssion.

Richard
--
  Richard Barrett
  Application Performance Tools group
  Computer Science and Mathematics Division
  Oak Ridge National Laboratory

  http://users.nccs.gov/~rbarrett

_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft



------------------------------

Message: 2
Date: Mon, 12 Oct 2009 12:30:10 -0700
From: Greg Bronevetsky <bronevetsky1 at llnl.gov>
Subject: Re: [Mpi3-ft] mpi3-ft Digest, Vol 20, Issue 2
To: "MPI 3.0 Fault Tolerance and Dynamic Process Control working
        Group"  <mpi3-ft at lists.mpi-forum.org>,  "mpi3-ft at lists.mpi-forum.org"
        <mpi3-ft at lists.mpi-forum.org>
Message-ID: <806j7o$1mpg9g at nspiron-2.llnl.gov>
Content-Type: text/plain; charset="us-ascii"; format=flowed


> >         b) Ranks 0,1,2 are part of a restorable communicator
> C.  Rank 2 dies.
> > Rank 0 calls MPI_Comm_restore.   Is rank 1 obligated to make any
> calls before
> > using communicator C successfully?   What will MPI_Comm_proc_gen return for
> > rank 0? 1?  2?   What will MPI_Comm_gen return for rank 0? 1? 2?
>
>I believe the immediate above discussion addresses the latter two
>questions. But does point to the need to strengthen the text (which
>Rainer calls for in the discussion text), which I'll try to do.
>Perhaps enough to say that this is a local call? Regarding the first
>question, this is part of the, should I say, "race condition" issue
>discussed above? The communicator is "broken", which rank 1 will be
>informed of when it make a call to the communicator. So my question
>then: MPI_Comm_restore is collective, and therefore all processes in
>the communicator must call it before any can return. What if a rank
>rarely (legal) if ever (bad code) actually contacts that rank? But
>wait, that's certainly the situation for MPI_COMM_WORLD.

The way I was thinking about this situation is that MPI_Comm_restore
is not collective and once a process knows that another has called
it, it does not have to call it on its own. However, at least one
process must call MPI_Comm_restore in order for it to be restored and
MPI_Comm_restore can be called any number of times by any number of
processes with no ill effects.

The motivation behind this API is that we don't want to force all
members of a communicator to be involved in recovery (definitely a
bad idea for MPI_COMM_WORLD) but we need a way to control the fact
that multiple processes may try to call MPI_Comm_restore
simultaneously. The intuition is that process restoration is an
idempotent operation for each failure that can be issued any number
of times with the same effect. However, if the rank suffers from
multiple failures, this API may cause application processes to become
confused about the restored process' generation. Fortunately, the
application has two ways to avoid getting confused:
- coordinate the calls to MPI_Comm_restore on its own
- have the restored process query its own generation and send it to
others when this information is needed

> > 8) If MPI_Comm_rejoin passes MPI_COMM_WORLD, then does it really change its
> > size of MPI_COMM_WORLD?
>
>My understanding is that we don't want to "replace" ranks unless and
>until the user makes it so. So new ranks may append to the set of
>ranks, which included failed ones. But MPI_COMM_SIZE will return the
>number of active ranks. Did I say this correctly?
I'm confused. I thought MPI_Comm_size would return the number of
active and inactive ranks. It would be confusing otherwise since you
could get 5 active ranks but a rank space that ranges from 0 to 9.

> > 10) The API document says this regarding MPI_Comm_Irestore: "It is local in
> > scope, and thus restores local communications (point-to-point, one-sided,
> > data-type creation, etc.), but not collective communications."  If this is
> > true, then how do you restore collective communications?   Can
> you then go on
> > to collectively call MPI_Comm_restore_all?   If you do, would
> every rank need
> > to specify that no new ranks are to be created since they have already been
> > created by the earlier call to MPI_Comm_restore?  Also, I don't think it is
> > *really* local in scope, if it was, there would be no reason to have a
> > non-blocking version.
>
>I expect much of this will be clarified with a stronger discussion
>of R^3 (mentioned above).But the non-blocking to my understanding is
>designed to execution to continue while restoration takes place. For
>example, a Monte Carlo code can simply say, "that process is not
>there so I will process without its data. But I would like 'it' back
>if possible, so I'll check later.
Isn't this what MPI_Comm_recover_collective() is for?


> > 12) MPI_Comm_restore specifies ranks of communicators that should
> be restored.
> > I assume it will block until the restored ranks call
> MPI_Comm_rejoin on those
> > communicators?   (I say that because of the line
> "[MPI_Comm_restore ] is local
> > in scope, and thus restores local communications...".  Restoring local
> > communications to who?   I assume to the process created by
> MPI_Comm_restore?
> > If it does not block, how does it know when it can safely talk to
> the restored
> > ranks using any of the communicators they are in?   So, I assume it blocks.
> > That seems to imply that the restored rank MUST call MPI_Comm_rejoin on the
> > communicator referenced in its creation.   If rank 0 calls
> MPI_Comm_restore()
> > and passes in Rank 3 of communicator FOO, then the restored
> process must call
> > MPI_Comm_rejoin on communicator FOO.  But when the restored rank 3 calls
> > MPI_Comm_recoverable, it could return several communicators and
> rank 3 has no
> > idea that it MUST call MPI_Comm_rejoin on some, but is not req!
> >  uired to call MPI_Comm_rejoin on others?
>
>If it blocks, the process can't call MPI_Comm_rejoin. So do we have
>a contradiction in the spec? Here's to hoping that clarification of
>R^3 makes this question go away :)
I think that MPI_Comm_restore should return at the point where
subsequent messages sent to the failed process will not cause a
failure unless that process fails again. In other words, it should be
like MPI_Send which doesn't guarantee delivery but does guarantee
that the sender's job is finished. Similarly, MPI_Comm_restore should
submit a restoration request to the MPI runtime, which takes care of
everything else. Messages to the process being restored are buffered
until it is actually restored. If the application sends messages to
the restored process but it does not rejoin the communicator or
receive these messages, then the application will quickly run out of
buffer space. The only issue left is that the application has no way
to control the amount of time it takes to restore a process and thus,
the amount of buffer space. I suggest that if a process tries to send
a message to a process that is being restored and runs out of buffer
space, it should hang until the destination process is ready to
receive the messages.

> > 13) What does MPI_COMM_WORLD look like before the new process calls
> > MPI_COMM_REJOIN.  If the process was created through a call to
> > MPI_Comm_restore that specified multiple ranks to be restored, are all of
> > those ranks together in an MPI_COMM_WORLD until they call
> MPI_Comm_rejoin?  Is
> > the MPI_Comm_rejoin call collective across all of those newly created
> > processes or can they all call one at a time at their leisure?
>
>A "broken" communication will have invalid ranks, revealed to the
>calling process when it attempts to use the rank, and in the manner
>specified by the FT configuration (default or user specified). But
>again, the R^3 text will have to clear this up. Also, see the
>"Discussion" text in the spec for an additional issue. And regarding
>MPI_COMM_WORLD (and MPI_COMM_SELF), these are by default intended to
>be restored (from a local view). Hmmm, that statement confuses even
>me. So he punts back to R^3.
That makes perfect sense. MPI Calls MPI_Rejoin on MPI_COMM_WORLD and
MPI_COMM_SELF automatically and leaves the rest to the restored
process. If some members of these communicators have failed,
communication to them will also fail like normal.


> > 14) Is there anything we are proposing with MPI_Comm_rejoin/restore that
> > cannot be accomplished with MPI_Comm_spawn, MPI_Comm_merge?  The
> only thing I
> > can think of is that MPI_COMM_WORLD cannot be "fixed" using
> > MPI_Comm_spawn/merge, but only because it is a constant.
>
>My understanding is that we are attempting to bridge the gap between
>an "invisibly" fault tolerant implementation and a fully user
>controlled scheme, where that gap may be small (or non-existent?) to large.
I think the biggest improvement is the process respawning and
communicator rejoining functionality, which is much easier to use and
requires less synchronization than if the user did it on their own
using existing APIs.


> > 16) MPI_Comm_restore seems to be based on the idea that some ranks have
> > exited.   What if rank A cannot talk to rank B, but rank B still exists and
> > can talk to rank C?  What does it mean to restore a rank in this
> case?  None
> > of the ranks are gone, they are just having communication
> problems.   It seems
> > like there should be some way to come up with a failure free set
> of ranks such
> > that all the ranks in the set can communicate across all process pairs.
>
>My understanding is that a process in a communicator that cannot
>communicate with all processes in that communicator indicates a
>fault. But who is at fault may be the appropriate question that you
>are asking. R^3 discussion? Your last sentence, however, seems to me
>to point to a new communicator the user would have to create.
If some communication fault causes MPI on one process to conclude
that another has died and to inform the application of this fact, it
is responsible for maintaining this illusion by killing somebody off
when communication is restored.

> > 17) Ranks 0, 1, & 2 are in Comm FOO. Rank 2 dies.   Rank 0 calls
> > MPI_Comm_restore({FOO,2}) and can now communicate with 2 once again using
> > point-to-point calls?   Is there a way that 1 can ever restore
> communication
> > to the new rank 2?   I believe the only way is that all ranks
> (including the
> > new rank 2) collectively call MPI_Comm_restore({})?  I'm not sure that is a
> > problem, but I wanted to check my understanding of how these calls work.
>
>The the first answer that popped into my head contradicts something
>I said above (ie the communicator is broken). So R^3?
>
>Just my pass at addressing a broad set of issues. Please, please,
>please, don't try to spare my feelings, just view this as the start
>of what should be a storng disucssion.

I think my description of MPI_Comm_restore covers this case. If rank
1 is sure that rank 0 has already called MPI_Comm_restore(2), then it
can safely communicate. If not, then it has to call
MPI_Comm_restore(2) on its own.

Greg Bronevetsky
Computer Scientist
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky at llnl.gov
http://greg.bronevetsky.com



------------------------------

_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft


End of mpi3-ft Digest, Vol 21, Issue 4
**************************************


--
  Richard Barrett
  Application Performance Tools group
  Computer Science and Mathematics Division
  Oak Ridge National Laboratory

  http://users.nccs.gov/~rbarrett




More information about the mpiwg-ft mailing list