[Mpi3-ft] Communicator Virtualization as a step forward
Greg Bronevetsky
bronevetsky1 at llnl.gov
Wed Feb 18 11:20:30 CST 2009
My point is that each MPI implementation will have to negotiate with
each checkpointer what this call will mean on each platform, meaning
that it is upto the implementation-specific call to do all that
magic. It doesn't save the application any real effort to call this
routine by its implementation-specific name rather than a unified
name given that the other two major problems are not solved:
- Providing a uniform way for checkpointers and MPI implementations
to interact (THE major for having this proposal in the first place) and
- Support for more than the synch-and-stop checkpointing protocol,
which does scales poorly with rising failure rates.
My point here is that covering one of the three major issues is not
good enough to modify the MPI specification and I don't think that
we'll ever be able to do more than that. We're pretty much agreed in
the working group that the first point above cannot be done and I
don't think that the second can be done without providing callbacks
to be called for each incoming/outgoing message.
Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov
http://greg.bronevetsky.com
At 08:40 AM 2/18/2009, Supalov, Alexander wrote:
>Thanks. How will the checkpointer-specific call talk to the MPI?
>
>-----Original Message-----
>From: mpi3-ft-bounces at lists.mpi-forum.org
>[mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Greg Bronevetsky
>Sent: Wednesday, February 18, 2009 5:29 PM
>To: MPI 3.0 Fault Tolerance and Dynamic Process Control working
>Group; MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
>
>
> >Thanks. How will you let the MPI know the checkpoint is coming, to
> >give it a fair chance to prepare to this and then recover after the
> >checkpoint? This is akin to the MPI_Finalize/MPI_Init in some sense,
> >midway thru the job, hence the analogy.
>
>Just use the checkpointer-specific call. The call is going to have
>checkpointer-specific semantics, so why not give it a
>checkpointer-specific name? I understand that there is some use to
>allowing applications to use the same name across all checkpointers
>but the bar should be higher than that for adding something to the
>standard. Also, right now the whole approach inherently only supports
>one checkpointing protocol: synch-and-stop. If we can work out a more
>generic API that supports other protocols I think that it may have
>enough value to be included in the spec. Right now it still hasn't
>passed the bar.
>
>Greg Bronevetsky
>Post-Doctoral Researcher
>1028 Building 451
>Lawrence Livermore National Lab
>(925) 424-5756
>bronevetsky1 at llnl.gov
>http:// greg.bronevetsky.com
>
>_______________________________________________
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>---------------------------------------------------------------------
>Intel GmbH
>Dornacher Strasse 1
>85622 Feldkirchen/Muenchen Germany
>Sitz der Gesellschaft: Feldkirchen bei Muenchen
>Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
>Registergericht: Muenchen HRB 47456 Ust.-IdNr.
>VAT Registration No.: DE129385895
>Citibank Frankfurt (BLZ 502 109 00) 600119052
>
>This e-mail and any attachments may contain confidential material for
>the sole use of the intended recipient(s). Any review or distribution
>by others is strictly prohibited. If you are not the intended
>recipient, please contact the sender and delete all copies.
>
>
>_______________________________________________
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
More information about the mpiwg-ft
mailing list