[Mpi3-ft] FTWG conference call today

Schulz, Martin schulzm at llnl.gov
Wed Jan 23 11:44:21 CST 2013


Here is the promised reference:

Martin Schulz, Greg Bronevetsky, Rohit Fernandes, Daniel Marques, Keshav Pingali, and Paul Stodghill. 2004. Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs. In Proceedings of the 2004 ACM/IEEE conference on Supercomputing (SC '04). IEEE Computer Society, Washington, DC, USA, 38-. DOI=10.1109/SC.2004.29 http://dx.doi.org/10.1109/SC.2004.29

It's on the ACM DL at:

http://dl.acm.org/citation.cfm?id=1049982

Let me know if you have any questions or comments,

Martin



On Jan 23, 2013, at 9:02 AM, Sur, Sayantan wrote:

> Yup. It looks like our emails crossed.
> 
> Thanks,
> Sayantan
> 
> 
>> -----Original Message-----
>> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
>> bounces at lists.mpi-forum.org] On Behalf Of Aurélien Bouteiller
>> Sent: Wednesday, January 23, 2013 8:34 AM
>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>> Subject: Re: [Mpi3-ft] FTWG conference call today
>> 
>> Yes,
>> 
>> you have not received the email yet ?
>> 
>> Aurelien
>> 
>> Le 23 janv. 2013 à 11:10, "Sur, Sayantan" <sayantan.sur at intel.com> a écrit :
>> 
>>> Hi,
>>> 
>>> Is there a meeting today?
>>> 
>>> Thanks,
>>> Sayantan
>>> 
>>>> -----Original Message-----
>>>> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
>>>> bounces at lists.mpi-forum.org] On Behalf Of Aurélien Bouteiller
>>>> Sent: Wednesday, January 09, 2013 8:10 AM
>>>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>>>> Subject: Re: [Mpi3-ft] FTWG conference call today
>>>> 
>>>> Dear WG members,
>>>> 
>>>> This is a reminder that according to our planning, we are having our
>>>> regular phone meeting.
>>>> 
>>>> Agenda:
>>>> - Followup on object state discussions
>>>> 
>>>> 
>>>> Date: Jan. 9, 2012
>>>> Time: Noon EDT/New York
>>>> Dial-in information: 218-339-4600
>>>> Code: 623998#
>>>> 
>>>> 
>>>> Next Meeting:
>>>> * Jan. 23, 2013
>>>> 
>>>> Le 12 déc. 2012 à 13:31, "Sur, Sayantan" <sayantan.sur at intel.com> a écrit
>> :
>>>> 
>>>>> Hello WG members,
>>>>> 
>>>>> Josh, Darius and I were on the call. We discussed our assignment to
>>>>> define
>>>> what happens to objects upon failure. Specifically, what happens to
>>>> objects that are created locally (i.e. do not require any remote
>>>> processes to call MPI), but the MPI implementation can store them in a
>> distributed fashion.
>>>>> 
>>>>> We had a short brainstorming session. The thoughts that were
>>>>> discussed
>>>> were:
>>>>> 
>>>>> - We could require of the implementation that after failure and when
>>>>> such
>>>> objects are accessed, the implementation provides either SUCCESS or
>>>> FAILURE, i.e. there are no corrupted or partially available objects.
>>>>> - It could be that some alive ranks can read their objects, whereas
>>>>> others
>>>> cannot.
>>>>> - The app could use MPI_Comm_agree to reach consensus on whether
>> all
>>>> required objects are able to be read on ranks that are alive.
>>>>> - For some objects, such as Datatype, there are no accessor
>>>>> functions other
>>>> than when it is used (e.g. Send/recv). It is possible that an MPI
>>>> implementation could return error when a datatype is used by app, but
>>>> the internal representation is not available to the implementation.
>>>> However, this is not very useful as the app then needs a way to discern
>> why a send failed.
>>>>> - Would it make sense to add *_Check functions to objects to see if
>>>>> they
>>>> are still available (after failure)?
>>>>> 
>>>>> Please let me know if I missed something in the notes.
>>>>> 
>>>>> Sayantan
>>>>> 
>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
>>>>>> bounces at lists.mpi-forum.org] On Behalf Of Aurélien Bouteiller
>>>>>> Sent: Wednesday, December 12, 2012 6:33 AM
>>>>>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working
>>>>>> Group
>>>>>> Subject: [Mpi3-ft] FTWG conference call today
>>>>>> 
>>>>>> Dear working group members,
>>>>>> 
>>>>>> We have our usual biweekly conference call planned for today.
>>>>>> Unfortunately, nobody from UT is available to attend, but the
>>>>>> conference call will be setup and available to the group anyway.
>>>>>> 
>>>>>> We would appreciate if somebody could keep a summary of
>> discussions.
>>>>>> 
>>>>>> 
>>>>>> Agenda:
>>>>>> - Followup items from the Meeting
>>>>>> 
>>>>>> 
>>>>>> Date: Dec 12, 2012
>>>>>> Time: Noon EDT/New York
>>>>>> Dial-in information: 218-339-4600
>>>>>> Code: 623998#
>>>>>> 
>>>>>> 
>>>>>> Next Meeting:
>>>>>> * Jan. 9, 2013
>>>>>> 
>>>>>> 
>>>>>> Please note: Dec. 26 date has been cancelled.
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> * Dr. Aurélien Bouteiller
>>>>>> * Researcher at Innovative Computing Laboratory
>>>>>> * University of Tennessee
>>>>>> * 1122 Volunteer Boulevard, suite 309b
>>>>>> * Knoxville, TN 37996
>>>>>> * 865 974 9375
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> mpi3-ft mailing list
>>>>>> mpi3-ft at lists.mpi-forum.org
>>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>> 
>>>>> _______________________________________________
>>>>> mpi3-ft mailing list
>>>>> mpi3-ft at lists.mpi-forum.org
>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>> 
>>>> --
>>>> * Dr. Aurélien Bouteiller
>>>> * Researcher at Innovative Computing Laboratory
>>>> * University of Tennessee
>>>> * 1122 Volunteer Boulevard, suite 309b
>>>> * Knoxville, TN 37996
>>>> * 865 974 9375
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>> 
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>> 
>> --
>> * Dr. Aurélien Bouteiller
>> * Researcher at Innovative Computing Laboratory
>> * University of Tennessee
>> * 1122 Volunteer Boulevard, suite 309b
>> * Knoxville, TN 37996
>> * 865 974 9375
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

________________________________________________________________________
Martin Schulz, schulzm at llnl.gov, http://people.llnl.gov/schulzm
CASC @ Lawrence Livermore National Laboratory, Livermore, USA







More information about the mpiwg-ft mailing list