[Mpi3-ft] [MPI3-IO] Fault Tolerance & I/O Discussion

Josh Hursey jjhursey at open-mpi.org
Mon Feb 6 13:40:09 CST 2012

We had a good discussion today. Attached are some notes that I took from
the call.

There were a few questions that we were discussing at the end of the call.
As a result we are going to try to setup another teleconf.

Below is a doodle poll to pick a date/time:

If you are interested in attending this teleconf, please fill out the poll
by 2 pm Eastern on Wednesday, Feb. 8.

In the mean time let us keep discussing these issues on the FT and I/O
mailing lists. My preference would be to just discuss it on the FT mailing
list as there are more FT folks over there, and then we would not distract
the discussions about the other I/O tickets.


On Mon, Feb 6, 2012 at 10:55 AM, Josh Hursey <jjhursey at open-mpi.org> wrote:

> Just a reminder that we are meeting today at Noon Eastern. Enclosed are
> the call-in details and a link to the FT stabilization proposal.
> Thanks,
> Josh
> On Thu, Feb 2, 2012 at 2:46 PM, Josh Hursey <jjhursey at open-mpi.org> wrote:
>> We will meet Monday, Feb. 6 from 12-1 pm EST/New York to discuss I/O in
>> the context of the fault tolerance proposal (or the Super Bowl if we get
>> bored).
>> We can use the following teleconf information:
>>   US Toll Free number: 877-801-8130
>>   Toll number: 1-203-692-8690
>>   Access Code: 1044056
>> The Run-Through Stabilization proposal can be found attached to the
>> ticket:
>>   https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/276
>> https://svn.mpi-forum.org/trac/mpi-forum-web/attachment/ticket/276/FTWG-Process-FT-Draft-2011-12-20.pdf
>> We will be primarily focusing on section 17.12 of that document. I will
>> try to send out a reminder the day of.
>> Thanks,
>> Josh
>> On Tue, Jan 31, 2012 at 5:04 PM, Josh Hursey <jjhursey at open-mpi.org>wrote:
>>> Try the new link (I had to close the other poll due to a timezone
>>> problem):
>>>   http://www.doodle.com/yifhpi5emyyzrspa
>>> -- Josh
>>> On Tue, Jan 31, 2012 at 4:41 PM, Mohamad Chaarawi <chaarawi at hdfgroup.org
>>> > wrote:
>>>>  Hi Josh,
>>>> When I click the link, it says the poll has been deleted.
>>>> Thanks,
>>>> Mohamad
>>>> On 01/31/2012 11:52 AM, Josh Hursey wrote:
>>>> (Cross posted to both the I/O and FT MPI-3 listservs)
>>>>  During the FT plenary session at the Jan. MPI Forum meeting there
>>>> were some concerns about fault tolerance semantics in the I/O chapter. We
>>>> did not have much time to fully discuss the additional semantics during the
>>>> meeting. To make sure that we push towards a complete set of semantics
>>>> useful for the I/O community in the next draft I would like to have a
>>>> teleconf to discuss the I/O chapter of the FT proposal. Preferability in
>>>> the next week and a half (so we have time to fine tune things before the
>>>> next forum meeting).
>>>>  Below is a link to a doodle poll to find a good time for a teleconf.
>>>> If you are interested in participating in this discussion, please fill this
>>>> poll out by 2 PM Eastern on Thurs. Feb. 2 so we can set the date/time.
>>>>    http://www.doodle.com/s3hz9daeh8pn483m
>>>>  Thanks,
>>>> Josh
>>>>  --
>>>> Joshua Hursey
>>>> Postdoctoral Research Associate
>>>> Oak Ridge National Laboratory
>>>> http://users.nccs.gov/~jjhursey
>>>> _______________________________________________
>>>> MPI3-IO mailing listMPI3-IO at lists.mpi-forum.orghttp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-io
>>>> _______________________________________________
>>>> MPI3-IO mailing list
>>>> MPI3-IO at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-io
>>> --
>>> Joshua Hursey
>>> Postdoctoral Research Associate
>>> Oak Ridge National Laboratory
>>> http://users.nccs.gov/~jjhursey
>> --
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20120206/9f628b87/attachment-0001.html>
-------------- next part --------------
Notes from I/O & FT discussion on Feb. 6, 2012
- Should collective I/O operations be uniformly returning?
  - Meaning should they be required to return either success or failure everywhere, and never a mix.
  - Non-uniform return codes: allow for looser synchronization in the collective I/O operation, but the user must use a MPI_File_validate call to determine if a new failure might have caused one of the previous set of I/O operations to fail.
  - Uniform return codes: force all I/O collectives to synchronize on their return code from the operation. This tells the user if some process failed before or during the operation. It does not tell the caller whom failed, but they can jump to a recovery routine and call MPI_File_validate to get this information.
  - If a uniformly returning collective operation fails due to process failure, what do we say with regard to the data on disk and the file pointer?

- Should MPI_File_sync be strongly synchronizing
  - Should the return from MPI_File_sync be uniform (success everywhere or error everywhere)?
  - Given the 'sync-barrier-sync' advice in Section 13.6.10, imposing such uniformity might add to the overhead of the 'sync' operation.
  - Though related to collective I/O routines, it might be best to think of MPI_File_sync semantics separately.

- MPI_File_validate and MPI_File_validate_sync
  - MPI_File_validate should be a strong synchronization for processes, but imply nothing about I/O operations. This is useful when the user wants to synchronize to find about emerging failure without the overhead of synchronizing the file to disk.
  - MPI_File_validate_sync would add the 'sync-barrier-sync' semantic to the MPI_File_validate.
  - Question of if we need the MPI_File_validate_sync combination and instead:
    - Make MPI_File_sync uniformly returning.
    - Consider an advice that 'sync-barrier-sync' in the presence of failure should be more like 'sync-validate-sync'.

- Behavior of file operations with a local file pointer
  - Local file operations are not affected by emerging process failure. Kind of like point-to-point, but the remote side is the disk.
  - So if the group associated with the file handle contains a failure then it is valid to post new read/write operations using a local file pointer.

- Behavior of file operations with a shared file pointer
  - Independent operations are not affected by emerging process failure.
    - This implies that if the shared file pointer is maintained on a single process, then the MPI implementation must take sufficient action to protect that pointer from process failure (e.g., replication).
    - For independent operations there is no explicit synchronization, so the application cannot tell without doing some other communication operation if they are allowed to read/write or not. They cannot tell without additional communication if the remote peer is failed or slow.
  - Collective operations. The operation should fail.
    - Should the operation uniformly fail? see point above.
  - What is the state of the shared file pointer after a failure?
    - Is there sufficient mechanisms available to reasonably recover from the process loss, reason about the state of the file on the disk, and continue with the file handle?

- Collective file operations on a shared file pointer
  - After MPI_File_validate, how are processes going to participate in the collective.
  - Partially specified in Sections 17.12.3 and 17.12.4, but need to review to make sure that the semantics are clear and complete.

- MPI_File_close semantics seemed good
  - May need a do/while loop with MPI_File_validate to close the file, but the file should not 'work around' emerging failure to close the file. Keep the file open, return an error, and let the application decide what to do (e.g., close it again, write more data, ...).

- MPI_File_open
  - Could we add an info argument that would set the collectives to be uniformly returning?
  - This would allow the user to decide which level of strict-ness they need, and MPI might/should be able to optimize the collectives appropriately.

- What about non-process failures like transient I/O failures with disk?
  - This is out-of-scope for this proposal, but is something that the working group would like to return to at some point.

More information about the mpiwg-ft mailing list