[Mpi3-ft] MPI Forum Plenary Notes - Jan 2012

Josh Hursey jjhursey at open-mpi.org
Thu Jan 19 15:02:40 CST 2012

Attached are the notes that I was able to accumulate from the FT working
group plenary session during the Jan. 2012 MPI Forum meeting.

Thanks to everyone that sent me notes to contribute to this effort.

-- Josh

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20120119/cc53444e/attachment.html>
-------------- next part --------------
Notes from the MPI Forum Plenary Session(s) during the Jan. 2012 meeting
 - Page numbers and sections refer to the Dec. 20, 2011 draft of the RTS proposal.
 - I tried to group topics by section, but some topics relate to multiple sections.
 - I also tried not to editorialize these notes, but represent them as best as I can from the notes available to me from that meeting.
 - These notes are a combination of my notes and those sent to me from various individuals [Darius Buntinas, Quincey Koziol, Jeff Squyres, Fab Tillier]. Thanks a bunch for the help!

mmm General Plenary
- Introductory slides:
  - Wanted a few more details on specific feedback and requirements from applications

- Would it be useful to consider a transactional section (critical section with transactional semantics for the message channel or communicator)?

- Some trouble with the wording "uniformly return" for operations like validate and communicator creation that return uniformly across all participating processes. Would prefer something with more meaning to the academic community e.g., transactional.

- Double check that we're not using the word "channel" in the standard

- General:  Do we want to support only the transactional model?

- Should we have a reading tomorrow during the scheduled session?
   Yes      3
   No      17
   Abstain  2

mmm Other locations in the text
- Page 300 at 41,43: replace "communication handle" with "communicator"
  -->[JJH] "communication handle" is correct since error handlers can be associated with windows/files that no longer have an explicit communicator.

- Page 318 at 39: "process 0" --> "process 0 in MPI_COMM_WORLD"

- Page 572 at 38-30: C++ is deprecated, should not add new error values.
- Page 573 at 40: C++ is deprecated, should not add new types.
- Page 583 at 12-13: C++ is deprecated, should not add new constants.

mmm 17.1 Introduction
- Was not reviewed.

- Page 539 at 18: "MPI features that support the development" --> "MPI features that enable the development"

mmm 17.2 MPI Terms and Conventions
- 'collectively active/inactive' are problematic phrases.

mmm 17.3 Process Fault Detection
- This sections indicated to some that the proposal -requires- progress in MPI, which is deemed highly problematic.
  - Suggested that we remove this section, or most of it.

- Do we need eventual notification in the failure detector.

- Do we need eventual notification of _every_ failed process, or just the processes we're communicating with.

- p540 at 9: "MPI provides ... connected processes" --> "MPI implementations that support process fault tolerance guarantee that  all alive processes will eventually detect all failures of connected processes."
  -->[JJH] I'll have to think about this a bit more.

- p540 at 20: drop "something like"
  -->[JJH] This gives us theoretical wiggle room :)

- p540 at 23: "Eventually every process that crashes is permanently..." --> "Every process that crashes is eventually permanently..."
  -->[JJH] I don't want to change the quote, since it is well established.

mmm 17.4 Querying for Failed Processes
- Page 541 at 37: What happens if MPI_COMM_REMOTE_GROUP_FAILED is called on an intracommunicator?
  -->[JJH] It is undefined for MPI_COMM_REMOTE_GROUP. I would assume that this is erroneous and would return MPI_ERR_COMM. We probably should not specify to match the standard.

mmm 17.5 MPI Environmental Management
  - in MPI, *_NULL is not a valid handle

- Errhandlers
  - Propose a ticket to clarify that "operations allowed inside an error handler are implementation specific".

- Failhandler and threads
  - p545 line 1: Advice to implementors is incorrect. Remove this advice, and add the following text into the paragraph proceeding.
  - "Error handlers and process failure handlers may only be called in the context of an MPI operation."

- MPI_Failhandler_set_mode()
  - p545 line 16-20: This paragraph is worded incorrectly, and needs to be revised.
  - The way it is worded disallows two MPI_COMM_WORLDs to be connected (e.g., spawn, connect/accept) because of the casual use of the word 'connected' in the paragraph. The intention is that "... must specify the same mode of operation in order to set a process failure handler on the communicator."
  - Need to clarify 'collective over what?' MPI_COMM_WORLD, MPI universe? Should be 'world'.
  - So clarify that it is possible to have two connected sets of process groups that have different modes, but they cannot register a failure handler on that communicator if the modes mismatch.
  - Spawn example
Parent:            Child:
spawn()   -------> Init()
                   set_failhandler() (implies default mode, so this will fail)
                   mode_set() (is this legal? since the default was implied above)

- Rolf has a proposal to set the mode in MPI_Init_thread() as part of the required/provided argument.
  - Make the required parameter of MPI_Init() a bitmask of THREAD level and FT level.
  - FT level can be 0=NONE 1=FT with _ALL mode 2=FT with _SUBSET mode.
  - This is being discussed on the list

- 17.5.1: The MPI_FAILHANDLER_MODE_ALL operating mode seemed problematic to some. They wanted a strong use case for this mode (over that of SUBSET).

- It was suggested that if it is possible to implement process failure handlers in a separate library then why specify them in the MPI standard?

- Failure handlers:
  - Do we want to require them to be called from every MPI function, e.g., MPI_Comm_rank or MPI_Initialized?
  - Maybe we should just say in "communication functions" (e.g., probe, send, recv)?

- Tools:
  - Don't call failure handlers from within certain MPIT functions that would be called from within an interrupt.

- Accessor function for failure handler mode.
  - Need a query interface to determine if the mode has been set or not. For tool support, and libraries that wish to check this value to determine if they may set it.
    - Something like: MPI_Failhandler_query_mode(); MPI_Allreduce(suggested_mode); MPI_Failhandler_set_mode();

- In _ALL mode, if your handler calls validate, you need to call validate every time.
  - Anything that matches across process failures.
  - E.g., drain, comm_create(?) etc.

- (Martin) Create new failure handler prototypes to get rid of varargs.
  - Instead of reusing the Error Handler object

- Consider an advice to users: strongly advised to only use collectives over the communicator passed to the failure handler.

- Page 543 at 25: "are to be postponed" --> "should be postponed" or, maybe even must?
  -->[JJH] 'must be postponed'

- Page 544 at 46: "with the exception of the list below" --> "with the exception of those listed below"

- Page 547 at 37: The error classes are already listed in 8.4, so technically aren't being added here.  "The following error classes, also listed in Section 8.4, are specific to process fault tolerance:"

mmm 17.6 Point-to-Point
- Clarify that a process failure 'disables the posting of new MPI_ANY_SOURCE messages'.

- Keith had an issue with interface design decisions for reenable_anysource.
  - Should it return the 'alive' group or the 'failed' group?

- MPI_Comm_drain
  - Suspected race condition that needs to be highlighted and addressed.
    - If Process A posts a send to Process B and it succeeds, then calls Drain in a failhandler. If the message was cached by A and the process B posts the recv should it be delivered or should the recv fail? (I think it should succeed if posted since we are really 'flushing' the channel.
Process A                Process B
 -> Success
 -> Failhandler
    -> Drain
                           -> Success or Error?
                           -> Failhandler
                              -> Drain
  - Can one thread do a send during a comm_drain?  What does that mean? Are they serialized?
  - What happens when a drain is posted on one process (A), and another process (B) posts a point-to-point message with A (but not yet the drain)? Should B's message to A be canceled because A posted a drain?
  - Draining may result in a send that completes successfully (e.g., small message that's buffered), but the matching receive will complete with an error.
  - What if the drain called in a failure handler matches a drain posted during the normal execution path? Is this a problem that the user can realistically address on their own? (Same question can be posed for other operations like validate - we need to clarify this)

- MPI_ERR_ANY_SOURCE_DISABLED is a bad name.  Some suggestions:
  - ...?

- Behavior of Probe/Iprobe/Mprobe/Improbe?
  - not mentioned anywhere, but should probably behave like Recv/Irecv.

- Page 548 at 3: What about blocking receive calls?
  -->[JJH] Clarified in the forward referenced section.

- Page 548 at 19: link to example 8.7 actually goes to section 8.7.

- Page 550 at 42: for MPI_Comm_any_source_enabled() other functions use "flag" as parameter name, rather than "enabled".

- Page 551 at 23: "operations operations"

- Page 553 at 7: "one or both of the processes": there are 3 processes, the one performing the sendrecv, the remote process performing from which data is received, and the remote process to which data is sent.  Would be nice to clarify that "one or both" refers to *remote* processes.

mmm 17.7 Collectives
- "Collective inactive/active" is bad wording and should be rephrased as something like "when a process fails then the posting of collectives is henceforth disabled on this communicator until the group is renegotiated using a validation operation" 

- Typo '\%' symbol in examples

- Accessor for collectively inactive flag for tools

- Register functions that would be called when a communicator is validated, etc.  To allow cached info to be updated.

- Page 554 at 24: "appropriately" --> "accordingly"

- Page 555 at 1,37: should comm be INOUT?

- Page 555 at 26: "...is used by the alive process." --> "... is used by the alive processes."

- Page 556 at 7, 10, 29, 31: Should the communicators be INOUT?

- Page 556 at 15, 38: MPI_Comm *subcomms --> MPI_Comm subcomms[].  Should it be const?

- Page 557 at 30-32: didn't parse, consider rewording

mmm 17.8 Groups, Contexts, Communicators, and Caching
- Need a cleaner justification for why MPI_Comm_{create,dup,split} must match across failure
  - I said that I have an example that we agreed showed that without these semantics it was dangerous. I'll try to dig up that example again and post to the list.
  - "Why do we require transactional semantics for just certain operations like MPI_Comm_dup?"

- Consider a program that uses MPI_Comm_dup in the main program and in a failure handler. (also consider if you us validate in this way). There are issues with matching occurring across those logical boundaries.
Process A                Process B
                          -> Failhandler
                             -> Validate()
 -> Failhandler
    -> Validate()

- Page 561 at 45-46: "In the MPI_COMM_SPLIT operation, failed processes in the associated communicator effectively supplies the color MPI_UNDEFINED." --> "In the MPI_COMM_SPLIT operation, failed processes in the associated communicator are treated as if they supplied the color MPI_UNDEFINED."

- Page 562 at 8: "..., and  in the presence..." --> "..., as well as in the presence..."

- Page 562 at 15-16: "All participating communicator(s) must be collectively active before calling any intercommunicator construction operation." --> "The input communicator to intercommunicator construction operations must be collectively active on all participating processes."

- Page 562 at 30-35: Clarify statement about MPI_COMM_DUP matching across process failures.
  - Consider providing rationale.

mmm 17.9 Process Topologies
- Page 563 at 13, similar wordsmithing as for page 562 at 15-16

mmm 17.10 Process Creation and Management
- Spawn:
  - Fail at init rather than just not connect them.
  - No children exist after a failure

mmm 17.11 One-Sided Communication
- p564 line45: "Additionally, the memory associated with the window during MPI_WIN_CREATE is undefined"
  - This is problematic considering that in the new RMA proposal the whole process address space could be exposed during MPI_Win_create.
  - Some members of the RMA group would like something more restrictive.
  - Need to coordinate with the RMA group to define exactly what this should be.
  - Suggested something like "Only the memory targeted by an operation that modified it is undefined after process failure"
  - Example suggested was one of walking a large graph laying and modifying markers. If only those modified are undefined then the remaining graph can be used to repair the operation.
  - In a failure: only values "targeted" by an RMA operation that modify memory are undefined.  Talk to RMA folks to make sure that's OK.

mmm 17.12 I/O
- Is it allowed to have outstanding I/O operations across a MPI_File_validate?

- Need to expand/strengthen the language in section 17.12 about how non-blocking [collective/independent] I/O is impacted by the call to File_validate.

- When checking to see if there's an error (with MPI_File_validate), don't necessarily want to have the full sync-barrier-sync semantics.
  - it would be really valuable to have that behavior be optional, or
  - be separated out into a different routine.

- How does the implied sync-barrier-sync semantic impact the file when atomic mode is enabled?

More information about the mpiwg-ft mailing list