[mpiwg-sessions] Debuggers, sessions, and PMIx
rhc at pmix.org
Tue Jan 31 03:42:40 CST 2023
Sorry I missed the discussion about debuggers and MPI sessions. I wanted to pass along the state of PMIx support for this interaction along with some proposals to extend it.
First, we already have support for the following:
1. A debugger/tool can query for proctable information using the PMIx group name associated with the session - i.e., to get the proctable for all processes in the session, you would query for proctable passing the group name in place of the namespace and PMIX_RANK_WILDCARD for the rank. If you want the proctable info for a single process in the group, then you would use the group name for the namespace and the group rank of the proc.
2. If you are spawning processes to form a group, then you can use the existing "stop at exec", "stop in PMIx init", and "stop in app" (usually used to stop in MPI_Init, though it can be used to stop anywhere in the app) to pause the new processes until debugger attach/release. I suspect you might also want to pause the parent process doing the spawn so the debugger can attach to it as well before releasing the new processes. We don't have anything to help here, so I propose to add a new "stop in spawn" attribute. When passed to PMIx_Spawn, it will hold the parent process just prior to returning from the blocking form of PMIx_Spawn, or just prior to executing the callback function for the non-blocking form of PMIx_Spawn.
This leaves the question of how to handle construction of MPI sessions across existing processes. For this purpose, I propose to:
a. Add a "stop in group construct" attribute that would pause just before returning from the blocking form of group construct, or just before executing the callback function for the non-blocking form. In the case of async group formation (using PMIx_Group_invite/join), the pause would occur just before delivering the "group complete" event to each participant. Debugger notification occurs when the PMIx library indicates that all participants have reached the pause point.
b. Utilize the "stop in app" support by passing that attribute to the PMIx group construct operation. This would generate an event notifying a debugger that the app has paused during MPI "sessions init" or some other designated location after completing group construction. Remember, "stop in app" takes a string "tag" so the MPI library can not only look for the attribute, but also check the "tag" to know where to stop. The attribute/tag is included in the debugger notification as well as returned to each group member in the "group complete" info array. Debugger notification takes place once all processes indicate that they have reached the assigned location (mechanism identical to "stop in app" when spawning processes).
Implementation of this support is pretty simple and can be made available in PMIx v5, due out this spring. Please feel free to provide comments/suggestions - none of this is in concrete!
More information about the mpiwg-sessions