[mpiwg-tools] Debugger spawn

Tue Jul 19 08:43:58 CDT 2016

Hi Chris

Responses inline below
Ralph

> On Jul 19, 2016, at 1:27 AM, Chris January <chris.january at allinea.com> wrote:
> 
> Hello Ralph,
> 
> On 18/07/16 20:00, Ralph Castain wrote:
>> We’ve been chatting in the meetings about how to possibly use PMIx for obtaining proctable info and having the resource manager (or mpirun) launch debugger daemons. I have prototyped some code for PMIx that supports these operations (will commit it to PMIx for the 2.0 release, and it will be in OMPI master shortly), and written a sample debugger startup tool (see attached) that illustrates how it would be used.
>> 
>> I think you will find it relatively simple. We can add/subtract/modify the returned proctable data as required.
> 
> Thank you for sending over the sample code. I have a couple of questions that concern the the case where the tool is actually starting the job itself (e.g. running mpiexec -n ...):
> 1. How can the tool ensure that the job does not start executing (beyond, say, MPI_Init) before the tool has attached?
> In MPIR, if MPIR_being_debugged is set in the starter process, the MPI processes wait at a barrier before or inside MPI_Init until the starter process returns from the MPIR_Breakpoint function.

I provided the “PMIX_DEBUGGER_DAEMONS” info key to alert the launcher that we were running a debugger against the application so it can ensure the job is “paused” as needed. I would propose adding another info key (“PMIX_DEBUG_TARGET”) that would have a value of the nspace (aka jobid) of the target application

> 2. Let's say the resource manager has a command like SLURM's srun that can both make an allocation, and also start a job running. In this case, if the tool starts the job itself by running srun ..., it will be outside the resource manager's allocation. How, in that case, will the PMIX_tool interface know which job the tool wants to work with? How will the tool find the server PID it needs to pass?

The PMIx_tool_init function automatically finds the local PMIx server on the node where the tool is being executed, and then connects to it. This is done via a rendezvous file (a well-known filename created by the PMIx server). If multiple rendezvous files are found, PMIx_tool_init will complain and gracefully exit - at that point, you can run the tool again and simply provide the pid of the desired target PMIx server. I’m open to alternative rendezvous protocols - this was just something simple. We are implementing a plugin system and so it can be tailored as desired - the plugin system will also support proprietary binary plugins for those desiring them.

Note that PMIx and the tool have no ability to start the job on their own. Instead, the tool would package up the application directives and ask that the RM spawn both the application _and_ the associated debugger daemons at the same time. I have updated the code example to show how that might be done.

HTH
Ralph

-------------- next part --------------
A non-text attachment was scrubbed...
Name: debugger.c
Type: application/octet-stream
Size: 7715 bytes
Desc: not available
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-tools/attachments/20160719/598bf576/attachment-0001.obj>
-------------- next part --------------

> 
> Yours,
> Chris January - VP Engineering - Allinea Software Ltd.
> _______________________________________________
> mpiwg-tools mailing list
> mpiwg-tools at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-tools