[Mpi3-tools] Next MPI-3 Tools WG meeting Monday 8/26 + Schedule fornext meetings

Dong H. Ahn ahn1 at llnl.gov
Mon Aug 9 13:54:21 CDT 2010


  On 8/9/2010 11:19 AM, John DelSignore wrote:
> Dong H. Ahn wrote:
>>    Hi John,
>>
>> I just looked at the revision and I think this addressed most of the
>> issues I've raised except for the MPIR_attach_fifo support.
> Sorry, I forgot about MPIR_attach_fifo. I had to scrounge around in my email to find the description. Here's what I have, from an email you sent on 7/9/10:
>
> =============================================
> ... I would also hope to
> cover a recently implemented MPIR extension on OpenRTE/OpenMPI:
> MPIR_attach_fifo support.
>
> -------
>
> "
> char MPIR_attach_fifo[256];
>
> Definition is not required.
>
> Definition is contained within the address space of the starter process.
>
> Variable is written by the tool, and read by the starter process.
> MPIR_attach_fifo  is a null-terminated character string that is written by the tool into the
> address space of the starter process. The string is the path name of a FIFO (named pipe). Writing a byte
> into this FIFO will cause the starter process to start to monitor MPIR_being_debugged in attach mode.
> "
> -------
>
> As part of LaunchMON/STAT OpenRTE/OpenMPI port, Ralph Castain reproduced
> BlueGene's MPIR colocation extension but there was a performance
> requirement of orterun that kept orterun from polling on the value
> changes of MPIR_being_debugged for attach case by default. So we
> addressed it by making orterun open up a FIFO and its comm thread block
> on the FIFO along with other TCP channels; when a debugger attaches to
> orterun and writes a byte into it, the starter process looks at
> MPIR_being_debugger and performs the co-location service.
> =============================================
>
>> If it is
>> missing from this revision because of a lack of technical details, Ralph
>> and I should be able to provide them.
> Given that I know zilch MPIR_attach_fifo, it would make sense for someone else (not me) to write-up the description and incorporate it into the document. Since it is an extension, I think it should be added in sections 7 and 9. Sections 7.3 and 9.19 seem appropriate places for the additions.

Ralph and I will try to draft it before the next call.

I think Ralph is more qualified to provide answers below but here is my 
version.

> Finally, I'd like a little clarification on this extension...
>
> 1) Who creates and owns the FIFO? Is it the starter process?

Yes.

> On what node is the FIFO created? My concern is this: If the starter process is being debugged remotely (e.g.: totalview -r remotehost mpiexec) does the debugger have to open the FIFO on the remotehost? I'd assume so, which means that the debugger has to extend its remote debugging protocol to execute this "write to FIFO" operation.
This wasn't an issue with STAT/LaunchMON because this tool co-locates 
the launchmon engine to the starter process which knows how to deal with 
files. BTW, is "write-to-file" not supported for TotalView remote 
debugging protocol?

> 2) Who is responsible for deleting the FIFO? What happens if the debugger forcibly kills the starter process? Is the FIFO leaked?

The starter process is responsible for deleting it.

As to your second question, I think this kind of cleanup requirement 
applies not only to this FIFO but also all other resources that the RM 
holds. The RM should be able to release all of its resources when its 
starter process is forcibly killed by the controlling debugger. In any 
case, I was told that oretrun creates a per-session directory where it 
put all of the file resources, which was also used for managing this 
FIFO resource. I suspect that orterun can perform the FIFO clean-up in 
the same fashion it cleans up other files in this directory when 
forcibly killed by the debugger.

> 3) Why didn't you use a socket or a signal instead of a FIFO? With a socket or signal, you don't have lifetime issues. With a signal, you don't have remote debugging issues.

We thought about those options but none are well-suited for orterun. 
Most signals are already taken for this resource manager; a socket could 
be used, but then one will have to worry about authentication and etc...

Best,
Dong

> Cheers, John D.
>
>
>
>> Best,
>> Dong
>>
>> On 8/6/2010 11:58 AM, John DelSignore wrote:
>>> Hi,
>>>
>>> Attached is the most recent draft (8/6/2010) of the MPIR document for discussion during the 8/9 meeting.
>>>
>>> I updated it to reflect the comments I received in email on the 6/11/2010 draft that I felt warranted changing the document.
>>>
>>> I also copy and pasted Jeff's mpimsgq_dll_locations description from the Latex file and the MPI website; this is intended as a placeholder until someone (not me) has time to cleanup the text to fit into the document.
>>>
>>> The "Diffs" document shows the changes relative to the 6/11/2010 draft.
>>>
>>> Cheers, John D.
>>>
>>>
>>> Jeff Squyres wrote:
>>>> On Jul 25, 2010, at 5:23 PM, Martin Schulz wrote:
>>>>
>>>>> Here is the proposed schedule for the upcoming meetings after tomorrow:
>>>>>
>>>>> 8/9 - Final discussion of the MPIR document
>>>>> 8/23 - Discussion of the completed/integrated MPIT document
>>>>> 9/6 - Labor day - no meeting
>>>>> 9/16-9/18 - MPI forum meeting in Stuttgart
>>>>> 	First reading of the MPIR document
>>>>> 9/20 - Feedback from the MPI forum, MPIT discussions
>>>> Webex links for these meetings are now up on the wiki.
>>>>
>> _______________________________________________
>> Mpi3-tools mailing list
>> Mpi3-tools at lists.mpi-forum.org
>> http://*lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-tools
>>
> _______________________________________________
> Mpi3-tools mailing list
> Mpi3-tools at lists.mpi-forum.org
> http://*lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-tools
>




More information about the mpiwg-tools mailing list