[mpiwg-sessions] will be on a plane today - and some observations

Ralph H Castain rhc at open-mpi.org
Mon Aug 20 19:28:15 CDT 2018


IIRC, there may be a default range setting that blocks the lookup - I’ll try to take a look later. Please understand that this isn’t an operation we spend a lot of effort supporting as virtually nobody has used it.


> On Aug 20, 2018, at 4:26 PM, HOLMES Daniel <d.holmes at epcc.ed.ac.uk> wrote:
> 
> Hi Ralph,
> 
> Unfortunately, the fact that “test3” is passing the “portname" on the command-line is a red-herring - connect/accept also fails when the “portname” is advertised by the job that opens the port using MPI_Publish_name and discovered by the other job using MPI_Lookup_name. This is the test case that the sandbox code relies on. I also modified the “test3” example to use Publish/Lookup to figure out if there was any difference in the internal handling (inside Open MPI and/or inside PMIx) between these situations. There is no difference in the final outcome or in the code path from dpm_connect_accept to the deadlock.
> 
> In all cases, both processes get as far as ompi/dpm/dpm.c:398 (using git commit 5768336) and calls into PMIx_Connect. They both then get as far as opal/mca/pmix/pmix3x/pmix/src/client/pmix_client_connect.c:102 (same git hash), i.e. PMIX_WAIT_THREAD(&cb->lock);
> 
> What happens next is an infinite loop that (at least) includes a whole bunch of calls to "HASH:STORE rank -2 key pmix.<various>” output messages that burns a couple of CPUs at 100% and shifts data on the local network at max-bandwidth.
> 
> The MacOS Activity Monitor shows two orterun processes at 50% CPU each and one orte-server process at 100%. Network usage statistics are roughly in the same proportion.
> 
> It’s going to be hard to progress any further without knowing a starting point for the code doing the HASH:STORE operations. I guess it’s the PMIx progress threads trying to complete the PMIX_PTL_SEND_RECV operations pushed onto the event queues in PMIx_Connect_nb but that isn’t helping me all that much.
> 
> Cheers,
> Dan.
>> Dr Daniel Holmes PhD
> Applications Consultant in HPC Research
> d.holmes at epcc.ed.ac.uk <mailto:d.holmes at epcc.ed.ac.uk>
> Phone: +44 (0) 131 651 3465
> Mobile: +44 (0) 7940 524 088
> Address: Room 3415, JCMB, The King’s Buildings, Edinburgh, EH9 3FD
>> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
>> 
>> On 20 Aug 2018, at 16:24, Ralph H Castain <rhc at open-mpi.org <mailto:rhc at open-mpi.org>> wrote:
>> 
>> Passing a port on the cmd line for accept/connect was never implemented as I don’t think anyone really cared. Given how OMPI uses PMIx for that operation, it shouldn’t be all that difficult to do. 
>> 
>> As noted in the referenced issue, there was a problem last year with cross-mpirun connections. Not sure when I’ll have time to look at it.
>> 
>> Canceling the meeting today is fine with me - I got pulled away and didn’t get the PMIx Groups implementation done (sigh).
>> 
>> 
>>> On Aug 20, 2018, at 8:10 AM, HOLMES Daniel <d.holmes at epcc.ed.ac.uk <mailto:d.holmes at epcc.ed.ac.uk>> wrote:
>>> 
>>> Hi Howard,
>>> 
>>> Thanks for the update. Sounds promising.
>>> 
>>> I'm trying to fix the test3.zip example from:
>>> https://github.com/open-mpi/ompi/issues/3458#issuecomment-322951227 <https://github.com/open-mpi/ompi/issues/3458#issuecomment-322951227>
>>> 
>>> If successful, this would extend the testing opportunities for the sandbox code to situations that involve more than one mpirun. The issue is definitely some sort of deadlock in PMIx but I’ve not figured it out completely yet.
>>> 
>>> I’m cancelling the meeting today, unless anyone objects in the next 50 minutes.
>>> 
>>> Cheers,
>>> Dan.
>>>>>> Dr Daniel Holmes PhD
>>> Applications Consultant in HPC Research
>>> d.holmes at epcc.ed.ac.uk <mailto:d.holmes at epcc.ed.ac.uk>
>>> Phone: +44 (0) 131 651 3465
>>> Mobile: +44 (0) 7940 524 088
>>> Address: Room 3415, JCMB, The King’s Buildings, Edinburgh, EH9 3FD
>>>>>> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
>>>>>> 
>>>> On 20 Aug 2018, at 15:55, Pritchard Jr., Howard <howardp at lanl.gov <mailto:howardp at lanl.gov>> wrote:
>>>> 
>>>> HI Folks,
>>>> 
>>>> I’ll be on a plane at 11 AM MDT today so will not be able to call in.
>>>> 
>>>> I tried running the tests Dan had added/modified  and observed
>>>> what he did, that one can’t allow more than one outstanding
>>>> accept/connect going on at a time or Open MPI’s ORTE gets confused.
>>>> I reduced this down to a simpler test which hangs with only 3 ranks
>>>> and am narrowing down what the issue is.
>>>> 
>>>> I’ll be opening a PR with changes to chapter 8 of the standard and
>>>> replacement for MPI_Get_Set_Names later this week.
>>>> 
>>>> Howard
>>>> 
>>>> -- 
>>>> Howard Pritchard
>>>> B Schedule
>>>> HPC-ENV
>>>> Office 9, 2nd floor Research Park
>>>> TA-03, Building 4200, Room 203
>>>> Los Alamos National Laboratory
>>>> 
>>>> _______________________________________________
>>>> mpiwg-sessions mailing list
>>>> mpiwg-sessions at lists.mpi-forum.org <mailto:mpiwg-sessions at lists.mpi-forum.org>
>>>> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions <https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions>
>>> 
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>> _______________________________________________
>>> mpiwg-sessions mailing list
>>> mpiwg-sessions at lists.mpi-forum.org <mailto:mpiwg-sessions at lists.mpi-forum.org>
>>> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions <https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions>
>> 
>> _______________________________________________
>> mpiwg-sessions mailing list
>> mpiwg-sessions at lists.mpi-forum.org <mailto:mpiwg-sessions at lists.mpi-forum.org>
>> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions
> 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> _______________________________________________
> mpiwg-sessions mailing list
> mpiwg-sessions at lists.mpi-forum.org
> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-sessions/attachments/20180820/7730e0b0/attachment.html>


More information about the mpiwg-sessions mailing list