[mpiwg-sessions] will be on a plane today - and some observations

HOLMES Daniel d.holmes at epcc.ed.ac.uk
Mon Aug 20 18:26:22 CDT 2018


Hi Ralph,

Unfortunately, the fact that “test3” is passing the “portname" on the command-line is a red-herring - connect/accept also fails when the “portname” is advertised by the job that opens the port using MPI_Publish_name and discovered by the other job using MPI_Lookup_name. This is the test case that the sandbox code relies on. I also modified the “test3” example to use Publish/Lookup to figure out if there was any difference in the internal handling (inside Open MPI and/or inside PMIx) between these situations. There is no difference in the final outcome or in the code path from dpm_connect_accept to the deadlock.

In all cases, both processes get as far as ompi/dpm/dpm.c:398 (using git commit 5768336) and calls into PMIx_Connect. They both then get as far as opal/mca/pmix/pmix3x/pmix/src/client/pmix_client_connect.c:102 (same git hash), i.e. PMIX_WAIT_THREAD(&cb->lock);

What happens next is an infinite loop that (at least) includes a whole bunch of calls to "HASH:STORE rank -2 key pmix.<various>” output messages that burns a couple of CPUs at 100% and shifts data on the local network at max-bandwidth.

The MacOS Activity Monitor shows two orterun processes at 50% CPU each and one orte-server process at 100%. Network usage statistics are roughly in the same proportion.

It’s going to be hard to progress any further without knowing a starting point for the code doing the HASH:STORE operations. I guess it’s the PMIx progress threads trying to complete the PMIX_PTL_SEND_RECV operations pushed onto the event queues in PMIx_Connect_nb but that isn’t helping me all that much.

Cheers,
Dan.
—
Dr Daniel Holmes PhD
Applications Consultant in HPC Research
d.holmes at epcc.ed.ac.uk<mailto:d.holmes at epcc.ed.ac.uk>
Phone: +44 (0) 131 651 3465
Mobile: +44 (0) 7940 524 088
Address: Room 3415, JCMB, The King’s Buildings, Edinburgh, EH9 3FD
—
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
—

On 20 Aug 2018, at 16:24, Ralph H Castain <rhc at open-mpi.org<mailto:rhc at open-mpi.org>> wrote:

Passing a port on the cmd line for accept/connect was never implemented as I don’t think anyone really cared. Given how OMPI uses PMIx for that operation, it shouldn’t be all that difficult to do.

As noted in the referenced issue, there was a problem last year with cross-mpirun connections. Not sure when I’ll have time to look at it.

Canceling the meeting today is fine with me - I got pulled away and didn’t get the PMIx Groups implementation done (sigh).


On Aug 20, 2018, at 8:10 AM, HOLMES Daniel <d.holmes at epcc.ed.ac.uk<mailto:d.holmes at epcc.ed.ac.uk>> wrote:

Hi Howard,

Thanks for the update. Sounds promising.

I'm trying to fix the test3.zip example from:
https://github.com/open-mpi/ompi/issues/3458#issuecomment-322951227

If successful, this would extend the testing opportunities for the sandbox code to situations that involve more than one mpirun. The issue is definitely some sort of deadlock in PMIx but I’ve not figured it out completely yet.

I’m cancelling the meeting today, unless anyone objects in the next 50 minutes.

Cheers,
Dan.
—
Dr Daniel Holmes PhD
Applications Consultant in HPC Research
d.holmes at epcc.ed.ac.uk<mailto:d.holmes at epcc.ed.ac.uk>
Phone: +44 (0) 131 651 3465
Mobile: +44 (0) 7940 524 088
Address: Room 3415, JCMB, The King’s Buildings, Edinburgh, EH9 3FD
—
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
—

On 20 Aug 2018, at 15:55, Pritchard Jr., Howard <howardp at lanl.gov<mailto:howardp at lanl.gov>> wrote:

HI Folks,

I’ll be on a plane at 11 AM MDT today so will not be able to call in.

I tried running the tests Dan had added/modified  and observed
what he did, that one can’t allow more than one outstanding
accept/connect going on at a time or Open MPI’s ORTE gets confused.
I reduced this down to a simpler test which hangs with only 3 ranks
and am narrowing down what the issue is.

I’ll be opening a PR with changes to chapter 8 of the standard and
replacement for MPI_Get_Set_Names later this week.

Howard

--
Howard Pritchard
B Schedule
HPC-ENV
Office 9, 2nd floor Research Park
TA-03, Building 4200, Room 203
Los Alamos National Laboratory

_______________________________________________
mpiwg-sessions mailing list
mpiwg-sessions at lists.mpi-forum.org<mailto:mpiwg-sessions at lists.mpi-forum.org>
https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
_______________________________________________
mpiwg-sessions mailing list
mpiwg-sessions at lists.mpi-forum.org<mailto:mpiwg-sessions at lists.mpi-forum.org>
https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions

_______________________________________________
mpiwg-sessions mailing list
mpiwg-sessions at lists.mpi-forum.org<mailto:mpiwg-sessions at lists.mpi-forum.org>
https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-sessions/attachments/20180820/7bdd90ba/attachment-0001.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-sessions/attachments/20180820/7bdd90ba/attachment-0001.ksh>


More information about the mpiwg-sessions mailing list