[mpiwg-sessions] MPICH/hydra happier with Dan's test cases (kind of)

Ralph H Castain rhc at open-mpi.org
Wed Sep 19 10:15:01 CDT 2018


Dan/Howard: can you provide me with a simple reproducer of the problem for OMPI? I may have a little time to track it down - have an idea of the source of the trouble from Dan’s offlist note.

Ralph


> On Sep 17, 2018, at 8:42 AM, HOLMES Daniel via mpiwg-sessions <mpiwg-sessions at lists.mpi-forum.org> wrote:
> 
> Hi Howard, et al,
> 
> Howard: thanks for checking this on MPICH. Looks like we need a cross-mpirun example code - we could work on that this week.
> 
> All: *reminder* no telecon today - we’ll meet up during the MPI Forum in a couple of days.
> 
> Cheers,
> Dan.
>> Dr Daniel Holmes PhD
> Applications Consultant in HPC Research
> d.holmes at epcc.ed.ac.uk <mailto:d.holmes at epcc.ed.ac.uk>
> Phone: +44 (0) 131 651 3465
> Mobile: +44 (0) 7940 524 088
> Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, EH8 9BT
>> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
>> 
>> On 17 Sep 2018, at 05:39, Pritchard Jr.,
>> Howard via mpiwg-sessions <mpiwg-sessions at lists.mpi-forum.org <mailto:mpiwg-sessions at lists.mpi-forum.org>> wrote:
>> 
>> Hi Folks,
>> 
>> I made some minor corrections to the two problem test cases and they now work
>> with mpich.  I opened a PR and assigned Dan as the reviewer.
>> 
>> Howard
>> 
>> -- 
>> Howard Pritchard
>> B Schedule
>> HPC-ENV
>> Office 9, 2nd floor Research Park
>> TA-03, Building 4200, Room 203
>> Los Alamos National Laboratory
>> 
>> 
>> From: mpiwg-sessions <mpiwg-sessions-bounces at lists.mpi-forum.org <mailto:mpiwg-sessions-bounces at lists.mpi-forum.org>> on behalf of MPI Sessions working group <mpiwg-sessions at lists.mpi-forum.org <mailto:mpiwg-sessions at lists.mpi-forum.org>>
>> Reply-To: MPI Sessions working group <mpiwg-sessions at lists.mpi-forum.org <mailto:mpiwg-sessions at lists.mpi-forum.org>>
>> Date: Sunday, September 16, 2018 at 10:10 PM
>> To: MPI Sessions working group <mpiwg-sessions at lists.mpi-forum.org <mailto:mpiwg-sessions at lists.mpi-forum.org>>
>> Cc: Howard Pritchard <howardp at lanl.gov <mailto:howardp at lanl.gov>>
>> Subject: [mpiwg-sessions] MPICH/hydra happier with Dan's test cases (kind of)
>> 
>>> HI Folks,
>>> 
>>> MPICH/hydra is happy with libCFG_noMCW, even with all the sleeps
>>> removed. I ran up to 36 ranks with hydra/mpich and didn’t see a problem.
>>> 
>>> Its not so happy with the other tests, I think they are buggy.
>>> 
>>> Here’s what I get for libCFG_noMCW_multiport
>>> 
>>> hpp at sn-fey1:/usr/projects/hpctools/hpp/mpi_sessions_code_sandbox>mpiexec -n 8 ./libCFG_noMCW_multiport
>>> process 1 (MPI_COMM_WORLD) now calling barrier on MPI_COMM_WORLD
>>> process 3 (MPI_COMM_WORLD) now calling barrier on MPI_COMM_WORLD
>>> process 4 (MPI_COMM_WORLD) calling from group thingy
>>> rank 4 non-trivial use-case: target group size 4, localGroup size 1
>>> rank 4 opening port
>>> process 5 (MPI_COMM_WORLD) now calling barrier on MPI_COMM_WORLD
>>> process 6 (MPI_COMM_WORLD) calling from group thingy
>>> rank 6 non-trivial use-case: target group size 4, localGroup size 1
>>> rank 6 opening port
>>> process 7 (MPI_COMM_WORLD) now calling barrier on MPI_COMM_WORLD
>>> process 0 (MPI_COMM_WORLD) calling from group thingy
>>> rank 0 non-trivial use-case: target group size 4, localGroup size 1
>>> process 2 (MPI_COMM_WORLD) calling from group thingy
>>> rank 2 non-trivial use-case: target group size 4, localGroup size 1
>>> rank 2 opening port
>>> rank 2 opened port tag#0$description#sn-fey1.lanl.gov <http://sn-fey1.lanl.gov/>$port#55566$ifname#128.165.227.181$
>>> rank 2 publishing port tag#0$description#sn-fey1.lanl.gov <http://sn-fey1.lanl.gov/>$port#55566$ifname#128.165.227.181$ using name foobar10 round 1
>>> rank 4 opened port tag#0$description#sn-fey1.lanl.gov <http://sn-fey1.lanl.gov/>$port#45508$ifname#128.165.227.181$
>>> rank 4 publishing port tag#0$description#sn-fey1.lanl.gov <http://sn-fey1.lanl.gov/>$port#45508$ifname#128.165.227.181$ using name foobar10 round 2
>>> rank 6 opened port tag#0$description#sn-fey1.lanl.gov <http://sn-fey1.lanl.gov/>$port#52388$ifname#128.165.227.181$
>>> rank 6 publishing port tag#0$description#sn-fey1.lanl.gov <http://sn-fey1.lanl.gov/>$port#52388$ifname#128.165.227.181$ using name foobar10 round 3
>>> rank 2 published port tag#0$description#sn-fey1.lanl.gov <http://sn-fey1.lanl.gov/>$port#55566$ifname#128.165.227.181$ using name foobar10 round 1
>>> rank 2 accepting on port tag#0$description#sn-fey1.lanl.gov <http://sn-fey1.lanl.gov/>$port#55566$ifname#128.165.227.181$ (localSize 1)
>>> Fatal error in PMPI_Publish_name: Invalid service name (see MPI_Publish_name), error stack:
>>> PMPI_Publish_name(134): MPI_Publish_name(service="foobar10 round 2", MPI_INFO_NULL, port="tag#0$description#sn-fey1.lanl.gov <http://sn-fey1.lanl.gov/>$port#45508$ifname#128.165.227.181$") failed
>>> MPID_NS_Publish(67)...: Lookup failed for service name foobar10 round 2
>>> Fatal error in PMPI_Publish_name: Invalid service name (see MPI_Publish_name), error stack:
>>> PMPI_Publish_name(134): MPI_Publish_name(service="foobar10 round 3", MPI_INFO_NULL, port="tag#0$description#sn-fey1.lanl.gov <http://sn-fey1.lanl.gov/>$port#52388$ifname#128.165.227.181$") failed
>>> MPID_NS_Publish(67)...: Lookup failed for service name foobar10 round 3
>>> 
>>> 
>>> If I have a chance I”ll play with this test and see if I can get it to work.  hmm.. maybe hydra doesn’t like the whitespace publish names.
>>> 
>>> Howard
>>> 
>>> 
>>> -- 
>>> Howard Pritchard
>>> B Schedule
>>> HPC-ENV
>>> Office 9, 2nd floor Research Park
>>> TA-03, Building 4200, Room 203
>>> Los Alamos National Laboratory
>>> 
>> _______________________________________________
>> mpiwg-sessions mailing list
>> mpiwg-sessions at lists.mpi-forum.org <mailto:mpiwg-sessions at lists.mpi-forum.org>
>> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions
> 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> _______________________________________________
> mpiwg-sessions mailing list
> mpiwg-sessions at lists.mpi-forum.org
> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-sessions/attachments/20180919/ea159ddb/attachment-0001.html>


More information about the mpiwg-sessions mailing list