[mpiwg-sessions] MPICH/hydra happier with Dan's test cases (kind of)

HOLMES Daniel d.holmes at epcc.ed.ac.uk
Mon Sep 17 10:42:33 CDT 2018


Hi Howard, et al,

Howard: thanks for checking this on MPICH. Looks like we need a cross-mpirun example code - we could work on that this week.

All: *reminder* no telecon today - we’ll meet up during the MPI Forum in a couple of days.

Cheers,
Dan.
—
Dr Daniel Holmes PhD
Applications Consultant in HPC Research
d.holmes at epcc.ed.ac.uk<mailto:d.holmes at epcc.ed.ac.uk>
Phone: +44 (0) 131 651 3465
Mobile: +44 (0) 7940 524 088
Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, EH8 9BT
—
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
—

On 17 Sep 2018, at 05:39, Pritchard Jr., Howard via mpiwg-sessions <mpiwg-sessions at lists.mpi-forum.org<mailto:mpiwg-sessions at lists.mpi-forum.org>> wrote:

Hi Folks,

I made some minor corrections to the two problem test cases and they now work
with mpich.  I opened a PR and assigned Dan as the reviewer.

Howard

--
Howard Pritchard
B Schedule
HPC-ENV
Office 9, 2nd floor Research Park
TA-03, Building 4200, Room 203
Los Alamos National Laboratory


From: mpiwg-sessions <mpiwg-sessions-bounces at lists.mpi-forum.org<mailto:mpiwg-sessions-bounces at lists.mpi-forum.org>> on behalf of MPI Sessions working group <mpiwg-sessions at lists.mpi-forum.org<mailto:mpiwg-sessions at lists.mpi-forum.org>>
Reply-To: MPI Sessions working group <mpiwg-sessions at lists.mpi-forum.org<mailto:mpiwg-sessions at lists.mpi-forum.org>>
Date: Sunday, September 16, 2018 at 10:10 PM
To: MPI Sessions working group <mpiwg-sessions at lists.mpi-forum.org<mailto:mpiwg-sessions at lists.mpi-forum.org>>
Cc: Howard Pritchard <howardp at lanl.gov<mailto:howardp at lanl.gov>>
Subject: [mpiwg-sessions] MPICH/hydra happier with Dan's test cases (kind of)

HI Folks,

MPICH/hydra is happy with libCFG_noMCW, even with all the sleeps
removed. I ran up to 36 ranks with hydra/mpich and didn’t see a problem.

Its not so happy with the other tests, I think they are buggy.

Here’s what I get for libCFG_noMCW_multiport

hpp at sn-fey1:/usr/projects/hpctools/hpp/mpi_sessions_code_sandbox>mpiexec -n 8 ./libCFG_noMCW_multiport
process 1 (MPI_COMM_WORLD) now calling barrier on MPI_COMM_WORLD
process 3 (MPI_COMM_WORLD) now calling barrier on MPI_COMM_WORLD
process 4 (MPI_COMM_WORLD) calling from group thingy
rank 4 non-trivial use-case: target group size 4, localGroup size 1
rank 4 opening port
process 5 (MPI_COMM_WORLD) now calling barrier on MPI_COMM_WORLD
process 6 (MPI_COMM_WORLD) calling from group thingy
rank 6 non-trivial use-case: target group size 4, localGroup size 1
rank 6 opening port
process 7 (MPI_COMM_WORLD) now calling barrier on MPI_COMM_WORLD
process 0 (MPI_COMM_WORLD) calling from group thingy
rank 0 non-trivial use-case: target group size 4, localGroup size 1
process 2 (MPI_COMM_WORLD) calling from group thingy
rank 2 non-trivial use-case: target group size 4, localGroup size 1
rank 2 opening port
rank 2 opened port tag#0$description#sn-fey1.lanl.gov<http://sn-fey1.lanl.gov>$port#55566$ifname#128.165.227.181$
rank 2 publishing port tag#0$description#sn-fey1.lanl.gov<http://sn-fey1.lanl.gov>$port#55566$ifname#128.165.227.181$ using name foobar10 round 1
rank 4 opened port tag#0$description#sn-fey1.lanl.gov<http://sn-fey1.lanl.gov>$port#45508$ifname#128.165.227.181$
rank 4 publishing port tag#0$description#sn-fey1.lanl.gov<http://sn-fey1.lanl.gov>$port#45508$ifname#128.165.227.181$ using name foobar10 round 2
rank 6 opened port tag#0$description#sn-fey1.lanl.gov<http://sn-fey1.lanl.gov>$port#52388$ifname#128.165.227.181$
rank 6 publishing port tag#0$description#sn-fey1.lanl.gov<http://sn-fey1.lanl.gov>$port#52388$ifname#128.165.227.181$ using name foobar10 round 3
rank 2 published port tag#0$description#sn-fey1.lanl.gov<http://sn-fey1.lanl.gov>$port#55566$ifname#128.165.227.181$ using name foobar10 round 1
rank 2 accepting on port tag#0$description#sn-fey1.lanl.gov<http://sn-fey1.lanl.gov>$port#55566$ifname#128.165.227.181$ (localSize 1)
Fatal error in PMPI_Publish_name: Invalid service name (see MPI_Publish_name), error stack:
PMPI_Publish_name(134): MPI_Publish_name(service="foobar10 round 2", MPI_INFO_NULL, port="tag#0$description#sn-fey1.lanl.gov<http://sn-fey1.lanl.gov>$port#45508$ifname#128.165.227.181$") failed
MPID_NS_Publish(67)...: Lookup failed for service name foobar10 round 2
Fatal error in PMPI_Publish_name: Invalid service name (see MPI_Publish_name), error stack:
PMPI_Publish_name(134): MPI_Publish_name(service="foobar10 round 3", MPI_INFO_NULL, port="tag#0$description#sn-fey1.lanl.gov<http://sn-fey1.lanl.gov>$port#52388$ifname#128.165.227.181$") failed
MPID_NS_Publish(67)...: Lookup failed for service name foobar10 round 3


If I have a chance I”ll play with this test and see if I can get it to work.  hmm.. maybe hydra doesn’t like the whitespace publish names.

Howard


--
Howard Pritchard
B Schedule
HPC-ENV
Office 9, 2nd floor Research Park
TA-03, Building 4200, Room 203
Los Alamos National Laboratory

_______________________________________________
mpiwg-sessions mailing list
mpiwg-sessions at lists.mpi-forum.org<mailto:mpiwg-sessions at lists.mpi-forum.org>
https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-sessions/attachments/20180917/6d8b4260/attachment-0001.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-sessions/attachments/20180917/6d8b4260/attachment-0001.ksh>


More information about the mpiwg-sessions mailing list