<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">Pavan is correct; the program is buggy. Here’s an example<div class=""><br class=""></div><div class="">process 1 process 2</div><div class="">start(recv) /* something causes a delay at process 2 */</div><div class="">start(rsend)</div><div class="">wait(all)</div><div class=""><br class=""></div><div class=""> start(recv)</div><div class=""> ….</div><div class=""><br class=""></div><div class="">In this case, the rsend on process 1 occurs before the recv is started on process 2, and the MPI program is incorrect. Without some synchronization, either explicit or implicit (e.g., an allreduce for time step control), the use of Rsend in any form is unlikely to be correct.</div><div class=""><br class=""></div><div class="">Bill</div><div class=""><br class=""><div class=""><br class="webkit-block-placeholder"></div><div class=""><br class="webkit-block-placeholder"></div><div class="">
<div style="color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div style="color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div style="color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div style="color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;">William Gropp<br class="">Director and Chief Scientist, NCSA<br class="">Thomas M. Siebel Chair in Computer Science<br class="">University of Illinois Urbana-Champaign</div><br class="Apple-interchange-newline"></div></div><br class="Apple-interchange-newline"></div><br class="Apple-interchange-newline"></div><br class="Apple-interchange-newline"><br class="Apple-interchange-newline">
</div>
<br class=""><div><blockquote type="cite" class=""><div class="">On Nov 7, 2018, at 12:14 PM, Balaji, Pavan via mpi-forum <<a href="mailto:mpi-forum@lists.mpi-forum.org" class="">mpi-forum@lists.mpi-forum.org</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">Brian,</span><div style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><br class=""></div><div style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class="">Assuming all processes are doing the same code as below, I think the user program is incorrect and you were just getting lucky with the other implementations.</div><div style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><br class=""></div><div style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class="">Specifically, there’s nothing stopping the rsend from a process to reach the other process before it posted the corresponding recv. For example, it might still be in the second wait all from the previous iteration.</div><div style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><br class=""></div><div style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""> — Pavan<br class=""><br class=""><div dir="ltr" class="">Sent from my iPhone</div><div dir="ltr" class=""><br class="">On Nov 7, 2018, at 12:09 PM, Smith, Brian E. via mpi-forum <<a href="mailto:mpi-forum@lists.mpi-forum.org" style="color: rgb(149, 79, 114); text-decoration: underline;" class="">mpi-forum@lists.mpi-forum.org</a>> wrote:<br class=""><br class=""></div><blockquote type="cite" class=""><div dir="ltr" class=""><div class="WordSection1" style="page: WordSection1;"><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">Hi all,</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class=""> <o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="" class="">(trying again; I thought this address was subscribed to the list but maybe not. Sorry if this is a duplicate)<o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="" class=""><o:p class=""> </o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">I have a user-provided code that uses persistent ready sends. (Don’t ask. I don’t have an answer to “why?”. Maybe it actually helped on some machine sometime in the past?)</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class=""> </span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">Anyway, the app fails on Titan fairly consistently (95+% failure) but works on most other platforms (BGQ, Summit, generic OMPI cluster, generic Intel MPI cluster).</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">Note – I haven’t tried as many times on the other platforms as on Titan so maybe it might fail on one of them occasionally. I saw zero failures in my testing however.</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class=""> </span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">The code is basically this:</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">MPI_Recv_init()</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">MPI_Rsend_init()</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class=""> </span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">While(condition)</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">{</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class=""> MPI_Start(recv_request)</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class=""> MPI_Start(rsend_request)</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class=""> MPI_Waitall(both requests)</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class=""> Twiddle_sendbuf_slightly();</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">}</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class=""> </span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">MPI_Request_free(recv_request)</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">MPI_Request_free(rsend_request)</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class=""> </span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">MPI_Cart_shift(rotate source/dest different direction now)</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">MPI_Recv_init() // sending the other direction now, basically</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">MPI_Rsend_init()</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">While(condition)</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">{</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class=""> MPI_Start(recv_request)</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class=""> MPI_Start(rsend_request)</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class=""> MPI_Waitall(both requests)</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class=""> Twiddle_sendbuf_slightly();</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">}</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class=""> </span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">MPI_Request_free(recv_request)</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">MPI_Request_free(rsend_request)</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class=""> </span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">Is this considered a “correct program”? There’s only a couple paragraphs on persistent sends in 800+ pages of standard, and not much more for nonblocking ready sends (which is essentially what this becomes). It’s pretty vague territory.<span class="Apple-converted-space"> </span></span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class=""> </span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">I tried splitting the Waitall() into 2 Wait()s, explicitly waiting on the Recv request first, then the Rsend request. However, this still fails and suggests the requests are not happening in order:</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif; background-color: white;" class=""><span style="font-size: 8.5pt; font-family: Menlo;" class="">Rank 2 [Wed Nov 7 08:26:12 2018] [c5-0c0s3n1] Fatal error in PMPI_Wait: Other MPI error, error stack:<o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif; background-color: white;" class=""><span style="font-size: 8.5pt; font-family: Menlo;" class="">PMPI_Wait(207).....................: MPI_Wait(request=0x7fffffff5698, status=0x7fffffff5630) failed<o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif; background-color: white;" class=""><span style="font-size: 8.5pt; font-family: Menlo;" class="">MPIR_Wait_impl(100)................: <o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif; background-color: white;" class=""><span style="font-size: 8.5pt; font-family: Menlo;" class="">MPIDI_CH3_PktHandler_ReadySend(829): Ready send from source 1 and with tag 1 had no matching receive<o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif; background-color: white;" class=""><span style="font-size: 8.5pt; font-family: Menlo;" class=""><o:p class=""> </o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">It strongly looks like the send is not always posted before the receive, or at least the waitall completes the send sometimes before the recv. I suspect that means an implementation bug. Cray might actually be doing something for optimizing either persistent communications or ready sends (or both) that we never did on BGQ (so it’s not necessarily an MPICH vs OMPI difference at least) <o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class=""> </span><span style="font-size: 8.5pt; font-family: Menlo;" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">Thoughts?</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class=""> </span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">I’ll open a bug with them at some point but wanted to verify semantics first.</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class=""> </span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class="">Thanks</span><span style="" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class=""> <o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="" class="">Brian Smith<o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="" class="">Oak Ridge Leadership Computing Facility<o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 11pt;" class=""><o:p class=""> </o:p></span></div></div></div></blockquote><blockquote type="cite" class=""><div dir="ltr" class=""><span class="">_______________________________________________</span><br class=""><span class="">mpi-forum mailing list</span><br class=""><span class=""><a href="mailto:mpi-forum@lists.mpi-forum.org" style="color: rgb(149, 79, 114); text-decoration: underline;" class="">mpi-forum@lists.mpi-forum.org</a></span><br class=""><span class=""><a href="https://lists.mpi-forum.org/mailman/listinfo/mpi-forum" style="color: rgb(149, 79, 114); text-decoration: underline;" class="">https://lists.mpi-forum.org/mailman/listinfo/mpi-forum</a></span><br class=""></div></blockquote></div><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">_______________________________________________</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">mpi-forum mailing list</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><a href="mailto:mpi-forum@lists.mpi-forum.org" style="color: rgb(149, 79, 114); text-decoration: underline; font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px;" class="">mpi-forum@lists.mpi-forum.org</a><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><a href="https://lists.mpi-forum.org/mailman/listinfo/mpi-forum" style="color: rgb(149, 79, 114); text-decoration: underline; font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px;" class="">https://lists.mpi-forum.org/mailman/listinfo/mpi-forum</a></div></blockquote></div><br class=""></div></body></html>