[mpiwg-p2p] [mpiwg-tools] Message matching for tools

Marc-Andre Hermanns hermanns at jara.rwth-aachen.de
Fri Dec 18 04:18:31 CST 2015

Hi Kento,

>> We analyze on the same scale as the measurement, thus we have one
>> thread per thread-local trace. Each thread processes its own 
>> thread-local trace. When encountering a communication event, it 
>> re-enacts this communication using the recorded communication 
>> parameters (rank, tag, comm). A send event leads to an issued 
>> send, a receive event leads to an issued receive.
> (1) Replaying receive events Papers about “parallel replay (or 
> record-and-replay)” uses (rank, rag, comm) for correct replay of 
> message receive orders. Unfortunately, (rank, tag, comm) cannot 
> replay message receive orders even in MPI_THREAD_SINGLE  ** In 
> general ** (Of course, it may work in particular case). You need
> to record (rank, message_id_number), and actually (tag, comm) does
> not work for this purpose. The details is described in Section 3.1
> of this paper ( http://dl.acm.org/citation.cfm?id=2807642 ).

I think we (Scalasca) does not have requirements as strict as the ones
outlined in the paper. We need only to ensure that the same
send/receive pairs also exchange data during the analysis (ideal case)
or that we can at least detect a mismatch (in case of logically
concurrent messages) and fix it locally, by exchanging the mixed up
message payloads.

If I understand section 3.1 correctly, the problem with out-of-order
receives (Figure 3) does not pose a problem in our case, as we only
care that msg1 is matched by req1 and msg2 is matched by req2, both in
during measurement and replay. MPI ordering semantics should take care
of that.

> (2) Replaying send events Sorry if you’re already aware of this.
> In some applications, usage of send calls, the destitution and the
>  message payload can change across different runs. So you also
> need to record ALL non-deterministic operations which affect MPI
> send behaviors. One of the examples is seed values for *void 
> srand(unsigned int* /seed/*)* for random numbers. time-related 
> function, such as gettimeofday(), can also be the example, thereby 
> MPI_Wtime().

I think for serial applications we are fine with what MPI
provides/guarantees in message ordering semantics as outlined above.
It is just in situations with logically concurrent messages that our
level of replay may break down.

>> b) Creating a derived datatype on the fly to add tool-level data 
>> to the original payload may induce a large overhead in 
>> practically _every_ send & receive operation and perturb the 
>> measurement.
> Yes, if it’s for performance analysis, it’ll somehow perturb the 
> measurement. This paper will help you to see if the piggybacking 
> overhead is acceptable or not. 
> http://greg.bronevetsky.com/papers/2008EuroPVM.pdf .

Thanks for the pointer. I was aware of the paper, but it is quite a
while back that I read it. As you pointed out, separate messages are
out of the question. For large buffers, I'd think the packing would
also be quite disadvantageous, as I'd need a malloc before the pack,
right? So the only option left would be the datatype (struct), also
suggested by Jeff in a different reply.

My concerns here were not only that the construction is costly (which
Jeff showed it need not be), but the fact that I turn _any_ contiguous
datatype into a non-contiguous datatype and some NICs may perform
significantly different with those. The paper seems to at least
suggest this with the use of ad-hoc datatypes.

Marc-Andre Hermanns
Jülich Aachen Research Alliance,
High Performance Computing (JARA-HPC)
Jülich Supercomputing Centre (JSC)

Schinkelstrasse 2
52062 Aachen

Phone: +49 2461 61 2509 | +49 241 80 24381
Fax: +49 2461 80 6 99753
email: hermanns at jara.rwth-aachen.de

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4899 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-p2p/attachments/20151218/6256b480/attachment-0001.bin>

More information about the mpiwg-p2p mailing list