[mpiwg-tools] PMPI for a complete MPI wrapper

Max Sagebaum max.sagebaum at scicomp.uni-kl.de
Fri Oct 13 02:33:40 CDT 2017


Thank you for the confirmation.

If that is the case I could also change the logic. If the request is already finished after the command is executed I could immediately perform the cleanup steps and do not need to store any additional data for the request.

Maybe I have time to do a test implementation on the weekend or next week and test a with some other mpi implementations.

Cheers

Max

On Thu, 2017-10-12 at 17:01 -0400, George Bosilca wrote:


On Wed, Oct 11, 2017 at 7:57 AM, Jean-Baptiste BESNARD <jbbesnard at paratools.fr<mailto:jbbesnard at paratools.fr>> wrote:
Using the request Fortran ID is quite original indeed, I did not think of that !
I’m usure of why OpenMPI does yield ‘1’ maybe because it directly buffered data and allows to free the request immediately ? Optimization which is not possible with MPI_Issend.
You may try with larger messages it could give you the request handle.


I wanted to confirm this. As the only valid fields in the status for isend are request completes or cancelled, if the user buffer can safely be modified it is legal to return a predefined status that correspond to a request that is completed (too late for cancel at this point). Thus, Open MPI returns an internal requests that match the empty status. This explains why you consistently get the f2c index 1.

  George.


You may also hash the request to get an ID ?

To summarize, it appears that the only possibility is complete MPI wrapping (functions where requests are involved) with custom request but this requires you to recompile the target program with your own request type.

We have AFAIK the following alternatives for request wrapping:

- As Extended Generalized Requests are not standard (only saves Wait*/Test* wrapping, avoids custom requests)
- As Generalized Requests are not practical on the polling side (you need to spawn a thread or populate a list to progress MPI handles (means THREAD_MULTIPLE) )  (only saves Wait*/Test* wrapping, avoids custom requests)
- Using C2F functions / Request hash. However, you observed that sometimes MPI functions did return (completed?) requests preventing match between Isend and Wait for example (Needs complete wrapping , avoids custom requests)
- Could you use MPI-T variables bound to MPI_Request objects ? I don’t think runtimes propose such variables yet (Needs complete wrapping, avoids custom request)
- Complete wrapping and custom MPI_Request object (should work, need recompilation)

Others may have good solutions I missed but for me at this point only the last model seems reliable enough. Recompilation could be avoided with a reliable way for « tagging » a standard request.

Cheers,

Jean-Baptiste.

Le 10 oct. 2017 à 12:27, Max Sagebaum <max.sagebaum at scicomp.uni-kl.de<mailto:max.sagebaum at scicomp.uni-kl.de>> a écrit :

Hello Jean-Baptiste,

sorry I forgot to answer you yesterday. I had a look at the example project and it looks ok. My problem is, that the wrapper will be a general one released as open source. Therefore I can not use techniques that are not part of the MPI library. It is even problematic if the basic technique that I use in the wrapper is not supported by lower standards. So unfortunately I can not use these extended generic requests.

While writing this message I got another idea how to associate data with a Request. The function MPI_Request_c2f convertes a c handle into an integer and needs to be in some way unique, such that the implementation can convert it back. I could then use this index to store the additional information in an array or a hashmap.

I tested this with the openmpi implemenation on my local machine and this looked promising. The only problem was, that the handle from the MPI_Isend was always 1 and therefore I could not associate the data. Since Send and Isend are unspecified how implementation handles these, I tried switching to Issend and here it worked.

Could this be a possible avenue to associate some data with a request object?

Cheers

Max

On Fri, 2017-10-06 at 17:34 +0200, Jean-Baptiste BESNARD wrote:
Hi Max,

In my idea you would have used Extended Generic Requests which do provide a « poll_fn », this function voids the use of progress threads which have indeed a lot of drawbacks.
In practice the MPI runtime will call the « poll_fn » when progressing the request (only in Wait & Test). This makes the use of this request abstraction much simpler.

I’ve taken a few minutes to make a (quick) example of their usage for request wrapping as they are mostly used inside ROMIO for now I think:
https://github.com/besnardjb/egreq_example

As extended generic requests are not standard you cannot be sure of finding them in all MPI implementations, here are those I know of:

- MPICH has them (I checked in 3.2 and probably OK in all of its derivatives)
- MPC has them since 2.5.2 (as I implemented them :-))
- I did not find them in OpenMPI

Even between these two implementations you'll find differences, I based myself on the aforementioned paper whereas, for example, MPICH does take the wait_fn as additional parameter of the _start.

Cheers,

Jean-Baptiste.

Le 6 oct. 2017 à 12:06, Max Sagebaum <max.sagebaum at scicomp.uni-kl.de<mailto:max.sagebaum at scicomp.uni-kl.de>> a écrit :

Hello Jean-Baptiste,

thanks for the pointer to the general requests. I did not yet had a look at them.

I can carry my additional data in this data structure. But I am not quite sure about the overhead. In order to implement this, I need to tell mpi with grequest_complete, that the request is completed. Since I want to wrap the usual asynchronous requests this would yield the following layout:

1. Start grequest
2. start thread that executes the grequest
3. In the thread:
I. start the regular request
II. Wait until the regular request is finished
III. Signal mpi that the grequest is complete
4. User calls Wait/Test etc. on the grequest
5. the free call performs the postprocessing (Needs to be performed in the main thread, otherwise race conditions need to be handled)(this might also be possible in the extra thread, but I need to look my main data structure, which introduces overheads on the whole application)

With this implementation I would create a thread for each asynchronous MPI call. The threads would be idle. Can this have an impact on overall performance?
I could imagine, that it is also possible to have a busy thread, that tests for all wrapped asynchronous requests. But the busy thread could slow down the performance of a cluster node quite strongly.

Whats you opinion on this matter?

Cheers

Max

On Thu, 2017-10-05 at 15:46 +0200, Jean-Baptiste BESNARD wrote:
Hi Max,

Thank you very much for these details, I must admit that I need a bit more time to fully understand the scenario :)

However, considering that you want to « wrap » requests could the Generalized Request interface 12.2 in the standard (or the extended one which are also widespread because of their use by ROMIO) be of any use to create requests objects which are then pointing to your own requests through the extra state parameter ?

For extended generalized requests see : http://www.mcs.anl.gov/uploads/cels/papers/P1417.pdf

Thanks,

Jean-Baptiste.

Le 5 oct. 2017 à 15:25, Max Sagebaum <max.sagebaum at scicomp.uni-kl.de<mailto:max.sagebaum at scicomp.uni-kl.de>> a écrit :

Hi @ all,

thanks for all the input. From what I gather from the discussion, a 'classic' wrapper - as mentioned by Marc (wrap functions only, leave types intact) - is no problem to generate. I agree on that.
For a complete Wrapper (wrap functions and redefine types) a new ABI needs to be defined.

If I aim for the new ABI I will have look at the wi4mpi project since they have already done this. I could link to there "interface" mode. I will post a message to Marc and Marc-Andre and the wi4mpi list if I have any problems here.

But before I would like to tell you a little bit more about the why as Marc-Andre, has rightfully ask for.

We are doing Algorithmic Differentiation which has three consequences for MPI communication:
- We need to store data for each MPI communication such as MPI_Send, Recv , etc.
- Buffers need to be pre- and postprocessed
- For each MPI communication there is a reverse communication.

The pre- and postprocessing part is the problematic bit. We need to do it, since we are using new structures to represent the floating point types.
This can be for example:
struct AReal {
double value;
int index;
};

In this example the prostprocessing requires the index to be adapted to the new machine, since the index is kind of a pointer for AD. So after the buffer is received the index of all AReal types needs to be renewed.
If I do this for a MPI_Recv, there are no problems since I can do everything inside of the routine.
If a MPI_Irecv is called, I can only modify the buffer after the Request is finished (e.g. in the Wait call). My design is now to define a new request:
struct AMPI_Request {
void* data;
Func func;
MPI_Request request;
}

My implementation of wait would then be,
int AMPI_Wait(AMPI_Request* request) {
int r = PMPI_Wait(request->request);

request->func(request->data); // perform the post processing
}

Because of the structre AMPI_Request I need to include mpi.h to have the original MPI_Request available and I need to modify all function where MPI_Request is used.
The same techniques is used so far for MPI_Op and MPI_Datatype.

I hope this explains, why I would like to have PMPI definitions for MPI_Request, MPI_Datatype, etc.

I can still change the design of my implementation, so I am also open for pointers how to avoid the redefinition of MPI_Request.

Cheers

Max

On Wed, 2017-10-04 at 16:22 +0000, Marc.PERACHE at CEA.FR<mailto:Marc.PERACHE at CEA.FR> wrote:

Hi Max,

As Marc-André said wi4mpi was designed to avoid the recompilation phase of large applications required each time you need to change the underlying MPI implementation. Basically, wi4mpi allows to change the internal representation of all MPI type declared in the mpi.h without recompiling the application in "preload" mode. Wi4mpi provides also its own MPI interface and translate types to the underlying MPI implementation in "interface" mode. In "interface" mode, you'll have to recompile your application. Currently, wi4mpi supports bi-directional ABI conversion for  OpenMPI, MPICH, IntelMPI, MPI Spectrum, wi4mpi ABI. By the end of the year we will add the MPC ABI.

If I understand correctly what you want to do, wi4mpi can provide the glue between your API (i.e. enriched MPI types) used by the application and the underlying MPI implementation. In this case, you'll have to recompile your application but it doesn't require code modification in the application. If you want to avoid application recompilation you'll need to modify wi4mpi internals. If you have questions on wi4mpi, we should take this off the list and keep everyone else in CC.

Regards,
Marc

-----Message d'origine-----
De : Marc-Andre Hermanns [mailto:hermanns at jara.rwth-aachen.de]
Envoyé : mercredi 4 octobre 2017 15:51
À : mpiwg-tools at lists.mpi-forum.org<mailto:mpiwg-tools at lists.mpi-forum.org>; Max Sagebaum
Cc : PERACHE Marc 600952
Objet : Re: [mpiwg-tools] PMPI for a complete MPI wrapper

Hi Max,



thanks for the fast answer. With the pmpi.h I mean a file like mpi.h
but only containing the PMPI_ Interface. As you suggested I might try
to create a full wrapper myself. I took a look on the wi4mpi project
and there approach seems to create there own interface aka. mpi.h and
then wrap this to the intel MPI or OpenMPI implementation. Due to this
approach, they know the data types and can generate the interface.



For the wi4mpi project, Jean-Baptiste and people from CEA may be the
right people to talk to.

The wi4mpi is a bit of a special project, as it is provides a software
'glue' to make MPI implementations interchangeable. It is a way to
overcome the missing ABI (i.e., a specification of types, etc.).

Usually, the users will have to recompile their application every time
they choose a different MPI (potentially also when using a different
version of the same MPI), as values and types in the mpi.h may have
changed. For large simulation codes, this can take a long time. When
you have a translating 'glue' like wi4mpi in between, you can swap MPI
implementations via LD_PRELOAD at the start time of the application.

I don't know enough about wi4mpi to really know what their goal is:
Have a mixed MPI run (e.g., couple two codes compiled against differnt
MPIs)? Use a library compiled for one MPI together with an application
compiled against another? Just make it easier for users to link
against the right MPI? All of the above?

@Marc? Any comments on what the design goal of wi4mpi is? Does it
support other MPI implementations beside Intel-MPI and Open-MPI?

(If this discussion drifts more towards 'wi4mpi' specifics, we should
take this off the list and keep everyone else in CC)



In my library I wanted to use a light wight wrapper. That is I wanted
to use the original data types. With this approach I currently have
structures like:

struct AMPI_Comm {
// my own data;
MPI_Comm comm; // the original object
};

I can then simply call the pmpi functions with the stored original
object.
If I have a wrapper such that there is a PMPI_Comm object available, I
could do the following:
struct MPI_Comm {
// my own data;
PMPI_Comm comm; // the original object
};

If the wrapper should use the same types from a general mpi.h, then I
do not know the types and would need to declare something like:

hidden_mpi.c
#include <mpi.h>
decltype(MPI_COMM_WORLD) PMPI_COMM_WORLD = MPI_COMM_WORLD;

and then I need to use PMPI_COMM_WORLD in my library and I can
generate a hmpi.h wich contains lines like:
#define MPI_COMM_WORLD PMPI_COMM_WORLD

Which could be included by the user. But in order use PMPI_* in my
library, I need to specify the symbol in a header file for which I
need the type. In order to get the type I need to include mpi.h which
will define MPI_COMM_WORLD and I have a name clash.



mpi.h only _declares_ the prototype. It does not define anything
(apart from CPP macros, etc.).

If you provide your own types, then you will need to declare your own
prototypes, which I would generate (see below).



So unfortunately I see no way in providing a wrapper without writing a
complete MPI Interface, which I would like to avoid. I might be able
to use the wi4mpi Project and use there interface as a base for my
implementation, which would add a dependency to my project.



For just a 'classic' wrapper, you just need to provide the definition
(implementation) of the function you want to replace, adhering to the
declared (in mpi.h) function prototype.



A third and very ugly option would be, that I define all my types as
void* in the interface for the user. But this disables type checking
and I still would need to wrap from void* to references of my types.

So I might just stay in my AMPI namespace and provide a macro for the
user to either call regular mpi functions or my wrapper functions.



If you need 'classic' wrappers for your project, you might consider
generating that code with a generator like 'wrap' [2].

As I mentioned in my other mail, writing a 'classic' wrapper is
straight forward.

Cheers,
Marc-Andre


[2] https://github.com/LLNL/wrap



On Wed, 2017-10-04 at 11:00 +0200, Jean-Baptiste BESNARD wrote:


Dear Max,

I’m not sure I completely understand what you mean by a « pmpi.h »
however I may have some initial elements below.

The PMPI interface is currently targeting MPI functions only and
indeed some of the values you’ll find in your executable will be
compile time constants.
In fact, most MPI types/Constants are implementation dependent,
there is no unified ABI.

Nonetheless, you might be able to interpret them in your wrapper
library in order to have them « rerouted » to your target
implementation.
I mean, knowing the value of MPI_COMM_WORLD you could rewrite it to
be MPI_COMM_WORLD2.
And for sure you wont’t find a PMPI_COMM_WORLD.

I can help on writing a wrapper for the whole PMPI interface. See my
repo here: https://github.com/besnardjb/mpi-snippets
There is a simple python script generating VIM snippets for MPI from
JSON specs, it can easily be converted to a script generating the
whole MPI interface.

Eventually, an approach close to what you want to do might
be https://github.com/cea-hpc/wi4mpi which operates this systematic
handler conversion between MPI flavors, but this clearly involves
some rewriting.

Hope this helps.

Regards,

Jean-Baptiste.



Le 4 oct. 2017 à 10:35, Max Sagebaum
<max.sagebaum at scicomp.uni-kl.de<mailto:max.sagebaum at scicomp.uni-kl.de>
<mailto:max.sagebaum at scicomp.uni-kl.de>> a écrit :
Hello @ all,

my question is concerning the PMPI specification. I hope the list
is the correct place to ask.

I want to write a complete wrapper for MPI. That is every define,
typedef and function will be wrapped and might be completely
changed. Currently I prefixed everything with AMPI_ such that no
name clashes exist. But the user would need to rename every
occurrence of MPI_ with AMPI_

I would now like to use the PMPI definition of MPI to define my
wrappers as the MPI version which then use the PMPI definitions.
Unfortunately I could not find tutorials for a complete wrapper.

As an example take MPI_COMM_WORLD. I made a grep on the openmpi
installation on my linux machine for PMPI_COMM_WORLD but the result
was empty. The definition of MPI_COMM_WORLD was
#define MPI_COMM_WORLD OMPI_PREDEFINED_GLOBAL( MPI_Comm,
ompi_mpi_comm_world)
without any chance to switch to PMPI_COMM_WORLD as a predefined macro.

I also checked the newest source tarball of openmpi and I could not
find anything for PMPI_COMM_WORLD there.

In the mpi 3.0 standard on page 555 in section 14.2.1 the
requirements are just listed for functions. Was the definition of
the PMPI_ supplements for defines, types etc. never discussed?

I would have expected, that I can just include a pmpi.h and then I
would have all the PMPI_ symbols without the MPI symbols available.

Do you know of any way I could make my idea work?

Cheers

Max



_______________________________________________
mpiwg-tools mailing list
mpiwg-tools at lists.mpi-forum.org<mailto:mpiwg-tools at lists.mpi-forum.org>
https://lists.mpi-forum.org/mailman/listinfo/mpiwg-tools

--

Max Sagebaum Chair for Scientific Computing, TU Kaiserslautern, Bldg/Geb 34, Paul-Ehrlich-Strasse, 67663 Kaiserslautern, Germany Phone: +49 (0)631 205 5638 Fax: +49 (0)631 205 3056 Email: max.sagebaum at scicomp.uni-kl.de URL: www.scicomp.uni-kl.de
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-tools/attachments/20171013/d4811186/attachment-0001.html>


More information about the mpiwg-tools mailing list