[Mpi-forum] MPI-IO for mesh based simulations beyond 2 billion elements

Sun Feb 19 08:02:27 CST 2012

On Blue Gene/P, MPI_Aint is a 64-bit integer specifically to address
this issue, even though addresses are only 32-bit integers.

For background and details, please see
http://lists.alcf.anl.gov/pipermail/dcmf/2008-February/000060.html and
http://lists.mcs.anl.gov/pipermail/mpich-discuss/2009-July/005315.html.

Jeff

On Sun, Feb 19, 2012 at 6:21 AM, Rolf Rabenseifner <rabenseifner at hlrs.de> wrote:
> Doug and all other implementors,
>
> is there any platform and MPI library at IBM and elsewhere that has
> an MPI_Aint and INTEGER(KIND=MPI_ADDRESS_KIND) less than 64 bit?
>
> Harald, further details below.
>
> Best regards
> Rolf
>
> ----- Original Message -----
>> From: "Harald Klimach" <h.klimach at grs-sim.de>
>> To: "Rolf Rabenseifner" <rabenseifner at hlrs.de>
>> Cc: "Main MPI Forum mailing list" <mpi-forum at lists.mpi-forum.org>
>> Sent: Sunday, February 19, 2012 11:55:35 AM
>> Subject: Re: MPI-IO for mesh based simulations beyond 2 billion elements
>> Dear Rolf,
>>
>> thanks a lot for your quick response and guidance.
>> MPI_Type_create_resized defines LB and Extent as MPI_ADDRESS_KIND.
>> What is this going to be on 32-Bit systems like the BlueGene? As far
>> as I
>> understand it, my chances for larger numbers are better with
>> MPI_OFFSET_KIND?
>> Thus, to me it looks like it is not possible to use the MPI Type
>> facilities for
>> large data sets in order to describe the IO layout for one-dimensional
>> data.
>> Instead I have to use the disp parameter in MPI_File_set_view. Is
>> there any
>> disadvantage attached to this, especially when doing collective IO?
>> Could MPI-Implementations exploit the data provided in the types
>> better than
>> individual disp in the view?
>
> With the resized datatype, the MPI library can detect that all
> filetypes collectively provided in MPI_SET_FILE_VIEW are disjunct!
> This may allow optimizations that are not possible with
> defining different disp in MPI_SET_FILE_VIEW; here, the first process
> will see the whole file, the all other ranks the whole file minus
> the portion really accessed by the processes with smaller ranks.
>
> Rolf
>
>> Thanks,
>> Harald
>>
>> Am 19.02.2012 um 11:23 schrieb Rolf Rabenseifner:
>>
>> > Dear Harald, dear I/O and datatype committee members,
>> >
>> > Indeed, you need a workaround. Our LARGE COUNT #265 is not enough.
>> > The workaround (instead of MPI_TYPE_CREATE_SUBARRAY) is:
>> >
>> > 1. Call MPI_FILE_SET_VIEW(fh, 0, vectype, vectype, datarep,
>> > MPI_INFO_NULL, iError)
>> > 2. Call MPI_FILE_GET_TYPE_EXTENT(fh, vectype, vectype_byte_extent,
>> > iError)
>> > 3. Byte_elemOffset = elemOff (as in your source code) *
>> > vectype_byte_extent
>> >    Byte_GlobalSize = globElems (as in your source code) *
>> >    vectype_byte_extent
>> > 4. Call MPI_Type_contiguous(locElems, vectype, chunktype, iError)
>> > 5. Call MPI_TYPE_CREATE_RESIZED(chunktype,-Byte_elemOffset,
>> >                                 Byte_GlobalSize, ftype)
>> >    (I hope that I understood thisroutine correctly
>> >     and here, a negativ lb is needed.)
>> >
>> > 6. call MPI_File_set_view( fh, 0_MPI_OFFSET_KIND, &
>> >> & vectype, ftype, "native", &
>> >> & MPI_INFO_NULL, iError )
>> >
>> > Any comments by specialists?
>> >
>> > Best regards
>> > Rolf
>> >
>> > ----- Original Message -----
>> >> From: "Harald Klimach" <h.klimach at grs-sim.de>
>> >> To: "Rolf Rabenseifner" <rabenseifner at hlrs.de>
>> >> Sent: Sunday, February 19, 2012 10:05:40 AM
>> >> Subject: MPI-IO for mesh based simulations beyond 2 billion
>> >> elements
>> >>
>> >> Dear Rolf,
>> >>
>> >> Mesh based simulations are an important backbone of CFD. The mesh
>> >> sizes we are capable
>> >> to compute on new HPC systems steadily increases even for complex
>> >> geometrical layouts.
>> >> For simple geometries, much larger domains then 2 billion elements
>> >> where already done
>> >> 10 years ago, see
>> >> M. Yokokawa, K. Itakura, A. Uno, T. Ishihara and Y. Kaneda,
>> >> "16.4-TFlops Direct Numerical Simulation of Turbulence by a Fourier
>> >> Spectral Method on the Earth Simulator", Proceedings of the 2002
>> >> ACM/IEEE Conference on Supercomputing, Baltimore MD (2002).
>> >> To describe the spatial domain around complex geometries usually
>> >> some
>> >> unstructured
>> >> approach has to be followed to allow a sparse representation. This
>> >> results in a one-dimensional
>> >> list of all elements which is distributed across all processes in a
>> >> parallel simulation.
>> >> Therefore the natural representation of all elements in each
>> >> partition
>> >> is perfectly
>> >> described by an one-dimensional MPI subarray datatype.
>> >> This would also be the most natural choice to describe the MPI File
>> >> view for each
>> >> partition to enable dumping and reading of results and restart
>> >> data.
>> >> However, due to the interface definition of
>> >> MPI_Type_create_subarray,
>> >> the global
>> >> problem size would be limited to roughly 2 billion elements by MPI.
>> >>
>> >> Here is the actual implementation we are facing in our Code:
>> >> We use a Lattice-Boltzmann representation, where each element is
>> >> described by
>> >> 19 MPI_DOUBLE_PRECISION values:
>> >>
>> >> call MPI_Type_contiguous(19, MPI_DOUBLE_PRECISION, vectype, iError)
>> >> call MPI_Type_commit( vectype, iError )
>> >>
>> >> This type is then used to build up the subarray description of each
>> >> partition:
>> >>
>> >> call MPI_Type_create_subarray( 1, [ globElems ], [ locElems ], [
>> >> elemOff ], &
>> >> & MPI_ORDER_FORTRAN, vectype, ftype, iError)
>> >> call MPI_Type_commit( ftype, iError )
>> >>
>> >> Where globElems is the global number of Elements, locElems the
>> >> number
>> >> of
>> >> Elements in the process local partition, and elemOff the Offset of
>> >> the
>> >> process
>> >> local partition in terms of elements. This type is then used to set
>> >> the view
>> >> for the local process on a file opened for MPI-IO:
>> >>
>> >> call MPI_File_open( MPI_COMM_WORLD, &
>> >> & 'restart.dat', &
>> >> & MPI_MODE_WRONLY+MPI_MODE_CREATE, &
>> >> & MPI_INFO_NULL, fhandle, &
>> >> & iError )
>> >>
>> >> call MPI_File_set_view( fhandle, 0_MPI_OFFSET_KIND, &
>> >> & vectype, ftype, "native", &
>> >> & MPI_INFO_NULL, iError )
>> >>
>> >> Finally the data is written to disk with collective write:
>> >>
>> >> call MPI_File_write_all(fhandle, data, locElems, vectype, iostatus,
>> >> iError)
>> >>
>> >> The global mesh is sorted according to a space-filling curve and to
>> >> provide
>> >> mostly equal distribution on all processes, the local number of
>> >> elements is
>> >> computed by the following integer arithmetic:
>> >>
>> >> locElems = globElems / nProcs
>> >> if (myrank < mod(globElems, nProcs) locElems = locElems + 1
>> >>
>> >> Where nProcs is the number of processes in the communicator and
>> >> myRank
>> >> the rank of the local partition in the communicator.
>> >> Note, that for differently sized elements, the workload varies and
>> >> the
>> >> elements
>> >> shouldn't be distributed equally for work load balance. And with
>> >> different
>> >> partitioning strategies other distributions might arise. However in
>> >> any case it
>> >> is not generally possible to find common discriminators greater
>> >> than
>> >> one
>> >> across all partitions.
>> >> Thus it is not easily possible to find a larger etype, which spans
>> >> more than a
>> >> single element.
>> >>
>> >> What is the recommended implementation for efficient parallel IO
>> >> within MPI
>> >> in such a scenario?
>> >>
>> >> Thanks a lot,
>> >> Harald
>> >>
>> >> --
>> >> Dipl.-Ing. Harald Klimach
>> >>
>> >> German Research School for
>> >> Simulation Sciences GmbH
>> >> Schinkelstr. 2a
>> >> 52062 Aachen | Germany
>> >>
>> >> Tel +49 241 / 80-99758
>> >> Fax +49 241 / 806-99758
>> >> Web www.grs-sim.de
>> >> JID: haraldkl at jabber.ccc.de
>> >>
>> >> Members: Forschungszentrum Jülich GmbH | RWTH Aachen University
>> >> Registered in the commercial register of the local court of
>> >> Düren (Amtsgericht Düren) under registration number HRB 5268
>> >> Registered office: Jülich
>> >> Executive board: Prof. Marek Behr Ph.D. | Dr. Norbert Drewes
>> >
>> > --
>> > Dr. Rolf Rabenseifner . . . . . . . . . .. email
>> > rabenseifner at hlrs.de
>> > High Performance Computing Center (HLRS) . phone
>> > ++49(0)711/685-65530
>> > University of Stuttgart . . . . . . . . .. fax ++49(0)711 /
>> > 685-65832
>> > Head of Dpmt Parallel Computing . . .
>> > www.hlrs.de/people/rabenseifner
>> > Nobelstr. 19, D-70550 Stuttgart, Germany . (Office: Allmandring 30)
>>
>> --
>> Dipl.-Ing. Harald Klimach
>>
>> German Research School for
>> Simulation Sciences GmbH
>> Schinkelstr. 2a
>> 52062 Aachen | Germany
>>
>> Tel +49 241 / 80-99758
>> Fax +49 241 / 806-99758
>> Web www.grs-sim.de
>> JID: haraldkl at jabber.ccc.de
>>
>> Members: Forschungszentrum Jülich GmbH | RWTH Aachen University
>> Registered in the commercial register of the local court of
>> Düren (Amtsgericht Düren) under registration number HRB 5268
>> Registered office: Jülich
>> Executive board: Prof. Marek Behr Ph.D. | Dr. Norbert Drewes
>
> --
> Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de
> High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530
> University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832
> Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner
> Nobelstr. 19, D-70550 Stuttgart, Germany . (Office: Allmandring 30)
>
> _______________________________________________
> mpi-forum mailing list
> mpi-forum at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-forum

-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/old/index.php/User:Jhammond