[Mpi-forum] MPI-IO for mesh based simulations beyond 2 billion elements

Sun Feb 19 05:21:53 CST 2012

Doug and all other implementors,

is there any platform and MPI library at IBM and elsewhere that has
an MPI_Aint and INTEGER(KIND=MPI_ADDRESS_KIND) less than 64 bit?

Harald, further details below.

Best regards
Rolf

----- Original Message -----
> From: "Harald Klimach" <h.klimach at grs-sim.de>
> To: "Rolf Rabenseifner" <rabenseifner at hlrs.de>
> Cc: "Main MPI Forum mailing list" <mpi-forum at lists.mpi-forum.org>
> Sent: Sunday, February 19, 2012 11:55:35 AM
> Subject: Re: MPI-IO for mesh based simulations beyond 2 billion elements
> Dear Rolf,
> 
> thanks a lot for your quick response and guidance.
> MPI_Type_create_resized defines LB and Extent as MPI_ADDRESS_KIND.
> What is this going to be on 32-Bit systems like the BlueGene? As far
> as I
> understand it, my chances for larger numbers are better with
> MPI_OFFSET_KIND?
> Thus, to me it looks like it is not possible to use the MPI Type
> facilities for
> large data sets in order to describe the IO layout for one-dimensional
> data.
> Instead I have to use the disp parameter in MPI_File_set_view. Is
> there any
> disadvantage attached to this, especially when doing collective IO?
> Could MPI-Implementations exploit the data provided in the types
> better than
> individual disp in the view?

With the resized datatype, the MPI library can detect that all
filetypes collectively provided in MPI_SET_FILE_VIEW are disjunct!
This may allow optimizations that are not possible with
defining different disp in MPI_SET_FILE_VIEW; here, the first process
will see the whole file, the all other ranks the whole file minus
the portion really accessed by the processes with smaller ranks.

Rolf

> Thanks,
> Harald
> 
> Am 19.02.2012 um 11:23 schrieb Rolf Rabenseifner:
> 
> > Dear Harald, dear I/O and datatype committee members,
> >
> > Indeed, you need a workaround. Our LARGE COUNT #265 is not enough.
> > The workaround (instead of MPI_TYPE_CREATE_SUBARRAY) is:
> >
> > 1. Call MPI_FILE_SET_VIEW(fh, 0, vectype, vectype, datarep,
> > MPI_INFO_NULL, iError)
> > 2. Call MPI_FILE_GET_TYPE_EXTENT(fh, vectype, vectype_byte_extent,
> > iError)
> > 3. Byte_elemOffset = elemOff (as in your source code) *
> > vectype_byte_extent
> >    Byte_GlobalSize = globElems (as in your source code) *
> >    vectype_byte_extent
> > 4. Call MPI_Type_contiguous(locElems, vectype, chunktype, iError)
> > 5. Call MPI_TYPE_CREATE_RESIZED(chunktype,-Byte_elemOffset,
> >                                 Byte_GlobalSize, ftype)
> >    (I hope that I understood thisroutine correctly
> >     and here, a negativ lb is needed.)
> >
> > 6. call MPI_File_set_view( fh, 0_MPI_OFFSET_KIND, &
> >> & vectype, ftype, "native", &
> >> & MPI_INFO_NULL, iError )
> >
> > Any comments by specialists?
> >
> > Best regards
> > Rolf
> >
> > ----- Original Message -----
> >> From: "Harald Klimach" <h.klimach at grs-sim.de>
> >> To: "Rolf Rabenseifner" <rabenseifner at hlrs.de>
> >> Sent: Sunday, February 19, 2012 10:05:40 AM
> >> Subject: MPI-IO for mesh based simulations beyond 2 billion
> >> elements
> >>
> >> Dear Rolf,
> >>
> >> Mesh based simulations are an important backbone of CFD. The mesh
> >> sizes we are capable
> >> to compute on new HPC systems steadily increases even for complex
> >> geometrical layouts.
> >> For simple geometries, much larger domains then 2 billion elements
> >> where already done
> >> 10 years ago, see
> >> M. Yokokawa, K. Itakura, A. Uno, T. Ishihara and Y. Kaneda,
> >> "16.4-TFlops Direct Numerical Simulation of Turbulence by a Fourier
> >> Spectral Method on the Earth Simulator", Proceedings of the 2002
> >> ACM/IEEE Conference on Supercomputing, Baltimore MD (2002).
> >> To describe the spatial domain around complex geometries usually
> >> some
> >> unstructured
> >> approach has to be followed to allow a sparse representation. This
> >> results in a one-dimensional
> >> list of all elements which is distributed across all processes in a
> >> parallel simulation.
> >> Therefore the natural representation of all elements in each
> >> partition
> >> is perfectly
> >> described by an one-dimensional MPI subarray datatype.
> >> This would also be the most natural choice to describe the MPI File
> >> view for each
> >> partition to enable dumping and reading of results and restart
> >> data.
> >> However, due to the interface definition of
> >> MPI_Type_create_subarray,
> >> the global
> >> problem size would be limited to roughly 2 billion elements by MPI.
> >>
> >> Here is the actual implementation we are facing in our Code:
> >> We use a Lattice-Boltzmann representation, where each element is
> >> described by
> >> 19 MPI_DOUBLE_PRECISION values:
> >>
> >> call MPI_Type_contiguous(19, MPI_DOUBLE_PRECISION, vectype, iError)
> >> call MPI_Type_commit( vectype, iError )
> >>
> >> This type is then used to build up the subarray description of each
> >> partition:
> >>
> >> call MPI_Type_create_subarray( 1, [ globElems ], [ locElems ], [
> >> elemOff ], &
> >> & MPI_ORDER_FORTRAN, vectype, ftype, iError)
> >> call MPI_Type_commit( ftype, iError )
> >>
> >> Where globElems is the global number of Elements, locElems the
> >> number
> >> of
> >> Elements in the process local partition, and elemOff the Offset of
> >> the
> >> process
> >> local partition in terms of elements. This type is then used to set
> >> the view
> >> for the local process on a file opened for MPI-IO:
> >>
> >> call MPI_File_open( MPI_COMM_WORLD, &
> >> & 'restart.dat', &
> >> & MPI_MODE_WRONLY+MPI_MODE_CREATE, &
> >> & MPI_INFO_NULL, fhandle, &
> >> & iError )
> >>
> >> call MPI_File_set_view( fhandle, 0_MPI_OFFSET_KIND, &
> >> & vectype, ftype, "native", &
> >> & MPI_INFO_NULL, iError )
> >>
> >> Finally the data is written to disk with collective write:
> >>
> >> call MPI_File_write_all(fhandle, data, locElems, vectype, iostatus,
> >> iError)
> >>
> >> The global mesh is sorted according to a space-filling curve and to
> >> provide
> >> mostly equal distribution on all processes, the local number of
> >> elements is
> >> computed by the following integer arithmetic:
> >>
> >> locElems = globElems / nProcs
> >> if (myrank < mod(globElems, nProcs) locElems = locElems + 1
> >>
> >> Where nProcs is the number of processes in the communicator and
> >> myRank
> >> the rank of the local partition in the communicator.
> >> Note, that for differently sized elements, the workload varies and
> >> the
> >> elements
> >> shouldn't be distributed equally for work load balance. And with
> >> different
> >> partitioning strategies other distributions might arise. However in
> >> any case it
> >> is not generally possible to find common discriminators greater
> >> than
> >> one
> >> across all partitions.
> >> Thus it is not easily possible to find a larger etype, which spans
> >> more than a
> >> single element.
> >>
> >> What is the recommended implementation for efficient parallel IO
> >> within MPI
> >> in such a scenario?
> >>
> >> Thanks a lot,
> >> Harald
> >>
> >> --
> >> Dipl.-Ing. Harald Klimach
> >>
> >> German Research School for
> >> Simulation Sciences GmbH
> >> Schinkelstr. 2a
> >> 52062 Aachen | Germany
> >>
> >> Tel +49 241 / 80-99758
> >> Fax +49 241 / 806-99758
> >> Web www.grs-sim.de
> >> JID: haraldkl at jabber.ccc.de
> >>
> >> Members: Forschungszentrum Jülich GmbH | RWTH Aachen University
> >> Registered in the commercial register of the local court of
> >> Düren (Amtsgericht Düren) under registration number HRB 5268
> >> Registered office: Jülich
> >> Executive board: Prof. Marek Behr Ph.D. | Dr. Norbert Drewes
> >
> > --
> > Dr. Rolf Rabenseifner . . . . . . . . . .. email
> > rabenseifner at hlrs.de
> > High Performance Computing Center (HLRS) . phone
> > ++49(0)711/685-65530
> > University of Stuttgart . . . . . . . . .. fax ++49(0)711 /
> > 685-65832
> > Head of Dpmt Parallel Computing . . .
> > www.hlrs.de/people/rabenseifner
> > Nobelstr. 19, D-70550 Stuttgart, Germany . (Office: Allmandring 30)
> 
> --
> Dipl.-Ing. Harald Klimach
> 
> German Research School for
> Simulation Sciences GmbH
> Schinkelstr. 2a
> 52062 Aachen | Germany
> 
> Tel +49 241 / 80-99758
> Fax +49 241 / 806-99758
> Web www.grs-sim.de
> JID: haraldkl at jabber.ccc.de
> 
> Members: Forschungszentrum Jülich GmbH | RWTH Aachen University
> Registered in the commercial register of the local court of
> Düren (Amtsgericht Düren) under registration number HRB 5268
> Registered office: Jülich
> Executive board: Prof. Marek Behr Ph.D. | Dr. Norbert Drewes

-- 
Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de
High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530
University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832
Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner
Nobelstr. 19, D-70550 Stuttgart, Germany . (Office: Allmandring 30)