[Mpi-forum] MPI-IO for mesh based simulations beyond 2 billion elements

Sun Feb 19 04:23:53 CST 2012

Dear Harald, dear I/O and datatype committee members,

Indeed, you need a workaround. Our LARGE COUNT #265 is not enough.
The workaround (instead of MPI_TYPE_CREATE_SUBARRAY) is:

 1. Call MPI_FILE_SET_VIEW(fh, 0, vectype, vectype, datarep, MPI_INFO_NULL, iError)
 2. Call MPI_FILE_GET_TYPE_EXTENT(fh, vectype, vectype_byte_extent, iError)
 3. Byte_elemOffset = elemOff (as in your source code) * vectype_byte_extent
    Byte_GlobalSize = globElems (as in your source code) * vectype_byte_extent
 4. Call MPI_Type_contiguous(locElems, vectype, chunktype, iError)
 5. Call MPI_TYPE_CREATE_RESIZED(chunktype,-Byte_elemOffset,
                                 Byte_GlobalSize, ftype)
    (I hope that I understood thisroutine correctly 
     and here, a negativ lb is needed.)

 6. call MPI_File_set_view( fh, 0_MPI_OFFSET_KIND, &
> & vectype, ftype, "native", &
> & MPI_INFO_NULL, iError )

Any comments by specialists?

Best regards
Rolf

----- Original Message -----
> From: "Harald Klimach" <h.klimach at grs-sim.de>
> To: "Rolf Rabenseifner" <rabenseifner at hlrs.de>
> Sent: Sunday, February 19, 2012 10:05:40 AM
> Subject: MPI-IO for mesh based simulations beyond 2 billion elements
>
> Dear Rolf,
> 
> Mesh based simulations are an important backbone of CFD. The mesh
> sizes we are capable
> to compute on new HPC systems steadily increases even for complex
> geometrical layouts.
> For simple geometries, much larger domains then 2 billion elements
> where already done
> 10 years ago, see
> M. Yokokawa, K. Itakura, A. Uno, T. Ishihara and Y. Kaneda,
> "16.4-TFlops Direct Numerical Simulation of Turbulence by a Fourier
> Spectral Method on the Earth Simulator", Proceedings of the 2002
> ACM/IEEE Conference on Supercomputing, Baltimore MD (2002).
> To describe the spatial domain around complex geometries usually some
> unstructured
> approach has to be followed to allow a sparse representation. This
> results in a one-dimensional
> list of all elements which is distributed across all processes in a
> parallel simulation.
> Therefore the natural representation of all elements in each partition
> is perfectly
> described by an one-dimensional MPI subarray datatype.
> This would also be the most natural choice to describe the MPI File
> view for each
> partition to enable dumping and reading of results and restart data.
> However, due to the interface definition of MPI_Type_create_subarray,
> the global
> problem size would be limited to roughly 2 billion elements by MPI.
> 
> Here is the actual implementation we are facing in our Code:
> We use a Lattice-Boltzmann representation, where each element is
> described by
> 19 MPI_DOUBLE_PRECISION values:
> 
> call MPI_Type_contiguous(19, MPI_DOUBLE_PRECISION, vectype, iError)
> call MPI_Type_commit( vectype, iError )
> 
> This type is then used to build up the subarray description of each
> partition:
> 
> call MPI_Type_create_subarray( 1, [ globElems ], [ locElems ], [
> elemOff ], &
> & MPI_ORDER_FORTRAN, vectype, ftype, iError)
> call MPI_Type_commit( ftype, iError )
> 
> Where globElems is the global number of Elements, locElems the number
> of
> Elements in the process local partition, and elemOff the Offset of the
> process
> local partition in terms of elements. This type is then used to set
> the view
> for the local process on a file opened for MPI-IO:
> 
> call MPI_File_open( MPI_COMM_WORLD, &
> & 'restart.dat', &
> & MPI_MODE_WRONLY+MPI_MODE_CREATE, &
> & MPI_INFO_NULL, fhandle, &
> & iError )
> 
> call MPI_File_set_view( fhandle, 0_MPI_OFFSET_KIND, &
> & vectype, ftype, "native", &
> & MPI_INFO_NULL, iError )
> 
> Finally the data is written to disk with collective write:
> 
> call MPI_File_write_all(fhandle, data, locElems, vectype, iostatus,
> iError)
> 
> The global mesh is sorted according to a space-filling curve and to
> provide
> mostly equal distribution on all processes, the local number of
> elements is
> computed by the following integer arithmetic:
> 
> locElems = globElems / nProcs
> if (myrank < mod(globElems, nProcs) locElems = locElems + 1
> 
> Where nProcs is the number of processes in the communicator and myRank
> the rank of the local partition in the communicator.
> Note, that for differently sized elements, the workload varies and the
> elements
> shouldn't be distributed equally for work load balance. And with
> different
> partitioning strategies other distributions might arise. However in
> any case it
> is not generally possible to find common discriminators greater than
> one
> across all partitions.
> Thus it is not easily possible to find a larger etype, which spans
> more than a
> single element.
> 
> What is the recommended implementation for efficient parallel IO
> within MPI
> in such a scenario?
> 
> Thanks a lot,
> Harald
> 
> --
> Dipl.-Ing. Harald Klimach
> 
> German Research School for
> Simulation Sciences GmbH
> Schinkelstr. 2a
> 52062 Aachen | Germany
> 
> Tel +49 241 / 80-99758
> Fax +49 241 / 806-99758
> Web www.grs-sim.de
> JID: haraldkl at jabber.ccc.de
> 
> Members: Forschungszentrum Jülich GmbH | RWTH Aachen University
> Registered in the commercial register of the local court of
> Düren (Amtsgericht Düren) under registration number HRB 5268
> Registered office: Jülich
> Executive board: Prof. Marek Behr Ph.D. | Dr. Norbert Drewes

-- 
Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de
High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530
University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832
Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner
Nobelstr. 19, D-70550 Stuttgart, Germany . (Office: Allmandring 30)