[Mpi-forum] MPI-IO for mesh based simulations beyond 2 billion elements

Sun Feb 19 10:51:42 CST 2012

Harald,
            If all you have is a one-dimensional array, and each process reads or writes a large contiguous chunk of it, you don't need to use Type_create_subarray or even use collective I/O. You can just have each process call MPI_File_read_at or write_at with the right offset. The subarray type is more useful if you have a multidimensional array and need to read a subsection of it that is not contiguously located in the file.

Rajeev

On Feb 19, 2012, at 4:23 AM, Rolf Rabenseifner wrote:

> Dear Harald, dear I/O and datatype committee members,
> 
> Indeed, you need a workaround. Our LARGE COUNT #265 is not enough.
> The workaround (instead of MPI_TYPE_CREATE_SUBARRAY) is:
> 
> 1. Call MPI_FILE_SET_VIEW(fh, 0, vectype, vectype, datarep, MPI_INFO_NULL, iError)
> 2. Call MPI_FILE_GET_TYPE_EXTENT(fh, vectype, vectype_byte_extent, iError)
> 3. Byte_elemOffset = elemOff (as in your source code) * vectype_byte_extent
>    Byte_GlobalSize = globElems (as in your source code) * vectype_byte_extent
> 4. Call MPI_Type_contiguous(locElems, vectype, chunktype, iError)
> 5. Call MPI_TYPE_CREATE_RESIZED(chunktype,-Byte_elemOffset,
>                                 Byte_GlobalSize, ftype)
>    (I hope that I understood thisroutine correctly 
>     and here, a negativ lb is needed.)
> 
> 6. call MPI_File_set_view( fh, 0_MPI_OFFSET_KIND, &
>> & vectype, ftype, "native", &
>> & MPI_INFO_NULL, iError )
> 
> Any comments by specialists?
> 
> Best regards
> Rolf
> 
> ----- Original Message -----
>> From: "Harald Klimach" <h.klimach at grs-sim.de>
>> To: "Rolf Rabenseifner" <rabenseifner at hlrs.de>
>> Sent: Sunday, February 19, 2012 10:05:40 AM
>> Subject: MPI-IO for mesh based simulations beyond 2 billion elements
>> 
>> Dear Rolf,
>> 
>> Mesh based simulations are an important backbone of CFD. The mesh
>> sizes we are capable
>> to compute on new HPC systems steadily increases even for complex
>> geometrical layouts.
>> For simple geometries, much larger domains then 2 billion elements
>> where already done
>> 10 years ago, see
>> M. Yokokawa, K. Itakura, A. Uno, T. Ishihara and Y. Kaneda,
>> "16.4-TFlops Direct Numerical Simulation of Turbulence by a Fourier
>> Spectral Method on the Earth Simulator", Proceedings of the 2002
>> ACM/IEEE Conference on Supercomputing, Baltimore MD (2002).
>> To describe the spatial domain around complex geometries usually some
>> unstructured
>> approach has to be followed to allow a sparse representation. This
>> results in a one-dimensional
>> list of all elements which is distributed across all processes in a
>> parallel simulation.
>> Therefore the natural representation of all elements in each partition
>> is perfectly
>> described by an one-dimensional MPI subarray datatype.
>> This would also be the most natural choice to describe the MPI File
>> view for each
>> partition to enable dumping and reading of results and restart data.
>> However, due to the interface definition of MPI_Type_create_subarray,
>> the global
>> problem size would be limited to roughly 2 billion elements by MPI.
>> 
>> Here is the actual implementation we are facing in our Code:
>> We use a Lattice-Boltzmann representation, where each element is
>> described by
>> 19 MPI_DOUBLE_PRECISION values:
>> 
>> call MPI_Type_contiguous(19, MPI_DOUBLE_PRECISION, vectype, iError)
>> call MPI_Type_commit( vectype, iError )
>> 
>> This type is then used to build up the subarray description of each
>> partition:
>> 
>> call MPI_Type_create_subarray( 1, [ globElems ], [ locElems ], [
>> elemOff ], &
>> & MPI_ORDER_FORTRAN, vectype, ftype, iError)
>> call MPI_Type_commit( ftype, iError )
>> 
>> Where globElems is the global number of Elements, locElems the number
>> of
>> Elements in the process local partition, and elemOff the Offset of the
>> process
>> local partition in terms of elements. This type is then used to set
>> the view
>> for the local process on a file opened for MPI-IO:
>> 
>> call MPI_File_open( MPI_COMM_WORLD, &
>> & 'restart.dat', &
>> & MPI_MODE_WRONLY+MPI_MODE_CREATE, &
>> & MPI_INFO_NULL, fhandle, &
>> & iError )
>> 
>> call MPI_File_set_view( fhandle, 0_MPI_OFFSET_KIND, &
>> & vectype, ftype, "native", &
>> & MPI_INFO_NULL, iError )
>> 
>> Finally the data is written to disk with collective write:
>> 
>> call MPI_File_write_all(fhandle, data, locElems, vectype, iostatus,
>> iError)
>> 
>> The global mesh is sorted according to a space-filling curve and to
>> provide
>> mostly equal distribution on all processes, the local number of
>> elements is
>> computed by the following integer arithmetic:
>> 
>> locElems = globElems / nProcs
>> if (myrank < mod(globElems, nProcs) locElems = locElems + 1
>> 
>> Where nProcs is the number of processes in the communicator and myRank
>> the rank of the local partition in the communicator.
>> Note, that for differently sized elements, the workload varies and the
>> elements
>> shouldn't be distributed equally for work load balance. And with
>> different
>> partitioning strategies other distributions might arise. However in
>> any case it
>> is not generally possible to find common discriminators greater than
>> one
>> across all partitions.
>> Thus it is not easily possible to find a larger etype, which spans
>> more than a
>> single element.
>> 
>> What is the recommended implementation for efficient parallel IO
>> within MPI
>> in such a scenario?
>> 
>> Thanks a lot,
>> Harald
>> 
>> --
>> Dipl.-Ing. Harald Klimach
>> 
>> German Research School for
>> Simulation Sciences GmbH
>> Schinkelstr. 2a
>> 52062 Aachen | Germany
>> 
>> Tel +49 241 / 80-99758
>> Fax +49 241 / 806-99758
>> Web www.grs-sim.de
>> JID: haraldkl at jabber.ccc.de
>> 
>> Members: Forschungszentrum Jülich GmbH | RWTH Aachen University
>> Registered in the commercial register of the local court of
>> Düren (Amtsgericht Düren) under registration number HRB 5268
>> Registered office: Jülich
>> Executive board: Prof. Marek Behr Ph.D. | Dr. Norbert Drewes
> 
> -- 
> Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de
> High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530
> University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832
> Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner
> Nobelstr. 19, D-70550 Stuttgart, Germany . (Office: Allmandring 30)
> 
> _______________________________________________
> mpi-forum mailing list
> mpi-forum at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-forum