[Mpi-forum] MPI-IO for mesh based simulations beyond 2 billion elements

Rob Latham robl at mcs.anl.gov
Mon Feb 20 09:26:28 CST 2012


On Sun, Feb 19, 2012 at 10:51:42AM -0600, Rajeev Thakur wrote:
> Harald,
>             If all you have is a one-dimensional array, and each process reads or writes a large contiguous chunk of it, you don't need to use Type_create_subarray or even use collective I/O. You can just have each process call MPI_File_read_at or write_at with the right offset. The subarray type is more useful if you have a multidimensional array and need to read a subsection of it that is not contiguously located in the file.

Hi Rajeev: that's not entirely true any longer.  On BlueGene,
collective I/O will carry out aggregation and alignment to block
boundaries.  On Lustre, collective I/O can conjure up a workload such
that all operations from a particular client go to one server.  

So, anything that allows collective I/O, even for large requests, we
should try to encourage.

32 bit MPI_Aint on 32 bit platforms isn't perfect (I think there's
some problem in the Fortran cases?)  but it's allowed us to actually
get work done on BlueGene.

==rob

> 
> Rajeev
> 
> 
> On Feb 19, 2012, at 4:23 AM, Rolf Rabenseifner wrote:
> 
> > Dear Harald, dear I/O and datatype committee members,
> > 
> > Indeed, you need a workaround. Our LARGE COUNT #265 is not enough.
> > The workaround (instead of MPI_TYPE_CREATE_SUBARRAY) is:
> > 
> > 1. Call MPI_FILE_SET_VIEW(fh, 0, vectype, vectype, datarep, MPI_INFO_NULL, iError)
> > 2. Call MPI_FILE_GET_TYPE_EXTENT(fh, vectype, vectype_byte_extent, iError)
> > 3. Byte_elemOffset = elemOff (as in your source code) * vectype_byte_extent
> >    Byte_GlobalSize = globElems (as in your source code) * vectype_byte_extent
> > 4. Call MPI_Type_contiguous(locElems, vectype, chunktype, iError)
> > 5. Call MPI_TYPE_CREATE_RESIZED(chunktype,-Byte_elemOffset,
> >                                 Byte_GlobalSize, ftype)
> >    (I hope that I understood thisroutine correctly 
> >     and here, a negativ lb is needed.)
> > 
> > 6. call MPI_File_set_view( fh, 0_MPI_OFFSET_KIND, &
> >> & vectype, ftype, "native", &
> >> & MPI_INFO_NULL, iError )
> > 
> > Any comments by specialists?
> > 
> > Best regards
> > Rolf
> > 
> > ----- Original Message -----
> >> From: "Harald Klimach" <h.klimach at grs-sim.de>
> >> To: "Rolf Rabenseifner" <rabenseifner at hlrs.de>
> >> Sent: Sunday, February 19, 2012 10:05:40 AM
> >> Subject: MPI-IO for mesh based simulations beyond 2 billion elements
> >> 
> >> Dear Rolf,
> >> 
> >> Mesh based simulations are an important backbone of CFD. The mesh
> >> sizes we are capable
> >> to compute on new HPC systems steadily increases even for complex
> >> geometrical layouts.
> >> For simple geometries, much larger domains then 2 billion elements
> >> where already done
> >> 10 years ago, see
> >> M. Yokokawa, K. Itakura, A. Uno, T. Ishihara and Y. Kaneda,
> >> "16.4-TFlops Direct Numerical Simulation of Turbulence by a Fourier
> >> Spectral Method on the Earth Simulator", Proceedings of the 2002
> >> ACM/IEEE Conference on Supercomputing, Baltimore MD (2002).
> >> To describe the spatial domain around complex geometries usually some
> >> unstructured
> >> approach has to be followed to allow a sparse representation. This
> >> results in a one-dimensional
> >> list of all elements which is distributed across all processes in a
> >> parallel simulation.
> >> Therefore the natural representation of all elements in each partition
> >> is perfectly
> >> described by an one-dimensional MPI subarray datatype.
> >> This would also be the most natural choice to describe the MPI File
> >> view for each
> >> partition to enable dumping and reading of results and restart data.
> >> However, due to the interface definition of MPI_Type_create_subarray,
> >> the global
> >> problem size would be limited to roughly 2 billion elements by MPI.
> >> 
> >> Here is the actual implementation we are facing in our Code:
> >> We use a Lattice-Boltzmann representation, where each element is
> >> described by
> >> 19 MPI_DOUBLE_PRECISION values:
> >> 
> >> call MPI_Type_contiguous(19, MPI_DOUBLE_PRECISION, vectype, iError)
> >> call MPI_Type_commit( vectype, iError )
> >> 
> >> This type is then used to build up the subarray description of each
> >> partition:
> >> 
> >> call MPI_Type_create_subarray( 1, [ globElems ], [ locElems ], [
> >> elemOff ], &
> >> & MPI_ORDER_FORTRAN, vectype, ftype, iError)
> >> call MPI_Type_commit( ftype, iError )
> >> 
> >> Where globElems is the global number of Elements, locElems the number
> >> of
> >> Elements in the process local partition, and elemOff the Offset of the
> >> process
> >> local partition in terms of elements. This type is then used to set
> >> the view
> >> for the local process on a file opened for MPI-IO:
> >> 
> >> call MPI_File_open( MPI_COMM_WORLD, &
> >> & 'restart.dat', &
> >> & MPI_MODE_WRONLY+MPI_MODE_CREATE, &
> >> & MPI_INFO_NULL, fhandle, &
> >> & iError )
> >> 
> >> call MPI_File_set_view( fhandle, 0_MPI_OFFSET_KIND, &
> >> & vectype, ftype, "native", &
> >> & MPI_INFO_NULL, iError )
> >> 
> >> Finally the data is written to disk with collective write:
> >> 
> >> call MPI_File_write_all(fhandle, data, locElems, vectype, iostatus,
> >> iError)
> >> 
> >> The global mesh is sorted according to a space-filling curve and to
> >> provide
> >> mostly equal distribution on all processes, the local number of
> >> elements is
> >> computed by the following integer arithmetic:
> >> 
> >> locElems = globElems / nProcs
> >> if (myrank < mod(globElems, nProcs) locElems = locElems + 1
> >> 
> >> Where nProcs is the number of processes in the communicator and myRank
> >> the rank of the local partition in the communicator.
> >> Note, that for differently sized elements, the workload varies and the
> >> elements
> >> shouldn't be distributed equally for work load balance. And with
> >> different
> >> partitioning strategies other distributions might arise. However in
> >> any case it
> >> is not generally possible to find common discriminators greater than
> >> one
> >> across all partitions.
> >> Thus it is not easily possible to find a larger etype, which spans
> >> more than a
> >> single element.
> >> 
> >> What is the recommended implementation for efficient parallel IO
> >> within MPI
> >> in such a scenario?
> >> 
> >> Thanks a lot,
> >> Harald
> >> 
> >> --
> >> Dipl.-Ing. Harald Klimach
> >> 
> >> German Research School for
> >> Simulation Sciences GmbH
> >> Schinkelstr. 2a
> >> 52062 Aachen | Germany
> >> 
> >> Tel +49 241 / 80-99758
> >> Fax +49 241 / 806-99758
> >> Web www.grs-sim.de
> >> JID: haraldkl at jabber.ccc.de
> >> 
> >> Members: Forschungszentrum Jülich GmbH | RWTH Aachen University
> >> Registered in the commercial register of the local court of
> >> Düren (Amtsgericht Düren) under registration number HRB 5268
> >> Registered office: Jülich
> >> Executive board: Prof. Marek Behr Ph.D. | Dr. Norbert Drewes
> > 
> > -- 
> > Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de
> > High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530
> > University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832
> > Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner
> > Nobelstr. 19, D-70550 Stuttgart, Germany . (Office: Allmandring 30)
> > 
> > _______________________________________________
> > mpi-forum mailing list
> > mpi-forum at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-forum
> 
> 
> _______________________________________________
> mpi-forum mailing list
> mpi-forum at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-forum

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA



More information about the mpi-forum mailing list