[MPI3-IO] shared file pointer

Tue Jan 31 15:50:27 CST 2012

My opinion:

1) Using non-blocking collectives in the implementation of the
non-blocking I/O functions is pretty straightforward. Yes it is a bit
harder (basically implementing a state machine) but that is a problem that
has been solved already in libnbc.

2) I'd have to take a closer look, but what do the I/O consistency rules
say here? What if the next one is write instead of read?

Also, in case of the split collectives, doesn't the standard state that
the call is considered complete when the end call returns?
Does it really state that the MPI_File_read_ordered_begin() needs to
update the shared file pointer?

It says on page 422 that no collective I/O routines are permitted on a
file handle concurrently with a split collective. This still allows for

MPI_File_read_ordered_begin()
MPI_File_read_shared ()      [but not read_ordered()]
MPI_File_read_ordered_end()

However, it says that an implementation is free to implement any split
collective data access routine using the corresponding blocking collective
routine when either the begin or end call is issued.

In other words, the example above has -- according to the standard -- two
outcomes: either the collective call logically executes before the
independent, or the independent logically executes before the collective.

I'm not sure that doing 2 iread_ordered calls would be legal or -- if
legal -- forced to act as if the shared file pointer was updated before
the read_shared() call (or the 2nd iread_ordered for that matter).

As executing 2 non-blocking MPI_Iread_read_ordered() calls would translate
to 2 MPI_Iscan operations, what would happen using non-blocking
collectives in the following scenario?

MPI_Iscan (comm)
MPI_Iscan (comm)

  Dries

* Moody, Adam T. <moody20 at llnl.gov> [2012-01-27 10:29:32]:

> Using MPI_Iallreduce certainly helps, and I think the progress rules of MPI would allow this to work.  The downside is that you may need to track a long chain of dependencies.

> MPI_File_iread_ordered()
> MPI_File_iread_ordered()
> MPI_File_read_shared()
> MPI_Waitall()

> In this case, fortunately, the second iread_ordered() op does not need to wait on the iallreduce of the first to start its own iallreduce.  It only needs to wait on the iallreduce of the first before it can actually start to read data.  This is good, because otherwise it would force the second iread_ordered to block, which is something we should (probably must) avoid.

> However, the read_shared needs to wait on the iallreduce for both outstanding iread_ordered ops before it can complete.  Thus, MPI ends up tracking all of these dependencies.

> Maybe that's not a big deal?  Or maybe there's another way to implement this?  It'd be good to have the ROMIO folks help out with this one.
> -Adam

> ________________________________________
> From: Mohamad Chaarawi [chaarawi at hdfgroup.org]
> Sent: Friday, January 27, 2012 9:25 AM
> To: Quincey Koziol
> Cc: Moody, Adam T.; Dries Kimpe
> Subject: Re: shared file pointer

> Hi All,

> >> As we discussed, I think our current non-blocking IO proposal may compound the problem of shared file pointers.  Consider that you have a program like the following:

> >> MPI_File_read_ordered_begin()
> >> MPI_File_read_shared()
> >> MPI_File_read_ordered_end()

> >> Even though the collective may not return data until the end call, it does update the shared file pointer so that the individual read call in the middle uses the updated pointer value.  The standard currently allows the begin call to be blocking (see MPI 2.2, page 422 lines 18-23).  Given this, a scalable way to update the shared file pointer in the begin call is to execute an MPI_Allreduce() to update the pointer after tallying up the amount from each process.

> >> However, with non-blocking collectives, we don't currently allow the initiation calls to be blocking.  If we follow that convention, then MPI_File_iread_ordered() cannot block, which means we can't update the pointer using the Allreduce method like we can in the split versions.  I think we should hash this out to be sure that it's possible to update the pointer in a scalable way.  Or perhaps we need to say that MPI_File_iread_ordered() can block, but I don't think this would go over well.
> >       Maybe there's a possible "out": could MPI_File_iread_ordered() be collective, but non synchronizing/blocking and use a nonblocking Allreduce operation?  That way the implementation could record (internally) that the Allreduce was pending and wait for it to complete before the MPI_File_read_shared() proceeded (which wouldn't necessarily force the MPI_File_iread_ordered() to complete, just the Allreduce component, so that the shared file pointer address could be updated).  Would that work?

> I'm thinking on the same lines as Quincey mentioned. A non-blocking
> all-reduce would work, where MPI would record internally that this
> allreduce needs to complete before starting this blocking read_shared.
> The other option is to allow the iread_ordered to block for the
> allreduce so that the File pointer could be adjusted before returning.

> Thanks,
> Mohamad

> >       I've CC'd Mohamad and Dries, to bring them into the conversation, maybe they'll have more ideas.

> >       Quincey

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-io/attachments/20120131/933b9cb6/attachment.pgp>