[MPI3-IO] shared file pointer

Wed Feb 1 13:38:32 CST 2012

Hi Dries,

On Jan 31, 2012, at 3:50 PM, Dries Kimpe wrote:

> My opinion:
> 
> 1) Using non-blocking collectives in the implementation of the
> non-blocking I/O functions is pretty straightforward. Yes it is a bit
> harder (basically implementing a state machine) but that is a problem that
> has been solved already in libnbc.

	OK, good.

> 2) I'd have to take a closer look, but what do the I/O consistency rules
> say here? What if the next one is write instead of read?

	I don't think that the read vs. write matters...

> Also, in case of the split collectives, doesn't the standard state that
> the call is considered complete when the end call returns?
> Does it really state that the MPI_File_read_ordered_begin() needs to
> update the shared file pointer?
> 
> It says on page 422 that no collective I/O routines are permitted on a
> file handle concurrently with a split collective. This still allows for
> 
> MPI_File_read_ordered_begin()
> MPI_File_read_shared ()      [but not read_ordered()]
> MPI_File_read_ordered_end()
> 
> However, it says that an implementation is free to implement any split
> collective data access routine using the corresponding blocking collective
> routine when either the begin or end call is issued.
> 
> In other words, the example above has -- according to the standard -- two
> outcomes: either the collective call logically executes before the
> independent, or the independent logically executes before the collective.

	Is there a well-defined behavior in the standard for this?

> I'm not sure that doing 2 iread_ordered calls would be legal or -- if
> legal -- forced to act as if the shared file pointer was updated before
> the read_shared() call (or the 2nd iread_ordered for that matter).

	Hmm, maybe nonblocking (both independent & collective) I/O on a shared file pointer doesn't make sense at all then?

	Quincey

> As executing 2 non-blocking MPI_Iread_read_ordered() calls would translate
> to 2 MPI_Iscan operations, what would happen using non-blocking
> collectives in the following scenario?
> 
> MPI_Iscan (comm)
> MPI_Iscan (comm)
> 
>  Dries
> 
> 
> * Moody, Adam T. <moody20 at llnl.gov> [2012-01-27 10:29:32]:
> 
>> Using MPI_Iallreduce certainly helps, and I think the progress rules of MPI would allow this to work.  The downside is that you may need to track a long chain of dependencies.
> 
>> MPI_File_iread_ordered()
>> MPI_File_iread_ordered()
>> MPI_File_read_shared()
>> MPI_Waitall()
> 
>> In this case, fortunately, the second iread_ordered() op does not need to wait on the iallreduce of the first to start its own iallreduce.  It only needs to wait on the iallreduce of the first before it can actually start to read data.  This is good, because otherwise it would force the second iread_ordered to block, which is something we should (probably must) avoid.
> 
>> However, the read_shared needs to wait on the iallreduce for both outstanding iread_ordered ops before it can complete.  Thus, MPI ends up tracking all of these dependencies.
> 
>> Maybe that's not a big deal?  Or maybe there's another way to implement this?  It'd be good to have the ROMIO folks help out with this one.
>> -Adam
> 
>> ________________________________________
>> From: Mohamad Chaarawi [chaarawi at hdfgroup.org]
>> Sent: Friday, January 27, 2012 9:25 AM
>> To: Quincey Koziol
>> Cc: Moody, Adam T.; Dries Kimpe
>> Subject: Re: shared file pointer
> 
>> Hi All,
> 
>>>> As we discussed, I think our current non-blocking IO proposal may compound the problem of shared file pointers.  Consider that you have a program like the following:
> 
>>>> MPI_File_read_ordered_begin()
>>>> MPI_File_read_shared()
>>>> MPI_File_read_ordered_end()
> 
>>>> Even though the collective may not return data until the end call, it does update the shared file pointer so that the individual read call in the middle uses the updated pointer value.  The standard currently allows the begin call to be blocking (see MPI 2.2, page 422 lines 18-23).  Given this, a scalable way to update the shared file pointer in the begin call is to execute an MPI_Allreduce() to update the pointer after tallying up the amount from each process.
> 
>>>> However, with non-blocking collectives, we don't currently allow the initiation calls to be blocking.  If we follow that convention, then MPI_File_iread_ordered() cannot block, which means we can't update the pointer using the Allreduce method like we can in the split versions.  I think we should hash this out to be sure that it's possible to update the pointer in a scalable way.  Or perhaps we need to say that MPI_File_iread_ordered() can block, but I don't think this would go over well.
>>>      Maybe there's a possible "out": could MPI_File_iread_ordered() be collective, but non synchronizing/blocking and use a nonblocking Allreduce operation?  That way the implementation could record (internally) that the Allreduce was pending and wait for it to complete before the MPI_File_read_shared() proceeded (which wouldn't necessarily force the MPI_File_iread_ordered() to complete, just the Allreduce component, so that the shared file pointer address could be updated).  Would that work?
> 
>> I'm thinking on the same lines as Quincey mentioned. A non-blocking
>> all-reduce would work, where MPI would record internally that this
>> allreduce needs to complete before starting this blocking read_shared.
>> The other option is to allow the iread_ordered to block for the
>> allreduce so that the File pointer could be adjusted before returning.
> 
>> Thanks,
>> Mohamad
> 
> 
>>>      I've CC'd Mohamad and Dries, to bring them into the conversation, maybe they'll have more ideas.
> 
>>>      Quincey
> 
>