[Mpi3-ft] Communicator Virtualization as a step forward

Greg Bronevetsky bronevetsky1 at llnl.gov
Fri Feb 13 10:47:26 CST 2009


>Such statements are way to broad to be true. In fact it depends on
>what recovery mode was used. Please read the document I sent few
>emails ago, to see all the capabilities that FT-MPI provided.
I make this statement because the FT-MPI model requires all processes 
to participate in recovery. At the very least, they need to 
participate in recreating communicators. This may become quite bad if 
we're using millions of processes. The problem I have is that this is 
a property of the model, not one implementation of it.

>Again this is not true. First, one will need a kind of database to
>store this information (distributed or centralized) that came with its
>own scalability and cost problems. In addition, in the context of

Good point. The scalability and cost problems will absolutely need to 
be studied. However, as Thomas also pointed out, layering the current 
API on top of FT-MPI will be complex and will have very different 
performance properties, making it useless for such a study.

>recovery the new processes will have to retrieve this information and
>let everybody else know not only their new contact information but the
>fact that they are  back in the specified communicator. Unfortunately,
>this is [again] _NOT_ a local operation. The fact that you seems to
>plan to delegate these problems to the runtime environment, doesn't
>make it local nor more scalable.

You're right, any implementation would need to do a bunch of 
additional communication operations in order to support the "local 
rejoin" API, making it not truly local. However, the key difference 
is that in FT-MPI the recovery must employ a series of global 
collective operations that require all processes to synchronize. In 
contrast, runtime support for the local rejoin option is much less 
coupled. When a process sends a message to another process, it will 
need to attach the receiver's expected rank in MPI_COMM_WORLD to the 
message. If the receiver exists and has the right rank, message 
delivery occurs fine. If not, the sender gets an error and needs to 
ask the runtime environment for the correct physical address for the 
given receiver rank. This system involves probably as much overall 
communication as the everybody-synchronize approach but since the 
communication is decoupled, it will have a much smaller hit on 
performance. One thing I don't know is how to do the above for puts 
and gets. Is it possible for RDMA hardware to do any verification of 
the destination process?

Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov  




More information about the mpiwg-ft mailing list