[Mpi-22] reConfigurable MPI

Richard Treumann treumann at [hidden]
Tue Jul 6 08:39:16 CDT 2010


Adding something this complex to the MPI Standard would require that a
handful of people who think it is important, join the MPI Forum and work
out a proposed addition to the standard and provide a "proof of concept"
implementation. The MPI Standard is developed by people with regular jobs
who consider it worthwhile to donate some of their time to the MPI Forum.
Obviously, most work for organizations that are also willing to fund modest
support for the MPI Forum.There is no paid staff that can be asked to work
out something new. The people who want something new need to do the work.

There is a Fault Tolerance subcommittee that may be doing some things that
overlap what you want but my personal feeling is that the primary goals of
the fault tolerance sub-committee are already so challenging that being
asked to add even more to their current set of goals would not go over

The MPI Standard, at its core, is designed around the following ideas
(incomplete list obviously):

1) Applications do not need to check for errors because the library will do
what the application asks if it can and issue a fatal error if success is
not possible
2) Communicators do not change membership (allows collective operations to
avoid overheads and unpredictability  from possible membership
3) Communicator creations are always collective and deterministic.
Everything that needs to be done can be done via messages among
participants (no interaction with a supervisor daemon needed)
4) Tasks or processes that make up a job run independently except when an
explicit application call forces them to interact.  (Send/Recv forces 2
tasks to interact, Barrier on a communicator forces all tasks of that
communicator to interact. Tasks that are not part of the interaction
specified by the application are semantically unaffected.)

These characteristic make both Fault Tolerance and the kind of extensions
you envision very challenging.

I do not know of any help anyone can offer you within the existing MPI
Standard.  Participation in the MPI Forum is generally open to any one who
can find the time and can attend the meetings.

(BTW - I think most MPI applications are run on managed clusters that are
large enough to run several jobs at once. These clusters do not have
resources come a go often. A node that needs routine repair or upgrade will
be deleted from the resource pool when it finishes  a job and when it is
added back to the pool, some job that is waiting in the work queue will get
it. The theory that it is better for a running job to claim it on the fly
than for the next job waiting in the queue be assigned the node is probably


Dick Treumann  -  MPI Team
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846         Fax (845) 433-8363

siavash ghiasvand <siavash.ghiyasvand_at_[hidden]> wrote on 07/01/2010
03:26:39 PM:

> [image removed]
> Re: [Mpi-22] reConfigurable MPI
> siavash ghiasvand
> to:
> MPI 2.2
> 07/01/2010 03:27 PM
> Cc:
> Richard Treumann
> You seem to be interested in the situation where resources are added
> to a cluster (or maybe freed up by other jobs completing) and having
> a running MPI job get notified asynchronously that there are newly
> available resources it can make a bid for.
> Absolutely, this is what I want to do. But first of all I tried to
> know why MPI as the leader of HPC clusters world didn't include this
> concept in its standard (May be it's totally against the HPC
> world!). With Mr.Solt, Mr.Gropp and your guides now I know the
> answer of that "Why?"
> This idea of the cluster manager pushing resources to a job without
> regard to where the job is in its execution would bring lots of new
> issues. I am not aware of anybody having made a serious attempt to
> even define what would be needed inside the MPI standard to let
> applications catch and act on an asynchronous notification like this.
> I heard about (I'm not sure) something like this in "MPI/GAMMA
Project" [1
> ], which pushes additional resources to a running MPI cluster and
> when the running program reaches mpi_barrier point those new
> resources are getting involved (completely asynchronous).
> My first guess is that pushing an offer of additional resource would
> not be very hard to design into a resource manager but the MPI API
> side of how to react asynchronously to that offer would be very complex.
> You are right, the automatic way for handling this, is really
> breathtaking. for example, if we divided a loop for 5 machines and
> the cluster is running now, how we can involve a new (6th) machine
> without restarting the entire cluster?!
> Running job decides to try for more resource vs resource manager
> tries to volunteer more resource to running job
> In PVM we have two functions pvm_addhosts and pvm_delhosts [2] and
> they can more or less handle the first type ("Running job decides to
> try for more resource") but the great issue is with the second one:
> "resource manager tries to volunteer more resource to running job"
> which means jobs are not aware about those new resources.
> Any help or Idea in this concept would be greatly appreciated.
> [1]  http://www.disi.unige.it/project/gamma/mpigamma/
> [2]
> Regards,
> Siavash,

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpi-22/attachments/20100706/ba904c13/attachment.html>

More information about the Mpi-22 mailing list