<html><body>

<p>Saivash <br>

<br>

Adding something this complex to the MPI Standard would require that a handful of people who think it is important, join the MPI Forum and work out a proposed addition to the standard and provide a "proof of concept" implementation. The MPI Standard is developed by people with regular jobs who consider it worthwhile to donate some of their time to the MPI Forum.  Obviously, most work for organizations that are also willing to fund modest support for the MPI Forum.There is no paid staff that can be asked to work out something new. The people who want something new need to do the work.<br>

<br>

There is a Fault Tolerance subcommittee that may be doing some things that overlap what you want but my personal feeling is that the primary goals of the fault tolerance sub-committee are already so challenging that being asked to add even more to their current set of goals would not go over well.  <br>

<br>

The MPI Standard, at its core, is designed around the following ideas (incomplete list obviously):<br>

<br>

1) Applications do not need to check for errors because the library will do what the application asks if it can and issue a fatal error if success is not possible<br>

2) Communicators do not change membership (allows collective operations to avoid overheads and unpredictability  from possible membership additions/subtractions)<br>

3) Communicator creations are always collective and deterministic. Everything that needs to be done can be done via messages among participants (no interaction with a supervisor daemon needed)<br>

4) Tasks or processes that make up a job run independently except when an explicit application call forces them to interact.  (Send/Recv forces 2 tasks to interact, Barrier on a communicator forces all tasks of that communicator to interact. Tasks that are not part of the interaction specified by the application are semantically unaffected.)<br>

<br>

These characteristic make both Fault Tolerance and the kind of extensions you envision very challenging.  <br>

<br>

I do not know of any help anyone can offer you within the existing MPI Standard.  Participation in the MPI Forum is generally open to any one who can find the time and can attend the meetings.<br>

<br>

(BTW - I think most MPI applications are run on managed clusters that are large enough to run several jobs at once. These clusters do not have resources come a go often. A node that needs routine repair or upgrade will be deleted from the resource pool when it finishes  a job and when it is added back to the pool, some job that is waiting in the work queue will get it. The theory that it is better for a running job to claim it on the fly than for the next job waiting in the queue be assigned the node is probably rare)<br>

<br>

        Dick <br>

<br>

<br>

Dick Treumann  -  MPI Team           <br>

IBM Systems & Technology Group<br>

Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601<br>

Tele (845) 433-7846         Fax (845) 433-8363<br>

<br>

<br>

<tt>siavash ghiasvand <siavash.ghiyasvand@gmail.com> wrote on 07/01/2010 03:26:39 PM:<br>

<br>

> [image removed] </tt><br>

<tt>> <br>

> Re: [Mpi-22] reConfigurable MPI</tt><br>

<tt>> <br>

> siavash ghiasvand </tt><br>

<tt>> <br>

> to:</tt><br>

<tt>> <br>

> MPI 2.2</tt><br>

<tt>> <br>

> 07/01/2010 03:27 PM</tt><br>

<tt>> <br>

> Cc:</tt><br>

<tt>> <br>

> Richard Treumann</tt><br>

<tt>> <br>

> You seem to be interested in the situation where resources are added<br>

> to a cluster (or maybe freed up by other jobs completing) and having<br>

> a running MPI job get notified asynchronously that there are newly <br>

> available resources it can make a bid for.</tt><br>

<tt>>  </tt><br>

<tt>> Absolutely, this is what I want to do. But first of all I tried to <br>

> know why MPI as the leader of HPC clusters world didn't include this<br>

> concept in its standard (May be it's totally against the HPC <br>

> world!). With Mr.Solt, Mr.Gropp and your guides now I know the <br>

> answer of that "Why?"</tt><br>

<tt>>  </tt><br>

<tt>>  </tt><br>

<tt>> This idea of the cluster manager pushing resources to a job without <br>

> regard to where the job is in its execution would bring lots of new <br>

> issues. I am not aware of anybody having made a serious attempt to <br>

> even define what would be needed inside the MPI standard to let <br>

> applications catch and act on an asynchronous notification like this.</tt><br>

<tt>>  </tt><br>

<tt>> I heard about (I'm not sure) something like this in "MPI/GAMMA Project" [1<br>

> ], which pushes additional resources to a running MPI cluster and <br>

> when the running program reaches mpi_barrier point those new <br>

> resources are getting involved (completely asynchronous).</tt><br>

<tt>>  </tt><br>

<tt>>  </tt><br>

<tt>> My first guess is that pushing an offer of additional resource would<br>

> not be very hard to design into a resource manager but the MPI API <br>

> side of how to react asynchronously to that offer would be very complex.</tt><br>

<tt>>  </tt><br>

<tt>> You are right, the automatic way for handling this, is really <br>

> breathtaking. for example, if we divided a loop for 5 machines and <br>

> the cluster is running now, how we can involve a new (6th) machine <br>

> without restarting the entire cluster?!</tt><br>

<tt>>  </tt><br>

<tt>>  </tt><br>

<tt>> Running job decides to try for more resource vs resource manager <br>

> tries to volunteer more resource to running job</tt><br>

<tt>>  </tt><br>

<tt>> In PVM we have two functions pvm_addhosts and pvm_delhosts [2] and <br>

> they can more or less handle the first type ("Running job decides to<br>

> try for more resource") but the great issue is with the second one: <br>

> "resource manager tries to volunteer more resource to running job" <br>

> which means jobs are not aware about those new resources.</tt><br>

<tt>>  </tt><br>

<tt>> Any help or Idea in this concept would be greatly appreciated.</tt><br>

<tt>>  </tt><br>

<tt>>  </tt><br>

<tt>> [1]  <a href="http://www.disi.unige.it/project/gamma/mpigamma/">http://www.disi.unige.it/project/gamma/mpigamma/</a></tt><br>

<tt>> [2]  <a href="http://docs.cray.com/books/004-3686-001/html-004-3686-001/vemjlb.html">http://docs.cray.com/books/004-3686-001/html-004-3686-001/vemjlb.html</a></tt><br>

<tt>>  </tt><br>

<tt>>  </tt><br>

<tt>>  </tt><br>

<tt>> Regards,</tt><br>

<tt>> Siavash</tt></body></html>