From treumann at [hidden] Thu Jul 1 10:32:04 2010 From: treumann at [hidden] (Richard Treumann) Date: Thu, 1 Jul 2010 11:32:04 -0400 Subject: [Mpi-22] reConfigurable MPI In-Reply-To: Message-ID: Most discussion I have heard relates to how a running job that makes a decision, at a well defined point, to grow itself can do that. The MPI standard currently tries to cover the aspects of this which fit within a library model and leave the resource manager outside the standard. This does not seem to be your focus. Tell me if I have misunderstood. You seem to be interested in the situation where resources are added to a cluster (or maybe freed up by other jobs completing) and having a running MPI job get notified asynchronously that there are newly available resources it can make a bid for. This idea of the cluster manager pushing resources to a job without regard to where the job is in its execution would bring lots of new issues. I am not aware of anybody having made a serious attempt to even define what would be needed inside the MPI standard to let applications catch and act on an asynchronous notification like this. My first guess is that pushing an offer of additional resource would not be very hard to design into a resource manager but the MPI API side of how to react asynchronously to that offer would be very complex. Either way (Running job decides to try for more resource vs resource manager tries to volunteer more resource to running job) I think the MPI standard can not try to specify very much about how the resource manager does its part. Dick Treumann - MPI Team IBM Systems & Technology Group Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 > As you said, That paragraph is in the context of process creating > and yes! my question was about runtime reconfiguration of an MPI > cluster (whole) based on dynamic application requirements. But I > think that would be impossible without having a flexible resource > control. When a new node joins a cluster some pre-works must be done > to enable that new node act as a member of that cluster (in run > time), also, some one must be informed about new resources which > just have been added to the cluster by entrance of new node and I > think all these preparations need that missing flexible resource manager. > > > > Regards, > Siavash_______________________________________________ > mpi-22 mailing list > mpi-22_at_[hidden] > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-22 * -------------- next part -------------- An HTML attachment was scrubbed... URL: From siavash.ghiyasvand at [hidden] Thu Jul 1 14:26:39 2010 From: siavash.ghiyasvand at [hidden] (siavash ghiasvand) Date: Thu, 1 Jul 2010 23:56:39 +0430 Subject: [Mpi-22] reConfigurable MPI In-Reply-To: Message-ID: > > You seem to be interested in the situation where resources are added to a > cluster (or maybe freed up by other jobs completing) and having a running > MPI job get notified asynchronously that there are newly available resources > it can make a bid for. Absolutely, this is what I want to do. But first of all I tried to know why MPI as the leader of HPC clusters world didn't include this concept in its standard (May be it's totally against the HPC world!). With Mr.Solt, Mr.Gropp and your guides now I know the answer of that "Why?" > This idea of the cluster manager pushing resources to a job without regard > to where the job is in its execution would bring lots of new issues. I am > not aware of anybody having made a serious attempt to even define what would > be needed inside the MPI standard to let applications catch and act on an > asynchronous notification like this. > I heard about (I'm not sure) something like this in "MPI/GAMMA Project" [1], which pushes additional resources to a running MPI cluster and when the running program reaches mpi_barrier point those new resources are getting involved (completely asynchronous). > My first guess is that pushing an offer of additional resource would not be > very hard to design into a resource manager but the MPI API side of how to > react asynchronously to that offer would be very complex. > You are right, the automatic way for handling this, is really breathtaking. for example, if we divided a loop for 5 machines and the cluster is running now, how we can involve a new (6th) machine without restarting the entire cluster?! > Running job decides to try for more resource vs resource manager tries to > volunteer more resource to running job In PVM we have two functions pvm_addhosts and pvm_delhosts [2] and they can more or less handle the first type ("Running job decides to try for more resource") but the great issue is with the second one: "resource manager tries to volunteer more resource to running job" which means jobs are not aware about those new resources. Any help or Idea in this concept would be greatly appreciated. [1] http://www.disi.unige.it/project/gamma/mpigamma/ [2] http://docs.cray.com/books/004-3686-001/html-004-3686-001/vemjlb.html Regards, Siavash * -------------- next part -------------- An HTML attachment was scrubbed... URL: From treumann at [hidden] Tue Jul 6 08:39:16 2010 From: treumann at [hidden] (Richard Treumann) Date: Tue, 6 Jul 2010 09:39:16 -0400 Subject: [Mpi-22] reConfigurable MPI In-Reply-To: Message-ID: Saivash Adding something this complex to the MPI Standard would require that a handful of people who think it is important, join the MPI Forum and work out a proposed addition to the standard and provide a "proof of concept" implementation. The MPI Standard is developed by people with regular jobs who consider it worthwhile to donate some of their time to the MPI Forum. Obviously, most work for organizations that are also willing to fund modest support for the MPI Forum.There is no paid staff that can be asked to work out something new. The people who want something new need to do the work. There is a Fault Tolerance subcommittee that may be doing some things that overlap what you want but my personal feeling is that the primary goals of the fault tolerance sub-committee are already so challenging that being asked to add even more to their current set of goals would not go over well. The MPI Standard, at its core, is designed around the following ideas (incomplete list obviously): 1) Applications do not need to check for errors because the library will do what the application asks if it can and issue a fatal error if success is not possible 2) Communicators do not change membership (allows collective operations to avoid overheads and unpredictability from possible membership additions/subtractions) 3) Communicator creations are always collective and deterministic. Everything that needs to be done can be done via messages among participants (no interaction with a supervisor daemon needed) 4) Tasks or processes that make up a job run independently except when an explicit application call forces them to interact. (Send/Recv forces 2 tasks to interact, Barrier on a communicator forces all tasks of that communicator to interact. Tasks that are not part of the interaction specified by the application are semantically unaffected.) These characteristic make both Fault Tolerance and the kind of extensions you envision very challenging. I do not know of any help anyone can offer you within the existing MPI Standard. Participation in the MPI Forum is generally open to any one who can find the time and can attend the meetings. (BTW - I think most MPI applications are run on managed clusters that are large enough to run several jobs at once. These clusters do not have resources come a go often. A node that needs routine repair or upgrade will be deleted from the resource pool when it finishes a job and when it is added back to the pool, some job that is waiting in the work queue will get it. The theory that it is better for a running job to claim it on the fly than for the next job waiting in the queue be assigned the node is probably rare) Dick Dick Treumann - MPI Team IBM Systems & Technology Group Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 siavash ghiasvand wrote on 07/01/2010 03:26:39 PM: > [image removed] > > Re: [Mpi-22] reConfigurable MPI > > siavash ghiasvand > > to: > > MPI 2.2 > > 07/01/2010 03:27 PM > > Cc: > > Richard Treumann > > You seem to be interested in the situation where resources are added > to a cluster (or maybe freed up by other jobs completing) and having > a running MPI job get notified asynchronously that there are newly > available resources it can make a bid for. > > Absolutely, this is what I want to do. But first of all I tried to > know why MPI as the leader of HPC clusters world didn't include this > concept in its standard (May be it's totally against the HPC > world!). With Mr.Solt, Mr.Gropp and your guides now I know the > answer of that "Why?" > > > This idea of the cluster manager pushing resources to a job without > regard to where the job is in its execution would bring lots of new > issues. I am not aware of anybody having made a serious attempt to > even define what would be needed inside the MPI standard to let > applications catch and act on an asynchronous notification like this. > > I heard about (I'm not sure) something like this in "MPI/GAMMA Project" [1 > ], which pushes additional resources to a running MPI cluster and > when the running program reaches mpi_barrier point those new > resources are getting involved (completely asynchronous). > > > My first guess is that pushing an offer of additional resource would > not be very hard to design into a resource manager but the MPI API > side of how to react asynchronously to that offer would be very complex. > > You are right, the automatic way for handling this, is really > breathtaking. for example, if we divided a loop for 5 machines and > the cluster is running now, how we can involve a new (6th) machine > without restarting the entire cluster?! > > > Running job decides to try for more resource vs resource manager > tries to volunteer more resource to running job > > In PVM we have two functions pvm_addhosts and pvm_delhosts [2] and > they can more or less handle the first type ("Running job decides to > try for more resource") but the great issue is with the second one: > "resource manager tries to volunteer more resource to running job" > which means jobs are not aware about those new resources. > > Any help or Idea in this concept would be greatly appreciated. > > > [1] http://www.disi.unige.it/project/gamma/mpigamma/ > [2] http://docs.cray.com/books/004-3686-001/html-004-3686-001/vemjlb.html > > > > Regards, > Siavash, * -------------- next part -------------- An HTML attachment was scrubbed... URL: From cl5882 at [hidden] Wed Jul 7 02:51:56 2010 From: cl5882 at [hidden] (Christos Lamprakis) Date: Wed, 07 Jul 2010 10:51:56 +0300 Subject: [Mpi-22] MPI_Scatter structure's field Message-ID: <20100707105156.77571942cmjckv18@webmail.duth.gr> Hello to everybody, In the script below I dont know how to Scatter the structure's field genes. I would like to scatter them in a chunk of 2. So for N old_chrome and M genes I will have: at node 0 old_chrome[0-N].genes[0-1] at node 1 old_chrome[0-N].genes[0-2] ... at node x old_chrome[0-N].genes[(M-2)to(M-1)] Any help How to Scatter? Many thanks #include #include #include #include #include #include #include "mylib.h" #define NUM_OF_GENERATIONS 10000 #define NUM_OF_GENES 3 #define NUM_OF_CHROMES 5 #define NUM_OF_FORCES 3 truct chromosome{ double fitness; double genes[NUM_OF_GENES]; double up_limit[NUM_OF_GENES]; double low_limit[NUM_OF_GENES]; }; struct chromosome old_chrome[NUM_OF_CHROMES], new_chrome[NUM_OF_CHROMES], temp_chrome[NUM_OF_CHROMES]; double forces[NUM_OF_FORCES]; double matrix[NUM_OF_FORCES][NUM_OF_GENES]; double matrix_local[NUM_OF_FORCES][NUM_OF_GENES]; double forces_local[NUM_OF_FORCES]; int CROSS_CHROME[NUM_OF_CHROMES]; int main(int argc, char* argv[]) { int size,rank,i,j,n; int blockCounts[4] = {1,NUM_OF_GENES,NUM_OF_GENES,NUM_OF_GENES}; MPI_Datatype types[4]; MPI_Aint displace[4]; MPI_Datatype myStructType; int base; srand48(time(NULL)); MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Address(old_chrome,displace); MPI_Address(old_chrome[0].genes,displace+1 ); MPI_Address(old_chrome[0].up_limit, displace+2); MPI_Address(old_chrome[0].low_limit, displace+3); types[0] = MPI_DOUBLE; types[1] = MPI_DOUBLE; types[2] = MPI_DOUBLE; types[3] = MPI_DOUBLE; base = displace[0]; for(i=0; i<4; i++) { displace[i] -= base; } MPI_Type_struct(4, blockCounts, displace, types, &myStructType); MPI_Type_commit(&myStructType); if(rank == 0) { for(i=0;i<4;i++) { for(j=0;j<2;j++) { old_chrome[i].genes[j]=drand48(); } } } for(i=0;i Message-ID: Dear Richard, Thank you very very much, I got the point. Regards, Siavash On Tue, Jul 6, 2010 at 6:09 PM, Richard Treumann wrote: > Saivash > > Adding something this complex to the MPI Standard would require that a > handful of people [...] > > Dick > * -------------- next part -------------- An HTML attachment was scrubbed... URL: