[Mpi3-ft] MPI_Recv + MPI_Comm_failure_ack

Aurélien Bouteiller bouteill at icl.utk.edu
Fri Mar 15 13:06:52 CDT 2013


The intent was to return ERR_PROC_FAILED (nothing is pending) and the Ack should also stop this return of PROC_FAILED for blocking ANY_SOURCE. 

Good catch Dave! 

Aurelien

Le 15 mars 2013 à 14:01, Wesley Bland <wbland at icl.utk.edu> a écrit :

> You're right. A blocking call shouldn't return MPI_ERR_PENDING when there is no request to be pending. I did think we'd covered this some other way. It's definitely the intent for both versions of receive to be able to ignore acknowledged failures.
> 
> 
> On Fri, Mar 15, 2013 at 1:53 PM, David Solt <dsolt at us.ibm.com> wrote:
> I'm pretty sure the intent was that MPI_Recv should NOT return MPI_ERR_PENDING as there is no request on which the error can be pending, but I don't know if much thought was given to allowing MPI_Recv to ignore acknowledge ranks. 
> Dave 
> 
> 
> 
> From:        Wesley Bland <wbland at icl.utk.edu> 
> To:        "MPI 3.0 Fault Tolerance and Dynamic Process Control working Group" <mpi3-ft at lists.mpi-forum.org>, 
> Date:        03/15/2013 12:45 PM 
> Subject:        Re: [Mpi3-ft] MPI_Recv + MPI_Comm_failure_ack 
> Sent by:        mpi3-ft-bounces at lists.mpi-forum.org 
> 
> 
> 
> I think you are correct in your evaluation, though I also think that wasn't our intent. I think the intent (unless I'm forgetting a discussion) was to allow MPI_ERR_PENDING to be returned by MPI_RECV and let MPI_COMM_FAILURE_ACK cover both cases. Can anyone else confirm that this was the goal. 
> 
> If that's the case, it's something we'll need to fix in the text. 
> 
> Thanks, 
> Wesley 
> 
> 
> On Fri, Mar 15, 2013 at 12:32 PM, David Solt <dsolt at us.ibm.com> wrote: 
> Based on the proposal: 
> 
> MPI_Comm_failure_ack(blah, blah) 
> 
> This local operation gives the users a way to acknowledge all locally noticed failures on 
> comm. After the call, unmatched MPI_ANY_SOURCE receptions that would have raised an 
> error code MPI_ERR_PENDING due to process failure (see Section 17.2.2) proceed without 
> further reporting of errors due to those acknowledged failures. 
> 
> I think this clearly indicates that MPI_Recv is uninfluenced by calls to MPI_Comm_failure_ack.  Therefore, there is no way to call MPI_Recv(MPI_ANY_SOURCE) and ignore failures reported by MPI_Comm_failure_ack.   
> 
> I believe the following code will NOT work (i.e. after the first failure, the MPI_Recv will continuously fail): 
> 
> 
> MPI_Comm_size(intercomm, &size); 
> while (failures < size) { 
>         err = MPI_Recv(blah, blah, MPI_ANY_SOURCE, intercomm, &status); 
>         if (err == MPI_PROC_FAILED) { 
>                 MPI_Comm_failure_ack(intercomm); 
>                 MPI_Comm_failure_get_acked(intercomm, &group); 
>                 MPI_Group_size(group, &failures); 
>         } else { 
>                 /* process received data */ 
>         } 
> } 
> 
> and has to be written as: 
> 
> MPI_Comm_size(intercomm, &size); 
> while (failures < size) { 
> 
>         if (request == MPI_REQUEST_NULL) { 
>                 err = MPI_Irecv(blah, blah, MPI_ANY_SOURCE, intercomm, &request); 
>         } 
>         err = MPI_Wait(&request, &status); 
> 
>         if (err == MPI_ERR_PENDING) { 
>                 MPI_Comm_failure_ack(intercomm); 
>                 MPI_Comm_failure_get_acked(intercomm, &group); 
>                 MPI_Group_size(group, &failures); 
>         } else { 
>                 /* process received data */ 
>         } 
> } 
> 
> Am I correct in my thinking? 
> If so, was there a reason why MPI_Recv could not also "obey" MPI_Comm_failure_ack calls? 
> 
> Thanks, 
> Dave
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft 
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

--
* Dr. Aurélien Bouteiller
* Researcher at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 309b
* Knoxville, TN 37996
* 865 974 9375











More information about the mpiwg-ft mailing list