[Mpi3-ft] MPI_Recv + MPI_Comm_failure_ack
Aurélien Bouteiller
bouteill at icl.utk.edu
Fri Mar 15 13:06:52 CDT 2013
The intent was to return ERR_PROC_FAILED (nothing is pending) and the Ack should also stop this return of PROC_FAILED for blocking ANY_SOURCE.
Good catch Dave!
Aurelien
Le 15 mars 2013 à 14:01, Wesley Bland <wbland at icl.utk.edu> a écrit :
> You're right. A blocking call shouldn't return MPI_ERR_PENDING when there is no request to be pending. I did think we'd covered this some other way. It's definitely the intent for both versions of receive to be able to ignore acknowledged failures.
>
>
> On Fri, Mar 15, 2013 at 1:53 PM, David Solt <dsolt at us.ibm.com> wrote:
> I'm pretty sure the intent was that MPI_Recv should NOT return MPI_ERR_PENDING as there is no request on which the error can be pending, but I don't know if much thought was given to allowing MPI_Recv to ignore acknowledge ranks.
> Dave
>
>
>
> From: Wesley Bland <wbland at icl.utk.edu>
> To: "MPI 3.0 Fault Tolerance and Dynamic Process Control working Group" <mpi3-ft at lists.mpi-forum.org>,
> Date: 03/15/2013 12:45 PM
> Subject: Re: [Mpi3-ft] MPI_Recv + MPI_Comm_failure_ack
> Sent by: mpi3-ft-bounces at lists.mpi-forum.org
>
>
>
> I think you are correct in your evaluation, though I also think that wasn't our intent. I think the intent (unless I'm forgetting a discussion) was to allow MPI_ERR_PENDING to be returned by MPI_RECV and let MPI_COMM_FAILURE_ACK cover both cases. Can anyone else confirm that this was the goal.
>
> If that's the case, it's something we'll need to fix in the text.
>
> Thanks,
> Wesley
>
>
> On Fri, Mar 15, 2013 at 12:32 PM, David Solt <dsolt at us.ibm.com> wrote:
> Based on the proposal:
>
> MPI_Comm_failure_ack(blah, blah)
>
> This local operation gives the users a way to acknowledge all locally noticed failures on
> comm. After the call, unmatched MPI_ANY_SOURCE receptions that would have raised an
> error code MPI_ERR_PENDING due to process failure (see Section 17.2.2) proceed without
> further reporting of errors due to those acknowledged failures.
>
> I think this clearly indicates that MPI_Recv is uninfluenced by calls to MPI_Comm_failure_ack. Therefore, there is no way to call MPI_Recv(MPI_ANY_SOURCE) and ignore failures reported by MPI_Comm_failure_ack.
>
> I believe the following code will NOT work (i.e. after the first failure, the MPI_Recv will continuously fail):
>
>
> MPI_Comm_size(intercomm, &size);
> while (failures < size) {
> err = MPI_Recv(blah, blah, MPI_ANY_SOURCE, intercomm, &status);
> if (err == MPI_PROC_FAILED) {
> MPI_Comm_failure_ack(intercomm);
> MPI_Comm_failure_get_acked(intercomm, &group);
> MPI_Group_size(group, &failures);
> } else {
> /* process received data */
> }
> }
>
> and has to be written as:
>
> MPI_Comm_size(intercomm, &size);
> while (failures < size) {
>
> if (request == MPI_REQUEST_NULL) {
> err = MPI_Irecv(blah, blah, MPI_ANY_SOURCE, intercomm, &request);
> }
> err = MPI_Wait(&request, &status);
>
> if (err == MPI_ERR_PENDING) {
> MPI_Comm_failure_ack(intercomm);
> MPI_Comm_failure_get_acked(intercomm, &group);
> MPI_Group_size(group, &failures);
> } else {
> /* process received data */
> }
> }
>
> Am I correct in my thinking?
> If so, was there a reason why MPI_Recv could not also "obey" MPI_Comm_failure_ack calls?
>
> Thanks,
> Dave
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
--
* Dr. Aurélien Bouteiller
* Researcher at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 309b
* Knoxville, TN 37996
* 865 974 9375
More information about the mpiwg-ft
mailing list