[mpiwg-tools] reset a stopped pvar

William Gropp wgropp at illinois.edu
Wed Sep 25 09:17:53 CDT 2013


The proliferation of error classes is why I question the decision to not use the code/class structure that has worked so well for the rest of MPI (and requiring support of enough routines to decode error codes, including more detailed error strings, would have been a small change to most or all implementations).  The error code system was designed to provide a mechanism for detailed error reporting without creating a zillion error classes.

Bill

William Gropp
Director, Parallel Computing Institute
Deputy Director for Research
Institute for Advanced Computing Applications and Technologies
Thomas M. Siebel Chair in Computer Science
University of Illinois Urbana-Champaign




On Sep 25, 2013, at 9:09 AM, Junchao Zhang wrote:

> I agree. 
> Bu I think a better error code name is MPI_T_ERR_PVAR_WATERMARK_NOTSTARTED.
> If you remember an earlier problem I reported, "read a never started continuous pvar",  we should also have a MPI_T_ERR_PVAR_NEVERSTARTED.
> 
> --Junchao Zhang
> 
> 
> On Tue, Sep 24, 2013 at 6:50 PM, Martin Schulz <schulzm at llnl.gov> wrote:
> 
> On Sep 19, 2013, at 11:24 AM, Junchao Zhang <jczhang at mcs.anl.gov> wrote:
> 
>> For a running (i.e., started) watermark, it is reasonable to return the starting value.
>> But for a stopped one, it is strange to do a read and return what is read. 
> 
> Yes, I agree - I think we are running into a strange case here where definition and intended use don't quite match.
> 
> Let's consider a watermark on a particular resource with values changing as follows:
> 
> 	30
> 	60
> RESET
> 	60
> 	20
> 	70
> 	20
> READ(1)
> 	20
> 	30
> START
> 	30
> 	40
> 	50
> 	35
> 	45
> READ(2)
> 	45
> 	40
> STOP
> 	40
> 	100
> READ(3)
> 	100
> 
> Intuitively, as also Kathryn described, you want the watermark inside the start/stop region, i.e., READ(2) should return 50. Even more important, READ(3) should return 50, since this was the watermark inside the start/stop region. This requires, though, that the starting value is applied at START - if we do it at RESET, the final value at READ(2) is 60, which doesn't make sense at all (in particular due to the peak of 70  in between), or it would be 70 if you continue updating between RESET and START, which also doesn't make sense. 
> 
> So what should READ(1) return if we keep it completely turned off until we reach START. Perhaps we need a new error code NOTSTARTED?
> 
> Martin
> 
> 
> 
> 
> 
>> 
>> --Junchao Zhang
>> 
>> 
>> On Thu, Sep 19, 2013 at 11:03 AM, Kathryn Mohror <kathryn at llnl.gov> wrote:
>> Hi Junchao,
>> 
>>> 
>>> Also, for a stopped pvar, after reset and before restarting, what does a pvar_read return?
>>> Returning zero sounds good for counters? What about watermarks? Old value, garbage value or MPI_T_ERROR_XXX? I would choose ERROR.
>>> The side-effect is that it makes resetting pvars not beautiful.
>> 
>> In my interpretation, it returns the starting value of the variable as defined according to the variable class. So, for watermarks, it would be the current value at the time of the reset. I can imagine a scenario where you want to know what the starting value  of a variable is for some reason, so you wouldn't want it to be erroneous for a tool to read a non-started variable.
>> 
>> Do others agree with this?
>> 
>> Kathryn
>> 
>>> 
>>> --Junchao Zhang
>>> 
>>> On Thu, Sep 19, 2013 at 12:32 AM, Martin Schulz <schulzm at llnl.gov> wrote:
>>> Hmm, that is a good catch. I agree with Kathryn's interpretation - in particular the use case she is laying out. If one does:
>>> 
>>> Reset
>>> Start
>>> Stop
>>> 
>>> You want the watermark from that interval, i.e., the starting value as of the start call should be the right thing. This is something we definitely should clarify.
>>> 
>>> Thanks,
>>> 
>>> Martin
>>> 
>>> 
>>> 
>>> On Sep 18, 2013, at 8:33 PM, Kathryn Mohror <kathryn at llnl.gov>
>>>  wrote:
>>> 
>>>> Hi Junchao,
>>>> 
>>>>>   What is the right behavior when resetting a stopped pvar? The standard says setting to its starting value.
>>>>>   For counters, timers etc, setting them to zero sounds reasonable.
>>>>>   But for a watermark, setting it to "the current utilization level" looks weird. It implies that a value caught during the stopped period can affect its future value when the pvar is re-started.
>>>>>   Probably, we should reset a stopped watermark to a state as if it has never been started.
>>>>>   Any comments?  Thanks
>>>> 
>>>> Hmm. It makes sense to me, but I'll let others chime in if they disagree. I think that the moment you start the watermark variable, you want to know what the "mark" is, so it would be the value of current utilization. So even if a higher (or lower) value is caught during the stopped period (which it shouldn't be, because variables aren't supposed to be updated when stopped), it will be set to the current utilization value when started. I interpret this as being able to measure the watermark during different epochs of the program execution. Every time you start the variable, it's a fresh epoch and you want to know what the watermark was during that epoch.
>>>> 
>>>> However, I can see how this isn't clear as it could be -- I'll try to see what we can do to clarify it in the text.
>>>> 
>>>> Thanks again for taking the time to give us this feedback.
>>>> 
>>>> Kathryn
>>>> 
>>>> 
>>>>> --Junchao Zhang
>>>>> _______________________________________________
>>>>> mpiwg-tools mailing list
>>>>> mpiwg-tools at lists.mpi-forum.org
>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-tools
>>>> 
>>>> ______________________________________________________________
>>>> Kathryn Mohror, kathryn at llnl.gov, http://people.llnl.gov/mohror1
>>>> CASC @ Lawrence Livermore National Laboratory, Livermore, CA, USA
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> mpiwg-tools mailing list
>>>> mpiwg-tools at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-tools
>>> 
>>> ________________________________________________________________________
>>> Martin Schulz, schulzm at llnl.gov, http://people.llnl.gov/schulzm
>>> CASC @ Lawrence Livermore National Laboratory, Livermore, USA
>>> 
>>> 
>>> _______________________________________________
>>> mpiwg-tools mailing list
>>> mpiwg-tools at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-tools
>>> 
>>> _______________________________________________
>>> mpiwg-tools mailing list
>>> mpiwg-tools at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-tools
>> 
>> ______________________________________________________________
>> Kathryn Mohror, kathryn at llnl.gov, http://people.llnl.gov/mohror1
>> CASC @ Lawrence Livermore National Laboratory, Livermore, CA, USA
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> mpiwg-tools mailing list
>> mpiwg-tools at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-tools
>> 
>> _______________________________________________
>> mpiwg-tools mailing list
>> mpiwg-tools at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-tools
> 
> 
> ________________________________________________________________________
> Martin Schulz, schulzm at llnl.gov, http://people.llnl.gov/schulzm
> CASC @ Lawrence Livermore National Laboratory, Livermore, USA
> 
> 
> _______________________________________________
> mpiwg-tools mailing list
> mpiwg-tools at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-tools
> 
> _______________________________________________
> mpiwg-tools mailing list
> mpiwg-tools at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-tools

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-tools/attachments/20130925/dc58b5d8/attachment-0001.html>


More information about the mpiwg-tools mailing list