[mpiwg-tools] Using MPI_T

Fri Oct 25 00:53:20 CDT 2013

Hi Junchao, 

On Oct 24, 2013, at 9:15 PM, Junchao Zhang <jczhang at mcs.anl.gov>
 wrote:

> I agree having a well-defined interface to access the info is good. But I was wondering how to take advantage of this interface and what kinds of users need it (e.g., system admins, tool developers, application programmers, MPI developers). I list my understanding of these users.
> *system admin: select proper parameters to build an optimized MPI library for their platform. However, to this end, giving the manual of a MPI implementation is enough.

The MPI_T interface has the nice side effect that it can be used to auto generate such a manual for that particular MPI implementation - those settings are not always well documented and this way we can query the MPI implementation. Also it enables auto-tuning approaches, which a few people are already working on.

> *tool developer: The T in MPI_T is for tools. But who is the user of such a tool?

End users / code developers

> Yes, a tool can give profiling data of applications. But how to gain insights from the profiling data, and then refine the execution environment, or even do code refactoring? Can it be automatic?

This is not an MPI_T question, but a general question for tools - for this, think of MPI_T as hardware counters or PAPI for CPUs: lot's of counters to measure and some things are CPU specific, but the data has been used over and over again to guide optimizations. This, for the first time, breaks the black box of MPI and allows tools that are platform agnostic, i.e., we can give our users the same tool on Sequoia, Mira, Titan, Stampede, and Linux clusters,

> *application programmer: The example Anh Vo gives is good. But we need to justify the extra coding complexity.

Another point is pure documentation - users can simply query and print their environment and store it as part of the meta-data. OMP has just added a very similar feature by request of the users.

> *MPI developer: (I have no idea)

Each MPI has plenty of internal "sensors" for performance tuning and debugging - having a standardized interface will allow developers to use existing (and often highly sophisticated) tools to optimize and debug the MPI implementation itself. Otherwise, ad-hoc tools must be written for each MPI, which either are a huge unnecessary time sinks (that stuff is complicated) or won't scale or do its job.

Hope this helps,

Martin

> 
> --Junchao Zhang
> 
> 
> On Thu, Oct 24, 2013 at 5:39 PM, Todd Gamblin <tgamblin at llnl.gov> wrote:
> On Oct 24, 2013, at 3:31 PM, Junchao Zhang <jczhang at mcs.anl.gov>
>  wrote:
> 
>> I would call it "selective MPI parameter setting". 
>> But users need to know to what extent the size is "slightly larger", and be notified doing so is indeed better.
>> BTW, I'm confused if this case often happens, why not implement it in MPI instead of bothering MPI_T?
> 
> Even if MPI uses this information for automatic tuning, tools still want to be able to measure this information.  Not all of us live in MPI, and having a well-defined interface for access to information like this from *outside* MPI was the whole point of the MPI_T interface.
> 
> -Todd
> 
> 
> 
>> 
>> --Junchao Zhang
>> 
>> 
>> On Thu, Oct 24, 2013 at 4:50 PM, William Gropp <wgropp at illinois.edu> wrote:
>> Here's a very simple one that applications sometimes do - if your messages are small but slightly larger than the eager threshold, you can often improve performance by increasing the eager threshold.  Some MPI implementations have provided environment variables to do just that; with MPI_T, this can be done within the program, and scoped for the parts of the application that need it.  And using MPI_T, you might be able to discover is this is indeed a good idea.  
>> 
>> Bill
>> 
>> William Gropp
>> Director, Parallel Computing Institute
>> Deputy Director for Research
>> Institute for Advanced Computing Applications and Technologies
>> Thomas M. Siebel Chair in Computer Science
>> University of Illinois Urbana-Champaign
>> 
>> 
>> 
>> 
>> On Oct 24, 2013, at 3:42 PM, Anh Vo wrote:
>> 
>>> I am also not aware of any applications/tools doing such things yet. But that’s an example of how MPI_T might benefit the developers of those applications/tools. But right now I don’t of any commercial users of MPI_T
>>>  
>>> --Anh
>>>  
>>> From: mpiwg-tools [mailto:mpiwg-tools-bounces at lists.mpi-forum.org] On Behalf Of Junchao Zhang
>>> Sent: Thursday, October 24, 2013 1:41 PM
>>> To: <mpiwg-tools at lists.mpi-forum.org>
>>> Subject: Re: [mpiwg-tools] Using MPI_T
>>>  
>>> OK. I believe it is an advanced topic. I'm not aware of applications doing such cool things.
>>> If you happen to know an application that would benefit from MPI_T, I would like to implement it. 
>>> 
>>> --Junchao Zhang
>>>  
>>> 
>>> On Thu, Oct 24, 2013 at 3:30 PM, Anh Vo <Anh.Vo at microsoft.com> wrote:
>>> I would say it depends on the situation. In most cases I would imagine the applications/tools would do the aggregation. And yes, in my example the processes need to communicate to know the message pressure
>>>  
>>> --Anh
>>>  
>>> From: mpiwg-tools [mailto:mpiwg-tools-bounces at lists.mpi-forum.org] On Behalf Of Junchao Zhang
>>> Sent: Thursday, October 24, 2013 1:27 PM
>>> To: <mpiwg-tools at lists.mpi-forum.org>
>>> Subject: Re: [mpiwg-tools] Using MPI_T
>>>  
>>> Hi, Anh,
>>>   I think your example is to use feedback to do throttling. 
>>>   A further question is: should we do it at application level (since you mentioned aggregation) or do it in MPI runtime?
>>>   The example also implies processes need to communicate to know pressure of each other.
>>>   Thanks.
>>> 
>>> --Junchao Zhang
>>>  
>>> 
>>> On Thu, Oct 24, 2013 at 2:40 PM, Anh Vo <Anh.Vo at microsoft.com> wrote:
>>> Hi Junchao,
>>> One example is monitoring the length of the unexpected message queues. Basically, when an MPI process receives an incoming message from another MPI process and it has not posted a receive for such message yet, the message is typically copied into an unexpected receive queues. When the process posts a receive, it loops through the unexpected queue and sees whether any of the messages in the queue would match with this receive. If the unexpected queue is too long, you would spend a lot of time looping through the queue.  Extra memcpy operations are also needed for unexpected receive (vs the case where the message arrives and there’s already a posted receive for it)
>>>  
>>> By monitoring the length of the unexpected receive queue, the user can adjust the rate of message flow. For example, if the other side processes messages fast enough, you can keep sending lots of small data (such as heart beat or piggyback), but in the case where the other side is slow processing messages (thus end up with high queue depth for unexpected queue), it might be more beneficial to compress the message or aggregate them before sending
>>>  
>>> --Anh
>>>  
>>> From: mpiwg-tools [mailto:mpiwg-tools-bounces at lists.mpi-forum.org] On Behalf Of Junchao Zhang
>>> Sent: Thursday, October 24, 2013 12:31 PM
>>> To: <mpiwg-tools at lists.mpi-forum.org>
>>> Subject: [mpiwg-tools] Using MPI_T
>>>  
>>> Hello,
>>>   The standard talks about the motivation of MPI_T as "MPI implementations often use internal variables to control their operation and performance. Understanding and manipulating these variables can provide a more efficient execution environment or improve performance for many applications."
>>>   I could imagine that through performance variables, users can know MPI internal states during application execution. But how to use that to improve performance? What EXTRA advantages does MPI_T bring? I don't get the idea.
>>>   Can someone shed light on that?
>>>   Thank you.
>>> --Junchao Zhang
>>> 
>>> _______________________________________________
>>> mpiwg-tools mailing list
>>> mpiwg-tools at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-tools
>>>  
>>> 
>>> _______________________________________________
>>> mpiwg-tools mailing list
>>> mpiwg-tools at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-tools
>>>  
>>> _______________________________________________
>>> mpiwg-tools mailing list
>>> mpiwg-tools at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-tools
>> 
>> 
>> _______________________________________________
>> mpiwg-tools mailing list
>> mpiwg-tools at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-tools
>> 
>> _______________________________________________
>> mpiwg-tools mailing list
>> mpiwg-tools at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-tools
> 
> ______________________________________________________________________
> Todd Gamblin, tgamblin at llnl.gov, http://people.llnl.gov/gamblin2
> CASC @ Lawrence Livermore National Laboratory, Livermore, CA, USA
> 
> 
> _______________________________________________
> mpiwg-tools mailing list
> mpiwg-tools at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-tools
> 
> _______________________________________________
> mpiwg-tools mailing list
> mpiwg-tools at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-tools

________________________________________________________________________
Martin Schulz, schulzm at llnl.gov, http://people.llnl.gov/schulzm
CASC @ Lawrence Livermore National Laboratory, Livermore, USA