[Mpi-forum] Question about MPI-4.1's MPI_Get_hw_resources_info()

Jeff Squyres (jsquyres) jsquyres at cisco.com
Mon Nov 27 15:53:13 CST 2023


Guillaume and I have been conversing off list about this issue today.

There's ongoing discussion about this topic in the Topologies WG.

________________________________
From: Jeff Squyres (jsquyres) <jsquyres at cisco.com>
Sent: Monday, November 27, 2023 1:58 PM
To: Guillaume Mercier <guillaume.mercier at u-bordeaux.fr>
Cc: mpi-forum at lists.mpi-forum.org <mpi-forum at lists.mpi-forum.org>
Subject: Re: [Mpi-forum] Question about MPI-4.1's MPI_Get_hw_resources_info()

Ping; still waiting for some answers here.

My issue is that the text in MPI-4.1 is so broad as to be meaningless.  I.e., both implementors and application developers can interpret it any way they want, and therefore the level and type of information delivered by different implementations could (will) be wildly different in both their intent and execution.  "Flexible" is one thing, but "so ambiguous as to be meaningless" is another.
________________________________
From: Jeff Squyres (jsquyres) <jsquyres at cisco.com>
Sent: Thursday, November 16, 2023 11:50 AM
To: Guillaume Mercier <guillaume.mercier at u-bordeaux.fr>; Jeff Squyres (jsquyres) <jsquyres at cisco.com>
Cc: mpi-forum at lists.mpi-forum.org <mpi-forum at lists.mpi-forum.org>
Subject: Re: [Mpi-forum] Question about MPI-4.1's MPI_Get_hw_resources_info()

Amusingly, my email client didn't show me any of your inline replies -- it only​ showed me your top-post reply.  I didn't see your inline replies until after I sent my mail repeating/summarizing my original questions.  Thanks, Outlook!  🙁

Ok, let me reply to your inline comments:

  *   If we're not supposed to list software or virtual instances, that's... problematic.  Example: the Cisco NIC exports a whole bunch of PCI virtual functions that are discovered by Linux, but it's really one PCI or Mezzanine card with tons of resources on it (each setting of resources is created dynamically and addressed separately).  This gets very grey very fast.
     *
My overall point: the lines between hardware, firmware, and software are very, very fuzzy these days.  They're not likely to get any clearer, either.
  *
For my question of asking what the value of the "true" and "false" values are: it's not that users​ may request that resource allocations may change, it's that the MPI implementation, OS, drive, firmware, or even hardware may choose to change resource allocations.  HPC/MPI applications tend to do this far less than non-HPC applications, but it still does happen.
  *
You're calling it "flexible" -- but if different MPI implementations and/or different vendors do things very differently, then what is the value of this mechanism for the application?
  *
Per your comment about the provider part of the URI intending to show where the information came from, are you saying that the same resources can/should potentially be listed multiple times?  E.g., openmpi://usnic-blah, cisco://usnic-blah, usnic://blah, pmix://nic/cisco/usnic/blah, hwloc://pci/235234/23424214, linux://net/usnic/32452345, ... ?
  *
I don't think you replied to some of my questions:
     *   What is the precise distinction between the "true" and "false" values of the info keys?
     *   What is the precise definition of when an implementation is required to provide the same info keys/values between processes?

________________________________
From: mpi-forum <mpi-forum-bounces at lists.mpi-forum.org> on behalf of Jeff Squyres (jsquyres) via mpi-forum <mpi-forum at lists.mpi-forum.org>
Sent: Thursday, November 16, 2023 11:35 AM
To: Guillaume Mercier <guillaume.mercier at u-bordeaux.fr>
Cc: mpi-forum at lists.mpi-forum.org <mpi-forum at lists.mpi-forum.org>
Subject: Re: [Mpi-forum] Question about MPI-4.1's MPI_Get_hw_resources_info()

Ok, that's a fine intent.  But all my specific questions remain -- e.g.,

  *   What is the precise distinction between the "true" and "false" values of the info keys?
  *   What is the technical benefit of providing the "true" and "value" values to the user/application in the info keys
  *   What items can be listed in these hardware info keys?  (e.g., what about virtual or software-only devices)
  *   What is the relationship between the software models listed as examples for the URI prefixes and the hardware that they are supposed to represent?
  *   What is the precise definition of when an implementation is required to provide the same info keys/values between processes?

________________________________
From: Guillaume Mercier
Sent: Thursday, November 16, 2023 3:02 AM
To: Jeff Squyres (jsquyres)
Cc: mpi-forum at lists.mpi-forum.org
Subject: Re: [Mpi-forum] Question about MPI-4.1's MPI_Get_hw_resources_info()

Hi Jeff,

Let me revise my first answer and be more specific on a couple
of points you raise in your message.

Remember that until MPI 4.1, there was no standard way to
provide a value to the "mpi_hw_resource_type" info key that can guide
the splitting of communicators on hardware-basis
(i.e. with a call to MPI_comm_split_type with  MPI_COMM_TYPE_HW_GUIDED
as the input split_type value). MPI_Get_hw_resource_info fills
this gap and makes applications  that rely on this mechanism more
portable than previously.

On 16/11/2023 03:09, Jeff Squyres (jsquyres) via mpi-forum wrote:

>      2. For example, my company makes a piece of hardware that can have
>         thousands of virtual NICs on it, and those virtual NICs might
>         even migrate around to different pieces of hardware (e.g., they
>         can migrate between different fiber optic outputs on the same
>         NIC).  MPI processes are assigned to a virtual NIC, not a
>         hardware NIC.  Am I allowed to include a reference to these
>         virtual NICs in the keys/values that are returned (since the
>         Linux device name refers to a virtual entity, not necessarily a
>         specific set of hardware)?  If so, how do I determine the
>         true/false value to assign?

On second thoughts, since these virtual NICs are software "instances"
(for the lack of a better word), I'm not sure that they should be
listed as keys in the resulting MPI_Info object. I'd like to discuss
this more with you.

>      3. The text states that the info keys/values are specific to the
>         point of time when the call is made.  p446:11-12 even explicitly
>         states that the process and/or its hardware restrictions may
>         change over time.  So even if I grokked what "restricted to a
>         single instance of a hardware resource of that type" is intended
>         to mean, if things can change -- and they can -- what is the
>         point of giving a true or false value to the user?

Things can change, but not systematically. I don't think that current
applications modify the binding of their MPI process that often.
Therefore, in the majority of cases, the information you get after
the first call to the procedure is likely to remain valid until
the application's end.


>      4.
>         Is the intent that keys will include a specific, unique
>         reference to an instance of "hardware" (e.g., a PCI address)?
>         If so, then the value of "true" and "false" becomes even more
>         nebulous (or meaningless).  E.g., if I list a key containing
>         "cisco-nic-12bc83fde9" to indicate a specific NIC, what is the
>         exact "hardware resource of that type", and/or how would an
>         application know that "cisco-nic-12bc83fde9" and
>         "cisco-nic-bbbbbbbbb" are of the same "hardware resource type"?

In this case, both guillaume://cisco-nic set to "true" AND
jeffS://cisco-nic-12bc83fde9 set to "true" seem acceptable to me.
It befalls the user to pick a provider and thus to consider which
information should be effectively used.
MPI_Get_hw_resource_info "only" fills the gap
between the application and the lower-level mechanims that can be used
to retrieve this kind of information without resorting to call this
lower-level mechanims directly in the application.

>      5. I can imagine that there could be many different scenarios here;
>         can someone provide some guidance on what exactly an
>         implementation is supposed to do here?  This text seems to be...
>         ambiguous.

What you call ambiguous, I would call flexible ;)
But joking aside, the text can surely be improved and I'd more
than happy to take your input into account and come up with an
even better version for MPI 4.2 or 5.0.

>  2. The AtoI in p445:42-46 says that we should use URIs with a type of
>     "openmpi://" or "hwloc://" or "pmix://" or "openmpi://" or
>     "slurm://" or ...
>      1. All of these are software models (although hwloc's data refers
>         to either hardware or to software devices that correspond to
>         some form of hardware -- although that's not always clear, either).

The provider only indicate where the information comes from, as two
different sources might report slightly different things. I'll take your
previous "cisco-nic "example: hwloc might choose to report only a
"cisco-nic" type while Cisco's tool might report more precise
information. I think it would be detrimental to the user to not report
all possible informations. Then about software models, I would surely
qualify "openmp" as a software model but not the others.


>      2. The use of software models in the text is confusing, because the
>         routine has "hw" in its name, strongly implying that there's
>         supposed to be a direct tie-in to hardware.

Software models are only used as potential providers, nothing more.
Fundamentally  I don't see the difference with what hwloc does
(information reporting) and what this function does. Or maybe I didn't
understand your comment right?

Cheers,
Guillaume
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpi-forum/attachments/20231127/2bc65842/attachment-0001.html>


More information about the mpi-forum mailing list