[Mpiwg-large-counts] Fwd: Re: Large Count - the principles for counts, sizes, and byte and nonbyte displacements

Wed Oct 9 12:44:20 CDT 2019

----- Forwarded Message -----
From: Rolf Rabenseifner <rabenseifner at hlrs.de>
To: HOLMES Daniel <d.holmes at epcc.ed.ac.uk>
Cc: Purushotham V. Bangalore <puri at uab.edu>, Martin Ruefenacht <m.a.ruefenacht at gmail.com>, Claudia Blaas-Schenner <claudia.blaas-schenner at tuwien.ac.at>, Jeff Squyres <jsquyres at cisco.com>, Anthony Skjellum <tony-skjellum at utc.edu>
Sent: Tue, 08 Oct 2019 20:12:18 +0200 (CEST)
Subject: Re: [Mpiwg-large-counts] Large Count - the principles for counts, sizes, and byte and nonbyte displacements

Hi all,

(Dan and Puri, on Sep. 23, I sent also a copy of my email to 
 Martin. For me it Looks like that he had no complains)

All the rest of answers are inline below.

Best regards
Rolf

----- Original Message -----
> From: "HOLMES Daniel" <d.holmes at epcc.ed.ac.uk>
> To: "Rolf Rabenseifner" <rabenseifner at hlrs.de>
> Cc: "Purushotham V. Bangalore" <puri at uab.edu>, "Martin Ruefenacht" <m.a.ruefenacht at gmail.com>, "Claudia Blaas-Schenner"
> <claudia.blaas-schenner at tuwien.ac.at>, "Jeff Squyres" <jsquyres at cisco.com>, "Anthony Skjellum" <tony-skjellum at utc.edu>
> Sent: Monday, October 7, 2019 8:25:03 PM
> Subject: Re: [Mpiwg-large-counts] Large Count - the principles for counts, sizes, and byte and nonbyte displacements

> Hi Rolf,
> 
> Apologies - I have no record of this email arriving before today. This probably
> means that I am not a member of te mpiwg-large-counts email list, which seems
> somewhat sub-optimal.
>> 
>> To understand how big and large counts should be implemented in MPI-4,
>> it is important to understand the count, displacement and size model of in
>> MPI-3.1.
>> 
>> As long as we do not have a common understanding of MPI-3.1,
>> we will have problems to define MPI-4.
> 
> Agreed.
> 
>> Therefore my clear question, do we agree on the rules above?
> 
> In short, no.
> In addition, these rules are incomplete.
> Also, there are breaches in MPI-3.1 for most of these rules.
> 
> —
> 
> Below this point in this email, I attempt to cover some of the disagreements
> with your rules and I include a missing rule. However, this is not the sum
> total of the knowledge/experience gained in/by the WG. Some of the other WG
> members may wish to disagree and I expect that some points will generate
> further discussion.
> 
>> - an index into such an array, i.e., the number of an element.
>> Argument name / descriptions [Routine]:
>>  -- array_of_displacements / displacement ..., in multiples of oldtype extent
>>  (... integer) [MPI_TYPE_INDEXED]
>>  -- sdispls / integer array (of length group size). Entry j specifies the
>>  displacement ... [MPI_ALLTOALLV]
>> C-type in MPI-3.1: int
>> C-type in _l: MPI_Count
>
> The two examples you give are not the same. The array_of_displacements is a
> number, correct type MPI_Count (in C), as you state, because it is a “multiple
> of” <something>. 

It looks like, that for MPI_TYPE_INDEXED, you agreed.

> However, each element in the sdispls array is a displacement
> relative to a memory location, sendbuf. 
> The correct type is MPI_Aint (in C).
> Note that it is incorrect in MPI-3.1 for this parameter to be int[] (in C)
> because int is *not* a smaller version of MPI_Aint. A displacement must not be
> described as a number of bytes because of segmented address spaces. The phrase
> “displacement in bytes” is nonsense.

MPI-3.1 p171:7 clearly writes: sendbuf+sdispls[i]*extent(sendtype)

This means if sendbuf would be declared as 
  type_X sendbuf[sendcount[i]];
then sendbuf[sdispls[i]] would be exactly the same as we have in
MPI_TYPE_INDEXED: sdispls[i] is in index into an array of elements.

To compare it again:

- MPI_TYPE_INDEXED MPI-3.1 p89:9-10 and 15-16

   IN array_of_displacements displacement for each block, 
                             in multiples of oldtype extent (array of integer)

   int MPI_Type_indexed(..., const int array_of_displacements[],

- MPI_ALLTOALLV MPI-31., p170:23-24 and p171:7

   int MPI_Alltoallv(..., const int sdispls[],

   sendbuf+sdispls[i]*extent(sendtype)

Second reason: When defining MPI_NEIGHBOR_ALLTOALL|V|W, we
clearly revisited MPI_ALLTOALL|V|W and decided to correct wrong types.
For MPI(_NEIGHBOR)_ALLTOALL and MPI(_NEIGHBOR)_ALLTOALLV, we decided
that all is correct, whereas vor MPI(_NEIGHBOR)_ALLTOALLW, we decided
that the correct type for the displs is MPI_Aint (and not int).

Therefore, I cannot see any difference, and the MPI Forum also did not
see this difference when they standardized MPI_NEIGHBOR_ALLTOALLV.

>> - number of bytes
>> Argument name / descriptions [Routine]:
>>  -- size / size of window in bytes (non-negative integer) [MPI_WIN_CREATE]
>>  -- outsize / output buffer size, in bytes (integer) [MPI_PACK_EXTERNAL]
>> C-type in MPI-3.1: MPI_Aint
>> C-type in _l:      MPI_Aint
>> (Wrong) C-types in MPI-3.1: int [MPI_PACK, MPI_TYPE_SIZE]
>> C-type (corrected) in _l:   MPI_Aint [MPI_PACK, MPI_TYPE_SIZE]
>
> A number of <something> is a number, not a displacement. The correct “large”
> type is MPI_Count (in C). There are places in MPI-3.1 where this is, correctly,
> int or MPI_Count but there are other places in MPI-3.1 where it is,
> incorrectly, MPI_Aint.

I never said that that a number is a displacement.
MPI_Aint is clearly used for 
- addresses relative to a buffer begin,
- absolute addresses (returned by MPI_GET_ADDRESS)
- number of bytes.

MPI-3.1 says
2.5.6 Absolute Addresses and Relative Address Displacements
Some MPI procedures use address arguments that represent an absolute address in the calling
program, or relative displacement arguments that represent differences of two absolute
addresses. The datatype of such arguments is MPI_Aint in C and INTEGER (KIND=
MPI_ADDRESS_KIND) in Fortran. 

Relative or absolute addresses ==> MPI_Aint.
It is not written MPI_Aint ==> rel. or abs. addresses.

Examples for number of bytes:

MPI_TYPE_CREATE_HVECTOR, result of MPI_AINT_DIFF, MPI_TYPE_GET_EXTENT,
MPI_WIN_CREATE, MPI_WIN_ALLOCATE, MPI_WIN_ALLOCATE_SHARED, 
MPI_WIN_SHARED_QUERY, MPI_WIN_ATTACH.

This is a feature, not a bug.

Example for byte-displacements:

MPI_TYPE_CREATE_HINDEXED, MPI_TYPE_CREATE_HINDEXED_BLOCK, 
MPI_TYPE_CREATE_STRUCT, result of MPI_AINT_DIFF. 

A real exception is MPI_PUT, ... with  MPI_Aint target_disp,
and target_addr = window_base + target_disp x disp_unit.

>> - smaller number in bytes
>> Some argument names: e.g., disp_unit
>> Description: local unit size for displacements, in bytes (positive integer)
>> C-type in MPI-3.1: int
>> C-type in _l: still int? or MPI_Count?
>
> Why must disp_unit be smaller, and is it really a number of bytes? The premise
> for disp_unit is that a window might consist of a sequence of locations in
> memory that are used to store values of a particular type (typically
> represented by an MPI datatype). The disp_unit gives an indication of the
> extent of the datatype, a quantity by which all offset values for the window
> will be scaled, in order that address arithmetic adding (offset*disp_unit) to
> base_address produces a new address that is one of the sequence of locations in
> memory used to store values, i.e. the location of the beginning of one of the
> datatypes. This indicates to me that the correct type for disp_unit is MPI_Aint
> (in C). Note that, if true, this means that using int (as is done in MPI-3.1)
> is incorrect, even prior to the large count changes because int is *not* a
> smaller version of MPI_Aint.

Yes. disp_unit is an extent of a type, which is a number of bytes, which
should have been MPI_Aint (as in MPI_TYPE_GET_EXTENT).

Therefore 
   C-type in _l: should be MPI_Aint.

> A missing rule concerns "length arguments” (see section 2.5.2 in MPI-3.1 for a
> definition).
> The length of the array_of_requests for the MPI_Waitany function is given by a
> parameter called count, which is of type int (in C). This is a different class
> of type than any discussed so far. The WG has flip-flopped with regards to
> whether this class of type should be enlarged or not. Our current position is
> (I believe) to leave this class of types alone, i.e. they will remain int (in
> C).

I have to think about.

Best regards
Rolf

> 
> Cheers,
> Dan.
> —
> Dr Daniel Holmes PhD
> Architect (HPC Research)
> d.holmes at epcc.ed.ac.uk<mailto:d.holmes at epcc.ed.ac.uk>
> Phone: +44 (0) 131 651 3465
> Mobile: +44 (0) 7940 524 088
> Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, EH8 9BT
> —
> The University of Edinburgh is a charitable body, registered in Scotland, with
> registration number SC005336.
> —
> 
> On 7 Oct 2019, at 18:09, Rolf Rabenseifner
> <rabenseifner at hlrs.de<mailto:rabenseifner at hlrs.de>> wrote:
> 
> Dear all,
> 
> I've never seen any answer to my email.
> Maybe there isn't any answer, maybe it never reached me.
> 
> Please give me an advice.
> 
> Best regards
> Rolf
> 
> ----- Forwarded Message -----
> From: Rolf Rabenseifner via mpiwg-large-counts
> <mpiwg-large-counts at lists.mpi-forum.org<mailto:mpiwg-large-counts at lists.mpi-forum.org>>
> To:
> mpiwg-large-counts at lists.mpi-forum.org<mailto:mpiwg-large-counts at lists.mpi-forum.org>
> Cc: Rolf Rabenseifner <rabenseifner at hlrs.de<mailto:rabenseifner at hlrs.de>>
> Sent: Mon, 23 Sep 2019 17:15:09 +0200 (CEST)
> Subject: [Mpiwg-large-counts] Large Count - the principles for counts, sizes,
> and byte and nonbyte displacements
> 
> Dear all,
> 
> To understand how big and large counts should be implemented in MPI-4,
> it is important to understand the count, displacement and size model of in
> MPI-3.1.
> 
> As long as we do not have a common understanding of MPI-3.1,
> we will have problems to define MPI-4.
> 
> Therefore here my understanding of MPI-3. We have
> 
> - number of array elements with elements of a given type (typically represented
> by an MPI datatype handle).
> Usual argument name: count
> Usual description: number of elements in ... buffer (non-negative integer)
> C-type in MPI-3.1: int
> C-type in _l: MPI_Count
> 
> - an index into such an array, i.e., the number of an element.
> Argument name / descriptions [Routine]:
>  -- array_of_displacements / displacement ..., in multiples of oldtype extent
>  (... integer) [MPI_TYPE_INDEXED]
>  -- sdispls / integer array (of length group size). Entry j specifies the
>  displacement ... [MPI_ALLTOALLV]
> C-type in MPI-3.1: int
> C-type in _l: MPI_Count
> 
> - number of bytes
> Argument name / descriptions [Routine]:
>  -- size / size of window in bytes (non-negative integer) [MPI_WIN_CREATE]
>  -- outsize / output buffer size, in bytes (integer) [MPI_PACK_EXTERNAL]
> C-type in MPI-3.1: MPI_Aint
> C-type in _l:      MPI_Aint
> (Wrong) C-types in MPI-3.1: int [MPI_PACK, MPI_TYPE_SIZE]
> C-type (corrected) in _l:   MPI_Aint [MPI_PACK, MPI_TYPE_SIZE]
> 
> - smaller number in bytes
> Some argument names: e.g., disp_unit
> Description: local unit size for displacements, in bytes (positive integer)
> C-type in MPI-3.1: int
> C-type in _l: still int? or MPI_Count?
> 
> - Position or relative byte displacement within an array of bytes.
> Such values can be calculated as any sum and product of in, long, long long,
> and MPI_Aint as long as MPI_Aint value contains a pure integer size value,
> i.e., an (integer) difference of two absolute addresses within one sequential
> storage, see MPI-3.1 page 115 line 31, or a MPI datatype extent, retrieved,
> e.g., with MPI_TYPE_GET_EXTENT.
> Argument names / description:
>  -- position / current position in buffer, in bytes (integer) [MPI_PACK_EXTERNAL]
>  -- array_of_displacements / byte displacement of each block (array of integer)
>  [MPI_TYPE_CREATE_STRUCT]
> C-type in MPI-3.1: MPI_Aint
> C-type in _l:      MPI_Aint
> (Wrong) C-types in MPI-3.1: int [MPI_PACK, MPI_ALLTOALLW]
> C-type (corrected) in _l:   MPI_Aint [MPI_PACK, MPI_ALLTOALLW]
> 
> - Absolute address values for byte displacements.
> These values are also valid for all byte displacements in datatype routines
> and in MPI_NEIGHBOR_ALLTOALLW, provided that they are used in combination
> with buffer=MPI_BOTTOM.
> They cannot be used in MPI_ALLTOALLW.
> With "C-type (corrected) in _l: MPI_Aint [MPI_ALLTOALLW]",
> they are also usable with MPI_ALLTOALLW.
> 
> 
> I already looked at the Large/Big Count pdf and saw that in the datatype chapter
> these rules were broken, for example for the ...PACK/UNPACK... routines.
> 
> 
> Therefore my clear question, do we agree on the rules above?
> 
> 
> Already detected bugs in Version from Sep. 13, 2019:
> 
> - page 127:
>  No idea why you changed the name from MPI_GET_ELEMENTS to MPI_TYPE_GET_ELEMENTS.
>  Should be reverted.
> 
> - MPI_PACK:
>  outsize and position should be handled identical to those in MPI_PACK_EXTERNAL,
>  i.e.,
>  both are MPI_Aint...
> 
> - MPI_PACK_SIZE:
>  size should be handled identical to that in MPI_PACK_EXTERNAL_SIZE, i.e.,
>  MPI_Aint...
> 
> - MPI_UNPACK:
>  insize and position should be handled identical to those in MPI_UNPACK_EXTERNAL,
>  i.e.,
>  both are MPI_Aint...
> 
> - MPI_Type_contiguous: the large count _l version is missing
> 
> - MPI_Type_create_darray
>   -- array_of_distribs must be INTEGER, because it holds enumeration values
>      and nothing else.
>   -- array_of_dargs requires significant explanation because it can
>      hold an enumeration (INTEGER) and also large count values which
>      can cause not understandable compiler error reports
>      in Fortran, because using MPI_COUNT_KIND array_of_gsizes values
>      together with an INTEGER enumeration constant,
>      here MPI_DISTRIBUTE_DFLT_DARG, would cause a compiler message
>      like "no matching interface found".
> 
>      Two possible text/interface solutions:
>      - If using the mpi_f08 module and MPI_DISTRIBUTE_DFLT_DARG together
>        with large Count Version of this procedure, i.e.,
>        INTEGER(KIND=MPI_COUNT_KIND) array_of_gsizes and array_of_dargs
>        arguments, then one should use
>           INT(MPI_DISTRIBUTE_DFLT_DARG, MPI_COUNT_KIND)
>        instead of
>           MPI_DISTRIBUTE_DFLT_DARG.
>      - overloading with two versions (long,normal) and (long,long)
>        but i would recommend the first solution because it does
>        not require additional MPI library implementation overhead.
> 
> 
> Before I can continue with reviewing, the principles above must be
> cleared/discussed/agreed/...
> 
> Best regards
> Rolf
> 
> 
> 
> --
> Dr. Rolf Rabenseifner . . . . . . . . . .. email
> rabenseifner at hlrs.de<mailto:rabenseifner at hlrs.de> .
> High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530 .
> University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832 .
> Head of Dpmt Parallel Computing . . .
> www.hlrs.de/people/rabenseifner<http://www.hlrs.de/people/rabenseifner> .
> Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room 1.307) .
> _______________________________________________
> mpiwg-large-counts mailing list
> mpiwg-large-counts at lists.mpi-forum.org<mailto:mpiwg-large-counts at lists.mpi-forum.org>
> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-large-counts
> 
> --
> Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de .
> High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530 .
> University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832 .
> Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner .
> Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room 1.307) .

-- 
Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de .
High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530 .
University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832 .
Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner .
Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room 1.307) .

-- 
Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de .
High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530 .
University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832 .
Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner .
Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room 1.307) .