[Mpiwg-large-counts] Large Count - the principles for counts, sizes, and byte and nonbyte displacements

Sun Oct 27 14:28:06 CDT 2019

Dear Jeff H.,

comments inline. It is really a complex Topic.
Thank you very much about your pointing to 57 bits Registers.
I tried to understand by reading
https://en.wikipedia.org/wiki/X86-64#Virtual_address_space_details
and the article around this topic.

All the rest below.

----- Original Message -----
> From: "Jeff Hammond" <jeff.science at gmail.com>
> To: "Rolf Rabenseifner" <rabenseifner at hlrs.de>
> Cc: "mpiwg-large-counts" <mpiwg-large-counts at lists.mpi-forum.org>, "Jeff Squyres" <jsquyres at cisco.com>, "James Dinan"
> <james.dinan at intel.com>
> Sent: Friday, October 25, 2019 8:35:47 PM
> Subject: Re: [Mpiwg-large-counts] Large Count - the principles for counts, sizes, and byte and nonbyte displacements

> On Fri, Oct 25, 2019 at 5:20 AM Rolf Rabenseifner <rabenseifner at hlrs.de>
> wrote:
> 
>> Dear Jeff,
>>
>> > If we no longer care about segmented addressing, that makes a whole
>> bunch of
>> > BigCount stuff a LOT easier. E.g., MPI_Aint can basically be a
>> > non-segment-supporting address integer.
>>
>> > AINT_DIFF and AINT_SUM can go away, too.
>>
>> Both statements are -- in my opinion -- incorrect.
>> And the real problem is really ugly, see below.
>>
>> After we seem to agree that MPI_Aint is used as it is used, i.e.,
>> to currently store
>>  - absolute addresses (which means the bits of a 64-bit-unsigned address
>>    interpreted as a signed twos-complement 64 bit integer
>>    i.e., values between -2**63 and + 2**63-1
>>    (and only here is the discussion about whether some higher bits may
>>     be used to address segments)
>>  - relative addresses between -2**63 and + 2**63-1
>>
> 
> C and C++ do not allow one to do pointer arithmetic outside of a single
> array so I'm not sure how one would generate relative addresses this large,
> particularly on an x86_64 machine where the underlying memory addresses are
> 48 or 57 bits.

"C and C++ do not allow one to do pointer arithmetic outside of a single array"
perfectly fits to the MPI Definition on "sequential storage" and the
rule that a diff is only allowed within such sequential storage.

About 
>>  - relative addresses between -2**63 and + 2**63-1

I describe only what MPI allows.
One should have in mind, that relative addresses also used as displacements
in derived datatypes.
Those can be larger than the address range of a real memory system, 
if a file is larger than this range AND the derived datatype is for example
used to define a filetype, i.e., which portions of a globally accessed file
are accessible for an individual MPI process. 

>>  - byte counts between 0 and 2**63-1
>>
> 
> All such uses should use functions with MPI_Count, not MPI_Aint, once those
> functions are defined in MPI 4.0.

Exactly for this proposal I detected a ***severe*** problem, 
described on the lines below with the two addresses addr1 and addr2
and using as example 8 bit MPI_Aint and 12 bit MPI_Count.

>> And that for two absolute addresses within the same "sequential storage"
>> (defined in MPI-3.1 Sect. 4.1.12 page 115 lines 17-19), it is allowed
>> to use a minus operator (as Long as integer overflow detection is
>> switched off) or MPI_Aint_diff.
>>
> 
> Again, pointer arithmetic has to be within a single array.  It's pretty
> hard to generate an overflow in this context.

I'm pretty sure that you may be not right, at least in the future:

MPI_Aint_diff was defined on he Background that at least in the future,
the virtual addresses may span from lower to higher half of the virtual 
64bit address space as shown in the right diagram in figure
in https://en.wikipedia.org/wiki/X86-64#Virtual_address_space_details
and https://en.wikipedia.org/wiki/X86-64#/media/File:AMD64-canonical--64-bit.svg

If an array goes from lower half to higher half then we have the problem
described below with the two absolute addresses 127 and 129 in 
my example with only a 8 bits space.

As long as in the virtual address space in user mode, only addresses
in the lower half exist, we do not have the problem.

When I understand correctly, we want to specify MPI-4.0 to be future-safe
for the case that true 64-bit addressing will come in some future.
(This may include systems that allow full direct addressibility 
of the whole cluster memory.)

Are I'm right with this assumption or should MPI-4.0 only be usable
for 56/57 bit addresses?

>> In principle, the MPI standard is not fully consistent with that:
>>
>> MPI-3.1 page 102 lines 45-46 tell:
>>  "To ensure portability, arithmetic on MPI addresses
>>   must
>>   be performed using the MPI_AINT_ADD and MPI_AINT_DIFF functions."
>> and
>>
> 
> Yes, it also says for portability, RMA needs to use MPI_Alloc_mem and not
> malloc or similar arrays.  That never matters in practice.

Yes, I agree, nearly all applications may still use the minus Operator
for MPI_Aint instead of MPI_Aint_diff.

>> > > ... MPI-3.1 2.5.6 "Absolute
>> > > Addresses and Relative Address Displacements" p16:39-43:
>> > >
>> > > "For retrieving absolute addresses or any calculation with absolute
>> addresses, one
>> > > should
>> > > use the routines and functions provided in Section 4.1.5. Section
>> > > 4.1.12 provides additional rules for the correct use of absolute
>> addresses. For
>> > > expressions with relative displacements or other usage without absolute
>> > > addresses, intrinsic operators (e.g., +, -, *) can be used."
>>
>> And now about large counts, especially if we want to extent routines
>> that currently use MPI_Aint to something larger, i.e., MPI_Aint_x or
>> MPI_Count.
>>
> 
> MPI_Aint_x will not exist.  MPI_Count is already a thing and is at least as
> large as MPI_Aint and MPI_Offset.

I only wanted to be general. Of course, I also expect that we should read
below
      MPI_Aint_x (or MPI_Count) addr_x;
always as 
      MPI_Count) addr_x;

>> Here, the major problem is the automatic cast within an assignment
>>
>>   MPI_Aint addr;
>>   MPI_Aint_x (or MPI_Count) addr_x;
>>
>>   MPI_Get_address(...., &addr);
>>   addr_x = addr;  // ***this Statement is the problem****
>>
>>
> No.  This statement is not a problem.  MPI_Count is required to hold the
> full range of MPI_Aint.  You can assign MPI_Aint to MPI_Count without
> truncation.

You completely misunderstood my hint. The Problem is the sign Expansion
and not a removal of high-order bits.
In my example MPI_Count has 12 bits and MPI_Aint 8 bits and an array may 
span addresses from lower half to the higher half of the 8 bit address space.

>> let's take my example fro a previous email (using 8 bit MPI_Aint)
>>
>>   addr1 01111111 = 127 (signed int) = 127 (unsigned int)
>>   addr2 10000001 = -127  (signed int) = 129 (unsigned int)
>>
>> Internally the addreses are viewed by the hardware and OS as unsigned.
>> MPI_Aint is interpreting the same bits as signed int.
>>
>> addr2-addr1 = 129 -127 = 2 (as unsigned int)
>> but in a real application code with "-" operator:
>>             = -127 -127 = -254
>>   --> signed int Overflow because 8 bit can express only -128 .. +127
>>   --> detected or automatically corrected with +256 --> -254+256 = 2
>>
>> And now with 12 bit MPI_Aint_x
>>
>>   addr1_x := addr1  results in (by sign propagation)
>>   addr1_x = 000001111111 = 127 (signed int) = 127 (unsigned int)
>>
>>   addr2_x := addr2  results in (by sign propagation)
>>   addr2_x = 111110000001 = -127  (signed int) = 129 (unsigned int)
>>
>> and then
>>   addr2_x - addr1_x = -127 - 127 = -254
>> which is a normal integer within 12bit,
>> and therefore ***NO*** overflow correction!!!!!!
>>
>> And therefore a completely ***wrong*** result.
>>
>> Using two different types for absolute addresses seems to be a
>> real problem in my opinion.
>>
>>
>> And of course signed 64bit MPI_Aint does allow to specify only
>> 2**63-1 bytes, which is about 8*1000**6 Bytes,
>> which is only 8 Exabyte.
>>
>> On systems with less than 8 Exabyte per MPI process, this is not
>> a problem for message passing, but it is a problem for I/O,
>>
> 
> That is why MPI_Count is potentially larger than MPI_Aint, because it also
> has to hold MPI_Offset for IO purposes.

As you could see, the problem arises if MPI_Count is really larger than MPI_Aint.

>> and therefore for derived datatypes.
>> And derived datayptes use MPI_Aint at several locations,
>> and some of them with the possibility of providing absolute addresses.
>>
> 
> You are welcome to create a ticket for large-count datatype functions that
> use MPI_Count if one does not already exist.

I expect that the large Count Group wants to extend also all derived datatype
routines to large counts to be able to Support large offsets in parallel I/O.

And for this purpose, the large count working group must decide whether
for example the long version of MPI_TYPE_CREATE_STRUCT will have
not only 
    MPI_Count (instead of int)      array_of_blocklengths[],
but also  
    MPI_Count (instead of MPI_Aint) array_of_displacements[].

> Jeff
>>
>> A solution of this problem seems to be not trivial, or is there one?

Yes, there is a solution:
If a system provides a 64-bit unsigned integer address space which allows
that an array or structure may go accross the middle F00000000..000 address,
then MPI_Get_addressmust really map 
the contiguous **unsigned** address space
    from 0000...0000 until FFFF...FFFF
to the also contiguous **signed** address space
    from F000...0000 until 7FFF...FFFF
by simply subtracting F000...0000 from each unsigned address
to achieve the corresponding **signed** address.

This should be mentioned in advice to implementors.

Rolf

>> And always doing MPI_Aint with more than 8 bytes is also a no-option,
>> based on the ABI discussion, and is also a waste of memory.
>>
>>
>> Best regards
>> Rolf
>>
>>
>> ----- Original Message -----
>> > From: "mpiwg-large-counts" <mpiwg-large-counts at lists.mpi-forum.org>
>> > To: "Jeff Squyres" <jsquyres at cisco.com>
>> > Cc: "Jeff Hammond" <jeff.science at gmail.com>, "James Dinan" <
>> james.dinan at intel.com>, "mpiwg-large-counts"
>> > <mpiwg-large-counts at lists.mpi-forum.org>
>> > Sent: Friday, October 25, 2019 1:02:35 AM
>> > Subject: Re: [Mpiwg-large-counts] Large Count - the principles for
>> counts, sizes, and byte and nonbyte displacements
>>
>> > Jim (cc) suffered the most in MPI 3.0 days because of AINT_DIFF and
>> AINT_SUM, so
>> > maybe he wants to create this ticket.
>> >
>> > Jeff
>>
>>
>> > On Thu, Oct 24, 2019 at 2:41 PM Jeff Squyres (jsquyres) < [
>> > mailto:jsquyres at cisco.com | jsquyres at cisco.com ] > wrote:
>> >
>> >
>> > Not opposed to ditching segmented addressing at all. We'd need a ticket
>> for this
>> > ASAP, though.
>> >
>> > This whole conversation is predicated on:
>> >
>> > - MPI supposedly supports segmented addressing
>> > - MPI_Aint is not sufficient for modern segmented addressing (i.e.,
>> representing
>> > an address that may not be in main RAM and is not mapped in to the
>> current
>> > process' linear address space)
>> >
>> > If we no longer care about segmented addressing, that makes a whole
>> bunch of
>> > BigCount stuff a LOT easier. E.g., MPI_Aint can basically be a
>> > non-segment-supporting address integer. AINT_DIFF and AINT_SUM can go
>> away,
>> > too.
>>
>>
>> > On Oct 24, 2019, at 5:35 PM, Jeff Hammond via mpiwg-large-counts < [
>> > mailto:mpiwg-large-counts at lists.mpi-forum.org |
>> > mpiwg-large-counts at lists.mpi-forum.org ] > wrote:
>> >
>> > Rolf:
>> >
>> > Before anybody spends any time analyzing how we handle segmented
>> addressing, I
>> > want you to provide an example of a platform where this is relevant. What
>> > system can you boot today that needs this and what MPI libraries have
>> expressed
>> > an interest in supporting it?
>> >
>> > For anyone who didn't hear, ISO C and C++ have finally committed to
>> > twos-complement integers ( [
>> > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0907r1.html |
>> > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0907r1.html ]
>> , [
>> > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2218.htm |
>> > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2218.htm ] ) because
>> modern
>> > programmers should not be limited by hardware designs from the 1960s. We
>> should
>> > similarly not waste our time on obsolete features like segmentation.
>> >
>> > Jeff
>> >
>> > On Thu, Oct 24, 2019 at 10:13 AM Rolf Rabenseifner via
>> mpiwg-large-counts < [
>> > mailto:mpiwg-large-counts at lists.mpi-forum.org |
>> > mpiwg-large-counts at lists.mpi-forum.org ] > wrote:
>> >
>> >
>> >> I think that changes the conversation entirely, right?
>> >
>> > Not the first part, the state-of-current-MPI.
>> >
>> > It may change something for the future, or a new interface may be needed.
>> >
>> > Please, can you describe how MPI_Get_address can work with the
>> > different variables from different memory segments.
>> >
>> > Or whether a completely new function or a set of functions is needed.
>> >
>> > If we can still express variables from all memory segments as
>> > input to MPI_Get_address, there may be still a way to flatten
>> > the result of some internal address-iquiry into a flattened
>> > signed integer with the same behavior as MPI_Aint today.
>> >
>> > If this is impossible, then new way of thinking and solution
>> > may be needed.
>> >
>> > I really want to see examples for all current stuff as you
>> > mentioned in your last email.
>> >
>> > Best regards
>> > Rolf
>>
>>
>> ----- Original Message -----
>> > From: "HOLMES Daniel" <d.holmes at epcc.ed.ac.uk>
>> > To: "mpiwg-large-counts" <mpiwg-large-counts at lists.mpi-forum.org>
>> > Cc: "Rolf Rabenseifner" <rabenseifner at hlrs.de>, "Jeff Squyres" <
>> jsquyres at cisco.com>
>> > Sent: Thursday, October 24, 2019 6:41:34 PM
>> > Subject: Re: [Mpiwg-large-counts] Large Count - the principles for
>> counts, sizes, and byte and nonbyte displacements
>>
>> > Hi Rolf & Jeff,
>> >
>> > I think this wiki article is instructive on this topic also:
>> > https://en.wikipedia.org/wiki/X86_memory_segmentation
>> >
>> > This seems like a crazy memory addressing system to me personally, but
>> it is a
>> > (historic) example of a segmented addressing approach that MPI_Aint can
>> > support.
>> >
>> > The “strange properties” for arithmetic are strange indeed, depending on
>> what
>> > the MPI_Aint stores and how.
>> >
>> > If MPI_Aint was 20 bits long and stores only the address, then it cannot
>> be used
>> > to determine uniquely which segment is being used or what the offset is
>> within
>> > that segment (there are 4096 possible answers). Does MPI need that more
>> > detailed information? Probably - because segments were a way of
>> implementing
>> > memory protection, i.e. accessing a segment you did not have permission
>> to
>> > access led to a “segmentation fault” error. I do not know enough about
>> these
>> > old architectures to say whether an attempt to access the *same byte*
>> using two
>> > different segment:offset pairs that produce the *same* address could
>> result in
>> > different behaviour. That is, if I have access permissions for segment 3
>> but
>> > not for segment 4, I can access {seg=3,offset=2^16-16} but can I access
>> > {segment=4,offset=2^16-32}, which is the same byte? If not, then MPI
>> needs to
>> > store segment and offset inside MPI_Aint to be able to check and to set
>> > registers correctly.
>> >
>> > If MPI_Aint is 32 bits long and stores the segment in the first 16 bits
>> and the
>> > offset in the last 16 bits, then the 20 bit address can be computed in a
>> single
>> > simple instruction and both segment and offset are immediately
>> retrievable.
>> > However, doing ordinary arithmetic with this bitwise representation is
>> unwise
>> > because it is a compound structure type. Let us subtract 1 from an
>> MPI_Aint of
>> > this layout which stores offset of 0 and some non-zero segment. We get
>> offset
>> > (2^16-1) in segment (s-1), which is not 1 byte before the previous
>> MPI_Aint
>> > because segments overlap. The same happens when adding and overflowing
>> the
>> > offset portion - it changes the segment in an incorrect way. Segment++
>> moves
>> > the address forward only 16 bytes, not 2^16 bytes.
>> >
>> > The wrap-around from the end of the address space back to the beginning
>> is also
>> > a source of strange properties for arithmetic.
>> >
>> > One of the key statements from that wiki page is this:
>> >
>> > The root of the problem is that no appropriate address-arithmetic
>> instructions
>> > suitable for flat addressing of the entire memory range are
>> available.[citation
>> > needed] Flat addressing is possible by applying multiple instructions,
>> which
>> > however leads to slower programs.
>> >
>> > Cheers,
>> > Dan.
>> > —
>> > Dr Daniel Holmes PhD
>> > Architect (HPC Research)
>> > d.holmes at epcc.ed.ac.uk<mailto:d.holmes at epcc.ed.ac.uk>
>> > Phone: +44 (0) 131 651 3465
>> > Mobile: +44 (0) 7940 524 088
>> > Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh,
>> EH8 9BT
>> > —
>> > The University of Edinburgh is a charitable body, registered in
>> Scotland, with
>> > registration number SC005336.
>> > —
>>
>>
>> > ----- Original Message -----
>> >> From: "Jeff Squyres" < [ mailto:jsquyres at cisco.com | jsquyres at cisco.com
>> ] >
>> >> To: "Rolf Rabenseifner" < [ mailto:rabenseifner at hlrs.de |
>> rabenseifner at hlrs.de ]
>> >> >
>> >> Cc: "mpiwg-large-counts" < [ mailto:
>> mpiwg-large-counts at lists.mpi-forum.org |
>> >> mpiwg-large-counts at lists.mpi-forum.org ] >
>> >> Sent: Thursday, October 24, 2019 5:27:31 PM
>> >> Subject: Re: [Mpiwg-large-counts] Large Count - the principles for
>> counts,
>> >> sizes, and byte and nonbyte displacements
>> >
>> >> On Oct 24, 2019, at 11:15 AM, Rolf Rabenseifner
>> >> < [ mailto:rabenseifner at hlrs.de | rabenseifner at hlrs.de ] <mailto: [
>> >> mailto:rabenseifner at hlrs.de | rabenseifner at hlrs.de ] >> wrote:
>> >>
>> >> For me, it looked like that there was some misunderstanding
>> >> of the concept that absolute and relative addresses
>> >> and number of bytes that can be stored in MPI_Aint.
>> >>
>> >> ...with the caveat that MPI_Aint -- as it is right now -- does not
>> support
>> >> modern segmented memory systems (i.e., where you need more than a small
>> number
>> >> of bits to indicate the segment where the memory lives).
>> >>
>> >> I think that changes the conversation entirely, right?
>> >>
>> >> --
>> >> Jeff Squyres
>> >> [ mailto:jsquyres at cisco.com | jsquyres at cisco.com ] <mailto: [
>> >> mailto:jsquyres at cisco.com | jsquyres at cisco.com ] >
>> >
>> > --
>> > Dr. Rolf Rabenseifner . . . . . . . . . .. email [ mailto:
>> rabenseifner at hlrs.de |
>> > rabenseifner at hlrs.de ] .
>> > High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530 .
>> > University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832 .
>> > Head of Dpmt Parallel Computing . . . [
>> http://www.hlrs.de/people/rabenseifner |
>> > www.hlrs.de/people/rabenseifner ] .
>> > Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room 1.307) .
>> > _______________________________________________
>> > mpiwg-large-counts mailing list
>> > [ mailto:mpiwg-large-counts at lists.mpi-forum.org |
>> > mpiwg-large-counts at lists.mpi-forum.org ]
>> > [ https://lists.mpi-forum.org/mailman/listinfo/mpiwg-large-counts |
>> > https://lists.mpi-forum.org/mailman/listinfo/mpiwg-large-counts ]
>> >
>> >
>> > --
>> > Jeff Hammond
>> > [ mailto:jeff.science at gmail.com | jeff.science at gmail.com ]
>> > [ http://jeffhammond.github.io/ | http://jeffhammond.github.io/ ]
>> > _______________________________________________
>> > mpiwg-large-counts mailing list
>> > [ mailto:mpiwg-large-counts at lists.mpi-forum.org |
>> > mpiwg-large-counts at lists.mpi-forum.org ]
>> > [ https://lists.mpi-forum.org/mailman/listinfo/mpiwg-large-counts |
>> > https://lists.mpi-forum.org/mailman/listinfo/mpiwg-large-counts ]
>> >
>> >
>> > --
>> > Jeff Squyres
>> > [ mailto:jsquyres at cisco.com | jsquyres at cisco.com ]
>> >
>> >
>> >
>> > --
>> > Jeff Hammond
>> > [ mailto:jeff.science at gmail.com | jeff.science at gmail.com ]
>> > [ http://jeffhammond.github.io/ | http://jeffhammond.github.io/ ]
>> >
>> > _______________________________________________
>> > mpiwg-large-counts mailing list
>> > mpiwg-large-counts at lists.mpi-forum.org
>> > https://lists.mpi-forum.org/mailman/listinfo/mpiwg-large-counts
>>
>> --
>> Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de .
>> High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530 .
>> University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832 .
>> Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner .
>> Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room 1.307) .
>>
> 
> 
> --
> Jeff Hammond
> jeff.science at gmail.com
> http://jeffhammond.github.io/

-- 
Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de .
High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530 .
University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832 .
Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner .
Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room 1.307) .