[Mpiwg-large-counts] Large Count - the principles for counts, sizes, and byte and nonbyte displacements

Jeff Hammond jeff.science at gmail.com
Fri Oct 25 13:35:47 CDT 2019


On Fri, Oct 25, 2019 at 5:20 AM Rolf Rabenseifner <rabenseifner at hlrs.de>
wrote:

> Dear Jeff,
>
> > If we no longer care about segmented addressing, that makes a whole
> bunch of
> > BigCount stuff a LOT easier. E.g., MPI_Aint can basically be a
> > non-segment-supporting address integer.
>
> > AINT_DIFF and AINT_SUM can go away, too.
>
> Both statements are -- in my opinion -- incorrect.
> And the real problem is really ugly, see below.
>
> After we seem to agree that MPI_Ainit is used as it is used, i.e.,
> to currently store
>  - absolute addresses (which means the bits of a 64-bit-unsigned address
>    interpreted as a signed twos-complement 64 bit integer
>    i.e., values between -2**63 and + 2**63-1
>    (and only here is the discussion about whether some higher bits may
>     be used to address segments)
>  - relative addresses between -2**63 and + 2**63-1
>

C and C++ do not allow one to do pointer arithmetic outside of a single
array so I'm not sure how one would generate relative addresses this large,
particularly on an x86_64 machine where the underlying memory addresses are
48 or 57 bits.


>  - byte counts between 0 and 2**63-1
>

All such uses should use functions with MPI_Count, not MPI_Aint, once those
functions are defined in MPI 4.0.


>
> And that for two absolute addresses within the same "sequential storage"
> (defined in MPI-3.1 Sect. 4.1.12 page 115 lines 17-19), it is allowed
> to use a minus operator (as Long as integer overflow detection is
> switched off) or MPI_Aint_diff.
>

Again, pointer arithmetic has to be within a single array.  It's pretty
hard to generate an overflow in this context.


> In principle, the MPI standard is not fully consistent with that:
>
> MPI-3.1 page 102 lines 45-46 tell:
>  "To ensure portability, arithmetic on MPI addresses
>   must
>   be performed using the MPI_AINT_ADD and MPI_AINT_DIFF functions."
> and
>

Yes, it also says for portability, RMA needs to use MPI_Alloc_mem and not
malloc or similar arrays.  That never matters in practice.


> > > ... MPI-3.1 2.5.6 "Absolute
> > > Addresses and Relative Address Displacements" p16:39-43:
> > >
> > > "For retrieving absolute addresses or any calculation with absolute
> addresses, one
> > > should
> > > use the routines and functions provided in Section 4.1.5. Section
> > > 4.1.12 provides additional rules for the correct use of absolute
> addresses. For
> > > expressions with relative displacements or other usage without absolute
> > > addresses, intrinsic operators (e.g., +, -, *) can be used."
>
> And now about large counts, especially if we want to extent routines
> that currently use MPI_Aint to something larger, i.e., MPI_Aint_x or
> MPI_Count.
>

MPI_Aint_x will not exist.  MPI_Count is already a thing and is at least as
large as MPI_Aint and MPI_Offset.


> Here, the major problem is the automatic cast within an assignment
>
>   MPI_Aint addr;
>   MPI_Aint_x (or MPI_Count) addr_x;
>
>   MPI_Get_address(...., &addr);
>   addr_x = addr;  // ***this Statement is the problem****
>
>
No.  This statement is not a problem.  MPI_Count is required to hold the
full range of MPI_Aint.  You can assign MPI_Aint to MPI_Count without
truncation.


> let's take my example fro a previous email (using 8 bit MPI_Aint)
>
>   addr1 01111111 = 127 (signed int) = 127 (unsigned int)
>   addr2 10000001 = -127  (signed int) = 129 (unsigned int)
>
> Internally the addreses are viewed by the hardware and OS as unsigned.
> MPI_Aint is interpreting the same bits as signed int.
>
> addr2-addr1 = 129 -127 = 2 (as unsigned int)
> but in a real application code with "-" operator:
>             = -127 -127 = -254
>   --> signed int Overflow because 8 bit can express only -128 .. +127
>   --> detected or automatically corrected with +256 --> -254+256 = 2
>
> And now with 12 bit MPI_Aint_x
>
>   addr1_x := addr1  results in (by sign propagation)
>   addr1_x = 000001111111 = 127 (signed int) = 127 (unsigned int)
>
>   addr2_x := addr2  results in (by sign propagation)
>   addr2_x = 111110000001 = -127  (signed int) = 129 (unsigned int)
>
> and then
>   addr2_x - addr1_x = -127 - 127 = -254
> which is a normal integer within 12bit,
> and therefore ***NO*** overflow correction!!!!!!
>
> And therefore a completely ***wrong*** result.
>
> Using two different types for absolute addresses seems to be a
> real problem in my opinion.
>
>
> And of course signed 64bit MPI_Aint does allow to specify only
> 2**63-1 bytes, which is about 8*1000**6 Bytes,
> which is only 8 Exabyte.
>
> On systems with less than 8 Exabyte per MPI process, this is not
> a problem for message passing, but it is a problem for I/O,
>

That is why MPI_Count is potentially larger than MPI_Aint, because it also
has to hold MPI_Offset for IO purposes.


> and therefore for derived datatypes.
> And derived datayptes use MPI_Aint at several locations,
> and some of them with the possibility of providing absolute addresses.
>

You are welcome to create a ticket for large-count datatype functions that
use MPI_Count if one does not already exist.

Jeff


>
> A solution of this problem seems to be not trivial, or is there one?
>
> And always doing MPI_Aint with more than 8 bytes is also a no-option,
> based on the ABI discussion, and is also a waste of memory.
>
>
> Best regards
> Rolf
>
>
> ----- Original Message -----
> > From: "mpiwg-large-counts" <mpiwg-large-counts at lists.mpi-forum.org>
> > To: "Jeff Squyres" <jsquyres at cisco.com>
> > Cc: "Jeff Hammond" <jeff.science at gmail.com>, "James Dinan" <
> james.dinan at intel.com>, "mpiwg-large-counts"
> > <mpiwg-large-counts at lists.mpi-forum.org>
> > Sent: Friday, October 25, 2019 1:02:35 AM
> > Subject: Re: [Mpiwg-large-counts] Large Count - the principles for
> counts, sizes, and byte and nonbyte displacements
>
> > Jim (cc) suffered the most in MPI 3.0 days because of AINT_DIFF and
> AINT_SUM, so
> > maybe he wants to create this ticket.
> >
> > Jeff
>
>
> > On Thu, Oct 24, 2019 at 2:41 PM Jeff Squyres (jsquyres) < [
> > mailto:jsquyres at cisco.com | jsquyres at cisco.com ] > wrote:
> >
> >
> > Not opposed to ditching segmented addressing at all. We'd need a ticket
> for this
> > ASAP, though.
> >
> > This whole conversation is predicated on:
> >
> > - MPI supposedly supports segmented addressing
> > - MPI_Aint is not sufficient for modern segmented addressing (i.e.,
> representing
> > an address that may not be in main RAM and is not mapped in to the
> current
> > process' linear address space)
> >
> > If we no longer care about segmented addressing, that makes a whole
> bunch of
> > BigCount stuff a LOT easier. E.g., MPI_Aint can basically be a
> > non-segment-supporting address integer. AINT_DIFF and AINT_SUM can go
> away,
> > too.
>
>
> > On Oct 24, 2019, at 5:35 PM, Jeff Hammond via mpiwg-large-counts < [
> > mailto:mpiwg-large-counts at lists.mpi-forum.org |
> > mpiwg-large-counts at lists.mpi-forum.org ] > wrote:
> >
> > Rolf:
> >
> > Before anybody spends any time analyzing how we handle segmented
> addressing, I
> > want you to provide an example of a platform where this is relevant. What
> > system can you boot today that needs this and what MPI libraries have
> expressed
> > an interest in supporting it?
> >
> > For anyone who didn't hear, ISO C and C++ have finally committed to
> > twos-complement integers ( [
> > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0907r1.html |
> > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0907r1.html ]
> , [
> > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2218.htm |
> > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2218.htm ] ) because
> modern
> > programmers should not be limited by hardware designs from the 1960s. We
> should
> > similarly not waste our time on obsolete features like segmentation.
> >
> > Jeff
> >
> > On Thu, Oct 24, 2019 at 10:13 AM Rolf Rabenseifner via
> mpiwg-large-counts < [
> > mailto:mpiwg-large-counts at lists.mpi-forum.org |
> > mpiwg-large-counts at lists.mpi-forum.org ] > wrote:
> >
> >
> >> I think that changes the conversation entirely, right?
> >
> > Not the first part, the state-of-current-MPI.
> >
> > It may change something for the future, or a new interface may be needed.
> >
> > Please, can you describe how MPI_Get_address can work with the
> > different variables from different memory segments.
> >
> > Or whether a completely new function or a set of functions is needed.
> >
> > If we can still express variables from all memory segments as
> > input to MPI_Get_address, there may be still a way to flatten
> > the result of some internal address-iquiry into a flattened
> > signed integer with the same behavior as MPI_Aint today.
> >
> > If this is impossible, then new way of thinking and solution
> > may be needed.
> >
> > I really want to see examples for all current stuff as you
> > mentioned in your last email.
> >
> > Best regards
> > Rolf
>
>
> ----- Original Message -----
> > From: "HOLMES Daniel" <d.holmes at epcc.ed.ac.uk>
> > To: "mpiwg-large-counts" <mpiwg-large-counts at lists.mpi-forum.org>
> > Cc: "Rolf Rabenseifner" <rabenseifner at hlrs.de>, "Jeff Squyres" <
> jsquyres at cisco.com>
> > Sent: Thursday, October 24, 2019 6:41:34 PM
> > Subject: Re: [Mpiwg-large-counts] Large Count - the principles for
> counts, sizes, and byte and nonbyte displacements
>
> > Hi Rolf & Jeff,
> >
> > I think this wiki article is instructive on this topic also:
> > https://en.wikipedia.org/wiki/X86_memory_segmentation
> >
> > This seems like a crazy memory addressing system to me personally, but
> it is a
> > (historic) example of a segmented addressing approach that MPI_Aint can
> > support.
> >
> > The “strange properties” for arithmetic are strange indeed, depending on
> what
> > the MPI_Aint stores and how.
> >
> > If MPI_Aint was 20 bits long and stores only the address, then it cannot
> be used
> > to determine uniquely which segment is being used or what the offset is
> within
> > that segment (there are 4096 possible answers). Does MPI need that more
> > detailed information? Probably - because segments were a way of
> implementing
> > memory protection, i.e. accessing a segment you did not have permission
> to
> > access led to a “segmentation fault” error. I do not know enough about
> these
> > old architectures to say whether an attempt to access the *same byte*
> using two
> > different segment:offset pairs that produce the *same* address could
> result in
> > different behaviour. That is, if I have access permissions for segment 3
> but
> > not for segment 4, I can access {seg=3,offset=2^16-16} but can I access
> > {segment=4,offset=2^16-32}, which is the same byte? If not, then MPI
> needs to
> > store segment and offset inside MPI_Aint to be able to check and to set
> > registers correctly.
> >
> > If MPI_Aint is 32 bits long and stores the segment in the first 16 bits
> and the
> > offset in the last 16 bits, then the 20 bit address can be computed in a
> single
> > simple instruction and both segment and offset are immediately
> retrievable.
> > However, doing ordinary arithmetic with this bitwise representation is
> unwise
> > because it is a compound structure type. Let us subtract 1 from an
> MPI_Aint of
> > this layout which stores offset of 0 and some non-zero segment. We get
> offset
> > (2^16-1) in segment (s-1), which is not 1 byte before the previous
> MPI_Aint
> > because segments overlap. The same happens when adding and overflowing
> the
> > offset portion - it changes the segment in an incorrect way. Segment++
> moves
> > the address forward only 16 bytes, not 2^16 bytes.
> >
> > The wrap-around from the end of the address space back to the beginning
> is also
> > a source of strange properties for arithmetic.
> >
> > One of the key statements from that wiki page is this:
> >
> > The root of the problem is that no appropriate address-arithmetic
> instructions
> > suitable for flat addressing of the entire memory range are
> available.[citation
> > needed] Flat addressing is possible by applying multiple instructions,
> which
> > however leads to slower programs.
> >
> > Cheers,
> > Dan.
> > —
> > Dr Daniel Holmes PhD
> > Architect (HPC Research)
> > d.holmes at epcc.ed.ac.uk<mailto:d.holmes at epcc.ed.ac.uk>
> > Phone: +44 (0) 131 651 3465
> > Mobile: +44 (0) 7940 524 088
> > Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh,
> EH8 9BT
> > —
> > The University of Edinburgh is a charitable body, registered in
> Scotland, with
> > registration number SC005336.
> > —
>
>
> > ----- Original Message -----
> >> From: "Jeff Squyres" < [ mailto:jsquyres at cisco.com | jsquyres at cisco.com
> ] >
> >> To: "Rolf Rabenseifner" < [ mailto:rabenseifner at hlrs.de |
> rabenseifner at hlrs.de ]
> >> >
> >> Cc: "mpiwg-large-counts" < [ mailto:
> mpiwg-large-counts at lists.mpi-forum.org |
> >> mpiwg-large-counts at lists.mpi-forum.org ] >
> >> Sent: Thursday, October 24, 2019 5:27:31 PM
> >> Subject: Re: [Mpiwg-large-counts] Large Count - the principles for
> counts,
> >> sizes, and byte and nonbyte displacements
> >
> >> On Oct 24, 2019, at 11:15 AM, Rolf Rabenseifner
> >> < [ mailto:rabenseifner at hlrs.de | rabenseifner at hlrs.de ] <mailto: [
> >> mailto:rabenseifner at hlrs.de | rabenseifner at hlrs.de ] >> wrote:
> >>
> >> For me, it looked like that there was some misunderstanding
> >> of the concept that absolute and relative addresses
> >> and number of bytes that can be stored in MPI_Aint.
> >>
> >> ...with the caveat that MPI_Aint -- as it is right now -- does not
> support
> >> modern segmented memory systems (i.e., where you need more than a small
> number
> >> of bits to indicate the segment where the memory lives).
> >>
> >> I think that changes the conversation entirely, right?
> >>
> >> --
> >> Jeff Squyres
> >> [ mailto:jsquyres at cisco.com | jsquyres at cisco.com ] <mailto: [
> >> mailto:jsquyres at cisco.com | jsquyres at cisco.com ] >
> >
> > --
> > Dr. Rolf Rabenseifner . . . . . . . . . .. email [ mailto:
> rabenseifner at hlrs.de |
> > rabenseifner at hlrs.de ] .
> > High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530 .
> > University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832 .
> > Head of Dpmt Parallel Computing . . . [
> http://www.hlrs.de/people/rabenseifner |
> > www.hlrs.de/people/rabenseifner ] .
> > Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room 1.307) .
> > _______________________________________________
> > mpiwg-large-counts mailing list
> > [ mailto:mpiwg-large-counts at lists.mpi-forum.org |
> > mpiwg-large-counts at lists.mpi-forum.org ]
> > [ https://lists.mpi-forum.org/mailman/listinfo/mpiwg-large-counts |
> > https://lists.mpi-forum.org/mailman/listinfo/mpiwg-large-counts ]
> >
> >
> > --
> > Jeff Hammond
> > [ mailto:jeff.science at gmail.com | jeff.science at gmail.com ]
> > [ http://jeffhammond.github.io/ | http://jeffhammond.github.io/ ]
> > _______________________________________________
> > mpiwg-large-counts mailing list
> > [ mailto:mpiwg-large-counts at lists.mpi-forum.org |
> > mpiwg-large-counts at lists.mpi-forum.org ]
> > [ https://lists.mpi-forum.org/mailman/listinfo/mpiwg-large-counts |
> > https://lists.mpi-forum.org/mailman/listinfo/mpiwg-large-counts ]
> >
> >
> > --
> > Jeff Squyres
> > [ mailto:jsquyres at cisco.com | jsquyres at cisco.com ]
> >
> >
> >
> > --
> > Jeff Hammond
> > [ mailto:jeff.science at gmail.com | jeff.science at gmail.com ]
> > [ http://jeffhammond.github.io/ | http://jeffhammond.github.io/ ]
> >
> > _______________________________________________
> > mpiwg-large-counts mailing list
> > mpiwg-large-counts at lists.mpi-forum.org
> > https://lists.mpi-forum.org/mailman/listinfo/mpiwg-large-counts
>
> --
> Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de .
> High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530 .
> University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832 .
> Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner .
> Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room 1.307) .
>


-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-large-counts/attachments/20191025/458b666d/attachment-0001.html>


More information about the mpiwg-large-counts mailing list