[MLton] Re: [MLton-user] FFI, MacOS PPC, and 16-bit ints

Mon Dec 10 08:26:11 PST 2007

[Moving from mlton-user to mlton.]

On Tue, 27 Nov 2007, Vesa Karvonen wrote:
> On Nov 27, 2007 10:02 PM, Matthew Fluet <fluet at tti-c.org> wrote:
> [...]
>> It is a bug in MLton, failing to account for a big-endian target.  All
>> objects in the heap are padded for alignment purposes, so an Int16.int ref
>> is represented as a pointer to a 32-bit heap object, 16 bits of which
>> correspond to the Int16.int and 16 bits of which are (ignored) junk.
>> MLton does this packing/padding in a manner that is independent of the
>> endianness of the target platform.  This has the unfortunate consequence
>> that on a big endian platform, the int16_t* that C sees points to the
>> 16-bits of junk, not the 16-bits of Int16.int.
> [...]
>
> Browsing through the code, it would seem that the layout is decided in
> the make function starting at line 979 in packed-representation.fun.

The layout of a tuple is decided by the make function at 
packed-representation.fun:979

> It would also seem (and seems plausible to me) that a ref is compiled
> to a tuple object with a single mutable element (unless the ref gets
> flattened to a bigger tuple).

Correct.

> Thinking about this (haven't yet read the whole make function), I
> don't quite understand why endianness should matter.  I would assume
> that the pointer passed to C is the same pointer value as used in the
> ML world to point to the ref cell.

The pointer passed to C is the same pointer value as used in the ML world 
to point to the ref cell.

> It would seem to me that MLton
> currently creates a different layout depending on endianness.

It does not.

> On a
> little-endian arch, the padding is added after the used bits, while on
> a big-endian arch, the padding is added before the used bits.

This isn't correct when considering the bits, but one could view it as 
correct when considering the bytes.

Consider the layout that MLton would produce for the type:
   type t = (Int15.t * Int7.t)

MLton will greedily pack the components of the tuple into 32-bit bins. 
So, this yields the representation type
   [Word15, Word7, Bits10]
which we interpret in the following manner: given a Word32 value of this 
type, bits 0 - 14 corresponds to the Word15, bits 15 - 21 correspond to 
the Word7, and bits 22 - 31 correspond to the Bits10 (junk).  Hence, we 
extract the individual components as follows:

   z : [Word15,Word7,Bits10]
   y : [Word15,Bits1] = WordU32_extdToWord16 (z)
   x : [Word7,Bits25] = Word32_rshift (z, 15)
   w : [Word7,Bits1] = WordU32_extdToWord8 (x)

That is, to access each component, we right shift to get the component of 
interest at bit 0 and then take the appropriate low bits (bouncing up to 
the smallest primitive type that contains the component).  Note that these 
operations (rshift, lowbits, and also andb and orb masking to write 
mutable components back into a tuple) are all independent of the 
endianness of the target.  The low 8-bits or 16-bits of a word32 value 
means the same thing on big-endian and little-endian targets.

If we have an object pointer to that tuple, then we first need to extract 
the Word32 value and then continue as above to access each field:
   opt_X = Normal {hasIdentity = true, ty = [Word15,Word7,Bits10]}
   a : opt_X
   z : [Word15,Word7,Bits10] = OW32(a, 0)
   ... as above ...
Where OW32(a,0) means to extract a 32-bit value from the object pointer a 
at byte-offset 0.

Now, consider the layout that MLton would produce for the type:
   type t = (Int16.t * Int8.t)
This yields the representation type
   [Word16, Word8, Bits8]
which we interpret in the following manner: given a Word32 value of this 
type, bits 0 - 15 corresponds to the Word16, bits 16 - 23 correspond to 
the Word8, and bits 24 - 31 correspond to the Bits8 (junk).  As before, we 
could extract individual components as follows:

   z : [Word16,Word8,Bits8]
   y : Word16 = WordU32_extdToWord16 (z)
   x : [Word8,Bits24] = Word32_rshift (z, 16)
   w : Word8 = WordU32_extdToWord8 (x)

Now, if we have an object pointer to that tuple, then we have choices.  We 
could extract the Word32 value and then continue as above to access each 
field:
   opt_X = Normal {hasIdentity = true, ty = [Word16,Word8,Bits8]}
   a : opt_X
   z : [Word16,Word8,Bit8] = OW32(a, 0)
   ... as above ...
But, because the components are of primitive sizes, we can be slightly 
more efficient.  We can use OW16(a,?) and OW8(a,?) to directly access the 
appropriate components.  Now we need to be aware of the endianness of the 
target.  To access the Word16 component we need to know the *byte* offset 
of bits 0 - 15 in a Word32 value (relative to the pointer to the Word32 
value).  On a little endian target, the byte offset is zero, but on a big 
endian target, the byte offset is two.  Similarly, t access the Word8 
component, we need to know the *byte* offset of bits 16 - 24 in a Word32 
value.  On a little endian target, the byte offset is 3, but on a big 
endian target, the byte offset is 1.  So, on a little endian target, we 
extract individual components like:
   opt_X = Normal {hasIdentity = true, ty = [Word16,Word8,Bits8]}
   a : opt_X
   y : Word16 = OW16(a, 0)
   w : Word8 = OW8(a, 3)
and on a big endian target, we extract individual components like:
   opt_X = Normal {hasIdentity = true, ty = [Word16,Word8,Bits8]}
   a : opt_X
   y : Word16 = OW16(a, 2)
   w : Word8 = OW8(a, 1)

Actually, MLton currently only performs this optimization for Word8 
components, but there is no reason to not also perform it for Word16 
components.

This optimized indirect access to (well-aligned) primitive sized 
components is handled by the 'getByteOffset' function 
(packed-representation.fun:1084) and the conditional at 
packed-representation.fun:1102.

> I noted earlier that the computed layout, as described in the comment
> before the function, is troublesome on some architectures (not
> currently supported by MLton).  Perhaps now would be a good time to
> change it to compute a smarter layout.

As explained above, many sub-word32 components are accessed by first 
accessing a word32 component and then using bitops.  So, in those cases, 
there isn't much advantage to putting the sub-word32 components first.  On 
the other hand, when we can use the optimized indirect access, then (on 
some architectures) it may be more efficient to have the smaller byte 
offsets by putting the sub-word32 components first.  In any case, I 
believe that this would be accomplised by simply moving the lines:
             val (offset, components) =
                simple (!word64s, Bytes.inWord64, offset, components)
             val (offset, components) =
                simple (!word32s, Bytes.inWord32, offset, components)
after the lines:
             val (offset, components) =
                makeWords (Array.length a - 1, offset, components)
Because makeWords is greedy, this will still mean that word16 components 
come before word8 components.

As for the original FFI bug, I'm thinking it might be best to simply 
specially treat a tuple of all primitive type components (probably a 
common case, and a superset of those types appearing in the FFI).  When 
all components are of primitive type, then we can lay them out in an 
endian respecting manner (always using indirect selects).