[MLton] The bool array/vector performance bug

Tue Nov 4 19:35:09 PST 2008

On Tue, 4 Nov 2008, Vesa Karvonen wrote:
> On Mon, Jun 2, 2008 at 1:30 AM, Matthew Fluet <fluet at tti-c.org> wrote:
>> On Sat, 31 May 2008, Vesa Karvonen wrote:
>
> Indeed.  I would bet that accessing 32-bit elements on a modern 32-bit
> processor is only faster when it hits the L1 cache and even then the
> difference is likely to be minimal (if any).  I would not be surprised
> to see that it would actually improve overall performance to
> essentially always pack booleans as 1 bit elements in memory
> representations (not just in arrays).  It doesn't cost many extra
> instructions to test/set a bit (when properly optimized) rather than a
> byte/word and the savings in cache space (and even in time spent doing
> GC) could be significant.

I'm not convinced that 1 bit booleans in non-array objects would save that 
much on GC time (by saving object space).  All objects need to be at least 
4byte aligned, so one would need to have an object with more than 4 
booleans to be more space efficient with 1 bit booleans than with 8bit 
booleans.

>>> - don't allow ML bool in the FFI, but do
>>> - allow C Bool ("C_Bool.t") in the FFI.
>>
>> The C_Bool structure is in the Basis Library implementation, but not
>> currently exported by c-types.mlb.  The C_Bool structure is equal to (the
>> application of the WordToBool functor to) a WordN structure, where N
>> corresponds to the target platform's size of the C-type _Bool.
>
> I think that for the purpose of writing portable code it would make
> sense to export C_Bool regardless of whether the ML bool
> representation is changed.

Sure, though I don't believe that the C _Bool type is used much in 
standard headers.

> Given a set of alternatives ways to improve the ML bool representation
> (the goal), it makes sense to pick an alternative that also allows (or
> encourages) one to write portable code.  Clearly divorcing ML bool
> from C bool does that.  The choice of ML bool representation could
> then be done freely.  It could theoretically even be a compiler switch
> (-default-type {bool1, bool8, bool16, bool32, bool64}).

Agreed, that it would be good to allow/encourage portable code.

>> One other item is the handling of primitives that return boolean values.
>> Some of these primitives are not implemented by the codegens (e.g., the x86
>> codegen doesn't implement Word64_{equal,lt}), and so are turned into C
>> functions (in <src>/mlton/codegen/ssa-to-rssa.fun).  (This is why they are
>> declared in <src>/runtime/gen/basis-ffi.def.)  If the C functions do not
>> return values in the same representation as that of an ML bool, then you
>> will need to insert some coercion, either a word-size coercion or an
>> explicit test against 0.
>
> Hmm... What would be the best way to fix this?  These are somewhat
> special in that they are really a part of the runtime/compiler and can
> be changed at will if/when the bool representation is changed and I
> wouldn't want to add extra overhead to them.

I don't think there is any way to "fix" this other than to make sure that 
the C functions that implement conditional primitives agree with the 
compiler on the representation of booleans, or insert coercions.

Here are some other thoughts.  There is a long-standing tension/debate on 
whether conditionals should be value-producing or control-flow-producing. 
MLton currently takes a value-producing view of (most) conditionals (the 
exception being the checked-arithmetic primitives); that is, the Word32 
equality primitive is used like:

     x_5: bool = Word32_equal (x_2, global_1)

Under the control-flow-producing view, it would be used like:

     if Word32_equal (x_2, global_1) then L1 () else L2 ()

where the primitive is a transfer (the "if-then-else" being 
pretty-printing sugar).  Note that the two views are equivalent, in the 
sense that one can always convert a value to control-flow:

     x_5: bool = Word32_equal (x_2, global_1)
     case x_5 of true => L1 () | false => L2 ()

and control-flow to a value:

   datatype boolz = truez | falsez

     if Word32_equal (x_2, global_1) then L1 () else L2 ()
   L1 ()
     x_1: boolz = truez
     L3 (x_1)
   L2 ()
     x_2: boolz = falsez
     L3 (x_2)
   L3 (x_5: boolz)
     ...

Note that the boolz type is not special in any way; indeed, the compiler 
is free to use any representation it sees fit -- currently, MLton would 
represent it as 1bit (padded to 8bits when used in an array).

I point this out because one could introduce an pass that separates the 
bool type (as returned by conditional primitives) from the type of the 
conditional value stored in data structures.  That is, introduce the boolz 
datatype and transform:

     x_5: bool = Word32_equal (x_2, global_1)
     ...

to

     z: bool = Word32_equal (x_2, global_1)
     case z of true => L1 () | false => L2 ()
   L1 ()
     zt: boolz = truez
     L3 (zt: boolz)
   L2 ()
     zf: boolz = falsez
     L3 (zf: boolz)
   L3 (x_5: boolz)
     ...

for fresh z, L1, zt, L2, zf, L3 and transform

     case y of true => Lt () | false => Lf ()

to

     case y of truez => Lt () | falsez => Lf ()

and otherwise transform the "bool" type to "boolz" type.  One could 
optimize the transformation; it doesn't make much sense to transform

     x_5: bool = Word32_equal (x_2, global_1)
     case x_5 of true => L1 () | false => L2 ()

when x_5 is otherwise not used (though, the knownCase optimization should 
clean up after the un-optimized transformation in this case).  Note, that 
you could apply a similar transformation to foreign functions with bool 
arguments or return; one would require a much more complicated 
transformation to handle foreign functions with "bool array", "bool 
vector", or "bool ref" arguments or results.  (One would need to do a 
dataflow analysis to determine which objects need to maintain a 
representation compatible with C.)

So, there is a way of changing the space usage of booleans without needing 
to change the implementation of the primitives.  There is a little 
overhead here when a 32-bit integer is returned from a C function 
(implementing a primitive) and then explicitly tested before doing an 
explicit assignment of 0 or 1 (as an 8bit integer).