[MLton] The bool array/vector performance bug

Tue Nov 4 01:49:44 PST 2008

On Mon, Jun 2, 2008 at 1:30 AM, Matthew Fluet <fluet at tti-c.org> wrote:
> On Sat, 31 May 2008, Vesa Karvonen wrote:
>>
>> The bool array/vector performance bug (bool array/vector uses 4-bytes per
>> element) has been discussed on the list a number of times (you can find
>> them by googling for "bool representation", for example).
>
> It is not clear to me that using 4-bytes per bool (rather than 1-bit or
> 1-byte) is best characterized as a "performance bug".

(I took that phrase from some of the old e-mails on this subject.)

>  I suspect that on
> most 32-bit systems, accessing a 4-byte element is more efficient than
> accessing a 1-byte element.

Modern CPUs tend to have instructions like mozx/movsx on the x86 and
for a purpose.  Accessing 8-bit data is not rare.  Such an instruction
may be slightly slower than a regular non-extending move instruction,
but not by much.

>  On the other hand, it is certainly the case
> that more 1-byte elements will fit in a cache line.

Indeed.  I would bet that accessing 32-bit elements on a modern 32-bit
processor is only faster when it hits the L1 cache and even then the
difference is likely to be minimal (if any).  I would not be surprised
to see that it would actually improve overall performance to
essentially always pack booleans as 1 bit elements in memory
representations (not just in arrays).  It doesn't cost many extra
instructions to test/set a bit (when properly optimized) rather than a
byte/word and the savings in cache space (and even in time spent doing
GC) could be significant.

>> - don't allow ML bool in the FFI, but do
>> - allow C Bool ("C_Bool.t") in the FFI.
>
> The C_Bool structure is in the Basis Library implementation, but not
> currently exported by c-types.mlb.  The C_Bool structure is equal to (the
> application of the WordToBool functor to) a WordN structure, where N
> corresponds to the target platform's size of the C-type _Bool.

I think that for the purpose of writing portable code it would make
sense to export C_Bool regardless of whether the ML bool
representation is changed.

>> I think that this is the pretty much the only sane alternative, because
>> the resulting code will be portable (the size of C bool is platform
>> dependent) and it allows maximal flexibility on the ML side.
>
> Portability is a different concern than performance.

Given a set of alternatives ways to improve the ML bool representation
(the goal), it makes sense to pick an alternative that also allows (or
encourages) one to write portable code.  Clearly divorcing ML bool
from C bool does that.  The choice of ML bool representation could
then be done freely.  It could theoretically even be a compiler switch
(-default-type {bool1, bool8, bool16, bool32, bool64}).

>> The current approach, using 4-byte bools, is just plain wrong, IMO,
>> because a C bool is not specified to be 4-bytes.
>
> Fair enough, but we don't currently specify that an ML bool is equivalent to
> a C _Bool.  Rather, we specify that an ML bool is equivalent to a C int32_t
> (see http://mlton.org/ForeignFunctionInterfaceTypes).  We don't currently
> specify that any ML type is equivalent to a C _Bool.

That is a good point.

> We actually go through some hoops in the elaborator
> (<src>/mlton/elaborate/elaborate-core.fun) to compare values coming to/from
> C symbols against 0 when they are being used as ML bool.  (This could be
> simplified by excluding ML bool from the FFI.)  Though, this doesn't help
> when the bool is coming from an _import or _export and it doesn't help if
> the FFI is used with an indirect type.  That is, all bets are off if you
> have:
>
> z.c:
>  void f(int32_t *a, int32_t i) { a[i] = 7; }
>
> z.sml:
>  val f = _import "f": bool array -> unit;
>  val a = Array.tabulate (10, fn () => true);
>  val () = ... f 3 ...  Array.sub (a, 3) ...

Indeed.  Those are good reasons to exclude (ML) bool from the FFI.

>> Using 4-bytes per bool on the ML side is also highly inefficient and
>> (which doesn't make it wrong but) makes MLton look bad on some toy
>> benchmarks
>> (http://shootout.alioth.debian.org/gp4/benchmark.php?test=nsieve&lang=all).
>
> Changing the benchmark to use Word8.word rather than bool did show a speedup
> (which I attribute to packing more elements into a cache line):
[...]

Indeed.  Also note that there is a version of the benchmark using bit
arrays (1-bit per element).  It runs even faster than the 8-bits per
element version.  Compare the times for N=9:

http://shootout.alioth.debian.org/gp4/fulldata.php?test=nsievebits&p1=mlton-2&p2=mlton-2&p3=mlton-2&p4=mlton-2
http://shootout.alioth.debian.org/gp4/fulldata.php?test=nsieve&p1=mlton-2&p2=mlton-2&p3=mlton-2&p4=mlton-2

> One other item is the handling of primitives that return boolean values.
> Some of these primitives are not implemented by the codegens (e.g., the x86
> codegen doesn't implement Word64_{equal,lt}), and so are turned into C
> functions (in <src>/mlton/codegen/ssa-to-rssa.fun).  (This is why they are
> declared in <src>/runtime/gen/basis-ffi.def.)  If the C functions do not
> return values in the same representation as that of an ML bool, then you
> will need to insert some coercion, either a word-size coercion or an
> explicit test against 0.

Hmm... What would be the best way to fix this?  These are somewhat
special in that they are really a part of the runtime/compiler and can
be changed at will if/when the bool representation is changed and I
wouldn't want to add extra overhead to them.

> Contrary to some of the comments I made in the threads cited above, getting
> hold of platform-dependent types in the compiler proper has proven
> expedient.  So, if it were necessary to have the size/representation of C
> _Bool known by the compiler, then it should be possible to do so; see the
> implementation of Control.Target.Size.cint and CType.cint and
> <src>/runtime/gen/gen-sizes.c.  Thus, another alternative is to simply
> establish that ML bool is represented as C _Bool, by making WordSize.bool
> pull from Control.Target.Size.cbool.
>
> Nonetheless, I think that divorcing ML bool from C _Bool is the right
> decision.

I agree.  One can always use C_Bool.t if interfacing with C is the
overriding concern.  It also allows the widest latitude in choosing
the ML bool representation and avoids nasty cases like mutating an ML
bool array in C.

-Vesa Karvonen