[MLton] The bool array/vector performance bug

Sun Jun 1 16:30:40 PDT 2008

On Sat, 31 May 2008, Vesa Karvonen wrote:
> The bool array/vector performance bug (bool array/vector uses 4-bytes per
> element) has been discussed on the list a number of times (you can find
> them by googling for "bool representation", for example).

It is not clear to me that using 4-bytes per bool (rather than 1-bit or 
1-byte) is best characterized as a "performance bug".  I suspect that on 
most 32-bit systems, accessing a 4-byte element is more efficient than 
accessing a 1-byte element.  On the other hand, it is certainly the case 
that more 1-byte elements will fit in a cache line.

> - don't allow ML bool in the FFI, but do
> - allow C Bool ("C_Bool.t") in the FFI.

The C_Bool structure is in the Basis Library implementation, but not 
currently exported by c-types.mlb.  The C_Bool structure is equal to (the 
application of the WordToBool functor to) a WordN structure, where N 
corresponds to the target platform's size of the C-type _Bool.

> I think that this is the pretty much the only sane alternative, because
> the resulting code will be portable (the size of C bool is platform
> dependent) and it allows maximal flexibility on the ML side.

Portability is a different concern than performance.

> The current approach, using 4-byte bools, is just plain wrong, IMO,
> because a C bool is not specified to be 4-bytes.

Fair enough, but we don't currently specify that an ML bool is equivalent 
to a C _Bool.  Rather, we specify that an ML bool is equivalent to a C 
int32_t (see http://mlton.org/ForeignFunctionInterfaceTypes).  We don't 
currently specify that any ML type is equivalent to a C _Bool.

We actually go through some hoops in the elaborator 
(<src>/mlton/elaborate/elaborate-core.fun) to compare values coming 
to/from C symbols against 0 when they are being used as ML bool.  (This 
could be simplified by excluding ML bool from the FFI.)  Though, this 
doesn't help when the bool is coming from an _import or _export and it 
doesn't help if the FFI is used with an indirect type.  That is, all bets 
are off if you have:

z.c:
  void f(int32_t *a, int32_t i) { a[i] = 7; }

z.sml:
  val f = _import "f": bool array -> unit;
  val a = Array.tabulate (10, fn () => true);
  val () = ... f 3 ...  Array.sub (a, 3) ...

> Using 4-bytes per bool on the ML side is also highly inefficient and 
> (which doesn't make it wrong but) makes MLton look bad on some toy 
> benchmarks 
> (http://shootout.alioth.debian.org/gp4/benchmark.php?test=nsieve&lang=all).

Changing the benchmark to use Word8.word rather than bool did show a 
speedup (which I attribute to packing more elements into a cache line):

[fluet at wolfpup tmp]$ for ((i=0;i<5;i++)); do /usr/bin/time ./nsieve-word8 12 > /dev/null ; done
16.43user 0.22system 0:17.58elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+18172minor)pagefaults 0swaps
16.38user 0.24system 0:17.67elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+18172minor)pagefaults 0swaps
16.42user 0.24system 0:17.59elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+18172minor)pagefaults 0swaps
16.37user 0.23system 0:17.55elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+18170minor)pagefaults 0swaps
16.44user 0.21system 0:17.68elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+18172minor)pagefaults 0swaps

[fluet at wolfpup tmp]$ for ((i=0;i<5;i++)); do /usr/bin/time ./nsieve-bool 12 > /dev/null ; done
22.88user 0.61system 0:25.62elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+60672minor)pagefaults 0swaps
22.92user 0.51system 0:25.01elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k
24inputs+0outputs (0major+60672minor)pagefaults 0swaps
22.90user 0.53system 0:24.76elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+60671minor)pagefaults 0swaps
22.89user 0.54system 0:24.75elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+60671minor)pagefaults 0swaps
22.88user 0.53system 0:24.80elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+60672minor)pagefaults 0swaps

> So, what is the status of the bool FFI thingy?  And what places in the
> compiler/runtime/basis lib need to be changed to:
> - eliminate ML bool from the FFI,

Dropping '("Bool", CType.bool, Tycon.bool)' from the list of nullary FFI 
types at <src>/mlton/elaborate/elaborate-core.fun:755 will effectively 
remove ML bool from the FFI.  As noted above, you could also clean up the 
code paths in mkFetch, mkStore, and mkSymbol that arise from the isBool 
tests.

There are no essential uses of ML bool as an FFI type in the Basis Library 
implementation, but there are some uses in <src>/runtime/gen/basis-ffi.def 
where we declare C functions that are used as the implementation of 
comparison primitives that return bool.

> - add C_Bool to FFI (this might already be available?), and

It is implicitly available, as we have:
   structure C_Bool = WordToBool (type t = Word8.word val zero: t = 0wx0 val one: t = 0wx1)
in <src>/basis-library/config/c/$(TARGET_ARCH)-$(TARGET_OS)/c-types.sml 
(though Word8 might be Word32 on some platforms), and Word8.word is an FFI 
type.

> - change the representation of ML bool to use 8-bits (or 1-bit) (in arrays
>  and vectors at least)?

The representation is established in 
<src>/mlton/backend/packed-representation.fun; see lines 1407-1417.  The 
underlying word size is set by WordSize.bool 
(<src>/mlton/ast/word-size.fun).

I would look carefully at all the uses of WordSize.bool in the compiler,
and also all the uses of CType.bool.

> Is that all?  I would guess that changing the representation is about as
> difficult as knowing all the places that need to be changed.  If more
> programming is needed, I'd be happy to help to fix this.

One other item is the handling of primitives that return boolean values. 
Some of these primitives are not implemented by the codegens (e.g., the 
x86 codegen doesn't implement Word64_{equal,lt}), and so are turned into C 
functions (in <src>/mlton/codegen/ssa-to-rssa.fun).  (This is why they are 
declared in <src>/runtime/gen/basis-ffi.def.)  If the C functions do not 
return values in the same representation as that of an ML bool, then you 
will need to insert some coercion, either a word-size coercion or an 
explicit test against 0.

Contrary to some of the comments I made in the threads cited above, 
getting hold of platform-dependent types in the compiler proper has proven 
expedient.  So, if it were necessary to have the size/representation of C 
_Bool known by the compiler, then it should be possible to do so; see the 
implementation of Control.Target.Size.cint and CType.cint and 
<src>/runtime/gen/gen-sizes.c.  Thus, another alternative is to simply 
establish that ML bool is represented as C _Bool, by making WordSize.bool 
pull from Control.Target.Size.cbool.

Nonetheless, I think that divorcing ML bool from C _Bool is the right 
decision.