[MLton] benchmarks with -representation {normal,packed}

Fri, 23 Apr 2004 12:48:09 -0700

> Where's the speedup coming from?  Avoiding GCs?  Less memory traffic in
> allocating heap objects?  

My guess is less memory traffic.

Another possibility is a change in how case statements are
implemented for sum types like

	datatype t = A | B | C of int

The old way used to be to do a test on the bottom two bits to see if
they are zero and either go to a switch on the value-carrying variants
or the non-value-carrying variants.

The new way is to do a switch on all the tag bits, immediately jumping
to the correct non-value carrying variant, or go to a switch on the
value-carrying variants.

> I could imagine a slowdown due to extra bit-fiddling to manipulate
> packed objects.

I think it's pretty unlikely.  A couple of bit ops is better than a
memory op.

> I would expect packed representations to pay off in programs that are
> using a lot of memory and doing frequent GCs (so, I'm interested to hear
> if there is a self-compile speedup), but I didn't think that applied to
> many of the benchmarks.

Yeah, I agree that the benchmarks aren't the best demonstration of
this.  But I should at least cook one up to show what's possible :-).

Here's the time and allocation for a self compile with MLton compiled
-representation normal and -representation packed (each compiling
MLton -representation packed).

		normal		packed		ratio
		---------------	---------------	-----
time		415.13 + 197.84 391.39 + 205.09	1.028
allocated 	32,025,718,116	31,150,575,268  1.028

So, we get a few percent decrease in allocation and run time by
packing.

> Again, I would have thought the bit fiddling of packed representations
> would have increased code size, rather than decrease.

The bit fiddling isn't that messy, and each bit fiddle corresponds to
a memory fiddle.  To take an extreme example, consider allocating a
tuple of type "bool * bool * bool * bool".  In the packed world, this
type is represented as a byte (with the high four bits as zero).  So,
in Rssa, building the tuple looks like

    x_864  = Word32_toWord8 (b3_1)
    x_862  = Word32_toWord8 (b2_1)
    x_863  = Word8_lshift (x_864, 0wx1)
    x_861  = Word8_orb (x_863, x_862)
    x_859  = Word32_toWord8 (b1_1)
    x_860  = Word8_lshift (x_861, 0wx1)
    x_858  = Word8_orb (x_860, x_859)
    x_856  = Word32_toWord8 (x_416)
    x_857  = Word8_lshift (x_858, 0wx1)
    x_855  = Word8_orb (x_857, x_856)

On the other hand, in the unpacked world, this type is represented as
a five word object, with one word for the header and one for each
bool.  So, in Rssa, building the tuple looks like.

    x_534
    = Object {header = 0x1D,
	      size = 20,
	      stores = ({offset = 0, value = b3_1},
			{offset = 4, value = b2_1},
			{offset = 8, value = b1_1},
			{offset = 12, value = x_416})}

If we get down to the assembly, the code size is pretty similar.
Here's the packed tuple:

	movb %cl,%ah
	movb %dl,%al
	shlb $0x1,%ah
	orb %ah,%al
	movb %bl,%dh
	shlb $0x1,%al
	orb %al,%dh
	movl %esi,%eax
	movb %al,%dl
	shlb $0x1,%dh
	orb %dh,%dl

and the unpacked one:

	movl $0x1D,(0+(0*4))(%esp)
	leal (0+(4*1))(%esp),%ecx
	movl %esi,(0+((4*1)+(0*1)))(%ecx)
	movl %edi,(0+((8*1)+(0*1)))(%ecx)
	movl %edx,(0+((0*1)+(0*1)))(%ecx)
	movl (0+((28*1)+(0*1)))(%ebp),%esi
	movl %esi,(0+((12*1)+(0*1)))(%ecx)
	addl $20,%esp