[MLton] The PackWord performance regression on the 64-bit branch

Mon Nov 27 16:00:10 PST 2006

> I spent some time looking at / thinking about the PackWord performance
> regression on the 64-bit branch.  To be honest, I doubt that it is really
> important (the PackWord operations probably aren't very critical).

They are useful, and they do reflect negatively on checksum and md5 
benchmarks.

> Are the PackWord operations supposed to work on arbitrarily (mis)aligned
> accesses?

In order to support the PackWord<N>{Big,Little} structures, we do not need 
arbitrarily aligned accesses.  On the other hand, what constitutes proper 
alignment is somewhat subtle.  For any given PackWord<N>{Big,Little} 
structure, we will only access N-bit elements at offsets that are 0 
mod (N/8) relative to the origin of the Word8.word sequence.  We only 
guarantee that the origin of the Word8.word sequence is 0 mod 4 (with 
-align 4) and 0 mod 8 (with -align 8).  So, a PackWord64{Big,Little} 
access is not necessarily 0 mod 8.  For smaller values of N, there is 
proper alignment.

> AFAICT, they are always being passed an SML array/vector, so the SML side
> should be able to guarantee proper alignment.

Right, but the SML array/vector is only 0 mod 4 aligned, which doesn't 
necessarily suffice for 64 bit operations.

> Basically, if the fetch/store endianness is the same as the machine 
> endianness, then just fetch/store the data as a single operation of the 
> correct size.

Right, this is what the old primitives (that are on trunk) do.

> Otherwise perform little<->big endianness conversion before or after the 
> operation (using bitwise ops {<<, >>, andb, orb}).  It is much faster to 
> perform the endianness conversion in registers than go through memory 
> (which causes stalls such as partial memory stalls.

I'm not surprised, and I'd be happy to move endian conversion in to SML.

> OTOH, is there some reason why the operations need to be implemented in C?

No.  They only moved over to C because I found it easier to implement the 
C versions (copying the PackReal functions) than it was to generalize the 
primitives.  It's probably only a day or two of work to restore the 
primitives and push the changes through the compiler.

> Isn't it possible to implement them on the SML side using MLton.Pointer?
>
> AFAICT, implementing the operations on the SML side should, at most,
> require some simple magic to convert an array/vector to an
> MLton.Pointer.t.

That's dangerous magic to invoke.  When you (unsafely) cast an 
array/vector to an MLton.Pointer.t, the resulting value is just a 
Word32.word -- meaning that the garbage collector does not trace the 
pointer nor does it update the pointer if the corresponding array/vector 
got moved.

The primitives essentially served this purpose, because they could operate 
directly on the SML array/vector.

> structure P = MLton.Pointer
>
> val aPtr = _address "dummy" : P.t ;
>
> local
>   fun mk (wordSize, fromInt, <<, orb) (p, offs) = let
>      fun get p = fromInt (Word8.toInt (P.getWord8 (p, 0)))
>      (*                   ^^^^^^^^^^^ discussed below   *)
>      fun fst p = (p, get p)
>      fun nxt (p, w) = let
>         val p = P.add (p, 0w1)
>      in
>         (p, orb (<< (w, 0w8), get p))
>      end
>   in
>      #2 ((case wordSize of
>              8 => fst
>            | 16 => nxt o fst
>            | 32 => nxt o nxt o nxt o fst
>            | 64 => nxt o nxt o nxt o nxt o nxt o nxt o nxt o fst
>            | _  => raise Fail "impossible")
>             (P.add (p, Word.fromInt offs * Word.fromInt (wordSize div 8))))
>   end
> in
> val getWord8Little  = mk let open Word8  in (wordSize, fromInt, <<, orb) end
> val getWord16Little = mk let open Word16 in (wordSize, fromInt, <<, orb) end
> val getWord32Little = mk let open Word32 in (wordSize, fromInt, <<, orb) end
> val getWord64Little = mk let open Word64 in (wordSize, fromInt, <<, orb) end
> end
>
> val offset = valOf (Int.fromString (hd (CommandLine.arguments ())))
> val () = print (Word32.toString (getWord32Little (aPtr, offset))^"\n")
>
>
> On the trunk MLton, the getWord32Little part of above is compiled to
>
> 	sall $2,%esi
> 	addl (globalWord32+((3*4)+(0*1))),%esi
> 	movl %esi,%edi
> 	movzbl (0+((0x0*1)+(0*1)))(%esi),%edx
> 	incl %edi
> 	shll $0x8,%edx
> 	movl %edi,%esi
> 	movzbl (0+((0x0*1)+(0*1)))(%edi),%ecx
> 	orl %edx,%ecx
> 	incl %esi
> 	shll $0x8,%ecx
> 	movl %esi,%edi
> 	movzbl (0+((0x0*1)+(0*1)))(%esi),%edx
> 	orl %edx,%ecx
> 	incl %edi
> 	shll $0x8,%ecx
> 	movzbl (0+((0x0*1)+(0*1)))(%edi),%esi
> 	orl %ecx,%esi
> 	movl %esi,(0+((60*1)+(0*1)))(%ebp)
> 	movl $0x1,(0+((56*1)+(0*1)))(%ebp)
> 	addl $56,%ebp
> 	movl $L_880,(0+(-1*4))(%ebp)
>
> which is not too bad.  The above is much faster than what gcc generates
> from the current C code on the 64-bit branch:
>
> PackWord32_subVec:
> 	pushl	%ebx
> 	movl	$1, %edx
> 	subl	$16, %esp
> 	movl	28(%esp), %ecx
> 	leal	12(%esp), %ebx
> 	movl	24(%esp), %eax
> 	addl	%eax, %ecx
> 	.p2align 1
> .L33:
> 	movzbl	-1(%edx,%ecx), %eax
> 	movb	%al, -1(%edx,%ebx)
> 	incl	%edx
> 	cmpl	$5, %edx
> 	jne	.L33
> 	movl	12(%esp), %eax
> 	addl	$16, %esp
> 	popl	%ebx
> 	ret

Right; the ML code above essentially unrolls the loops.

> On the 64-bit branch MLton, the generated machine code (from the previous
> SML code) is
>
> 	sall $2,%esi
> 	addl (globalWord32+((3*4)+(0*1))),%esi
> 	movzbl (0+((0x0*1)+(0*1)))(%esi),%edi
> 	cmpl $0x0,%edi
> 	jl L_1255
> L_841:
> 	incl %esi
> 	shll $0x8,%edi
> 	movzbl (0+((0x0*1)+(0*1)))(%esi),%edx
> 	cmpl $0x0,%edx
> 	jl L_1254
> L_840:
> 	orl %edi,%edx
> 	incl %esi
> 	shll $0x8,%edx
> 	movzbl (0+((0x0*1)+(0*1)))(%esi),%edi
> 	cmpl $0x0,%edi
> 	jl L_1253
> L_839:
> 	orl %edi,%edx
> 	incl %esi
> 	shll $0x8,%edx
> 	movzbl (0+((0x0*1)+(0*1)))(%esi),%edi
> 	cmpl $0x0,%edi
> 	jl L_1252
> L_838:
> 	orl %edx,%edi
> 	movl %edi,(0+((64*1)+(0*1)))(%ebp)
> 	movl $0x1,(0+((60*1)+(0*1)))(%ebp)
> 	addl $60,%ebp
> 	movl $L_837,(0+(-1*4))(%ebp)
>
> and it looks like a bug in the optimizer.

Indeed; I would expect you to get code that is just like that in trunk.

> However, the above kind of byte-by-byte fetching is only necessary if the
> alignment of the pointer can be arbitrary.  If proper alignment is
> guaranteed then it is much faster to just perform endianness swapping (if
> machine endianness <> operation endianness) after a single fetch of the
> result size.

Right, and this is what the primitives would give you.

> Here is a draft of how to implement endianness swapping:

Nice.

Overall, I think the right thing to do is to extend the primitives as 
described in:
   http://mlton.org/pipermail/mlton-user/2004-November/000556.html
   http://mlton.org/pipermail/mlton/2004-November/026246.html
and to use the endian swapping SML code to switch words around when 
necessary.

It would also be nice to jettison the C code for the PackReal operations 
in favor of primitives that coerce between words and reals and thereby 
build the PackReal on top of the PackWord structures.