[MLton] The PackWord performance regression on the 64-bit branch

Thu Nov 23 16:42:46 PST 2006

I spent some time looking at / thinking about the PackWord performance
regression on the 64-bit branch.  To be honest, I doubt that it is really
important (the PackWord operations probably aren't very critical).
However, as I took the time to look at the code...

Are the PackWord operations supposed to work on arbitrarily (mis)aligned
accesses?

AFAICT, they are always being passed an SML array/vector, so the SML side
should be able to guarantee proper alignment.  This means that the
operations could be much much faster.  Basically, if the fetch/store
endianness is the same as the machine endianness, then just fetch/store
the data as a single operation of the correct size.  Otherwise perform
little<->big endianness conversion before or after the operation (using
bitwise ops {<<, >>, andb, orb}).  It is much faster to perform the
endianness conversion in registers than go through memory (which causes
stalls such as partial memory stalls
(http://fatphil.org/x86/pentopt/19.html)).

OTOH, is there some reason why the operations need to be implemented in C?

Isn't it possible to implement them on the SML side using MLton.Pointer?

AFAICT, implementing the operations on the SML side should, at most,
require some simple magic to convert an array/vector to an
MLton.Pointer.t.

Anyway, here is a bit of test code that I wrote:

structure P = MLton.Pointer

val aPtr = _address "dummy" : P.t ;

local
   fun mk (wordSize, fromInt, <<, orb) (p, offs) = let
      fun get p = fromInt (Word8.toInt (P.getWord8 (p, 0)))
      (*                   ^^^^^^^^^^^ discussed below   *)
      fun fst p = (p, get p)
      fun nxt (p, w) = let
         val p = P.add (p, 0w1)
      in
         (p, orb (<< (w, 0w8), get p))
      end
   in
      #2 ((case wordSize of
              8 => fst
            | 16 => nxt o fst
            | 32 => nxt o nxt o nxt o fst
            | 64 => nxt o nxt o nxt o nxt o nxt o nxt o nxt o fst
            | _  => raise Fail "impossible")
             (P.add (p, Word.fromInt offs * Word.fromInt (wordSize div 8))))
   end
in
val getWord8Little  = mk let open Word8  in (wordSize, fromInt, <<, orb) end
val getWord16Little = mk let open Word16 in (wordSize, fromInt, <<, orb) end
val getWord32Little = mk let open Word32 in (wordSize, fromInt, <<, orb) end
val getWord64Little = mk let open Word64 in (wordSize, fromInt, <<, orb) end
end

val offset = valOf (Int.fromString (hd (CommandLine.arguments ())))
val () = print (Word32.toString (getWord32Little (aPtr, offset))^"\n")

On the trunk MLton, the getWord32Little part of above is compiled to

	sall $2,%esi
	addl (globalWord32+((3*4)+(0*1))),%esi
	movl %esi,%edi
	movzbl (0+((0x0*1)+(0*1)))(%esi),%edx
	incl %edi
	shll $0x8,%edx
	movl %edi,%esi
	movzbl (0+((0x0*1)+(0*1)))(%edi),%ecx
	orl %edx,%ecx
	incl %esi
	shll $0x8,%ecx
	movl %esi,%edi
	movzbl (0+((0x0*1)+(0*1)))(%esi),%edx
	orl %edx,%ecx
	incl %edi
	shll $0x8,%ecx
	movzbl (0+((0x0*1)+(0*1)))(%edi),%esi
	orl %ecx,%esi
	movl %esi,(0+((60*1)+(0*1)))(%ebp)
	movl $0x1,(0+((56*1)+(0*1)))(%ebp)
	addl $56,%ebp
	movl $L_880,(0+(-1*4))(%ebp)

which is not too bad.  The above is much faster than what gcc generates
from the current C code on the 64-bit branch:

PackWord32_subVec:
	pushl	%ebx
	movl	$1, %edx
	subl	$16, %esp
	movl	28(%esp), %ecx
	leal	12(%esp), %ebx
	movl	24(%esp), %eax
	addl	%eax, %ecx
	.p2align 1
.L33:
	movzbl	-1(%edx,%ecx), %eax
	movb	%al, -1(%edx,%ebx)
	incl	%edx
	cmpl	$5, %edx
	jne	.L33
	movl	12(%esp), %eax
	addl	$16, %esp
	popl	%ebx
	ret

On the 64-bit branch MLton, the generated machine code (from the previous
SML code) is

	sall $2,%esi
	addl (globalWord32+((3*4)+(0*1))),%esi
	movzbl (0+((0x0*1)+(0*1)))(%esi),%edi
	cmpl $0x0,%edi
	jl L_1255
L_841:
	incl %esi
	shll $0x8,%edi
	movzbl (0+((0x0*1)+(0*1)))(%esi),%edx
	cmpl $0x0,%edx
	jl L_1254
L_840:
	orl %edi,%edx
	incl %esi
	shll $0x8,%edx
	movzbl (0+((0x0*1)+(0*1)))(%esi),%edi
	cmpl $0x0,%edi
	jl L_1253
L_839:
	orl %edi,%edx
	incl %esi
	shll $0x8,%edx
	movzbl (0+((0x0*1)+(0*1)))(%esi),%edi
	cmpl $0x0,%edi
	jl L_1252
L_838:
	orl %edx,%edi
	movl %edi,(0+((64*1)+(0*1)))(%ebp)
	movl $0x1,(0+((60*1)+(0*1)))(%ebp)
	addl $60,%ebp
	movl $L_837,(0+(-1*4))(%ebp)

and it looks like a bug in the optimizer.  Interestingly, if I use
Word8.toIntX (which is incorrect, of course), the 64-bit branch MLton does
the right thing:

	sall $2,%esi
	addl (globalWord32+((3*4)+(0*1))),%esi
	movl %esi,%edi
	movsbl (0+((0x0*1)+(0*1)))(%esi),%edx
	incl %edi
	shll $0x8,%edx
	movl %edi,%esi
	movsbl (0+((0x0*1)+(0*1)))(%edi),%ecx
	orl %ecx,%edx
	incl %esi
	shll $0x8,%edx
	movl %esi,%edi
	movsbl (0+((0x0*1)+(0*1)))(%esi),%ecx
	orl %edx,%ecx
	incl %edi
	shll $0x8,%ecx
	movsbl (0+((0x0*1)+(0*1)))(%edi),%esi
	orl %esi,%ecx
	movl %ecx,(0+((64*1)+(0*1)))(%ebp)
	movl $0x1,(0+((60*1)+(0*1)))(%ebp)
	addl $60,%ebp
	movl $L_833,(0+(-1*4))(%ebp)

...

However, the above kind of byte-by-byte fetching is only necessary if the
alignment of the pointer can be arbitrary.  If proper alignment is
guaranteed then it is much faster to just perform endianness swapping (if
machine endianness <> operation endianness) after a single fetch of the
result size.  Here is a draft of how to implement endianness swapping:

local
   fun mk (wordSize, `, <<, >>, andb, xorb, op -) w = let
      fun st (w, msk, sft) = let
         val odd = andb (w, msk)
         val evn = xorb (w, odd)
      in
         (xorb (<< (odd, sft), >> (evn, sft)),
          xorb (msk, << (msk, Word.>> (sft, 0w1))),
          Word.>> (sft, 0w1))
      end
   in
      #1 ((case wordSize of
              8 => (fn x => x)
            | 16 => st
            | 32 => st o st
            | 64 => st o st o st
            | _ => raise Fail "impossible")
             (w,
              << (`1, Word.fromInt (wordSize div 2)) - `1,
              Word.fromInt (wordSize div 2)))
   end
in
val swapWord8  = mk let open Word8  in (wordSize, fromInt, <<, >>, andb, xorb, op -) end
val swapWord16 = mk let open Word16 in (wordSize, fromInt, <<, >>, andb, xorb, op -) end
val swapWord32 = mk let open Word32 in (wordSize, fromInt, <<, >>, andb, xorb, op -) end
val swapWord64 = mk let open Word64 in (wordSize, fromInt, <<, >>, andb, xorb, op -) end
end

Here is what is generated with the trunk MLton for swapWord32:

	movl %edx,%esi
	andl $0xFFFF,%esi
	movl %esi,%edi
	xorl %edx,%edi
	shll $0x10,%esi
	shrl $0x10,%edi
	xorl %edi,%esi
	movl %esi,%edi
	andl $0xFF00FF,%edi
	xorl %edi,%esi
	shll $0x8,%edi
	shrl $0x8,%esi
	xorl %edi,%esi

-Vesa Karvonen