[MLton] The PackWord performance regression on the 64-bit branch

Matthew Fluet fluet at cs.cornell.edu
Mon Nov 27 16:00:10 PST 2006


> I spent some time looking at / thinking about the PackWord performance
> regression on the 64-bit branch.  To be honest, I doubt that it is really
> important (the PackWord operations probably aren't very critical).

They are useful, and they do reflect negatively on checksum and md5 
benchmarks.

> Are the PackWord operations supposed to work on arbitrarily (mis)aligned
> accesses?

In order to support the PackWord<N>{Big,Little} structures, we do not need 
arbitrarily aligned accesses.  On the other hand, what constitutes proper 
alignment is somewhat subtle.  For any given PackWord<N>{Big,Little} 
structure, we will only access N-bit elements at offsets that are 0 
mod (N/8) relative to the origin of the Word8.word sequence.  We only 
guarantee that the origin of the Word8.word sequence is 0 mod 4 (with 
-align 4) and 0 mod 8 (with -align 8).  So, a PackWord64{Big,Little} 
access is not necessarily 0 mod 8.  For smaller values of N, there is 
proper alignment.

> AFAICT, they are always being passed an SML array/vector, so the SML side
> should be able to guarantee proper alignment.

Right, but the SML array/vector is only 0 mod 4 aligned, which doesn't 
necessarily suffice for 64 bit operations.

> Basically, if the fetch/store endianness is the same as the machine 
> endianness, then just fetch/store the data as a single operation of the 
> correct size.

Right, this is what the old primitives (that are on trunk) do.

> Otherwise perform little<->big endianness conversion before or after the 
> operation (using bitwise ops {<<, >>, andb, orb}).  It is much faster to 
> perform the endianness conversion in registers than go through memory 
> (which causes stalls such as partial memory stalls.

I'm not surprised, and I'd be happy to move endian conversion in to SML.

> OTOH, is there some reason why the operations need to be implemented in C?

No.  They only moved over to C because I found it easier to implement the 
C versions (copying the PackReal functions) than it was to generalize the 
primitives.  It's probably only a day or two of work to restore the 
primitives and push the changes through the compiler.

> Isn't it possible to implement them on the SML side using MLton.Pointer?
>
> AFAICT, implementing the operations on the SML side should, at most,
> require some simple magic to convert an array/vector to an
> MLton.Pointer.t.

That's dangerous magic to invoke.  When you (unsafely) cast an 
array/vector to an MLton.Pointer.t, the resulting value is just a 
Word32.word -- meaning that the garbage collector does not trace the 
pointer nor does it update the pointer if the corresponding array/vector 
got moved.

The primitives essentially served this purpose, because they could operate 
directly on the SML array/vector.

> structure P = MLton.Pointer
>
> val aPtr = _address "dummy" : P.t ;
>
> local
>   fun mk (wordSize, fromInt, <<, orb) (p, offs) = let
>      fun get p = fromInt (Word8.toInt (P.getWord8 (p, 0)))
>      (*                   ^^^^^^^^^^^ discussed below   *)
>      fun fst p = (p, get p)
>      fun nxt (p, w) = let
>         val p = P.add (p, 0w1)
>      in
>         (p, orb (<< (w, 0w8), get p))
>      end
>   in
>      #2 ((case wordSize of
>              8 => fst
>            | 16 => nxt o fst
>            | 32 => nxt o nxt o nxt o fst
>            | 64 => nxt o nxt o nxt o nxt o nxt o nxt o nxt o fst
>            | _  => raise Fail "impossible")
>             (P.add (p, Word.fromInt offs * Word.fromInt (wordSize div 8))))
>   end
> in
> val getWord8Little  = mk let open Word8  in (wordSize, fromInt, <<, orb) end
> val getWord16Little = mk let open Word16 in (wordSize, fromInt, <<, orb) end
> val getWord32Little = mk let open Word32 in (wordSize, fromInt, <<, orb) end
> val getWord64Little = mk let open Word64 in (wordSize, fromInt, <<, orb) end
> end
>
> val offset = valOf (Int.fromString (hd (CommandLine.arguments ())))
> val () = print (Word32.toString (getWord32Little (aPtr, offset))^"\n")
>
>
> On the trunk MLton, the getWord32Little part of above is compiled to
>
> 	sall $2,%esi
> 	addl (globalWord32+((3*4)+(0*1))),%esi
> 	movl %esi,%edi
> 	movzbl (0+((0x0*1)+(0*1)))(%esi),%edx
> 	incl %edi
> 	shll $0x8,%edx
> 	movl %edi,%esi
> 	movzbl (0+((0x0*1)+(0*1)))(%edi),%ecx
> 	orl %edx,%ecx
> 	incl %esi
> 	shll $0x8,%ecx
> 	movl %esi,%edi
> 	movzbl (0+((0x0*1)+(0*1)))(%esi),%edx
> 	orl %edx,%ecx
> 	incl %edi
> 	shll $0x8,%ecx
> 	movzbl (0+((0x0*1)+(0*1)))(%edi),%esi
> 	orl %ecx,%esi
> 	movl %esi,(0+((60*1)+(0*1)))(%ebp)
> 	movl $0x1,(0+((56*1)+(0*1)))(%ebp)
> 	addl $56,%ebp
> 	movl $L_880,(0+(-1*4))(%ebp)
>
> which is not too bad.  The above is much faster than what gcc generates
> from the current C code on the 64-bit branch:
>
> PackWord32_subVec:
> 	pushl	%ebx
> 	movl	$1, %edx
> 	subl	$16, %esp
> 	movl	28(%esp), %ecx
> 	leal	12(%esp), %ebx
> 	movl	24(%esp), %eax
> 	addl	%eax, %ecx
> 	.p2align 1
> .L33:
> 	movzbl	-1(%edx,%ecx), %eax
> 	movb	%al, -1(%edx,%ebx)
> 	incl	%edx
> 	cmpl	$5, %edx
> 	jne	.L33
> 	movl	12(%esp), %eax
> 	addl	$16, %esp
> 	popl	%ebx
> 	ret

Right; the ML code above essentially unrolls the loops.

> On the 64-bit branch MLton, the generated machine code (from the previous
> SML code) is
>
> 	sall $2,%esi
> 	addl (globalWord32+((3*4)+(0*1))),%esi
> 	movzbl (0+((0x0*1)+(0*1)))(%esi),%edi
> 	cmpl $0x0,%edi
> 	jl L_1255
> L_841:
> 	incl %esi
> 	shll $0x8,%edi
> 	movzbl (0+((0x0*1)+(0*1)))(%esi),%edx
> 	cmpl $0x0,%edx
> 	jl L_1254
> L_840:
> 	orl %edi,%edx
> 	incl %esi
> 	shll $0x8,%edx
> 	movzbl (0+((0x0*1)+(0*1)))(%esi),%edi
> 	cmpl $0x0,%edi
> 	jl L_1253
> L_839:
> 	orl %edi,%edx
> 	incl %esi
> 	shll $0x8,%edx
> 	movzbl (0+((0x0*1)+(0*1)))(%esi),%edi
> 	cmpl $0x0,%edi
> 	jl L_1252
> L_838:
> 	orl %edx,%edi
> 	movl %edi,(0+((64*1)+(0*1)))(%ebp)
> 	movl $0x1,(0+((60*1)+(0*1)))(%ebp)
> 	addl $60,%ebp
> 	movl $L_837,(0+(-1*4))(%ebp)
>
> and it looks like a bug in the optimizer.

Indeed; I would expect you to get code that is just like that in trunk.

> However, the above kind of byte-by-byte fetching is only necessary if the
> alignment of the pointer can be arbitrary.  If proper alignment is
> guaranteed then it is much faster to just perform endianness swapping (if
> machine endianness <> operation endianness) after a single fetch of the
> result size.

Right, and this is what the primitives would give you.

> Here is a draft of how to implement endianness swapping:

Nice.

Overall, I think the right thing to do is to extend the primitives as 
described in:
   http://mlton.org/pipermail/mlton-user/2004-November/000556.html
   http://mlton.org/pipermail/mlton/2004-November/026246.html
and to use the endian swapping SML code to switch words around when 
necessary.

It would also be nice to jettison the C code for the PackReal operations 
in favor of primitives that coerce between words and reals and thereby 
build the PackReal on top of the PackWord structures.



More information about the MLton mailing list