profiling go

Stephen Weeks
Sat, 9 Jun 2001 12:35:28 -0700

> Here is the loop that computes the checksum of a vector of bytes:
> fun loop_55 (x_495, x_494) =
>        if x_494 = x_493
> 	  then if x_489 = x_495
> 		  then SOME_1 x_491
> 		  else raise BadChecksum
> 	  else loop_55 (Word32.+ (Word32.tolargeWord (Vector.sub (x_491,
> 								  x_494)),
> 				  Word32.+ (0w63,
> 					    Word32.* (0wx1234567, x_495))),
> 			x_494 + 1 (overflow => raise Overflow))

I'm a bit confused.  Should the Word32.tolargeWord be Word8.toLargeWord?  Also,
it looks like you have rearranged the order of things since the x_494 + 1 is
computed first by the assembly code (which would imply right-to-left
evaluation).  Could you please run MLton with "-show-types true" and send the
unedited CPS code?

Assuming x_491 is a Word8.word vector, you might be able to speed stuff up by
using Pack32Little.subVec to read an entire word at a time instead of a byte.

Here are my comments on the assembly.

					(36)(%edi) == x_491
					(40)(%edi) == x_493
					(44)(%edi) == x_495
					(48)(%edi) == x_494
	movl (48*1)(%edi),%eax		%eax = x_494
	cmpl (40*1)(%edi),%eax		if x_494 = x_493
	je L_423
	movl %eax,%ebx			%ebx = x_494
	incl %ebx			%ebx = x_494 + 1
	jo L_427
	movl (36*1)(%edi),%ecx		%ecx = x_491
	movb (%ecx,%eax,1),%dl		%dl = Vector.sub (x_491, x_494)
	movl %ebx,(48*1)(%edi)		x_494 = %ebx
	movzbl %dl,%eax			%eax = Word8.toLargeWord (%dl)
	movl %eax,localuint
	movl (44*1)(%edi),%eax		%eax = x_495
	movl $0x1234567,%ebx		%ebx = 0wx1234567
	xorl %edx,%edx
	mull %ebx			%eax = x_495 * 0wx1234567
	addl $0x63,%eax			%eax = Word32.+ (0w63, ...)
	addl localuint,%eax		%eax = Word32.+ (Word8.toLargWord ...)
	movl %eax,(44*1)(%edi)		x_495 = %eax
	jmp loop_55

Shouldn't the $0x63 be $0x3F?

> I'm  confused  by  the  constant  re-loading of %ecx (x_491 in the CPS code).

I'm betting it's because x_491 is a pointer and is live across a limit check,
and hence we won't let it live in a register.

> Also the storing of %eac in localuint.

I agree.  I don't know why we didn't use another register.

I don't understand the "xorl %edx, %edx".

I would think we could at least keep x_494 and x_495 in a register around the

All in all, pretty bad code, mostly due to the register allocator (both in the
backend and the codegen).

I guess if this is your only hot loop, you can FFI it to C for the time being?