profiling go

Henry Cejtin henry@sourcelight.com
Sat, 9 Jun 2001 23:36:39 -0500


Sorry,  yes,  I  had to convert the function from the A-normal form so that I
could actually read it, and clearly I typo'd a bit here and there.   Here  is
the actual cps with the `-show-types true' flag.

	fun loop_51 (x_449: word,
	             x_448: int) = 
	   let
	      val x_450: bool = MLton_eq(int) (x_448,
	                                       x_447)
	      fun L_377 () = 
	         let
	            val x_451: bool = MLton_eq(word) (x_443,
	                                              x_449)
	            fun L_379 () = 
	               let
	                  val x_452: option_1 = SOME_1 (x_445)
	               in
	                  x_452
	               end
	            fun L_380 () = 
	               raise (global_39)
	         in
	            case x_451 of
	              false => L_380
	            | true => L_379
	         end
	      fun L_378 () = 
	         let
	            fun L_381 () = 
	               raise (global_15)
	            val x_453: int = Int_addCheck (x_448,
	                                           global_0) Overflow L_381
	            val x_455: word8 = Vector_sub(word8) (x_445,
	                                                  x_448)
	            val x_456: word = Word8_toLargeWord (x_455)
	            val x_457: word = Word32_mul (global_70,
	                                          x_449)
	            val x_458: word = Word32_add (global_69,
	                                          x_457)
	            val x_454: word = Word32_add (x_456,
	                                          x_458)
	         in
	            loop_51 (x_454,
	                     x_453)
	         end
	   in
	      case x_450 of
	        false => L_378
	      | true => L_377
	   end

No, the 0x63 really is 0x63, not 0x3F.  I don't think that the re-loading was
because of a limit check.  It looks to me  like  there  are  no  allocations,
although  it  could  be  that  there is some funny control flow that makes it
possible.

The instruction
    xorl    %edx, %edx
is the standard way to clear  a  register  (%edx  in  this  case).   This  is
completely   un-needed   since  the  mull  instruction  puts  the  result  of
multiplying the %eax register by the %ebx register in the 64 bits  formed  by
the %edx register and the %eax register.  It looks as if Matthew thought that
this register had to be cleared before the multiply.

This isn't the only hot loop in the code.  Also, I don't mean to  imply  that
the  current  speed is unacceptable.  For instance, if I convert the array of
seek pointers from Position.int option's to Position.int's (with  negative  1
being  the  equivalent of NONE) then the code speeds up so that the C version
is only 1.95 times faster.  I definitely do NOT intend  to  do  this  in  the
code.  It is just too ugly.

The  idea  is  just that this is a good opportunity to see places where MLton
should generate better code.