[MLton-user] more optimization questions

Stephen Weeks MLton-user@mlton.org
Tue, 20 Dec 2005 15:22:57 -0800


> I'm running triply-nested loops with lots of FP and I'm seeing about  
> a 3:1 slow-down over and equivalent C program.  I'm pretty sure that  
> this number is real, because if I do a rough estimate of flops I'm  
> way off using ML.

Hi Brian.  I've taken a look at

  http://mlton.org/pages/TemporaryUpload/attachments/fd2.tar.gz

and I now understand what's going on.  Quick summary: the C codegen is
using procedure calls for Real.abs.  Eliminating those gets the run
time to well under 2X that of C.  

Now for the long explanation.  First, here are the timings (all in
seconds) I see on my machines.

On my x86:

fd2	C		 2.35
fd2_1x	MLton-native	 3.45
fd2_2x	MLton-native	 3.87
fd2_1x	MLton-C		11.82
fd2_2x	MLton-C 	11.71

On my PowerPC:

fd2	C	 	 4.40
fd2_1x	MLton-C		13.35
fd2_2x	MLton-C		13.48

Profiling showed that all the time is spent in the inner loop (line 59
of fd2.sml and line 43 of fd2_1.sml).  The fact that the C codegen is
slow on x86 as well as on PowerPC made me think that the problem was
in the codegen rather than anything specific to PowerPC.  So, I
compiled with '-keep g' and looked at the generated code.  Fortunately
the hot loop was only two basic blocks of assembly.

--------------------------------------------------------------------------------
loop_19:
	cmpl $0x190,%edi
	jnl L_1868
L_1361:
	movl %edi,(localWord32+((1*4)+(0*1)))
	fildl (localWord32+((1*4)+(0*1)))
	fldL (0+((40*1)+(0*1)))(%ebp)
	fmul %st, %st(1)
	fldL (globalReal64+((5*8)+(0*1)))
	fmulL (0+((32*1)+(0*1)))(%ebp)
	fmul %st(4), %st
	fxch %st(2)
	fsubL (globalReal64+((6*8)+(0*1)))
	fmulp %st, %st(2)
	movl %esi,%edi
	incl %edi
	imull $0x191,%edi
	movl (localWord32+((1*4)+(0*1))),%edx
	addl %edx,%edi
	movl (0+((24*1)+(0*1)))(%ebp),%ecx
	fldL (0+(0*1))(%ecx,%edi,8)
	movl %esi,%edi
	decl %edi
	imull $0x191,%edi
	addl %edx,%edi
	movl %edx,%ebx
	movl %edx,%eax
	faddL (0+(0*1))(%ecx,%edi,8)
	incl %ebx
	movl %ebx,%edi
	movl %ebx,(localWord32+((1*4)+(0*1)))
	movl %esi,%ebx
	imull $0x191,%ebx
	addl %ebx,%edi
	faddL (0+(0*1))(%ecx,%edi,8)
	decl %eax
	addl %ebx,%eax
	faddL (0+(0*1))(%ecx,%eax,8)
	addl %ebx,%edx
	fldL (0+(0*1))(%ecx,%edx,8)
	fld %st
	fmulL (globalReal64+((4*8)+(0*1)))
	fsubrp %st, %st(2)
	fxch %st(3)
	fmul %st(2), %st
	fmulp %st, %st(2)
	fsubp %st, %st(1)
	fmulL (0+((16*1)+(0*1)))(%ebp)
	fld %st
	fabs
	faddp %st, %st(3)
	faddp %st, %st(1)
	fstpL (0+(0*1))(%ecx,%edx,8)
	movl (localWord32+((1*4)+(0*1))),%edi
	jmp loop_19
--------------------------------------------------------------------------------

On the other hand, the hot loop in the generated C was much more
verbose. 

--------------------------------------------------------------------------------
loop_19:
	W32_1 = WordS32_lt (W32_0, 0x190);
	BNZ (W32_1, L_1384);
	W32_0 = Word32_add (0x1, S(Word32, 36));
	R64_0 = S(Real64, 64);
	S(Word32, 36) = W32_0;
	goto loop_18;
L_1384:
	R64_0 = WordS32_toReal64 (W32_0);
	R64_1 = Real64_mul (S(Real64, 48), R64_0);
	R64_2 = Real64_mul (-0.36E2, S(Real64, 40));
	R64_3 = Real64_mul (R64_2, S(Real64, 56));
	R64_4 = Real64_sub (R64_1, 0.1E1);
	R64_5 = Real64_mul (R64_4, R64_3);
	W32_1 = Word32_add (0x1, S(Word32, 36));
	W32_2 = WordS32_mul (W32_1, 0x191);
	W32_3 = Word32_add (W32_2, W32_0);
	R64_6 = X(Real64, S(Pointer, 32), W32_3, 8, 0);
	W32_4 = Word32_sub (S(Word32, 36), 0x1);
	W32_5 = WordS32_mul (W32_4, 0x191);
	W32_6 = Word32_add (W32_0, W32_5);
	R64_7 = X(Real64, S(Pointer, 32), W32_6, 8, 0);
	R64_8 = Real64_add (R64_7, R64_6);
	S(Word32, 72) = Word32_add (W32_0, 0x1);
	W32_7 = WordS32_mul (0x191, S(Word32, 36));
	W32_8 = Word32_add (W32_7, S(Word32, 72));
	R64_9 = X(Real64, S(Pointer, 32), W32_8, 8, 0);
	R64_10 = Real64_add (R64_8, R64_9);
	W32_9 = Word32_sub (W32_0, 0x1);
	W32_10 = Word32_add (W32_7, W32_9);
	R64_11 = X(Real64, S(Pointer, 32), W32_10, 8, 0);
	R64_12 = Real64_add (R64_11, R64_10);
	S(Word32, 76) = Word32_add (W32_7, W32_0);
	R64_13 = X(Real64, S(Pointer, 32), S(Word32, 76), 8, 0);
	R64_14 = Real64_mul (R64_13, 0.4E1);
	R64_15 = Real64_sub (R64_12, R64_14);
	R64_16 = Real64_mul (R64_5, S(Real64, 48));
	R64_17 = Real64_mul (S(Real64, 48), R64_16);
	R64_18 = Real64_sub (R64_15, R64_17);
	S(Real64, 80) = Real64_mul (S(Real64, 24), R64_18);
	{
	Real64 tmp0 = S(Real64, 80);
	S(Word32, 88) = 143;
	Push (92);
	FlushFrontier();
	FlushStackTop();
	CReturnW32 = Real64_class (tmp0);
	}
	CacheFrontier();
	CacheStackTop();
L_1383:
	Push (-92);
	W32_0 = CReturnW32;
	P_0 = G(Pointer, 15);
	P_1 = G(Pointer, 7);
L_140:
	W32_1 = O(Word32, P_1, 0);
	W32_2 = Word32_equal (W32_0, W32_1);
	BNZ (W32_2, L_1379);
switch ((Word32)P_0) {
case 0x1:
	GPNR(0) = G(Pointer, 16);
	Raise ();
default:
	P_1 = O(Pointer, P_0, 4);
	P_2 = O(Pointer, P_0, 0);
	P_0 = P_1;
	P_1 = P_2;
	goto L_140;
}
L_1379:
	W8_0 = O(Word8, P_1, 4);
switch (W8_0) {
case (Word8)0x1:
	R64_0 = S(Real64, 80);
L_149:
	R64_1 = Real64_add (S(Real64, 64), R64_0);
	R64_2 = X(Real64, S(Pointer, 32), S(Word32, 76), 8, 0);
	R64_3 = Real64_add (S(Real64, 80), R64_2);
	X(Real64, S(Pointer, 32), S(Word32, 76), 8, 0) = R64_3;
	S(Real64, 64) = R64_1;
	W32_0 = S(Word32, 72);
	goto loop_19;
case (Word8)0x2:
	R64_0 = S(Real64, 16);
	goto L_149;
default:
	{
	Real64 tmp0 = S(Real64, 80);
	S(Word32, 88) = 142;
	Push (92);
	FlushFrontier();
	FlushStackTop();
	CReturnW32 = Real64_signBit (tmp0);
	}
	CacheFrontier();
	CacheStackTop();
L_1378:
	Push (-92);
	W32_0 = CReturnW32;
	W32_1 = Word32_equal (W32_0, 0x0);
	BNZ (W32_1, L_1376);
	R64_0 = Real64_neg (S(Real64, 80));
	goto L_149;
L_1376:
	R64_0 = S(Real64, 80);
	goto L_149;
}
--------------------------------------------------------------------------------

This points to the problem.  The fabs in the assembly has been
replaced in the C by a whole mess of code, including calls to
Real64_class and Real64_signbit.  A look at the implementation of
Real.abs in basis-library/real/real.fun shows what's going on.

--------------------------------------------------------------------------------
      val abs =
         if MLton.Codegen.isNative
            then abs
         else
            fn x =>
            case class x of
               INF => posInf
             | NAN => x
             | _ => if signBit x then ~x else x
--------------------------------------------------------------------------------

The SVN log shows that this change was made on 20030905, in r2311.
The log doesn't explain the reason for the change, but having run into
this situation a lot, I can say with almost certainty that the reason
is that the C fabs did not produce results consistent with the SML
spec of Real.abs on at least one of our platforms.  So, we went for
the more obviously compliant, but slower version.

There are a number of ways to improve the situation.

  1. _import and use a C wrapper around fabs
  2. tweak real.fun to eliminate the "correctness wrapper" around abs
  3. (1), and cause MLton's C codegen to emit a call to fabs

(1) is the easiest because it doesn't require any changes to MLton.
(2) and (3) both require modifications to MLton -- however, both can
be achieved by modifying files distributed with MLton and without
recompiling the compiler.  I tried them all out -- here are the
timings on my PowerPC.

	fd_1x	fd_2x
	-----	-----
orig	13.35	13.48
(1)	 7.64	 9.02
(2) 	 6.13	 7.23
(3)	 5.00	 5.91

The reason why (2) is faster than (1), which also bypasses the
correctness wrapper, is that with (2), MLton understands that Real.abs
is a primitive as opposed to an FFI call and so can generate more
efficient code.  In the C codegen, the primitive is still implemented
by a C call to a function that just calls fabs.  The win in going from
(2) to (3) is in eliminating the C wrapper around fabs.  If anyone
wants to repeat my experiment, I did (3) by adding a line to
lib/mlton/include/c-chunk.h:

  #define Real64_abs fabs

In any case, recall that the time for the hand-written C code was
4.40.  So, now we're talking about a 15% or 44% slowdown relative to
C.  That is well within the norm for what we see with MLton,
especially with the C codegen on PowerPC.  On my x86, the speedups
from (3) are also quite nice.

	fd_1x	fd_2x
	-----	-----
orig	11.82	11.71
(3)	 3.04	 4.16

Since the hand-written C runs in 2.35, we've got a 29% or 77% slowdown
over C.  Again, well within the norms.  Interestingly, for fd_1x, the C
codegen with (3) is even faster than the native codegen (which yielded
3.45).

One thing that would be worth understanding is why fd2_1x is faster
than fd2_2x. Perhaps there is some trick in fd2_1x that could apply to
MLton's implementation of Array2.

In any case, hopefully this clears up any mysteries about the factor
of 3 slowdown over C.