[MLton-user] Why is recursion slower than C?

Matthew Fluet fluet at tti-c.org
Wed Apr 25 07:52:25 PDT 2007


> According to
> http://shootout.alioth.debian.org/debian/benchmark.php?test=recursive&lang=all 
> MLton is much slower than C. Indeed, if one looks at the simplest case 
> where it's slower (Fibonacci) the difference is over 50%.
> 
> In looking over the assembler generated for the SML function
>   fun fib n = if n < 2 then 1 else fib (n - 2) + fib (n - 1)
> as compared to the C function
>   int fib(int n) {
>     if (n < 2) {
>       return 1;
>     }
>     return fib(n - 2) + fib(n - 1);
>   }
> The main differences appear to be that:
>   MLton checks for stack and integer overflow (the gcc code doesn't)
>   MLton uses jmp instead of call/return
>   gcc uses leal to do the subtraction, while MLton uses subl.

> What I don't understand is why the MLton version is so much slower. The 
> overflow checks don't seem particularly costly. 

Right; we've not seen significant overheads due to overflow checking. 
Certainly on the native codegen, where you need a single instruction, it 
doesn't have much effect.  (With the C codegen, we need to do 
comparisons to check for overflow; I think we tried inline assembly at 
one point, but it was brittle.)

 > I'm assuming the problem
> is that MLton uses a lot of MOVs+JMP instead of the CALL instruction. Is 
> this correct? My assembler is really rusty, so I'm not sure what's 
> slower here.

I have no idea.  I would imagine that, even in a C program, JMP 
instructions (statically and dynamically) outnumber CALL/RET 
instructions, so Amdahl's law says that Intel/AMD should optimize JMP 
before CALL/RET.  On the other hand, there may be enough semantic 
information in a CALL instruction that it can be optimized somewhat in 
silicon.

> From a theoretical point of view, shouldn't MLton have a performance 
> advantage from not having to conform to the C calling convention? Most 
> of the function prelude in gcc's assembler seems superfluous.

In theory, yes.  In practice, we pass all function arguments and returns 
via the ML stack, never in registers.  So, we don't really win on that 
front.

> Also, it seems MLton is using %ebp for the stack pointer, instead of 
> %esp. Why is this? I gather this is why call/return are out.

In a MLton compiled program, there are effectively two stacks: the 
system C stack and the ML stack (allocated in the ML heap, which in turn 
is allocated in the system C heap).  There are various reasons for using 
%ebp over %esp:
   http://mlton.org/pipermail/mlton/2002-May/021982.html

The principal reason that CALL/RET doesn't work for the ML stack is that:
1) it grows 'up' rather than 'down' (so CALL/RET would adjust the stack 
pointer in the wrong way)
2) the return address isn't placed where CALL/RET would place it.

> MLton:
> fib_0:
> L_534:
>         cmpl %ebp,((gcState+12)+(0*4)) // guard against stack overflow
>         jnb L_533
> L_525:
> // This seems to all be about growing the stack.

Correct.  This is a little bad; stack overflow is rare, so we would 
rather the conditional jump to be to the GC_collect block and fall 
through to the non-GC_collect blocks.  That would give better 
instruction cache behavior, since in this code layout, we effectively 
get just one useful instruction (the cmpl %ebp,stackLimit) in the cache 
line.

> L_533:
> // Method starts here
>         movl (0+((0*1)+(0*1)))(%ebp),%esi
>         cmpl $0x2,%esi
>         jl L_1284
> L_532:
>         subl $0x2,%esi
>         jo L_1283 // catch integer overflow
> L_530:
>         movl %esi,(0+((8*1)+(0*1)))(%ebp)
>         addl $8,%ebp
>         movl $L_529,(0+(-1*4))(%ebp)
>         jmp fib_0 // resumes control at L_529

Sort of; note the addl $8,%ebp and the movl $L_529,stackTop[-4] grows 
the stack and sets a return address for the (first) non-tail call of 
fib_0.  So, this is an inter-procedural jump, not an intra-procedural jump.

> .p2align 4
> L_1283:
> L_531:
>         movl $0x1,(globalPointerNonRoot+((0*4)+(0*1)))
>         movl ((gcState+1028)+(0*4)),%ebp
>         addl ((gcState+16)+(0*4)),%ebp
>         jmp *(0+(-1*4))(%ebp)

This is throwing the overflow exception.  The movl $0x1,exnResult 
instruction stores the Overflow tag in a global exception location.  The 
next two instructions roll-back the stack to the exception handler.  The 
indirect jump starts execution at the exception handler.

> .p2align 4
> L_1284:
> L_526:
>         movl $0x1,(0+((0*1)+(0*1)))(%ebp)
>         jmp *(0+(-1*4))(%ebp)

This is the return of 1 when n < 2.  This is probably one place where we 
lose compared to CALL/RET.  IIRC, modern processors keep a shadow state 
that includes the return addresses for recent CALL instructions.  So, 
seeing a RET instruction coming down the pipeline, it already has a good 
guess as to the target return address, and can begin prefetching.  (Of 
course, code could still futz with the return address on the stack, so 
the processor can only speculatively prefetch.  If the return address 
did get changed (e.g., buffer overflow attack), then the processor pays 
the hit of flushing the pipe and going to the specified address.)

In our case, we've got an indirect jump.  You would hope that branch 
prediction helps here, but fib is a strange beast.  It has two non-tail 
recursive calls, so there are >= 2 places where this indirect jump goes: 
to either of the non-tail recursive call return points, or to any fib 
caller (who calls fib with n < 2).  Because of the recursive nature, the 
vast majority of these indirect jumps go back to the non-tail recursive 
call points, but I'm pretty sure that if you look at the dynamic 
behavior of fib, this return bounces back and forth between the two 
non-tail recursive call points -- utterly foiling branch prediction?

> .p2align 4
> .long 48
> L_529:
>         addl $-8,%ebp
>         movl (0+((8*1)+(0*1)))(%ebp),%esi
>         movl %esi,(0+((4*1)+(0*1)))(%ebp)
>         movl (0+((0*1)+(0*1)))(%ebp),%esi
>         decl %esi
>         movl %esi,(0+((12*1)+(0*1)))(%ebp)
>         addl $12,%ebp
>         movl $L_528,(0+(-1*4))(%ebp)
>         jmp fib_0

Here is the second non-tail recursive call of fib.

> .p2align 4
> .long 4
> L_528:
>         addl $-12,%ebp
>         movl (0+((12*1)+(0*1)))(%ebp),%esi
>         movl %esi,(0+((8*1)+(0*1)))(%ebp)
>         movl (0+((4*1)+(0*1)))(%ebp),%edi
>         addl %esi,%edi
>         jo L_531
> L_527:
>         movl %edi,(0+((0*1)+(0*1)))(%ebp)
>         jmp *(0+(-1*4))(%ebp)

This is the return of (fib (n - 1) + fib (n - 2)) when n >= 2.  Again, 
there is an indirect jump, usually back to one of the non-tail recursive 
call return points, and sometimes back to the fib caller.  Again, I 
suspect that the shadow state associated with CALL/RET handles this much 
better than branch prediction, since this will also bounce back and 
forth between the two non-tail recursive call return points.




More information about the MLton-user mailing list