[MLton-user] Some numbers on SML/NJ & MLton runs and also a few questions

Stephen Weeks sweeks@sweeks.com
Tue, 7 Oct 2003 11:29:06 -0700


> Wow, mlton is really fast.

:-)

> The MLton runs are compiled with
>     mlton -static true -verbose 2 -detect-overflow false -safe false  \
>       -inline 100000 weeks.cm 
> (But I don't think any of these fancy options really buy much.)

Right.  Except with the C codegen, where -detect-overflow false can
help a lot because overflow detection in C is slow.

> Times are in tenths of a second, except for the side-effecting algos
> dlm (side-effecting lists) and hp (vector heapsort), which are in
> tenths of a millisecond (moral: purity costs you three orders of
> magnitude. Ouch).

Wow.  That's a lot more than I would've guessed.

> Result: Mlton is way faster than sml/nj; just compare entries in the two
> matrices:

Yeah.  The factor of 2 or so we're used to.  But the destructive-list
(dlm) and heapsort (hp) speedups were especially large.  Perhaps the
dlm speedup can be partially explained by an expensive write barrier
in SML/NJ.  But the heapsort?  Hmmm... Oh well, we've always thought
our focus in MLton on getting to a simple IL with loops and
traditional optimizations was a good thing.

> Now, what's interesting is that the technique that is speediest depends
> on the compiler. SML/NJ likes the iterative-cons&reverse algo, while MLton
> likes the stack algo, and the differences are significant:
> SML/NJ
>     onm   1281    1217     1241     763    935    175    467
>     onm4   971     894      932     497    664     94    494
> 
> MLton
>     onm   686     607      652     322    454     75   301
>     onm4  960     855      908     429    614     97   588
> 
> I guess the issue is that SML/NJ's implementation of the "stack" algo
> doesn't actually use a stack -- it allocates the proc frames in the
> heap.

That's my explanation as well.  The real reason why the iterative
algorithm is better with SML/NJ is that they've slowed down the
stack.

Although, in MLton sometimes the tradeoff is the other way.  For
example, List.@ is faster when written with the iterative double
reverse than in the recursive way.

> But I *don't* understand how MLton makes the recursive algo work so well.

In the absence of other information, I would guess it's more just a
case of SML/NJ being slow.

If you look at our performance page, you'll see that we don't do
anything all that special with recursion.  E.g. for fib and tak,
you'll see that Poly/ML does better than MLton and the ML Kit and
SML/NJ aren't far behind.

If you want to try to understand better what MLton's doing, you can
use -keep g, -keep ssa, and -keep dot to look more at the ILs.  But of
course it's quite an investment to learn the ILs.

> I have tried to get a compile on a Sparc, but it bombs out, trying
> to compile with a -mno-epilogue flag. Gcc on my Sparc box has no
> such flag (but it does have an -mno-prologue-epilogue flag). Any
> hints?

Unfortunately, the -mno-epilogue is hardwired in the executable.  We
have since the 20030716 release fixed that bad design decision, moving
the gcc flags into the mlton script.  So, your options are to

1. Recompile MLton, taking out the bad switch.  It should be easy to
find in main.sml.

2. Build your own package or ask me to build an experimental one from
the MLton CVS.  Then you can modify the mlton script.

3. Switch to a gcc that supports -mno-epilogue.  I used
binutils-2.11.2-sol7-sparc-local and gcc-2.95.3-sol8-sparc-local from
sunfreeware.com.

> I also wonder a few things about the Sparc/C implementation (the usual
> compiling-through-C issues):
>   - Do ML proc frames get allocated on the C call stack, or are
>     they allocated with a different, explicit chunk of memory?

They are not allocated on the C call stack.  They are pushed and
popped on the SML stack, which is an object allocated in the SML heap.

>   - How efficient is heap allocation?

Both the native codegen and the C codegen start from the same low
level IL (called Machine in the sources) that explicitly manages the
frontier, limit checks, object allocation, etc.  So, heap allocation
should be pretty similar to the native codegen.  We do the usual
tricks of keeping the frontier in a C variable so it stays in a
register, coalescing limit checks, and keeping some slop at the end of
the nursery so that most limit checks can be done with a single
compare.

>   - How is general tail recursion managed?

There are only ever two frames on the C stack, one for main and one
for the currently executing code (a chunk in MLton terminology).  We
use a trampoline to switch between chunks.