[MLton-devel] Fwd: Re: pretty damn good

Suresh Jagannathan suresh@cs.purdue.edu
Mon, 4 Nov 2002 17:12:59 -0500


Hi Brad,

If you don't mind, I'm forwarding your message to the MLton mailing
list since it seems the questions you've raised maybe of broad
interest to the group.

-- Suresh

----------  Forwarded Message  ----------

Subject: Re: pretty damn good
Date: Mon, 4 Nov 2002 17:06:39 -0500 (EST)
From: Brad Lucier <lucier@math.purdue.edu>
To: suresh@cs.purdue.edu
Cc: lucier@math.purdue.edu (Brad Lucier)

I've been playing with the C code generated by MLton and various
compiler optimizations.  This is about the best I can get at the
moment:

[lucier@dsl-207-066 mlton-20020923]$ gcc -I/usr/lib/mlton/self/include -O1
 -fomit-frame-pointer -fschedule-insns2 -fno-strict-aliasing -fno-math-errno
 nucleic.batch.c -o nucleic.batch.2 -L/usr/lib/mlton/self -lmlton -lm
 /usr/lib/libgmp.a  -O2 [lucier@dsl-207-066 mlton-20020923]$ time
 ./nucleic.batch.2
17.730u 2.326s 0:20.45 98.0%    0+0k 0+0io 108pf+0w


So it seems that you're suffering a 6% penalty on this benchmark for
going through C.  That's not so bad if the C back end could be made
more portable.

I also looked at the C code generated.  You may (or may not) get a bit more
performance by using gcc's computed goto's for returns rather than going
through the dispatch table on the chunk switch.  (This will also increase
compile times significantly, and probably screw up gcse right now; some of
gcc's optimizations don't handle the large, interconnected call graphs
generated when using computed gotos.)  You also don't always
go through a trampoline, only for intermodule calls; we must have been
talking at cross purposes about trampolines.

About the above options, the fschedule-insns2 is redundant with -O2; I
 believe, however, that you might need -fno-strict-aliasing for correct
 compilation.

I'd like to see how this thing does on other benchmarks; how *do* you
run the benchmarks with various options?

Brad

> I downloaded mlton to my 350MHz PII linux box, finally figured out how
> to run the nucleic benchmark, and got the following timings:
>
> [lucier@dsl-207-066 mlton-20020923]$ time ./nucleic.batch
> 16.837u 2.330s 0:21.97 87.2%    0+0k 0+0io 109pf+0w
> [lucier@dsl-207-066 mlton-20020923]$ time ./nucleic.batch
> 16.939u 2.218s 0:19.30 99.1%    0+0k 0+0io 108pf+0w
>
> The time for Gambit-C on the same benchmark is
>
> [lucier@dsl-207-066 gambit]$ time ./nucleic -:m10000
> (time (run-bench name count run ok?))
>     2488 ms real time
>     2475 ms cpu time (2420 user, 55 system)
>     38 collections accounting for 97 ms real time (102 user, 0 system)
>     392602904 bytes allocated
>     2568 minor faults
>     22 major faults
> 2.443u 0.083s 0:03.58 70.3%     0+0k 0+0io 535pf+0w
> [lucier@dsl-207-066 gambit]$ time ./nucleic -:m10000
> (time (run-bench name count run ok?))
>     2478 ms real time
>     2464 ms cpu time (2400 user, 64 system)
>     38 collections accounting for 101 ms real time (98 user, 0 system)
>     392602904 bytes allocated
>     2568 minor faults
>     22 major faults
> 2.451u 0.074s 0:03.52 71.5%     0+0k 0+0io 535pf+0w
>
> (This is using Gambit's extensions of uniform double-precision vectors
> and float- and fixnum-specific arithmetic functions.  This is using the
> beta version of Gambit-C 4.0 and gcc 3.3 (experimental).)
>
> If I read your ML code correctly, it runs the loop 200 times; the gambit
> code runs it 10 times, so mlton's version is taking (16.939+2.218)/200=
> .0957850000 seconds, while gambit's version is taking (2.451+0.074)/10=
> .2525000000 seconds.
>
> Brad

-------------------------------------------------------


-------------------------------------------------------
This SF.net email is sponsored by: ApacheCon, November 18-21 in
Las Vegas (supported by COMDEX), the only Apache event to be
fully supported by the ASF. http://www.apachecon.com
_______________________________________________
MLton-devel mailing list
MLton-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mlton-devel