[MLton-devel] nucleic benchmark times

Matthew Fluet fluet@CS.Cornell.EDU
Wed, 13 Nov 2002 12:35:26 -0500 (EST)


> Again, the other problem is that it isn't a short-term task to try and
> tune another backend.  Nor is it trivial to extract the pros and cons of
> individual decisions.  My belief is that a huge win in the native codegen
> is the ability to assert that stack and heap locations don't alias.
> (Actually, that would be fairly easy to check.)

In an attempt to put some hard numbers with these opinions, I ran
benchmarks with the following options:

MLton0 -- mlton.cvs.HEAD -native true -detect-overflow true -native-alias false
MLton1 -- mlton.cvs.HEAD -native true -detect-overflow true -native-alias true
MLton2 -- mlton.cvs.HEAD -native true -detect-overflow false -native-alias false
MLton3 -- mlton.cvs.HEAD -native true -detect-overflow false -native-alias true
MLton4 -- mlton.cvs.HEAD -native false -detect-overflow true
MLton5 -- mlton.cvs.HEAD -native false -detect-overflow true -DFAST_INT
MLton6 -- mlton.cvs.HEAD -native false -detect-overflow false

The new -native-alias {false,true} option regulates the mayAlias function
in the native codegen.  When it it false, distinct memory classes (i.e.,
heap, stack, locals, globals, etc.) are assumed not to alias; with it
true, then these memory classes may alias.  So, with -native-alias true,
the native codegen will be more like C in handling reads/writes to the
stack and heap (although I wouldn't claim that the alias analyses would be
the same).  Detecting overflow was an easy set of flags to check and seems
like it could slow down the C-codegen code.

Here are the partial orders I would expect.

MLton0 <= MLton1  -- aliasing the stack and heap hurts
MLton2 <= MLton3
MLton0 <= {MLton4,MLton5} -- and C-codegen hurts
MLton2 <= MLton6

MLton2 <= MLton0  -- detecting overflow hurts
MLton3 <= MLton1
MLton6 <= MLton5 <= MLton4

Here are the runtime ratio results:

run time ratio
benchmark         MLton1 MLton2 MLton3 MLton4 MLton5 MLton6
barnes-hut          1.23   0.70   0.81   1.47   1.26   0.93
boyer               1.15   1.04   1.24   1.23   1.23   1.36
checksum            1.28   0.87   1.08  10.68   1.67   1.40
count-graphs        1.57   0.97   1.51   2.81   2.43   2.36
DLXSimulator        1.50   1.83   1.90   1.35   1.21   1.30
fft                 1.09   1.06   1.05   1.25   1.06   1.09
fib                 1.28   0.97   1.22   4.25   1.73   1.61
hamlet              1.30   1.04   1.23   2.02   1.81   1.83
imp-for             1.00   0.76   0.76  15.70   1.12   2.07
knuth-bendix        1.35   1.03   1.38   1.57   1.59   1.62
lexgen              1.32   1.00   1.30   3.09   2.33   2.60
life                1.45   0.93   1.45   2.69   2.73   2.74
logic               1.15   0.99   1.16   1.25   1.21   1.21
mandelbrot          1.00   0.92   0.92   5.15   2.20   2.13
matrix-multiply     1.43   0.89   1.45   3.42   2.33   3.03
md5                 1.21   0.96   1.16   3.78   1.83   1.57
merge               1.08   1.00   1.08   1.39   1.39   1.39
mlyacc              1.24   0.99   1.22   1.87   1.53   1.52
model-elimination   1.22   0.99   1.22   1.84   1.60   1.63
mpuz                1.48   0.93   1.39   5.88   2.28   2.09
nucleic             1.14   0.95   1.08   1.23   1.23   1.14
peek                1.25   0.63   0.63  22.00   3.75   4.58
psdes-random        1.26   1.06   1.28   4.13   1.89   1.92
ratio-regions       1.20   0.95   1.17   3.18   2.02   2.00
ray                 1.24   1.01   1.22   1.81   1.61   1.49
raytrace            1.19   1.00   1.19   1.66   1.72   1.59
simple              1.28   0.99   1.28   2.96   1.86   1.90
smith-normal-form   1.00   1.00   1.00   1.00   0.99   1.00
tailfib             1.00   0.87   0.87  15.03   1.21   1.02
tak                 1.15   0.95   1.14   2.94   1.88   1.83
tensor              1.00   0.41   0.41  23.71   1.73   0.84
tsp                 1.36   1.00   1.36   1.15   1.13   1.13
tyan                1.19   0.99   1.19   1.50   1.37   1.30
vector-concat       1.88   1.01   1.83   8.59   2.15   2.18
vector-rev          1.20   0.92   1.17   4.44   1.54   1.54
vliw                1.18   0.99   1.17   2.08   1.44   1.55
wc-input1           1.79   1.07   1.54   8.39   4.48   4.65
wc-scanStream       1.83   0.96   1.83   4.98   2.99   2.84
zebra               1.45   0.99   1.44   2.24   2.16   2.13
zern                1.22   0.97   1.22   4.60   2.08   1.98


Here's how my intuition stacks up.  The [...] are the violations of the
proposed ordering.  The f.ff is the average difference (including
violations).

MLton0 <= MLton1 -- [], 0.28
MLton2 <= MLton3 -- ["fft"], 0.25

Reasonable and shows that aliasing is a non-trivial cost.

MLton0 <= MLton4 -- [], 3.76
MLton0 <= MLton5 -- ["smith-normal-form"], 0.84
MLton2 <= MLton6 -- ["DLXSimulator"], 0.89

Again, reasonable.  The MLton0 <= MLton4 is skewed by some outliers.  But
MLton0 <= MLton5 and MLton2 <= MLton6 suggest quite an impressive gain by
just using the native codegen.  Certainly more than can just be attributed
to aliasing.  With that in mind, we look at

MLton1 <= MLton4 -- ["DLXSimulator","tsp"],3.47925
MLton1 <= MLton5 -- ["DLXSimulator","fft","smith-normal-form","tsp"],0.56575
MLton3 <= MLton6 -- ["DLXSimulator","tsp"],0.6385

"tsp" and "fft" violations suggest that floating-point could still be a
little better in the native codegen, but no violations in nucleic and
raytrace suggest we're not doing too badly.  I don't have
any intuition about "DLXSimulator"; there's some bit fiddling to decode
instructions but most of it should be list manipulations.  My only guess
with smith-normal-form is that since virtually everything will be FFI
calls to gmp, the C-codegen is just better at organizing around C-calls.
But, obviously the majority of time the native codgen is beating the
C-codgen by a non-trivial margin.  I'm not sure where to attribute the
speedup.  Steve doesn't think it comes from control-flow
(trampolining/integer returns).  I will note that all the C-codegen
benchmarks were  run with the default options, in particular -O1.  If -O2
-no-strict-aliasing is safe and correct, then that might be a better
comparision.  But I can't imagine it getting a 2X speedup.
A possibility is integer overflow; clearly the "slow" integer functions in
the C-codegen can be quite bad when integer ops dominate.  Even in the
others, it probably accounts for a bit.  But, even the "fast" integer
functions don't close the gap between the C and native codgens by all that
much.

Comparing -detect-overflow {true,false} yields more anomalies.

MLton2 <= MLton0 -- ["boyer","DLXSimulator","fft","hamlet",
                     "knuth-bendix","psdes-random","ray",
		     "vector-concat","wc-input1"], 0.04
MLton3 <= MLton1 -- ["boyer","DLXSimulator","knuth-bendix",
                     "logic","matrix-multiply",
		     "psdes-random"],0.06

DLXSim is again an unexplained outlier, and I think it skews the MLton2 <=
MLton0 average quite a bit.  Looking through the benchmarks above, the
violations (other than DLXSim) are all < 0.07 (which is still a little odd
that there are any violations).  And, there are some impressive speed-ups
in tensor, peek, and imp-for.  I think it would be worthwhile to implement
an SSA pass that eliminates integer overflow checks on code of the form:
	fun all (a, b, f) =
	    if a > b then
		true
	    else if f a then
		all (a+1, b, f)
	    else
		false
when the bounds are known (not necessarily constant, because we'd like to
use the fact that a vector length is always <= maxInt, hence an upper
bound of (Vector.length v - 1) wouldn't overflow).

MLton5 <= MLton4 -- ["knuth-bendix","life","raytrace"],2.9135
MLton6 <= MLton4 -- ["boyer","knuth-bendix","life"],2.9055
MLton6 <= MLton5 -- ["boyer","DLXSimulator","fft","hamlet",
                     "imp-for","knuth-bendix","lexgen",
		     "life","matrix-multiply","model-elimination",
		     "peek","psdes-random","simple","smith-normal-form",
		     "vector-concat","vliw","wc-input1"],~0.01

Very surprisingly, -detect-overflow true -DFAST_INT vs
-detect-overflow false has almost no effect in the C-codegen on average.
There are a lot of violations in either direction for MLton6 vs MLton5.



-------------------------------------------------------
This sf.net email is sponsored by: Are you worried about 
your web server security? Click here for a FREE Thawte 
Apache SSL Guide and answer your Apache SSL security 
needs: http://www.gothawte.com/rd523.html
_______________________________________________
MLton-devel mailing list
MLton-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mlton-devel