[MLton] cvs commit: Improvements to SSA{,2} shrinker in the presence of profiling

Matthew Fluet fluet@cs.cornell.edu
Sun, 12 Jun 2005 08:57:57 -0400 (EDT)


>   As stated above, this solves the performance problem with
>   wc-scanStream.  Unfortunately, it did not significantly affect any of
>   the other benchmarks.  The new outlier in the presence of profiling is
>   checksum.

In an attempt to understand where other performance problems lie with 
profiling, I ran the benchmarks with -profile drop, but with SSA, SSA2, 
and RSSA passes to erase profiling annotations.  

MLton0 -- mlton -profile no
MLton1 -- mlton -profile drop /* drop profiling at start of SSA opts
MLton2 -- mlton -profile drop /* drop profiling at end of SSA opts
MLton3 -- mlton -profile drop /* drop profiling at start of SSA2 opts
MLton4 -- mlton -profile drop /* drop profiling at end of SSA2 opts
MLton5 -- mlton -profile drop /* drop profiling at start of RSSA opts
MLton6 -- mlton -profile drop /* drop profiling at end of RSSA opts,
                              /* before implementProfiling
MLton7 -- mlton -profile drop /* don't drop profiling from ILs,
                              /* but don't actually implement anything in 
                              /* implementProfiling

run time ratio
benchmark         MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7
barnes-hut          1.00   1.04   1.04   1.04   1.03   1.04   1.04   1.04
boyer               1.00   1.02   1.03   1.02   1.02   1.02   1.02   1.04
checksum            1.00   1.00   1.04   1.00   1.02   1.03   1.00   1.65
count-graphs        1.00   1.08   1.03   1.04   1.00   1.01   1.02   1.08
DLXSimulator        1.00   1.00   1.08   0.97   0.97   0.97   0.98   0.97
fft                 1.00   0.98   0.97   0.95   0.95   1.06   1.05   0.97
fib                 1.00   1.10   1.10   1.10   1.10   1.10   1.10   1.39
flat-array          1.00   1.25   1.05   0.96   1.04   1.08   0.96   0.96
hamlet              1.00   1.12   1.04   1.04   1.04   1.04   1.04   1.08
imp-for             1.00   1.00   0.99   0.99   1.03   0.99   0.99   0.99
knuth-bendix        1.00   1.10   1.10   1.10   1.10   1.10   1.10   1.24
lexgen              1.00   0.98   1.07   0.98   1.03   1.02   0.98   1.02
life                1.00   1.03   1.08   1.06   1.03   1.04   1.09   1.01
logic               1.00   1.04   0.96   0.96   0.96   0.96   0.96   1.00
mandelbrot          1.00   0.99   1.01   1.03   0.99   0.99   0.99   0.99
matrix-multiply     1.00   1.00   0.99   1.00   1.06   1.08   1.01   0.99
md5                 1.00   1.00   1.25   1.25   1.40   1.40   1.40   1.40
merge               1.00   1.00   1.00   1.00   1.00   1.00   1.02   1.16
mlyacc              1.00   1.11   1.14   1.05   1.07   1.01   1.01   1.02
model-elimination   1.00   0.92   0.91   0.91   0.92   0.99   0.91   1.03
mpuz                1.00   1.01   0.95   0.95   0.95   0.95   0.95   1.00
nucleic             1.00   0.92   0.92   0.92   0.92   0.92   0.92   0.92
output1             1.00   0.94   0.94   0.97   0.97   0.94   0.94   0.97
peek                1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.25
psdes-random        1.00   1.10   1.07   1.13   1.07   1.07   1.11   1.11
ratio-regions       1.00   1.04   1.09   1.09   1.09   1.09   1.09   1.18
ray                 1.00   1.00   1.01   1.07   1.01   1.01   1.01   1.03
raytrace            1.00   1.04   1.03   1.03   1.02   1.03   1.07   1.04
simple              1.00   1.06   1.07   1.00   0.99   0.99   0.99   1.15
smith-normal-form   1.00   0.97   1.07   1.02   0.97   0.97   0.97   0.97
tailfib             1.00   1.00   0.96   0.96   0.96   0.96   0.96   0.96
tak                 1.00   1.34   1.35   1.31   1.31   1.31   1.31   1.36
tensor              1.00   1.01   0.81   0.81   0.81   0.81   0.84   0.96
tsp                 1.00   1.00   1.00   1.03   1.04   1.00   1.00   1.01
tyan                1.00   1.06   1.19   1.13   1.06   1.10   1.18   1.08
vector-concat       1.00   1.02   1.01   1.00   1.00   1.00   1.00   0.99
vector-rev          1.00   1.09   0.99   1.11   0.99   0.99   0.99   0.99
vliw                1.00   0.98   0.99   1.07   1.01   0.99   0.98   1.03
wc-input1           1.00   1.02   1.02   1.02   1.02   1.05   1.06   0.99
wc-scanStream       1.00   1.02   1.01   1.01   1.01   1.01   1.01   0.95
zebra               1.00   0.98   0.99   1.02   0.98   0.99   0.98   0.97
zern                1.00   1.00   0.99   0.99   0.99   0.99   0.98   1.00

Of the benchmarks that have a ratio >= 1.2 between keeping profiling all 
the way through and no profiling, there is no single culprit:

checksum            1.00   1.00   1.04   1.00   1.02   1.03   1.00   1.65
fib                 1.00   1.10   1.10   1.10   1.10   1.10   1.10   1.39
knuth-bendix        1.00   1.10   1.10   1.10   1.10   1.10   1.10   1.24
md5                 1.00   1.00   1.25   1.25   1.40   1.40   1.40   1.40
peek                1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.25
tak                 1.00   1.34   1.35   1.31   1.31   1.31   1.31   1.36

fib, knuth-bendix, and tak seem to suggest that there may be missed 
simplifications before the SSA optimizations.  md5 is missing something in 
the SSA optimizations, and again in the SSA2 optimizations.  checksum, 
fib, knuth-bendix, and peek each seem to exibit some cost being incurred 
by implementing profiling.  (Though, with -profile drop, this should 
essentially erase the profiling annotations.)

The way forward is clear -- investigate md5 to isolate the missing
optimizations, then try to investigate pre-SSA optimizations using tak,
and finally try to understand the cost of implementing profiling using 
checksum.  But, I'm probably going to take a break from this, since all 
the benchmarks above are essentially a small loop, so I'm hopeful that 
profiling has minimal impact on real programs.