SSA simplify passes

Matthew Fluet fluet@CS.Cornell.EDU
Fri, 4 Jan 2002 14:02:26 -0500 (EST)


Here's a little experiment I just ran.  I added a -loop-passes
integer option which sets the number of times to run through the list of
SSA simplification passe.  Here are the benchmark results:

MLton0 -- mlton -loop-passes 1
MLton1 -- mlton -loop-passes 2
compile time
benchmark         MLton0 MLton1
barnes-hut          3.44   3.95
checksum            0.84   0.80
count-graphs        2.30   2.48
DLXSimulator        6.14   6.87
fft                 1.71   1.86
fib                 0.74   0.73
hamlet             75.12  91.49
imp-for             0.77   0.82
knuth-bendix        3.12   3.34
lexgen              7.56   9.32
life                1.71   1.88
logic               4.32   5.59
mandelbrot          0.76   0.81
matrix-multiply     0.84   0.80
md5                 1.64   1.66
merge               0.80   0.80
mlyacc             29.03  37.35
mpuz                1.05   1.22
nucleic             3.79   3.88
peek                1.33   1.47
psdes-random        0.82   0.81
ratio-regions       3.34   5.62
ray                 4.78   5.60
raytrace           11.55  13.29
simple              9.61  12.10
smith-normal-form  10.85  11.51
tailfib             0.72   0.73
tak                 0.71   0.76
tensor              4.09   4.71
tsp                 1.98   2.03
tyan                5.03   6.01
vector-concat       0.84   0.84
vector-rev          0.80   0.79
vliw               16.70  24.14
wc-input1           2.19   2.45
wc-scanStream       2.17   2.50
zebra               7.32   9.95
zern                1.43   1.46
run time
benchmark         MLton0 MLton1
barnes-hut          5.44   5.00
checksum            4.06   4.04
count-graphs        6.07   5.52
DLXSimulator       22.50  22.41
fft                13.72  13.52
fib                 4.24   4.27
hamlet             10.60  10.29
imp-for            10.25  11.28
knuth-bendix        8.57   8.21
lexgen             13.69  12.43
life                8.74   9.62
logic              26.92  22.69
mandelbrot          7.72   7.87
matrix-multiply     5.48   1.35
md5                 2.53   0.50
merge              75.85  74.91
mlyacc             12.51  14.04
mpuz                5.69   5.59
nucleic            10.29  12.80
peek                4.05   3.74
psdes-random        4.18   3.89
ratio-regions      11.80  10.50
ray                 4.98   4.80
raytrace            6.00   6.54
simple              7.52   8.34
smith-normal-form   1.26   1.26
tailfib            19.29  19.28
tak                10.79  10.79
tensor              6.88   6.87
tsp                11.11  11.12
tyan               25.17  26.22
vector-concat       7.21   7.29
vector-rev          5.90   5.76
vliw                8.08   8.34
wc-input1           2.25   1.86
wc-scanStream       4.52   4.40
zebra               2.94   2.96
zern               48.77  44.88
run time ratio
benchmark         MLton1
barnes-hut          0.92
checksum            0.99
count-graphs        0.91
DLXSimulator        1.00
fft                 0.99
fib                 1.01
hamlet              0.97
imp-for             1.10
knuth-bendix        0.96
lexgen              0.91
life                1.10
logic               0.84
mandelbrot          1.02
matrix-multiply     0.25
md5                 0.20
merge               0.99
mlyacc              1.12
mpuz                0.98
nucleic             1.24
peek                0.92
psdes-random        0.93
ratio-regions       0.89
ray                 0.96
raytrace            1.09
simple              1.11
smith-normal-form   1.00
tailfib             1.00
tak                 1.00
tensor              1.00
tsp                 1.00
tyan                1.04
vector-concat       1.01
vector-rev          0.98
vliw                1.03
wc-input1           0.83
wc-scanStream       0.97
zebra               1.01
zern                0.92
size
benchmark            MLton0    MLton1
barnes-hut           69,982    69,326
checksum             21,585    21,585
count-graphs         44,713    41,777
DLXSimulator         99,417    97,273
fft                  33,069    32,613
fib                  21,585    21,585
hamlet            1,498,772 1,581,652
imp-for              21,577    21,577
knuth-bendix         65,938    62,698
lexgen              157,297   173,089
life                 41,481    42,729
logic                89,313    96,449
mandelbrot           21,545    21,513
matrix-multiply      22,009    21,785
md5                  31,986    27,746
merge                22,793    22,809
mlyacc              578,545   639,377
mpuz                 26,529    26,321
nucleic              63,769    63,081
peek                 31,514    31,370
psdes-random         22,553    22,569
ratio-regions        44,537    46,057
ray                  85,492    86,132
raytrace            204,769   202,833
simple              195,405   217,709
smith-normal-form   149,546   150,010
tailfib              21,257    21,257
tak                  21,705    21,689
tensor               72,321    76,353
tsp                  37,474    34,090
tyan                 91,954    94,978
vector-concat        22,393    22,393
vector-rev           22,209    21,905
vliw                340,541   416,821
wc-input1            45,514    46,474
wc-scanStream        46,794    47,746
zebra               127,954   121,154
zern                 28,684    28,484


Some interesting results:
md5 -- if you look at the .ssa of -loop-passes 1, you'll see three
functions all of the form:
fun x_148 (x_154) = L_88 ()
  L_88 ()
    ()
which are used in about 45 non-tail calls.  Before the last removeUnused
call, these functions do somthing, but their results aren't used, so they
are trimmed down to:
fun x_148 (x_154) = L_88 ()
  L_88 ()
    x_165 = Word32_ge (x_154, global_11)
    case x_165 of
      false => L_93 | true => L_94
  L_94 ()
    L_92 ()
  L_93 ()
    L_92 ()
  L_92 ()
    x_161 = Word32_sub (global_11, x_154)
    x_163 = Word32_ge (x_161, global_11)
    case x_163 of
      false => L_90 | true => L_91
  L_91 ()
    L_89 ()
  L_90 ()
    L_89 ()
  L_89 ()
    ()
which the shinker reduces to the above (this explains why removeUnused
didn't eliminate the function argument, although it is clearly unused in
the final function).

I think that just eliminating those extraneous non-tail calls sped it up
quite a bid.

matrix-multiply: As best I can make out, something (I don't know what
pass) establishes the fact that the source matrix is initialized, but
never updated -- so it replaces all the array subs (and associated
computation of subscripts) with the constant 1.0.

imp-for: This one slows down by 10%.  Very strange, especially considering
that when you look at the .ssa files, which are almost identical.  One
thing that is different is that there are a number of blocks whose
argument lists are reversed from what they are in -loop-passes 1.
Looking at the assembly, this is confirmed -- most of the code is
identical, where they differ are stack offset numbers and some reversals
of instructions in accesing arguents.  You can see above that the sizes of
the executables are identical.  Some subtle caching?

nucleic: Looking at the datatypes of the -loop-passes 2 version, my guess
is too much flattening of tuples.  In particular, something like this
looks bad:
fun x_8 (x_23,
	 x_22,
	 x_21,
	 x_20,
	 x_19,
	 x_18,
	 x_17,
	 x_16,
	 x_15,
	 x_14,
	 x_13,
	 x_12,
	 x_11,
	 x_10,
	 x_9) = L_16 ()
  L_16 ()
    x_68 = (x_12,
	    x_13,
	    x_14,
	    x_15,
	    x_16,
	    x_17,
	    x_18,
	    x_19,
	    x_20,
	    x_21,
	    x_22,
	    x_23)
    x_81 = #13 x_11
    case x_81 of
      A_0 => L_23 | _ => L_35
which is the expansion of this:
fun x_283 (x_521, x_520, x_519) = L_97 ()
  L_97 ()
    x_577 = #2 x_520
    x_578 = #1 x_520
    x_582 = #13 x_577
    case x_582 of
      A_0 => L_104 | _ => L_116


Those are the ones that I've looked at; I don't know if there is a point
to be made, but it's an experiment I've been meaning to try.