intraprocedural flattening

Stephen Weeks MLton@sourcelight.com
Thu, 19 Jul 2001 11:01:31 -0700


I implemented a simple intraprocedural flattening algorithm last night.  It
flattens a jump argument of tuple type if that argument only flows to select
statements.  This gets, for example, the problematic loop in Vector.unfold.  The
analysis is very fast, passes all the regressions, benchmarks, and
self-compiles.  The analysis runs at three places during CPS simplification,
pretty much right after contification.

On a self-compile, one of the passes saw 502 tupled jump arguments, and was able
to flatten 470 of them.  This didn't lead to spectacular performance
improvement in self-compile time, but did lead to a noticeable improvement.
We're now down to 474.5 seconds (breaking the 8 minute barrier), even including
the three extra passes of local flattening. 

Below are the benchmark results, where "old MLton" is without local flattening.
There is a bit of a slowdown (ratio .9 or .8 old MLton to new MLton) on a few of
the benchmarks, a bit of a speedup on about the same number, and a very nice
speedup on psdes-random and wc-scanStream, presumably due to removing an
allocation from of a loop.

I am a bit confused by the slowdown, since I am fairly sure that the
optimization will never introduce new allocation.  It does, however, move selects
earlier, and can cause selects to occur that would not have otherwise.  I think
I will combine the optimization with a forward analysis that only flattens a
jump argument if only tuples flow into it (in addition to the original
condition).

run time ratio
benchmark         old MLton
barnes-hut              1.0
checksum                0.9
count-graphs            1.2
DLXSimulator            0.9
fft                     1.1
fib                     1.0
hamlet                  1.1
knuth-bendix            1.0
lexgen                  1.0
life                    1.1
logic                   1.2
mandelbrot              1.1
matrix-multiply         0.9
md5                     1.0
merge                   1.0
mlyacc                  1.0
mpuz                    1.0
nucleic                 1.0
peek                    1.1
psdes-random            1.5
ratio-regions           1.0
ray                     1.0
raytrace                1.0
simple                  1.0
smith-normal-form       1.0
tak                     0.9
tensor                  0.9
tsp                     1.0
vector-concat           0.9
vector-rev              0.8
vliw                    1.1
wc-input1               1.0
wc-scanStream           1.6
zebra                   1.0
zern                    1.0

compile time
benchmark         MLton old MLton
barnes-hut          2.6       2.6
checksum            0.7       0.7
count-graphs        1.7       1.8
DLXSimulator        4.1       4.0
fft                 1.5       1.4
fib                 0.7       0.7
hamlet             51.1      48.2
knuth-bendix        2.4       2.3
lexgen              5.4       5.3
life                1.4       1.5
logic               7.2       7.3
mandelbrot          0.7       0.8
matrix-multiply     0.7       0.8
md5                 2.3       2.3
merge               0.7       0.8
mlyacc             19.2      19.1
mpuz                0.9       0.9
nucleic             4.0       4.0
peek                1.1       1.1
psdes-random        0.7       0.8
ratio-regions       3.0       3.0
ray                 3.3       3.2
raytrace            9.8       9.6
simple              7.1       7.2
smith-normal-form   7.7       7.7
tak                 0.7       0.7
tensor              3.0       2.9
tsp                 1.8       1.7
vector-concat       0.7       0.8
vector-rev          0.7       0.8
vliw               12.1      11.9
wc-input1           1.7       1.6
wc-scanStream       1.8       1.8
zebra               4.8       4.7
zern                1.2       1.1

run time
benchmark         MLton old MLton
barnes-hut          5.3       5.4
checksum            5.1       4.4
count-graphs        6.0       7.2
DLXSimulator       14.0      12.7
fft                 7.9       9.1
fib                 4.7       4.7
hamlet              9.8      10.7
knuth-bendix        8.4       8.1
lexgen             13.3      13.5
life               10.9      11.6
logic              26.8      31.0
mandelbrot          8.7       9.3
matrix-multiply     6.0       5.6
md5                 4.9       4.9
merge              39.0      40.8
mlyacc             10.7      10.9
mpuz                7.0       6.9
nucleic             8.3       8.5
peek                4.9       5.2
psdes-random        6.1       9.2
ratio-regions       9.4       9.3
ray                 6.0       6.1
raytrace            6.5       6.5
simple              7.3       7.4
smith-normal-form   1.1       1.1
tak                11.3      10.6
tensor              9.5       8.8
tsp                12.3      12.3
vector-concat       6.2       5.4
vector-rev          4.4       3.5
vliw                7.0       7.5
wc-input1           2.6       2.7
wc-scanStream       5.2       8.2
zebra               3.2       3.1
zern               45.1      44.8

size
benchmark           MLton old MLton
barnes-hut         63,672    63,904
checksum           21,812    23,396
count-graphs       41,076    43,228
DLXSimulator       88,060    87,500
fft                34,960    33,272
fib                21,564    23,212
hamlet            996,871   973,471
knuth-bendix       62,013    62,093
lexgen            129,916   131,004
life               38,548    40,052
logic             147,516   151,060
mandelbrot         21,532    23,276
matrix-multiply    22,236    23,740
md5                36,341    36,845
merge              22,708    24,348
mlyacc            427,948   423,196
mpuz               26,644    28,228
nucleic            58,780    60,420
peek               29,989    29,885
psdes-random       22,836    24,516
ratio-regions      60,372    59,932
ray                73,351    73,327
raytrace          183,628   180,572
simple            169,160   171,208
smith-normal-form 141,092   141,820
tak                21,596    23,236
tensor             69,259    68,619
tsp                39,101    39,069
vector-concat      22,356    23,828
vector-rev         22,364    23,836
vliw              272,776   262,936
wc-input1          42,661    42,533
wc-scanStream      45,285    45,093
zebra             111,501   111,485
zern               29,895    27,319