SSA simplify passes

Wed, 9 Jan 2002 16:03:23 -0500 (EST)

> > I could probably hack the register allocator to do these types of
> > memory-memory moves through a general purpose register.  With register
> > renaming, I shouldn't be slowed down by using the same register for both
> > components of the float (right?).
> 
> This wasn't that hard to add, although the results are mixed.  The
> original benchmarks I ran used one version, but I realized I could be a
> little more aggressive when doing this optimization, so I'm rerunning
> numbers.

Here are the latest floating point benchmark results:

MLton0 -- mlton -native-fastFMOV 0
MLton1 -- mlton -native-fastFMOV 1
MLton2 -- mlton -native-fastFMOV 2
MLton3 -- mlton -native-fastFMOV 3
MLton4 -- mlton -native-fastFMOV 4
MLton5 -- mlton -native-fastFMOV 5
run time ratio
benchmark       MLton1 MLton2 MLton3 MLton4 MLton5
barnes-hut        1.00   1.01   1.00   1.01   1.00
fft               1.00   1.00   0.99   1.00   0.99
hamlet            1.00   1.00   1.00   1.00   1.00
mandelbrot        1.00   1.13   1.13   1.13   1.13
matrix-multiply   1.02   1.01   1.00   0.99   1.00
nucleic           1.00   0.92   0.92   0.92   0.92
ray               1.00   1.00   1.00   1.00   1.00
raytrace          1.00   1.01   1.01   1.01   1.01
simple            1.00   1.00   1.00   1.00   1.00
tensor            1.00   1.00   1.00   1.00   1.00
tsp               1.00   1.00   1.00   1.00   1.00
tyan              1.00   1.00   1.00   1.00   1.00
vliw              1.00   1.00   1.01   1.00   1.00
zern              1.00   0.99   0.99   0.99   0.99

O.k.  There are two orthogonal optimizations that can be done.  The second
optimization is the mem-mem move optimization I've been talking about;
this is on when -native-fastFMOV = 2,3,4, or 5 and off otherwise.  As a
"sub-option", we can make the "mayAlias heap locations do not overlap"
assumption and perform mem-mem moves of mayAliasing heap values with an
indeterminate order using a general purpose register; this is on when
-native-fastFMOV = 4 or 5 and off otherwise.

The first optimization is orthogonal, but occured to me when I was working
on the other one.  By default, any floating-point move would always force
the destination of the move to be the top of the floating point stack;
i.e., a floating point move always looked like fldL (address) or fldL
%st(i); if there were no future uses of the destination, then the next
instruction would likely be a fstp (address).  If the destination would
like to be an address (i.e., no future uses) and the source would like to
be an address (i.e., not already in the f.p. stack and no future uses of
it), then we have the mem-mem move case above.  If, on the other hand, the
source would like to be a f.p. register (i.e., already in the f.p. stack
or there are some future uses), then it would appear to be better to get
the source into the top of the floating point stack and perform the move
to the destination as fstL (address).  As a concrete example, this
optimization turns the code on the left into the code on the right:

        fld %st                       |         fstL (0+(104*1))(%ecx)
        fstpL (0+(104*1))(%ecx)       |         fstL (0+(112*1))(%ecx)
        fld %st                       |         fstL (0+(120*1))(%ecx)
        fstpL (0+(112*1))(%ecx)       |         fstL (0+(128*1))(%ecx)
        fld %st                       |         fstL (0+(136*1))(%ecx)
        fstpL (0+(120*1))(%ecx)       <
        fld %st                       <
        fstpL (0+(128*1))(%ecx)       <
        fld %st                       <
        fstpL (0+(136*1))(%ecx)       <
        fstpL (0+(144*1))(%ecx)                 fstpL (0+(144*1))(%ecx)

Would seem to be a win.  Anyways, this optimization is on when
-native-fastFMOV = 1, 3, or 5 and off otherwise.

Looking at the benchmarks, the most obvious remark is that the first
optimization seems not to have a significant effect, but the second
optimization both positively and negatively affects programs.  mandelbrot
slows down because of my previous comments; nucleic seems to do lots of
shuffling of floating point values in and out of argument positions, so
these are all changed into mem-mem moves through a gpr.

I'm torn: a couple hundred "good" optimizations leads to an %8 speed-up in
nucleic, one "bad" optimization leads to a 13% slow-down on mandelbrot.