RSSA backend

Matthew Fluet fluet@CS.Cornell.EDU
Thu, 17 Jan 2002 14:43:56 -0500 (EST)


> > 4. make the array initialization loops explicit in MACHINE; I guess this
> >     would be an insertArrayInitializations after insertSignalChecks
> > 
> > I think we eventually want 4, which solves all the problems at once.
> 
> Agreed.  I will do it right now.  It should be easy.

Great.  Other than that, what's currently checked in to the rssa branch
passes all the regressions, benchmarks, and a self-compile.  

Benchmark results are equally as strange as the C codegen ones, although
they don't seem to agree on what got speedups.

run time ratio
benchmark         MLton1
barnes-hut          1.02
checksum            0.99
count-graphs        0.80
DLXSimulator        0.88
fft                 1.08
fib                 1.01
hamlet              1.13
imp-for             0.79
knuth-bendix        0.98
lexgen              0.96
life                0.95
logic               1.13
mandelbrot          0.90
matrix-multiply     0.91
md5                 0.94
merge               0.99
mlyacc              0.97
mpuz                1.01
nucleic             0.95
peek                2.58
psdes-random        0.90
ratio-regions       0.96
ray                 1.11
raytrace            0.99
simple              0.99
smith-normal-form   1.00
tailfib             0.80
tak                 0.98
tensor              0.61
tsp                 1.09
tyan                1.09
vector-concat       0.41
vector-rev          0.98
vliw                0.97
wc-input1           1.15
wc-scanStream       0.61
zebra               0.77
zern                0.94

I've no real idea either; peek is ok, because knownCase in the stable
version is speeding it up to 0.3 of the running time of the SSA optimizer
in the RSSA branch.  logic and wc-input1 might be worth looking at.

I think I see why there are such significant speedups, at least for the
x86 codegen.  The limit check insertion and pseudo-register allocator are
doing "better" in the RSSA branch.  For example, the loop in tail fib
looks like this in RSSA branch

fibP_0 {kind = Jump, live = (RI(4), RI(3), RI(2), RI(0))}:
    Switch (RI(4), [(0, L_12)], Some (L_15))

and like this in the stable branch:

fibP_0 {kind = Jump, live = [SI(4), SI(8), SI(12), SI(16)]}:
Switch (SI(16), [(0, L_10)], Some (L_13))

So, the RSSA looks a lot more like -native-live-stack true; in any event,
the x86 codegen turns the first loop into:

fibP_0:
	testl %edi,%edi
	jz L_41
L_15:
	decl %edi
	jo L_10
L_34:
	addl %ecx,%edx
	jo L_43
L_33:
	xchgl %edx,%ecx
	jmp fibP_0

because it will pass pseudo-regs in registers by default,
while it turns the second into:

fibP_0:
	movl (0+(16*1))(%ebp),%esi
	testl %esi,%esi
	jz L_27
L_13:
	decl %esi
	jo L_8
L_19:
	movl (0+(12*1))(%ebp),%edi
	movl %edi,%edx
	addl (0+(8*1))(%ebp),%edx
	jo L_29
L_20:
	movl %esi,(0+(16*1))(%ebp)
	movl %edi,(0+(8*1))(%ebp)
	movl %edx,(0+(12*1))(%ebp)
	jmp fibP_0

because it loads and stores stack slots.

On the down side, this means that there is a little slowdown in compile
times, because the live sets manipulated by the codegen are larger.