SSA simplify passes

Wed, 9 Jan 2002 09:45:38 -0500 (EST)

> I could probably hack the register allocator to do these types of
> memory-memory moves through a general purpose register.  With register
> renaming, I shouldn't be slowed down by using the same register for both
> components of the float (right?).

This wasn't that hard to add, although the results are mixed.  The
original benchmarks I ran used one version, but I realized I could be a
little more aggressive when doing this optimization, so I'm rerunning
numbers.

Note, there is a little bit of trickery with the way I'm handling
aliasing.  Consider the following:

	fldL (0+(56*1))(%ebp)
	fstpL (0+(80*1))(%ebp)

Easy enough to rewrite to:

	movl (0+(56*1))(%ebp),%eax
	movl %eax,(0+(80*1))(%ebp)
	movl ((0+(56*1))+4)(%ebp),%eax
	movl %eax,((0+(80*1))+4)(%ebp)

But, trying to rewrite

	fldL (0+(56*1))(%ebp)
	fstpL (0+(60*1))(%ebp)

to

	movl (0+(56*1))(%ebp),%eax
	movl %eax,(0+(60*1))(%ebp)
	movl ((0+(56*1))+4)(%ebp),%eax
	movl %eax,((0+(60*1))+4)(%ebp)

is incorrect because we overwrite the second word of the float with the
first word.  We need to perform the moves in the other order.

(There is an orthogonal issue concerning the fact that the destination
address is not mod 8 aligned; I thought we had fixed all that, but maybe
not.)

So, the real problem is one of aliasing/overlapping.  By extending my
mayAlias predicate to give an ordering, I can choose the right way to
perform the moves.  This works fine for stack slots, because we can always
determine which address is smaller (we're just comparing the offsets). 
Unfortunately, with the heap this isn't usually possible; two arbitrary
array elements could always alias, and we can't tell which is the smaller
address.  Currently, if we're trying to do a mem-mem floating point move
between two locations that mayAlias in an indeterminate order, I just punt
and move it through the floating point registers.  So, for example, the
copy of a f.p.  array into another f.p.  array will always bounce through
the f.p.  registers.  I think we can avoid this, because heap locations
have the added property of never overlapping, even if two locations
mayAlias.  But, currently, I haven't added that check.

The mandelbrot benchmark had a 10% slowdown with the "fast" f.p. move.
The difference between the two versions in the assembly is two lines:

L_17:                                   L_17:
        fildl (0+(44*1))(%ebp)                  fildl (0+(44*1))(%ebp)
        faddL (0+(8*1))(%ebp)                   faddL (0+(8*1))(%ebp)
        fmulL (globaldouble+(2*8))              fmulL (globaldouble+(2*8))
        fstL (0+(48*1))(%ebp)                   fstL (0+(48*1))(%ebp)
        movl $0,(0+(72*1))(%ebp)                movl $0,(0+(72*1))(%ebp)
        fstpL (0+(64*1))(%ebp)                  fstpL (0+(64*1))(%ebp)
        fldL (0+(32*1))(%ebp)         |         movl (0+(32*1))(%ebp),%esi
        fstpL (0+(56*1))(%ebp)        |         movl %esi,(0+(56*1))(%ebp)
                                      >         movl ((0+(32*1))+4)(%ebp),%esi
                                      >         movl %esi,((0+(56*1))+4)(%ebp)
.p2align 4,,7                           .p2align 4,,7
loop3_0:                                loop3_0:
        movl (0+(72*1))(%ebp),%esi              movl (0+(72*1))(%ebp),%esi
        cmpl $2048,%esi                         cmpl $2048,%esi
        jnl L_11                                jnl L_11
L_16:                                   L_16:
        fldL (0+(64*1))(%ebp)                   fldL (0+(64*1))(%ebp)
        fld %st                                 fld %st
        fmul %st, %st                           fmul %st, %st
        fldL (0+(56*1))(%ebp)                   fldL (0+(56*1))(%ebp)
        fld %st                                 fld %st
        fmul %st, %st                           fmul %st, %st
        fld %st(2)                              fld %st(2)
        fadd %st(1), %st                        fadd %st(1), %st
        fldL (globaldouble+(3*8))               fldL (globaldouble+(3*8))
        fxch %st(1)                             fxch %st(1)
        fcompp                                  fcompp
        fnstsw %ax                              fnstsw %ax
        testw $0x4500,%ax                       testw $0x4500,%ax
        jz L_39                                 jz L_39

But, the original (left) version has:
[fluet@lennon mandelbrot]$ mlprof -x -d 2 mandelbrot.false mlmon.false.out 
13.98 seconds of CPU time
main_0                                           100.00% (13.98s)
     L_16                         59.01% (8.25s)                 
          L_16    100.00% (8.25s)                                
     L_17                         17.53% (2.45s)                 
          L_17    100.00% (2.45s)                                

and the new (right) version has:
[fluet@lennon mandelbrot]$ mlprof -x -d 2 mandelbrot.true mlmon.true.out 
15.85 seconds of CPU time
main_0                                             100.00% (15.85s)
     L_16                          64.10% (10.16s)                 
          L_16    100.00% (10.16s)                                 
     L_17                           16.28% (2.58s)                 
          L_17     100.00% (2.58s)                                 

A little slowdown in L_17, which is where the change is, but that doesn't
seem that significant; but the difference in L_16 is striking.  What I
think is going on is that the pipelining between the integer-unit and the
floating-point-unit is sufficiently disjoint that when performing the
fldL (0+(56*1))(%ebp) in L_16, in the original version the value can be
fetched from on chip (because it was recently written by the
floating-point unit at fstpL (0+(56*1))(%ebp)), but in the new version,
the value can't be fetched from on chip because it's tied up in the
integer unit.