X86 floating point registers

Fri, 1 Jun 2001 11:57:54 -0700 (PDT)

> Speaking  of  X86  floating  point  registers,  did you look at this paper on
> exactly that by the MLRisc people?
>
>     http://cm.bell-labs.com/cm/cs/what/smlnj/compiler-notes/x86-fp.ps

Yeah, I read that when whatever working version introduced it came out.
It's on my codegen todo list to revisit the floating point stuff and see
if any of those techniques might be applicable.  It probably wouldn't be
too bad.

> I remember not being convinced that things worked quite as they say from some
> quick experiments, but I seem to recall that the  notion  was  that  swapping
> floating  point registers cost nothing because it just did register renaming.

Maybe.  I want to say that I've also heard that there are performance
issues with referencing deep locations on the stack.  The way things are
set up right now, I keep the number of exchanges pretty low.

Out of curiosity, here's what MLton is producing for the example in the
paper:

a = x * x
b = y * y
c = z * z
d = sqrt(a + b + c)
w = d / (a - b - c)

        Instructions                 Stack
	fldl (144*1)(%edi)           [x]
	fmul %st, %st                [x * x]
	fldl (152*1)(%edi)           [y, x * x]
	fmul %st, %st                [y * y, x * x]
	fld %st                      [y * y, y * y, x * x]
	fldl (%eax)                  [z, y * y, y * y, x * x]
	fmul %st, %st                [z * z, y * y, y * y, x * x]
	fld %st                      [z * z, z * z, y * y, y * y, x * x]
	fxch %st(2)                  [y * y, z * z, z * z, y * y, x * x]
	fadd %st(4), %st             [x^2 + y^2, z^2, z^2, y^2, x^2]
	faddp %st, %st(2)            [z^2, x^2 + y^2 + z^2, y^2, x^2]
	fxch %st(1)                  [x^2 + y^2 + z^2, z^2, y^2, x^2]
	fsqrt                        [sqrt(x^2 + y^2 + z^2), z^2, y^2, x^2]
	fxch %st(3)                  [x^2, z^2, y^2, sqrt(...)]
	fsubp %st, %st(2)            [z^2, x^2 - y^2, sqrt(...)]
	fsubrp %st, %st(1)           [x^2 - y^2 - z^2, sqrt(...)]
	fdivrp %st, %st(1)           [sqrt(...) / (x^2 - y^2 - z^2)]
	fstl (160*1)(%edi)           [sqrt(...) / (x^2 - y^2 - z^2)]

The last fst isn't a pop because the code falls through to a
Real.toString, which begins by calling Real_class; but, the computed value
does need to be saved before C-call, so the register allocator tries to be
smart enough to recognize when we've encountered the last def of a
variable before the end of the block, and then looks for the first
opportunity when it is at the top of the stack to save it to memory.

Same number of instructions and memory references as the example, but I
have a few more loads instead of exchanges.  I don't know about the
pipelining, but I suspect the above is worse than the examples.