CVS Commit

Matthew Fluet fluet@CS.Cornell.EDU
Wed, 12 Sep 2001 16:10:28 -0400 (EDT)


http://www.cs.cornell.edu/People/fluet/MLton/x86-allocate-registers.tgz

Changed the implementation of the Cache directive.  Rather than processing
each MemLoc->Register directive independently, inspect all of them and
choose the best order for processing.  Helps out on cases where we need to
load a value from memory into ebx, and move the value in ebx to eax.
Previously, we might have processed the mem->ebx first and gotten:
movl mem,%ecx
xchgl %ecx,%ebx
movl %ecx,%eax
Now, we get the better
movl %ebx,%eax
movl mem,%ebx


The above should be fine for the CVS log, but here's a little more
background.

(I've been playing around with the live-transfer code, trying to improve
when stuff is loaded into registers.  For example, a common idom that I'm
trying to eliminate is entry into a CPS function, where we've got
something like:

print_0:
	movl (4*1)(%edi),%edx
	movl (8*1)(%edi),%ebp
statementLimitCheckLoop_16:
	cmpl ((gcState+48)+(0*4)),%edi
	jae doGC_24
checkFrontier_1:
	leal (1048*1)(%esi),%esp
	cmpl ((gcState+8)+(0*4)),%esp
	jnbe doGC_26
skipGC_47:
	...

doGC_24:
doGC_24:
	...
	movl $L_505,(0*4)(%edi)
	call GC_gc
        ...
	jmp *(0*4)(%edi)

L_505:
	addl $-20,%edi
	movl (4*1)(%edi),%edx
	movl (8*1)(%edi),%ebp
	jmp statementLimitCheckLoop_16

The "problem" is that I've decided that SI(4) and SI(8) are in %edx and
%ebp respecitvely on entry to statementLimitCheckLoop_16 -- so they are
loaded at both entries to the label.  Really, it would be better to delay
putting SI(4) and SI(8) into registers until skipGC_47.  Now, the GC loop
isn't a critical section, but we are duplicating code.  (On the other
hand, we may be helping things by the fact that we load the values well
before their use -- no stalling the pipeline.)  On yet another hand, we do
want to keep values in registers around an allocating loop.  Anyways,
these are the things I'm thinking about.  In playing around with stuff, I
got an experimental version that really wasn't doing too good relative to
the current version (about 1.2 on vector-concat).  I saw the above on one
of the hot loops (occuring in both the new and the old version), and made
the x86-allocate-register changes.  Now the new version is about 0.6
relative to the old version -- even with the "bad"  live register choices. 
So, this should help out on the current version, although more-so with
-native-live-stack true than with the default.)

Anyways, Steve spoiled me rotten this summer with access to a speedy PIII
with loads of memory -- this PIII 500MHz with 192M isn't nearly as much
fun.  Steve, if you get a chance to incorporate the changes and feel like
posting a G1 or G2, I'd appreciate it.