CVS Commit

Matthew Fluet Matthew Fluet <fluet@CS.Cornell.EDU>
Wed, 26 Sep 2001 09:56:37 -0400 (EDT)


http://www.cs.cornell.edu/People/fluet/MLton/tree.tgz
/src/lib/mlton/basic/tree.{sig,sml}

Added layout function.


http://www.cs.cornell.edu/People/fluet/MLton/x86-codegen.tgz
/src/mlton/codegen/x86-codegen/x86.{sig,fun}
                               x86-loop-info.{sig,fun} (* new files *)
                               x86-live-transfers.{sig,fun}
                               x86-generate-transfers.fun

Abstracted the calculation and conversion of the loop nesting forest to
x86-loop-info.*.  Modified x86-generate-transfers to be "smarter" about
code layout.  At an Iff {condition, truee, falsee} transfer, the heuristic
is as follows:
 if either the truee's block or falsee's block has already been generated,
  then branch to that one, and fall thru to the other
 if neither of the truee's and falsee's blocks have been generated,
  then fall thru to the label with the least "loop-distance" from the
  Iff's label (where loop distance from l1 to l2 is NONE if l1 and l2 are 
  in non-nested loops, and SOME (d(l2) - d(l1)) where d(l) is the depth of
  the deepest occurence of l in the loop forest); this favors staying in
  the same loop or entering nested loops over exiting loops
 if both labels have the same loop distance, fall thru to the label
  with more live variables
 if both labels have the same number of live variables, fall thru to truee
Changed the "todo" list of blocks to be processed to a double-ended queue;
Near transfer labels (gotos, ifs, cases) are pushed; far transfers
(handlers, conts) are enqued.  Nested loops tend to come out the right
way.

(* End CVS log.  More elaboration below. *)

Using the loop forest "solved" the problem in vector-concat, where the
inner-loop was split:

loop_10:
        cmpl %edx,%ecx
        movl %ecx,(28*1)(%ebp)
        je L_70
L_34:
        movl (36*1)(%ebp),%ecx
        xchgl %esi,%ebx
        xchgl %esi,%edi
loop_5:
        movl (-2*4)(%ebx),%eax
        cmpl %eax,%esi
        jnl L_71
L_35:
        cmpl %eax,%esi
        jae L_72
L_37:
        movl (%ebx,%esi,4),%eax
        incl %esi
        jo L_86
noOverflow_2:
        movl %eax,(%ecx,%edx,4)
        incl %edx
        jo L_86
noOverflow_3:
        movl (28*1)(%ebp),%ecx
        xchgl %edi,%ebx
        xchgl %esi,%edi
        jmp loop_10
...
.p2align
L_71:
        testl $0x3,%edi
        jnz L_79
L_102:
        movl (%edi),%ebx
        movl (4*1)(%edi),%edi
        xorl %esi,%esi
        jmp loop_5

Now we get:

loop_10:
	cmpl %edx,%ecx
	movl %ecx,(28*1)(%ebp)
	je L_71
L_34:
	movl (36*1)(%ebp),%ecx
	xchgl %esi,%ebx
	xchgl %esi,%edi
loop_5:
	movl (-2*4)(%ebx),%eax
	cmpl %eax,%esi
	jl L_72
L_40:
	testl $0x3,%edi
	jnz L_73
L_94:
	movl (%edi),%ebx
	movl (4*1)(%edi),%edi
	xorl %esi,%esi
	jmp loop_5
...
.p2align 2
L_72:
L_35:
	cmpl %eax,%esi
	jae L_74
L_37:
	movl (%ebx,%esi,4),%eax
	incl %esi
	jo L_84
noOverflow_2:
	movl %eax,(%ecx,%edx,4)
	incl %edx
	jo L_84
noOverflow_3:
	movl (28*1)(%ebp),%ecx
	xchgl %edi,%ebx
	xchgl %esi,%edi
	jmp loop_10

This has the "problem" that the bottom of the loop_10 loop is far away.  I
realized that this was due to the fact that I use enque to add blocks to
the layout todo list.  After finishing with the loop_5 loop, the bottom of
the loop_10 loop is ready to go, but it's after everything that has been
enqued so far.  So, I changed it to a double-ended queue.  Near transfer
labels (gotos, ifs, cases) are pushed; far transfers (handlers, conts) are
enqued.  For the most part, this works; nested loops generally come out
the right way:

loop_10:
	cmpl %edx,%ecx
	movl %ecx,(28*1)(%ebp)
	je L_71
L_34:
	movl (36*1)(%ebp),%ecx
	xchgl %esi,%ebx
	xchgl %esi,%edi
loop_5:
	movl (-2*4)(%ebx),%eax
	cmpl %eax,%esi
	jl L_72
L_40:
	testl $0x3,%edi
	jnz L_73
L_94:
	movl (%edi),%ebx
	movl (4*1)(%edi),%edi
	xorl %esi,%esi
	jmp loop_5
.p2align 2
L_73:
L_95:
	movl (globalpointer+(4*4)),%esi
	jmp L_0
.p2align 2
L_72:
L_35:
	cmpl %eax,%esi
	jae L_74
L_37:
	movl (%ebx,%esi,4),%eax
	incl %esi
	jo L_84
noOverflow_2:
	movl %eax,(%ecx,%edx,4)
	incl %edx
	jo L_84
noOverflow_3:
	movl (28*1)(%ebp),%ecx
	xchgl %edi,%ebx
	xchgl %esi,%edi
	jmp loop_10

The invervening L_73 block is raising an exception; it's small here, so
it's o.k.  Overall, this sped vector-concat up from 5.0s to 4.9s on the
benchmarks.  On the other hand, it slowed tailfib down from 13.5s to
15.1s; I was somewhat surprised by this, but I think I've figured it out. 
The assembly instructions are identical in the old and new versions.  The
key loops are the following in both versions: 

loop_1:
	testl %esi,%esi
	jz L_41
L_20:
	movl $38,%edx
	xorl %ecx,%ecx
	movl $1,%edi
fibP_0:
	testl %edx,%edx
	jz L_42
L_21:
	decl %edx
	jo L_34
noOverflow_0:
	addl %ecx,%edi
	jo L_44
noOverflow_1:
	xchgl %edi,%ecx
	jmp fibP_0
...
.p2align 2
L_42:
L_24:
	cmpl $39088169,%ecx
	jne L_48
L_26:
	decl %esi
	jo L_34
noOverflow_2:
	jmp loop_1

What's different in the two versions are the relative positions of
L_42,L_34,L_44, and L_48.  In the new version, L_44 is at the head of the
queue when we finish the fibP_0 loop; unfortunately, the entire program
has been reduced to one CPS function, and raise-to-jump means that L_44
leads right into the top-level exception handler.  By the time L_42 gets
to the head of the queue, we've layed down a lot of code, so we need a
large offset for jz L_42.  Similar arguments for the various other lables. 
I was able to recover the lost time simply by reordering some of the
blocks in the resulting assembly file.  (Note, in all versions, the order
of the blocks are the same; so the static branch predictions should be the
same in all cases; obviously the dynamic branch predicitions are
identical.)  So, I think I've convinced myself that it's just bad icache
usage with the new version.  I think the right thing to do is make sure
that L_42 is layed down earlier; again, the loop forests should help; the
overflow labels aren't part of any loops, whereas L_42 is related to the
fibP_0 loop.  I think the right thing is some sort of priority queue for
the todo list, but I'm not quite sure what the right metric is.  Any
suggestions?