[MLton] cvs commit: rewrote x86.Block.compress to run in linear time

Thu, 1 Jul 2004 20:31:22 -0700

> > * localRef is taking too long.  I still don't know why.  It's not the
> >   multi subpass.  Any ideas?  In any case, that's the only remaining
> >   glaring problem in the pre codegen.
> 
> You might trace the SSA- restore pass with a Control.traceBatch.

I put a trace around restore and here's what I got

	    localRef starting
	       multi starting
	       multi finished in 0.61 + 5.93 (91% GC)
	       restore starting
	       restore finished in 0.91 + 0.00 (0% GC)
	       restore starting
	       restore finished in 0.00 + 0.00 (0% GC)
	       restore starting
	       restore finished in 0.00 + 0.00 (0% GC)
	    localRef finished in 99.32 + 56.54 (36% GC)

So, it looks there are only three functions with local refs, and the
restores don't hurt.

> Looking at the -verbose 3 you sent out earlier, the biggest problem seems
> to be in copyPropagate.  I glanced at that code and it does have a
> quadratic running time in the size of the basic block.
> You can turn that off with -native-copy-prop false.
> 
> -native-optimize 0 will also cut down the toLiveness time in
> allocateRegisters.  Essentially, that is responsible for computing the
> hints to the register allocator.

Here's a comparison of a compile with -native-optimize {1,0}.

			-native-optimize 
			1		0
			----------	----------
compile time (s)	4836		4054
code size (bytes)	25,519,250	27,904,878
run time (s)		273		303

Overall, there was about a 20% improvement in compile time, 10% larger
code, and 10% slower code.

Here are some details from the x86 codegen.

-native-optimize 1
	 outputAssembly starting
	    translateChunk totals 39.02 + 10.62 (21% GC)
	    simplify totals 1584.76 + 92.90 (6% GC)
	       completeLiveInfo totals 215.69 + 3.44 (2% GC)
	       toLivenessBlock totals 128.13 + 0.95 (1% GC)
	       moveHoist totals 227.31 + 28.56 (11% GC)
	       peepholeLivenessBlock totals 249.96 + 27.58 (10% GC)
	       copyPropagate totals 744.30 + 30.50 (4% GC)
	    generateTransfers totals 115.97 + 4.83 (4% GC)
	    allocateRegisters totals 922.38 + 40.59 (4% GC)
	       toLiveness totals 652.69 + 3.31 (1% GC)
	       Assembly.allocateRegisters totals 264.26 + 36.12 (12% GC)
	 outputAssembly finished in 3636.43 + 152.73 (4% GC)

-native-optimize 0
	 outputAssembly starting
	    translateChunk totals 39.17 + 9.61 (20% GC)
	    simplify totals 127.05 + 2.73 (2% GC)
	       completeLiveInfo totals 126.81 + 2.73 (2% GC)
	       toLivenessBlock totals 0.00 + 0.00 (0% GC)
	       moveHoist totals 0.00 + 0.00 (0% GC)
	       peepholeLivenessBlock totals 0.00 + 0.00 (0% GC)
	       copyPropagate totals 0.00 + 0.00 (0% GC)
	       peepholeLivenessBlock_minor totals 0.00 + 0.00 (0% GC)
	    generateTransfers totals 160.16 + 4.87 (3% GC)
	    allocateRegisters totals 1777.27 + 39.32 (2% GC)
	       toLiveness totals 1315.40 + 1.59 (0% GC)
	       Assembly.allocateRegisters totals 454.60 + 35.82 (7% GC)
	 outputAssembly finished in 3061.40 + 60.55 (2% GC)

So, it looks like that with -native-optimize 0 there's a lot of
savings to the simplifier, much of it is lost in allocate registers.

If we could improve the quadratic-ness of copyPropagate, we could
probably get most of that time back.  Any thoughts on the difficulty
of that?