CWS paper

Stephen Weeks MLton@sourcelight.com
Tue, 21 Nov 2000 15:14:08 -0800 (PST)


> Kathleen mentioned that she had talked with you about it.  There is a draft
> of the paper on the Moby web page (http://www.cs.bell-labs.com/who/jhr/moby).
> I would be interested in any comments/suggestions you might have.

Hi John.  Thanks for the pointer to your paper.  I read it, and the optimization
is similar to the contification optimization performed in MLton.

The analogue of Moby's BOL in MLton is the CPS IL.  Moby's BOL is different in
that BOL uses direct style while MLton uses continuation-passing style, but they
share the idea of directly expressing the control flow of (nested) loops with
tail recursive functions that share the same stack frame.  The analogue of the
analysis and transformation you describe in your paper is a CPS to CPS
optimization in MLton called contification.  Both analyses determine when
functions "have a single continuation" and can share the stack frame of their
caller.  Both transform such functions so that nontail calls to the function
become tail calls and so that returns from within the function become tail calls
to the known continuation.

The analyses are different in that Moby computes an approximation to the set of
continuations that a function is called with, and transforms the function if
that set is a singleton.  On the other hand, MLton determines if a function is
called in exactly one place from outside its body and if all calls from within
its body are tail calls.  If so, MLton "contifies" the function (i.e. makes it
share the stack frame of its caller).  The transformations are slightly
different because MLton's CPS IL is already in continuation-passing style, so no
local CPS needs to be performed.  Also, MLton's CPS IL is first-order (closure
conversion has already happened), so the transformation does not have to worry
about free variables or be integrated with closure conversion.

There are many situations where both analyses will determine that a given
function can be contified.  But there are also cases that each will catch that
the other does not (examples later).

As an experiment, I extended MLton's contification optimization with a
whole-program data-flow analysis, similar to the one in your paper, to determine
the set of continuations that a function is called with.  I used the same flat
lattice as in your paper.  So we have some terminology, I'll call the analysis
in Moby "continuation based" and the analysis in MLton "call based".
Contification runs at three places in MLton's optimizer, with intervening
optimization passes.  I ran all my usual benchmarks, counting at each
contification pass how many functions each analysis determined could be
contified.  The following table shows the results.  

For each benchmark there are three rows, one for each contification pass.
There are four columns: the total number of functions in the program, the number
of functions that both analyses determined could be contified, and one column
each for the number of functions that each analysis determined could be
contified that the other could not.  So, for example, the first row of
barnes-hut shows that both analyses could contify 61 of the 119 functions.  The
third column shows that the continuation based analysis could contify an
additional 2 functions that the call based analysis could not.  The fourth
column shows that the call based analysis could contify an additional 5
functions that the continuation based analysis could not.

		 funcs  both cont call
barnes-hut                         
                   119    61    2    5
                    52     6    1    0
                     7     0    0    0
checksum                           
                    13     9    0    0
                     4     1    0    0
                     1     0    0    0
count-graphs                       
                   107    55    6    5
                    47    10    5    0
                    11     0    2    0
fft                                
                    59    38    0    1
                    20     3    0    0
                     2     0    0    0
fib                                
                    11     7    0    0
                     4     1    0    0
                     2     0    0    0
knuth-bendix                       
                   112    37    3    6
                    71     4    1    2
                    34     0    0    0
lexgen                             
                   244   102    6   18
                   124     3    5    0
                    37     0    0    0
life                               
                    44    23    0    2
                    19     1    0    0
                     7     0    0    0
logic                              
                    73     8    0    7
                    58     1    0    0
                    31     0    0    0
mandelbrot                         
                    14    11    0    0
                     3     1    0    0
                     1     0    0    0
matrix-multiply                    
                    21    14    0    0
                     7     2    0    0
                     1     0    0    0
merge                              
                    16    12    0    0
                     4     1    0    0
                     2     0    0    0
mlton                              
                  7703  2549  379  708
                  3964    71  374   13
                  1023     4   48    0
mlyacc                             
                   820   345   13   72
                   400     9   12    1
                   111     1    3    0
mpuz                               
                    36    23    0    2
                    11     1    0    0
                     2     0    0    0
nucleic                            
                    34    16    0    2
                    16     1    0    0
                     7     0    0    0
ratio-regions                      
                   118    64    0    2
                    52     6    0    0
                     8     0    0    0
simple                             
                   264    84    1    7
                   173     6    0    0
                    23     0    0    0
smith-normal-form                  
                   108    47    3    3
                    58     7    0    1
                     6     0    0    0
tak                                
                    12     8    0    0
                     4     1    0    0
                     2     0    0    0
tensor                             
                   157    90    3    7
                    60     8    2    1
                     7     0    0    0
tsp                                
                    58    27    2    1
                    30     3    1    0
                     8     0    0    0
vliw                               
                   506   188   19   39
                   275    22   15    2
                    92     0    7    0
wc-input1                          
                    68    34    1    4
                    30     1    0    0
                     6     0    0    0
wc-scanStream                      
                    76    40    1    3
                    33     3    0    0
                     6     0    0    0
zern                               
                    50    33    2    0
                    17     4    0    0
                     2     0    0    0


Keep in mind that while both analysis were being performed, only the call based
contification transformation was happening.  The numbers show that for most of
the benchmarks, by the last round of contification, there are no contifiable
functions detectable by either analysis.  However, for a few of the benchmarks
(count-graphs, mlyacc, mlton, vliw), there are some functions that remain
detectable by the continuation based analysis.

I extended the MLton's contification pass to transform based on the results of
either, both, or neither analysis.  Here are the running times in seconds for
all of the benchmarks, compiled four different ways: without any contification,
with just call based continuation, with just continuation based contification,
and with both.

	          neither  call   cont   both
barnes-hut           9.5    5.0    9.4    5.0
checksum             8.1    3.9    3.9    3.9
count-graphs        14.3   11.0    9.8    8.7
fft                 26.1   20.8   21.1   20.8
fib                  4.7    4.9    4.9    4.9
knuth-bendix        11.8   14.0   10.2   14.0
lexgen              35.1   25.3   27.6   25.8
life                34.6   24.3   32.4   24.6
logic               55.5   49.2   54.6   49.2
mandelbrot          37.8    8.6    8.6    8.6
matrix-multiply      8.6    5.4    5.3    5.4
merge               47.0   44.7   44.6   44.6
mlton              674.2  388.3  579.8  394.2
mlyacc              28.5   15.7   25.1   15.8
mpuz                50.0   19.8   50.1   19.8
nucleic             15.1   13.0   14.7   12.9
ratio-regions       31.4   14.3   14.6   14.2
simple              14.1   12.7   14.0   12.7
smith-normal-form    1.2    1.2    1.2    1.2
tak                 12.8   12.9   12.9   12.9
tensor               5.0    4.0    4.6    4.5
tsp                 20.0   13.9   13.9   13.9
vliw                20.6   10.6   15.8   10.6
wc-input1            3.6    3.3    3.4    3.3
wc-scanStream       37.4   11.4   11.3   12.2
zern                59.9   34.7   34.7   34.8

So, for many of the benchmarks, turning off contification causes a huge
performance loss.  For many benchmarks, call based and continuation based
perform equally well.  For some benchmarks (barnes-hut, life, logic, mlton,
mlyacc, mpuz, vliw), call based performs significantly better.  For a couple of
benchmarks (count-graphs, knuth-bendix), continuation based performs better.
Unfortunately, there is even one benchmark (knuth-bendix) where the combination
does worse than either alone.

My guess is that most of the speed differences arise because of subsequent
inlining opportunities.  In particular, in MLton, functions are inlined, but
continuations are not.  Given that the other optimizations on MLton's CPS IL
were added after contification (which was the first optimization added to the
CPS IL, in September 1998) it's no surprise that they seem to work better with
the call based analysis than with the continuation based anaylses.  But I
believe that both analysis when run in isolation miss important cases.  I also
think it should be possible to have the optimizer behave as well when run on the
results of the combined analysis than with either separately.  So, for now, I
think I'll keep "both" as the default analysis in MLton and try to look into why
call based contification is hurting knuth-bendix.

Now, for some details, including examples of programs that each analysis catches 
that the other doesn't.

Here is a description of MLton's CPS IL.  A program is a collection of mutually
recursive functions.  The language is lexically scoped, simply-typed and first
order.  Functions take multiple arguments and return multiple values.  The body
of functions are in CPS form (it may be a bit hard to see that at first).
Continuations are defined within a function, and can be implemented by sharing
the stack frame of their enclosing function.  Loops are expressed via recursive
continuations.  

Here are the atomic concepts in the CPS IL.

	c in Const
	C in Con
	k in Cont
	x in Var
	f in Func
	p in Prim

Here is the grammar.  In the rules, I use <> for nonterminals and for atoms.
Comments are at the right.

<program>  ::= <fun>*
<fun>      ::= fun <f> (<x>*) = <exp>		toplevel function declaration
<exp>      ::= let <dec>* in <transfer> end		
<dec>      ::= fun <k> (<x>*) = <exp>		continuation declaration
             | val <x> = <simple>		simple expression
<simple>   ::= <c>				constant
             | <x>				variable
             | <p> (<x>*)			primitive application
             | <C> (<x>*)			constructor application
             | (<x>*)				tuple construction
             | #i <x>				tuple selection
<transfer> ::= <f> (<x>*)			tail call
             | <k> (<f> (<x>*))			nontail call
	     | <k> (<x>*)			jump
	     | <x>*				return
	     | case <x> of (<C> => <k>)*	constructor discrimination


As an example, here is what the doubly nested loop example from your paper would
look like, before contification.  I've taken a little liberty with the syntax,
so that you can read the program as SML.  I've also assumed a simple definition
of f.  But hopefully you can see how this program fits in the grammar above.

fun f (x) = ()
fun applyf (n) = lp_i (n, 0)
and lp_i (n, i) = 
   let
      fun K1 () = ()
      fun K2 () = 
	 let
	    fun K3 () = 
	       let
		  val i' = i + 1
	       in
		  lp_i (n, i')
	       end
	 in
	    K3 (lp_j (n, 0))
	 end
      val b = i < n
   in
      case b of
	 false => K1 ()
       | true => K2 ()
   end
and lp_j (n, j) =
   let
      fun K4 () = ()
      fun K6 () =
	 let
	    val j' = j + 1
	 in
	    lp_j (n, j')
	 end
      fun K5 () = K6 (f (j))
      val b = j < n
   in
      case b of
	 false => K4 ()
       | true => K5 ()
   end

On the above program, call based analysis observes that lp_j is called in
exactly one place from outside itself and that all self calls within lp_j are
tail calls.  Hence, lp_j can be turned into a continuation within lp_i.  On the
other hand, continuation based analysis computes that the set of continuations
that lp_j is called with is {K3}.  So, lp_j can be turned into a continuation
and prefixed onto K3.  So, both analyses do the right thing on this program.
Here is the resulting contified program.

fun f (x) = ()
fun applyf (n) = lp_i (n, 0)
and lp_i (n, i) = 
   let
      fun K1 () = ()
      fun K2 () = 
	 let
	    fun K3 () = 
	       let
		  val i' = i + 1
	       in
		  lp_i (n, i')
	       end
	    fun lp_j (n, j) =
	       let
		  fun K6 () =
		     let
			val j' = j + 1
		     in
			lp_j (n, j')
		     end
		  fun K5 () = K6 (f (j))
		  val b = j < n
	       in
		  case b of
		     false => K3 ()
		   | true => K5 ()
	       end
	 in
	    lp_j (n, 0)
	 end
      val b = i < n
   in
      case b of
	 false => K1 ()
       | true => K2 ()
   end

Here is an example that the call based analysis catches that continuation based
does not. 

fun f () = let ... in loop () end
fun loop () = let ... in loop () end
fun g () = ...  K1 (f ()) ...  K2 (f ()) ...

In the above example, f makes a tail call to loop, which only makes tail calls to
itself.  So the call based analysis will contify loop within f.  On the other
hand, because g calls f with two different continuations, the continuation based 
analysis will compute the set of continuations that loop is called with as {K1,
K2}, and decide not to contify loop.  As the counts above showed, this does
happen in practice.

Here are two examples that the continuation based analysis catches that the
call based analysis does not.

Example 1.

fun f () =
   let
      fun K () = ...
      ...
         K (g (x))
      ...
         K (g (y))
fun g () = ...

In the above example, there are two calls to g from outside g, so the call based 
analysis will not contify g.  On the other hand, since both calls are with the
same continuation, the continuation based analysis will compute the set of
continuations that g is called with as {K}, and can therefore contify g within
f.

Example 2.

fun f () = ... g () ... g () ...
fun g () = ... f () ... 
fun h () = ... K (f ()) ...

In the above example, there are two calls to f from outside of f and two calls
to g from outside of g, hence call based analysis will contify nothing.  On the
other hand, because all of the calls between f and g are tail calls, and there
is a single call from outside with continuation K, continuation based analysis
will be able to contify both f and g with h.

I observed variants of both of these examples in the benchmarks.

A slight variant of example 2 gives an example that neither analysis catches.

fun f () = ... g () ... g () ...
fun g () = ... f () ... 
fun h () = ... K1 (f ()) ... K2 (f ()) ...

Call based analysis will fail for the same reason as above, and continuation
based analysis fails because there are now two continuations that f and g are
called with.  However, one would still like to contify g within f so that it can 
share f's stack frame and more optimizations will apply.

A final point.  You may have noticed that MLton's CPS IL doesn't directly
support mutually recursive continuation declaration.  This could cause problems
when contifying a block of functions that share the same continuation, as
determined by the continuation based analysis.  However, because continuation
declarations can be nested, it is possible to define mutually recursive
continuations by nesting one within another.  This approach was sufficient to
contify all of the functions in all of the benchmarks detected as contifiable
the continuation based analysis.