reusing pseudo-regs

Matthew Fluet Matthew Fluet <fluet@cs.cornell.edu>
Fri, 1 Mar 2002 13:25:09 -0500 (EST)


> > Reuse pseudo-regs.  While this doesn't shrink the size of the final
> > executable (because the pseudo-reg space is compiled into the BSS
> > section), it should reduce the size of the in-memory executable image.
> > Also, we could get better cache performance by reusing pseudo-regs.
> 
> Cool.  That was something I'd been wanting to do.  I'll be interested
> to see if there is a performance improvement.

I didn't run benchmarks, although it would be trivial to modify
allocateVar to do a Control lookup and switch between the global
allocation and the reuse allocation.

Looking at self-compiles, there is a slight slow down in pre-codegen time
(I was just looking at -v so I don't know if it's just
in allocateRegisters), presumably due to the added cost of managing the
psuedo-reg allocation.  The representation of pseudo-reg allocations is
naive -- a {ty: Type.t, alloc: {index: int} list} list.  So, for every
modification of the allocation, it requires partitioning the list by type
in to a singleton list and the rest, making the update at the right type,
and rebuilding the list.  I tried to be careful with appendRev's and not
caring about orders when it didn't matter, but I suspect all that list
traversal and rebuilding hurts.  Granted, it's all constant time, because
there are only ever 5 Type.t variants, but still.  What I didn't want to
do was to tie allocateRegisters to a 5 tuple representation (which
would give faster lookup and update functions), but would mean having to
modify allocateRegisters when we introduce Int64 and Uint64 variants.
Instead, I think that the MTYPE signature should export some form of
finite functional map which internally would be represented by a 5 tuple,
but the outside world would just see:

signature MAP = sig
   type 'a map
   val new: (t -> 'a) -> 'a map
   val sub: ('a map * t) -> 'a
   val update: ('a map * t * 'a) -> 'a map
end

That's pretty trivial to implement.  Probably even better to do something
like that in lib/mlton/basic as a finite-map functor.  So, is there
something like that already there that I just don't know the right name
for?

But, it is fairly impressive in sheer number of pseudo-regs:

Locals(98, 20, 5109, 5289, 598); -- self-compile with global allocation
Locals(4, 12, 13, 502, 75);      -- self-compile with reuse allocation

The int case is the best reuse ratio -- that suggests that there is one
SSA function with 5109 variables allocated into pseud-regs, only 13 of
which are ever live in the same block (although, this might occur in a
different function, in which case the result is even more surprising). As
you point out, with per-statement granularity, we might be able to do
better.

Anyways, the overaching goal is to move to a point where I might be able
to use bit-vectors in the codegen to represent liveness sets.  Without
reuse, there are just too many variables (as you note in
backend/liveness.fun).  But, with even this coarse reuse, that's
approaching reasonable.  Allocating a vector of 18 words per basic
block is reasonable; allocating a vector of 347 words per basic block is
not.

Unfortunately, I still haven't worked out how to dynamically adapt enough.
The problem is that I don't know the set of memlocs that I'll be tracking
until after the compilation starts (and we check the -native-live-stack
flag).  By this time, all of the globally defined memlocs have been
allocated and hash-consed (that occurs at functor application, so
essentially before the actual compiler gets going).  Additionally, I can
make a fixed bit-vector size, but as seen above, in order to accomodate a
self-compile, the vectors need to be at least 18 words (and we should be
able to adapt to larger programs), and yet that hurts small programs that
may have many fewer live variables.  Really, it should be done on a
per SSA function basis -- figure out how many pseudo-regs are used, put an
upper bound on the size of the bit vector, and go from there.  But, that
doesn't work with the implementation of bit vectors as a functor.  I could
do something like List.set, and return a record of all the necessary
functions specialized to the bit-vector size, but with heavy
functor/structure code, that means threading the functions through
everything.  Another thought was an adaptive bit-vector implementation,
that would internally keep an array of "seen" elements and simply add to
it when necessary.  (In between processing of SSA functions, 
we'd reset the "seen" array and thus for the next function,
we could use smaller vectors.)  Unfortunately, that slows down the
implementation, because the set operations won't always be on vectors of
the same length (i.e., can't use on Vector.map2, because that will raise
when the vectors aren't the same size).  

> BTW, here are a couple of other things that were on my todo relating
> to pseudo-reg allocation.
> 
> 1. Use targeting info to avoid shuffles
> 2. Use liveness info at a statement granularity

Both of those would be nice; I think 1 has the potential to be the bigger
win, although I'd be wary of forcing a variable that could live in a
psuedo-reg to live in a stack slot just because it is moved to or from
there.  (I think you had this in mind with your description.)