x86 backend design

Thu, 22 Jun 2000 18:48:00 -0400 (EDT)

Warning: this is a bit long, and a little rambling at times, but I think
there are some good ideas here and would appreciate feedback.

Here are some more notes/thoughts on the x86 backend.  I thought I had a
decent approach in mind, but I'm running into some snags.

Overall, here's the idea I had.  Given the types of Statements (and
PrimOps) and Transfers that emerge from backend.fun, the main instructions
are pretty much dictated.  So, an initial translation can be made into
assembly where we ignore all restrictions and let most of the operands be
some sort of reference to their memory location.  Ideally, at this point
in time, very few registers have actually been used for something.  Now,
we run over the instruction stream and "fix" up all the restrictions; ex.,
eliminate memory-memory moves by moving through a register, ensure that
multiply and divide use the correct registers, etc.  Also, by tracking
which memory locations have been moved into registers, we eliminate other
references to the same location by using the register.  In addition, the
instruction stream would have some directives to this fixing/optimizing 
pass. In particular, a "Flush" directive would force everything in
registers back to the memory locations in preparation for either a gc or a
ffi.  Similarly, indicating that particular registers are or are not
available during paricular sections.

Obviously, all sorts of optimizations could be incorporated into the
fixing phase, but initially, we can start of with a very simple one that
doesn't look ahead or back up and simply flushes the register to its
memory location when it needs to free something.

The snag that I'm currently running into is how best to translate the
MachineOutput.Operand type into these abstract memory locations (or
immediate values, which simply are't flushed).  Essentially, I would like
a representation such that equality indeed means the same memory location 
(at least locally).  Some of the variants are pretty simple:

Char, Int, UInt => immediate values (as integers)
Label           => immediate value (as string)
Global          => mem loc of immediate expression
                   (ex., Global{index = 5, ty = Int} 
                           => MemLoc(_globalI + (5 * 4))
                    and the linker will resolve the expression into an
                    absolute address into the statically allocated array 
                    of integer globals)
GlobalPointerNonRoot => mem loc of immediate expression
                        (same as Global, except with _globalNonRoot 
                         as base address)
Register        => mem loc of immediate expression
                   (same as for Global, except with _localX as base address;
                    I'm proposing the name "local" for what the C-backend
                    is calling registers, to avoid confusion with x86
                    registers)
Pointer         => immediate value (as integer)
                   (C backend only does a cast, so we ignore it.)

Here are some tricky ones:
ArrayOffset     =>
Offset          =>
StackOffset     => memory locations of some sort.
                   Essentially, they all have the same pattern of:
                     base + index * scale
                   Problem is, they don't all have the same type of base
                   or index.  ArrayOffset and Offset both have arbitrary
                   operands as base and index, while StackOffset has the
                   implicit stackTop as the base and a (constant) integer
                   offset.  In practice, I don't think that ArrayOffset
                   and Offset really have arbitrary operands.  (In fact, I
                   can verify (by changing the type of the offset
                   field) that ArrayOffset will always have a constant
                   integer offset.)  From looking through some source,
                   I've only seen Stack, Register, or Global in the base
                   position, and Stack, Register and int constant in the
                   offset position for ArrayOffset.
	           Anyways, the point being that these aren't "deep"
                   structures. But, they aren't totally simple from the
                   instruction point of view.
                   OI(SP(3),5) => movl stackTop,%eax ; %eax = stackTop
                                  movl 3(%eax),%eax  ; %eax = SP(3)
                                  movl 5(%eax),%eax  ; %eax = OI(SP(3),5)
                   Even more complicated for array offset with stack in
                   both base and index.

                   Now, simply representing these memory locations isn't
                   necessarily difficult, but doing so in a way that will
                   make optimizations feasible is a little tricky.  Ex.,
                   there is a sequence in the output that looks like:
                      RU(3) = XU(RP(2), 1);
                      RU(4) = XU(RP(2), 0);
                   It would be nice to expose the fact that RP(2) is
                   indeed the same memory location and might benefit from
                   being in a register.  Likewise, the whole XU(...) is of
                   a fairly nice form, in that the index is constant and
                   the base doesn't require any extra indirection
                   (as it would with SP(2)); that information should be
                   exposed as well.

                   Essentially, I don't want to turn the above into:
                      movl _localP + (2*4),%eax
                      movl $1,%ebx
                      movb (%eax,%ebx,1),%ah
                      movb %ah,_localU + 3
                      movl _localP + (2*4),%eax
                      movl $2,%ebx
                      movb (%eax,%ebx,0),%ah
                      movb %ah,_localU + 4
                   Better to do
                      movl _localP + (2*4),%eax ; only calculate RP(2) once
                      movb 1(%eax),%ah          ; use constant index
                      movb %ah,_localU + 3
                      movb 0(%eax),%ah          ; use constant index
                      movb %ah,_localU + 4
                   We save three instructions (including a memory
                   reference), and save a register.

                   There are other complications regarding the MLton
                   stack.  For example, if we just choose to equate %esp
                   with the stackTop, then we've got minor issues when we
                   go to do a "C" call, because if all SX references
                   default to off of %esp, then there's some trickery to
                   be done when pushing values for the "C" call onto the
                   restored "C" stack.  Hence, the desire to leave SX
                   operands as some offset from the dereferenced stackTop
                   static location.  We then hope that the fixing stage
                   recognizes that memory location at stackTop is being
                   used a lot, and allocates a register to it.  Then at
                   the site of a "C" call, we insert the directive to say 
                   that %esp is no longer available, and everything can be 
                   dealt with the same way.  Too much to ask??

CastInt         => mem location of some sort
                   This is slightly simpler than the above,
                   but has some of the same traits.
                   In general CastInt can be applied to any Operand,
                   but it's only ever used in a SwitchIP transfer,
                   so the argument is going to be a GP, RP, or SP.
                   In fact, in the programs I've looked at, it only
                   comes up once, but on an SP, which is the most
                   complicated.

                   Actually, as I think about it, we might be able to
                   ignore it all together.  It only adds a cast,
                   so I think that we just transparently deal with it
                   as we would the wrapped value.  In the int case,
                   we'll just be dealing with it as a long the same way
                   we would if it had been a pointer.

Less clear are the following:
Void            => ???  I can't actually write an sml program which
                   results in a Void operand printing "void" to the .c
                   file.
IntInf          => ???  Same as Void, I can't actually figure out when 
                   such an operand lives to the end.
Prim            => mem loc of immediate value (as string)
                   (Steve and I have talked some about what a nullary 
                    _prim or _ffi should be translated into.  This version
                    would mean that a file which uses a nullary prim
                    should be linked against an object file which
                    statically declares that variable equal to the value.
                    Not entirely satisfactory, since it means that #DEFINE
                    type _ffi's will incur a level of indirection (and not
                    benefit from whatever constant propagation gcc was
                    doing).  There probably other user end _ffi issues to
                    work out, so this is probably o.k. for now.)
Float           => immediate value (as string) 
                   (Maybe, maybe not.  I need to think a bit more about
                    how to handle floating point.  I'm not sure that we
                    want approach it the same way that gcc does it, which
                    ammounts to computing the bits of a floating point
                    constant, do an immediate move into a stack location,
                    and then push it onto the floating point register.
                    Since neither SML/NJ nor MLton has the PackReal
                    structure yet, it's just a lot of work to compute the
                    bits, and the assembler is willing to do it for us anyways.
                    Looking at Intel instructions, there isn't any way to
                    reference an immediate floating point value;
                    everything needs to be loaded in from a memory
                    location. The alternative to gcc is to treat
                    floats the way strings are treated.  Put all floating
                    point constants into the .data section via .float
                    constructors and load them via labels.  The .data
                    section grows a little, but the .text section shrinks
                    by just a little more, so no space worry.  Possibly a
                    slight runtime hit, since we might get a cache miss to
                    the .data section as opposed to a cache hit to the
                    stack.  Anyways, I think we could potentially forgoe
                    floats on the first version of the backend.)

That's what I'm mulling over right now.  After getting all these specifics
down, I'm sort of leaning towards a small hierarchical set of memory
location operands.  There would be a little more overhead in walking over
them, but it might be worth it to let the type system enforce the fact
that we won't ever have "deep" structures.

ex. operand = Register  \
            | Immediate  -> as before, these are "real" operands that
            | Address   /    make sense to show up in the assembly file
            | MemLoc of memloc
Initially, we translate into instructions that exclusively use MemLoc's as
operands.  Then the fixing phase turns all MemLoc's into the appropriate
addresses and/or registers.

ImmMemLoc = ImmMemLoc of immediate expression
(* like locals and globals 
 *
 * These could essentially be turned into Address operands,
 *  but we want to isolate it for the hierarcy.
 *
 * Also, we could have it that MemLoc's could potentially be
 *  cached in registers, but Address's will just be output.
 *
 * Note that ImmMemLoc's that are equal really are the same memory
 *  location throughout the program execution.
 *)

SimpMemLoc = SimpMemLoc of {base: ImmMemLoc,
                            offset: immediate expresion}
(* like stack offsets      SI(3) 
 * and some offsets        OI(RP(1), 3)
 * and some array offsets  XU(RP(2), 1)
 *)

ComplexMemLocA = ComplexMemLocA of {offset: SimpMemLoc,
                                    base: immediate expression}
(* like some offsets       OI(SP(1), 4) 
 * and some array offsets  XU(SP(2), 4) -- although I've never seen this
 *)

ComplexMemLocB = ComplexMemLocB of {offset: SimpleMemLoc,
                                    base: ImmMemLoc}
(* like some array offsets XU(SP(5), RI(2))
 *)

ComplexMemLocC = ComplexMemLockC of {base: SimpleMemLoc,
                                     offset: SimpleMemLoc}
(* like some array offsets XU(SP(5), SI(2))
 *)

There are some other combinations, but nothing should ever be more
complicated than ComplexMemLocC.  I've never seen XU(RP(3), RI(2)),
although it might happen.  (Actually, maybe it can't.  At the time of
array creation, the base is certainly live across the limit check that
occurs at the creation.  And if you pass the array to another function, it
will be passed on the stack.  So, I think that can't happen.)

This might also give us a way of "scaling up" the fixing phase.  Throw our
hands up at the ComplexMemLoc's until we've got something working, then
start to be more aggressive.

I know this is a bit much to swallow, but comments and suggestions are
welcome.  It's at least been beneficial to write down all the details and
think about all the issues.

-M