[MLton] A MLton / C-- Experiment

Sat, 12 Mar 2005 23:21:59 -0500 (EST)

After sufficient time and exposure to the C-- developers, I conceived
of what appeared to be a rather simple experiment in writing a C--
codegen for MLton.  Like the C, x86, and bytecode codegens, this
experimental backend would be required to abide by the existing
garbage collector API/invariants.

* MLton Codegen Background *

Some background for the C-- developers/users:

Each codegen in MLton translates from the Machine IL
(http://mlton.org/Machine), which is untyped intermediate language,
which corresponds to an abstract register machine.  It has proven
useful to share the MLton runtime system between all codegens.  What
this effectively means is that it is most convenient to separate the
ML execution stack from the system stack and decide on ML stack layout
_before_ any codegen takes over. In particular, we compute all the ML
stack frame info for each "procedure" (note, by this time, a
"procedure" is far removed from an individual SML function in the
source code), including stack frame size, GC masks for each frame,
etc. To do so, the Machine IL imagines an abstract machine with an
infinite number of (pseudo-)registers of every size.  A liveness
analysis determines, for each variable, whether or not it is live
across a point where the runtime system might take over (for example,
any garbage collection point) or a call to another "procedure".  Those
that are live go on the ML stack, while those that aren't live go into
psuedo-registers.  From this information, we know all we need to about
each stack frame.  On the downside, nothing further on is allowed to
change this stack info; it is set in stone.

At this point, we either generate C, x86 assembly, or bytecode.  The
one place where the runtime system distinguishes between the codegens
is in how it looks up the frame layout from a 32-bit word.  This
32-bit word is the translation of the return continuation of a
non-tail call in the Machine IL.

If we generate C, each of those pseudo-registers corresponds to a C
local variable.  C is maintaining it's own stack in parallel with the
stack that MLton constructed above.  Now gcc tries to do register
allocation, and it tries to put as many of these pseudo-registers into
physical registers as possible; when it is forced to spill, it spills
to the C stack.  Hence, it doesn't disturb the ML stack.

The C codegen suffers from many of the problems that C-- identifies;
insufficient support for proper tail calls and efficient exceptions.
To compensate, the C codegen uses convoluted trampolining.  Hence, the
C codegen uses small 32-bit integers as the translation of Machine IL
return continuations; these small integers can be used as indices into
a table when the runtime needs to lookup the frame layout.

* Sketch of Proposed C-- Codegen *

My thought was to develop a C-- codegen (for x86 on linux) that
closely follows the C codegen, but utilizes the better control
transfers of C--.  That is, each procedure in the Machine IL would
become a C-- procedure; the pseudo-registers of a procedure in the
Machine IL would become local variables in a C-- procedure.  Tail
calls in the Machine IL would become tail calls in C--.  Handler
continuations in the Machine IL would become C-- continuations, placed
on the ML stack in known locations; raises in the Machine IL would
become cut to transfers in C--.  Returns in the Machine IL could
become either return or cut to transfers in C--.  I've chosen to
translate them to cut to transfers, because the Machine IL associates
ML stack frame info with the return continuation of a non-tail call.
As noted above, we need a way of looking up frame layout information
from the translation of the return continuation of a non-tail call in
the Machine IL.  C-- spans appear to be the proper solution here, but
spans are only associated with activation frames / continuations, so
the Machine IL return continuation would become a C-- continuation.  I
could still translate Machine IL returns into C-- returns, but since
control flow in a C-- procedure cannot implicitly fall through to a
C-- continuation, the corresponding non-tail call would need to either
itself cut to the continuation or jump (via a goto) into the
continuation code.  The remainder of the translation of the Machine IL
is mostly primitives, many of which map directly onto C-- operators,
and C calls.

- Limitations of Proposed C-- Codegen -

The astute reader will also have realized that this scheme is by no
means using C-- to its full potential.  MLton continues to compute
information that C-- would be willing to provide (stack frame layouts,
etc.).

The astute reader also will have realized that C-- is necessarily
maintaining its own execution stack, which coincides with the system
stack.  This places a number of immediate limitations on such an
incarnation of a C-- codegen:
 1) Large ML stacks (corresponding to highly non-tail recusive
    programs) will give rise to large C-- stacks.  Hence, programs
    that require large ML stacks (handled by MLton in the other
    codegens by growing the ML stack in the ML heap) may overflow the
    C-- stack.  However, as MLton has already allocated all variables
    live across a non-tail call on the ML stack, the corresponding C--
    stack frames should be small.
 2) Programs that make use of MLton's thread library cannot be
    supported.  Attempting to multiplex the one C-- stack among
    multiple ML stacks would lead to disaster.
 3) Programs that make use of MLton's world library cannot be
    supported.  Loading a world picks up the ML stack from the ML
    heap, but will not be able to recover the C-- stack.

- Goals and Interpreting Results -

However, the number of programs ruled out by (2) and (3) are
relatively small; and I suspect that the number of programs ruled out
by (1) are also relatively few.  The, not inconceivable, goal would be
to develop a C-- codegen sufficient for compiling and correctly
running the whole of the MLton benchmark suite.  (Unfortunately, a
self-compile is ruled out by (3)).

Certainly the results of such an experiment would be open to
interpretation.  If C-- does well in this constrained setting, we
could reasonably expect that it would do better if the MLton runtime
were written to accomodate the C-- runtime API.  If the C-- compiler
performs does poorly (in compile-time), then we could reasonably
expect that it would do worse when given both the Machine IL
pseudo-registers and the Machine IL stack allocated variables.
In any case, I suspect it would be beneficial to the C-- developers to
have a source of additional C-- programs and to have some inkling of
how C-- fits into a full compiler.

* My C-- Experience : The Good, the Bad, and the Ugly *

The good news is that I can successfully compile and run a number of
tiny SML programs.  The bad news is that I have been unable to support
64-bit operations to a degree sufficient for running realistic
programs; MLton has committed to 64-bit file positions, so any I/O
necessarily pulls in a lot of 64-bit ops.

On the whole, I have found C-- to be a relatively easy language to
target, at least in this simple scenario.  There is a good match
between the Machine IL and C--, so, in the ideal case, it would be
very straightforward to emit the right C-- code.  It has certainly
helped that C-- is a readable language.  I could very easily navigate
through the generated code and look for relevant portions.

I am still confident that the MLton/C-- experiment described here can
be completed, but I believe that I have uncovered a few bugs in qc--,
which are preventing me from making any further progress.  I don't
know how tenable it is to go back to 32-bit file positions to complete
the experiment; a less pleasant solution would be to attempt to
completely simulate 64-bit data with 32-bit data.

- The 64-bit Saga -

Here I will describe my attempt to accomplish 64-bit operations in
C--.  Ultimately, it is the saga of trying to translate the following
Machine IL transfer variant:

   (* In Arith, dst is modified whether or not the prim succeeds. *)
   Arith of {args: Operand.t vector,
	     dst: Operand.t,
             overflow: Label.t,
	     prim: Type.t Prim.t,
	     success: Label.t}

The prim field indicates which operation (from {add,neg,mul,sub}).
The size (from {8,16,32,64}) is indicated by the prim field and the
size of the operands of the dst and args fields.  The overflow and
success fields indicate Machine IL procedure local (and hence, C--
procedure local) labels to which control should be transfered (and
hence, a C-- goto transfer suffices).

First off, without Kevin Redwine's widener, I would have abandoned my
attempts much earlier.  It was conceivable to simulate 64-bit ops, but
trying to simulate 8- and 16-bit ops as well would have been a
non-starter.

As a running example, consider translating

    Arith {prim = WordS64_addCheck,
	   args = (RW64(0): Word64, RW64(1): Word64),
	   dst = RW64(2): Word64,
	   overflow = L_498,
	   success = L_488}

where RW32(n) indicates the nth pseudo-registers (hence, C-- local
variable) of word32 type.

For completeness, the naive translation (note the use of a temporary,
as the dst may correspond to one of the args) yields:

	tmp11 = %add(RW64_0, RW64_1);
	if %add_overflows(RW64_0, RW64_1) {
		RW64_2 = tmp11;
		goto L_498;
	} else {
		RW64_2 = tmp11;
		goto L_488;
	}

to which qc-- complains along the lines of:

File "z.24.cmm", line 2819, characters 2-41: Warning: No acceptable widths for %add on target 'x86'; asked for 64->64->64, target accepts 32->32->32
This can't happen: Asked for temporary in space `general-purpose temporaries' with unsupported width 64
Fatal error: exception Impossible.Impossible("Asked for temporary in space `general-purpose temporaries' with unsupported width 64")

Fair enough.  The next reasonable option would be to peform the 64-bit
addition out of a 32-bit add and a 32-bit add with carry.
Unfortunately, there does not appear to be a way to directly
extracting the high bits of a C-- local variable.  I could extract the
low bits with %lobits, but without %shra at 64-bits, I can't get to
the high bits.  The only solution appears to be copying the 64-bit
source values to memory, extracting the hi and lo bits, perform the
operation, copy the hi and lo bits back to memory, and finally copying
the 64-bit destination value from memory.  So, I try the following:

	bits64[wideSrc1_1] = RW64_0;
	bits64[wideSrc2_1] = RW64_1;
	bits32[wideDst_1 + 4] = %add(bits32[wideSrc1_1 + 4], 
                                     bits32[wideSrc2_1 + 4]);
	bits32[wideDst_1 + 0] = %addc(bits32[wideSrc1_1 + 0], 
                                      bits32[wideSrc2_1 + 0], 
                                      %carry(bits32[wideSrc1_1 + 4], 
                                             bits32[wideSrc2_1 + 4], 
                                             0 :: bits1));
	tmp11 = bits64[wideDst_1];
	if %bool(%carry(bits32[wideSrc1_1 + 0], 
                        bits32[wideSrc2_1 + 0], 
                        %carry(bits32[wideSrc1_1 + 4], 
                               bits32[wideSrc2_1 + 4], 
                               0 :: bits1))) {
		RW64_2 = tmp11;
		goto L_498;
	} else {
		RW64_2 = tmp11;
		goto L_488;
	}

Note that without an %addc_overflows operator, I must instead coerce
the carry out bit to a boolean.  Unfortunately, C-- doesn't really
care for this cast:

Not implemented in qc--: non-binary comparison in conditional guard
Fatal error: exception Impossible.Impossible("non-binary comparison in conditional guard")

Well, coercing from a bit1 to a boolean should be equivalent to
checking for equality with 1 (or, equivalently, inequality with 0).
So, my next try is:

	bits64[wideSrc1_1] = RW64_0;
	bits64[wideSrc2_1] = RW64_1;
	bits32[wideDst_1 + 4] = %add(bits32[wideSrc1_1 + 4], 
                                     bits32[wideSrc2_1 + 4]);
	bits32[wideDst_1 + 0] = %addc(bits32[wideSrc1_1 + 0], 
                                      bits32[wideSrc2_1 + 0], 
                                      %carry(bits32[wideSrc1_1 + 4], 
                                             bits32[wideSrc2_1 + 4], 
                                             0 :: bits1));
	tmp11 = bits64[wideDst_1];
	if %ne(%carry(bits32[wideSrc1_1 + 0], 
                      bits32[wideSrc2_1 + 0], 
                      %carry(bits32[wideSrc1_1 + 4], 
                             bits32[wideSrc2_1 + 4], 
                             0 :: bits1)), 
               0 :: bits1) {
		RW64_2 = tmp11;
		goto L_498;
	} else {
		RW64_2 = tmp11;
		goto L_488;
	}

At this point, I believe that I have something that should be an
acceptable C--/qc-- program.  Indeed, some C-- programs with this
construct successfully compile and assemble.  However some raise the
error:

!! exhibited by bug1.cmm
This can't happen: live variable is in killed set
Fatal error: exception Impossible.Impossible("live variable is in killed set")

and some give a warning from the assembler:

!! exhibited by bug2.cmm
bug2.s:1082: Warning: rest of line ignored; first ignored character is `<'

which corresponds to the following in the generated assembly:

        movl $-112,%ecx
        <Par(Store(Reg(V, 0),Add(Add(Fetch(Reg(r, 4),32), Bits(0xa4::bits32)), F
etch(Reg(r, 1),32)),32),Setflags(X86_addflags(Add(Fetch(Reg(r, 4),32), Bits(0xa4
::bits32)), Fetch(Reg(r, 1),32))))>
        fildq (%eax)

I actually didn't encounter a C-- program that produced the assembly
warning until after I had pursued the "live variables in kill set" a
little further.  I was unable to reproduce the error with a tiny C--
program that did nothing but compute both the 64-bit addition and the
carry out bit of the 64-bit addition.  In particular, the following
two functions compile without error:

export Cmm_Word64_add;
Cmm_Word64_add(bits64 src1, bits64 src2) {
	bits64[wideSrc1_0] = src1;
	bits64[wideSrc2_0] = src2;
	bits32[wideDst_0 + 4] = %add(bits32[wideSrc1_0 + 4], 
                                     bits32[wideSrc2_0 + 4]);
	bits32[wideDst_0 + 0] = %addc(bits32[wideSrc1_0 + 0], 
                                      bits32[wideSrc2_0 + 0], 
                                      %carry(bits32[wideSrc1_0 + 4], 
                                             bits32[wideSrc2_0 + 4], 
                                             0 :: bits1));
	return (bits64[wideDst_0]);
}

export Cmm_Word64_addOverflows;
Cmm_Word64_addOverflows(bits64 src1, bits64 src2) {
	return (%zx32(%carry(bits32[wideSrc1_0 + 0], 
                             bits32[wideSrc2_0 + 0], 
                             %carry(bits32[wideSrc1_0 + 4], 
                                    bits32[wideSrc2_0 + 4], 
                                    0 :: bits1))));
}

So, I concluded that the "live variables in kill set" is a bug that
manifests itself only in the presence of larger functions with more
control flow.  I realized that with the above two functions in hand, I
could go back to a my naive translation, but using these two functions
in place of the %add and %add_overflows at 64-bits.  But, I think, I
can do even better: C-- lets me return multiple values, so I ought to
be able to return both the sum and the overflow boolean with the
following function:

export Cmm_Word64_addWOverflows;
Cmm_Word64_addWOverflows(bits64 src1, bits64 src2) {
	bits64[wideSrc1_0] = src1;
	bits64[wideSrc2_0] = src2;
	bits32[wideDst_0 + 4] = %add(bits32[wideSrc1_0 + 4], 
                                     bits32[wideSrc2_0 + 4]);
	bits32[wideDst_0 + 0] = %addc(bits32[wideSrc1_0 + 0], 
                                      bits32[wideSrc2_0 + 0], 
                                      %carry(bits32[wideSrc1_0 + 4], 
                                             bits32[wideSrc2_0 + 4], 
                                             0 :: bits1));
	return (bits64[wideDst_0],
	        %zx32(%carry(bits32[wideSrc1_0 + 0], 
                             bits32[wideSrc2_0 + 0], 
                             %carry(bits32[wideSrc1_0 + 4], 
                                    bits32[wideSrc2_0 + 4], 
                                    0 :: bits1))));
}

Alas:

!! exhibited by bug3.cmm
This can't happen: dpwiden couldn't widen %shrl(bits64L[wideDst_0], 32)
Fatal error: exception Impossible.Impossible("dpwiden couldn't widen %shrl(bits64L[wideDst_0], 32)")

Turns out that this behavior is not related to 64-bit arithmetic
operations, but rather to the form of the particular return.  To wit,
the following function also exhibits the same error:

Foo() {
	return (bits64[L_0],
	        %zx32(0 :: bits1));
}

!! exhibited by bug4.cmm
This can't happen: dpwiden couldn't widen %shrl(bits64L[L_0], 32)
Fatal error: exception Impossible.Impossible("dpwiden couldn't widen %shrl(bits64L[L_0], 32)")

So, I return to a naive translation, using the functions
Cmm_Word64_add and Cmm_Word64_addOverflows.  At this point, I can
successfully compile and assemble all the emitted C-- code, although
a number of files give assembler warnings occurrences of "<Par(...)>"
in the emitted assembly, so it is unlikely that resulting program
would behave correctly.  Unfortunately, one final obstacle arises:

!! exhibited by bug5.cmm
/tmp/fileGr3XfV.o(.text+0xc37): In function `L_178':
: undefined reference to `.Lexit_l5'
collect2: ld returned 1 exit status

This particular C-- file was one that had assembler warnings; the
above linker error indicates that the emitted assembly has an unbound
label introduced by qc--.  Note that while other C-- files resulted in
assembler warnings, this was the only one that resulted in unbound
labels.

At this point, I have exhausted all of the obvious avenues for
progress.

- Another Issue -

As I noted above, I use C-- spans to lookup the frame layout when
queried by the runtime system.  In particular, in the runtime I have
added: 

static Word32 returnAddressToFrameIndex (Word32 w) {
	Cmm_Cont *k = (Cmm_Cont*)w;
	Cmm_Activation a = Cmm_YoungestActivation (k);
	Cmm_Dataptr ptr = Cmm_GetDescriptor (&a, 1);
	Word32 fi = *((Word32*)ptr);
	return fi;
}

When translating a non-tail Machine IL call, I emit C-- code like the
following:

...
	StackTop = %add(StackTop, 20);
	bits32[StackTop - (1 * 4)] = L_227;
	bits32[gcState + 0] = Frontier;
	bits32[gcState + 8] = StackTop;
	recur_0()
		also cuts to L_227
		also aborts
		;
	return ();

span 1 frameLayoutsIndexSpanData_30 {
continuation L_227():
}
	Frontier = bits32[gcState + 0];
	StackTop = bits32[gcState + 8];
	StackTop = %sub(StackTop, 20);
...

section "data" {
	frameLayoutsIndexSpanData_30:
		bits32 {21};
}

The StackTop variable corresponds to the top of the ML stack; note
that it is pushed by 20 bytes and the return continuation is installed
just below the new stack top.  There is a non-tail call to recur_0;
the also cuts to L_227 flow is the path by which the recur_0 transfer
control back to this function; the also aborts flow corresponds to the
lack of a handler continuation in this function, so a raise by recur_0
will cut past this activation.  Now, since recur_0 will never return
by a normal return, I would prefer to annotate the non-tail call as
follows:

	recur_0()
		also cuts to L_227
		also aborts
		never returns;

However, I have encountered situations where annotating non-tail calls
in this manner causes Cmm_GetDescriptor to fail to find the
corresponding span data.  Note that Cmm_YoungestActivation succeeds,
so an activation corresponding to that continuation is found, but
Cmm_GetDescriptor returns NULL.

Unfortnately, I don't seem to be able to recreate this behavior, but
I'll keep my eye out.

- Minor Comments -

Section 4.1 of the C-- Language Specification states:

It is legitimate to "export" or "import" the same name more than
once.

However, qc-- gives a hard (compilation aborting) error when a name is
imported more than once.

I am sure that the set of primitive operations is a constant source of
discussion.  I'll throw in my 2 cents:

 * Having %neg but not %neg_overflows leads to the incongruous:

	tmp11 = %neg(RW32_0);
	if %sub_overflows((0 :: bits32), RW32_0) ...

   In a vague belief that it will be easier for the qc-- codegen to
   recognize that the overflow check can piggyback on the arithmetic
   op when they are syntactically equal, I have forgone using %neg(x)
   in favor of %sub(0, x).

 * Not having %addc_overflows and %subb_overflows leads to awkward
   constructions for successfully trapping overflow on wide arithmetic
   synthesized from narrow arithmetic.

 * %hibits32 would also be helpful for synthesizing wide arithmetic
   from narrow arithmetic.  Copying to and from memory to extract the
   hi and lo bits seems prohibitively expensive, though a compiler
   could probably optimize some of that away.

 * The only other operations missing from MLton's list of primitives
   were trigonometric functions.

I whole-heartedly endorse an "invisible" declaration on local register
variables.  At virtually every stage of lowering, one wants a supply
of temporaries that mean nothing to the higher IL.  Also, when
lowering from an SSA like form, one is likely to have numerous
variables corresponding to intermediate computations.  Knowing (or at
least believing) that C-- is maintaining per-variable information
makes me wary of introducing too many variables.

* Conclusions *

Despite being unable to achieve the goal of developing a C-- codegen
sufficient for compiling and correctly running the whole of the MLton
benchmark suite, I do not believe that we are that far off.  Overall,
I was quite pleased with the ease with which I could accomplish as
much as I did.  As far as I know, using 64-bit data in the way I am
doing at the end of the saga (that is, simply copying 64-bit values to
and from memory and local variables, and passing to and from foreign
"C--" or foreign "C" functions) is intended to be supported by the
current version of qc--.

-Matthew

PS.  
I will be sending the bug exhibiting examples to bugs@cminusminus.org
shortly.