[MLton] mlb support

Thu, 24 Jun 2004 11:54:20 -0700

> I've been working a little on supporting mlb files, as they were last
> discussed (http://www.mlton.org/pipermail/mlton/2004-March/015645.html).
> I've gotten the two "big pieces" working -- lexer/parser and elaborator.

Great!

> The elaborator is very straight-forward. 

Good.  I was worried their might be some issues with the fact that
environments are mutable or missing some scoping capabilities.

> Interestingly, the elaboration of an MLB file yields no decs -- only
> the sub-elaboration of a .sml file yields decs.

Makes sense, since MLB's only talk about module-level stuff.

> 1) With symbolic links, one can have multiple paths to the same file.
> Should an .mlb file that is included through different paths be treated as
> the same .mlb (i.e., elaborated exactly once)?

Yes.  It seems easy enough to check the file identity (inode number).

To answer the other questions, let's consider in general the ways in
which the basis library is currently special (i.e. different from user
code).

  1. MLton knows (and has hardwired in various places) that the
     program is split into two pieces, the first piece being the basis
     library and the second piece being the user code.  This split
     affects the behavior of dead code elimination, and the many
     def-use info flags (-show-basis-used, -show-def-use,
     -warn-unused).
  2. The basis library can use various language extensions not
     available to user programs (rebinding of equals, _const
     expressions)
  3. Elaboration of the basis implicitly creates a primitive
     environment with basic types (bool, int, ...).
  4. The basis library sets some essential hooks that the compiler
     internals depend on.  For example, it calls Exn.setInitExtra,
     which is in turn used by the ImplementException pass.  Also, the
     use of _basisDone MLtonFFI records the structure needed by the
     elaborator to implement _import and _export declarations.

I think the key is that we need to separate the splitting of the
program into two pieces from the other facets and provide a way for
the user to specify the two pieces.  I propose that instead of viewing
every program as one mlb, we view it as two: b.mlb u.mlb.  We use this
split to treat b.mlb like we currently treat the basis and treat u.mlb
like we currently treat the user program.  The other facets can all be
implemented as annotations on mlb files.

So, for example, for backward compatibility, "f.sml" becomes

	$(SML_BASIS)/basis-2002.mlb 
	f.sml

where the second line means "the mlb file consisting of one line:
f.sml".  Similarly, "f.cm" becomes

	$(SML_BASIS)/basis-2002.mlb 
	f1.sml ... fn.sml

where the second line means "the mlb file consisting of n lines with
one line for each fi.sml".

We can use the notion of split to solve

> 2) Dead code pass.
> 5) -show-basis
> 6) -show-basis-used
> 7) -warn-unused
> 8) -show-def-use

For 5, the split causes us to only display the basis produced by the
u.mlb, and I think corresponds to what happens now as well as the
encoding you gave.  For 6, 7, 8, the call to Env.clearDefUses occurs
after elaborating b.mlb.  This will cause the def-use information to
be for u.mlb.

For 2, our dead code pass will treat b.mlb as it currently treats the
basis, i.e. with aggressive dead code, and u.mlb as it currently
treats the user program, i.e. with safe dead code.  I think this is
a better approach than using annotations, at least for now, since it
more directly generalizes what we currently have.  It also interacts
well with the -dead-code flag.

To make splitting available for more than just the basis library, we
need to add the flags we discussed before: -{load,save}-basis.  The
idea is that building a mlton program consists of a sequence of calls
of the form

	mlton -load-basis z0.basis -save-basis z1.basis z1.mlb
	mlton -load-basis z1.basis -save-basis z2.basis z2.mlb
	...
	mlton -load-basis zn-1.basis -save-basis zn.basis zn.mlb

ending with a final call

	mlton -load-basis zn.basis z.mlb

which actually builds the executable.  This is where the split is
determined: we treat zn.basis as the basis and z.mlb as the user
program for the purposes of 5, 6, 7, 8.

The rest can be handled by annotations.  I don't know the right
syntax.  Maybe something like

<bdec> ::= ! <ann>* (<bdec>)

<ann> ::= allowConst
        | rebindEquals
        | setFFI <longstrid>

> Stephen's mantra of being able to do everything without extra/proxy
> files.  I don't know that being able to annotate arbitrary basdecs
> is necessarily better.

Yeah, it still seems nice to me, as long as it doesn't cause problems,
to have the annotations apply to <bdec> rather than <foo>.mlb.

In any case, annotations solve

> 3) Lookup constants.  I can't use lookupConstantError, because both
> basis-library and user code are elaborated within the same .mlb.  Again,
> I suggest annotations as a way of turning on constant lookup within the
> basis and keeping it off within user code.  (Something similar might apply
> to the rebinding of equals.)

I also added setFFI as an annotation in the hope that we could move
_basis_done from being a language extension to part of mlbs.

Finally, regarding

> 9) Empty programs

Since we need the primitive datatypes to do anything, let's not worry
about handling completely empty programs.  I don't like the idea of
making the empty basis used to elaborate an mlb correspond to the
primitive basis.  That could be confusing to someone who writes an mlb
and forgets to include any basis.  I'd rather they see an error that
says "bool not defined" or whatever.  I'd like to see mlbs elaborated
in a completely empty environment, and use a _prim bdec to cause the
primitive environment to be included.  Prefixing "local _prim in end"
to all programs seems like the right fix to ensure that the primitive
decs are always there.

As to making sure that the basis is always included so that we get the
top-level hooks set, perhaps we could put in a check that happens once
we're ready to compile the whole program.  We could make sure that,
e.g. setFFI has been called and if not, report an error, or at least a
warning.

BTW, if you found any errors in the static semantics that I sent,
please send a corrected version.  Hopefully it will make it into
documentation someday.