[MLton] HOL's Moscow ML implementation, and pushing MLton to emulate it

Stephen Weeks MLton@mlton.org
Fri, 8 Apr 2005 12:11:04 -0700


> > A while ago, I read Nick Benton's paper
> > (http://mlton.org/References#Benton05) and really liked it, but if I
> > recall correctly, its solution to the problem involved coercions.
> > These are slow because they take time proportional to the size to the
> > data structure being coerced.  Plus, there are difficulties with
> > sharing (I don't remember if the paper adequately addressed those).
> 
> Nick's paper suggests using the exceptions embedding/projection trick.  

I took a look at Nick's paper again.  In section 5, he explains the
embedding/projection approach using a universal datatype, and points
out

  Whilst this kind of embedding is semantically elegant, it is
  extremely inefficient.  The problem is that there are multiple
  representations for datatype values, and each time they cross the
  boundary between the two languages, they are converted in their
  entirety from one representation to another.

I completely agree with his analysis of the problem, and reject that
approach because of its inefficiency (BTW, he also rejects laziness as
a solution -- see footnote #7 on page 14).  Nick then goes on to
explain the exceptions embedding/projection trick, about which he says

  This gains efficiency at the considerable cost of losing the ability
  to deal with datatypes polymorphically.  One also has to be careful
  not to apply newtype twice to the same ML type, since the resulting
  embeddings will be type incompatible.

I believe the implication of the inability to deal with datatypes
polymorphically this is that there is no generic list embedding, and
this is precisely what causes the problem I mentioned:

> > The biggest
> > problem I've never resolved to my satisfaction is how to lift
> > polymorphic values in the host environment to the hosted environment.
> 
> I don't understand your concern.  The hosted environment is an interpreter 
> running over its own universal datatype.  When you expose the polymorphic 
> function to the interpreter, it is specialized to the universal datatype.  
> Even if the interpreter supports a typed-language with polymorphism, the 
> terms it interpret must all fall under one universal datatype.

The problem occurs when values go between the two worlds.  In the
host world, we use MLton's specialized data representations, while in
the hosted world, we use a universal datatype.

Suppose that we export the following two functions from the host world
to the hosted world

 val tabulate1: int * (int -> 'a) -> 'a list
 val sum1: int list -> int

then in the hosted world we directly define

 val tabulate2: int * (int -> 'a) -> 'a list
 val sum2: int list -> int
 
and then compute

 val cs1 = tabulate1 (100, chr)
 val cs2 = tabulate2 (100, chr)
 val _ = sum1 cs1 + sum2 cs1 + sum1 cs2 + sum2 cs2

In the host world, sum1 is compiled to deal with whatever specialized
representation of int lists that MLton cooks up.  Hence, both cs1 and
cs2 must be coercible to that representation in constant time.  Since
lists are of unbounded size, the only way this can happen is for cs1
and cs2 to essentially contain the specialized representation of the
int list (unless Dan's laziness tricks work).  Hence, we must arrange
for both tabulate1 and tabulate2 to produce this specialized
representation.

It is difficult for the exported tabulate1 to produced the specialized
representation, since nowhere in the source code that MLton sees is a
version of tabulate1 specialized to int list.  To do it, somehow, the
tabulate1 that we export to the hosted world must be capable of
generating every specialized list representation that is used anywhere
in the source code.  Similarly, tabulate2, which is defined entirely
in the hosted world, must be capable of generating those same
specialized representations (while using the universal datatype, for,
e.g. real list, assuming it doesn't appear in the code).

Basically, I think the problem comes down to figuring out a way to
represent all of the types known at compile time using the host
representation, to embed these in the interpreter's universal type
(the exception trick isn't really necessary since we've got the whole
program), and to figure out how to make the interpreter always use the
right representation of a type so that there are no coercions.  I
believe doing this requires some pre-specialization of polymorphic
code (e.g. tabulate1 above) in anticipation of its use at certain
types in the interpreter.  This may be impractical due to code
explosion, or it may be impossible due to infinite amount of
duplication.  I don't know.