[MLton-user] 'a ref for C programmers: a primer

Mon Feb 12 10:02:02 PST 2007

As a former C programmer, I've always been confused by 'a ref in SML.
I wanted to state how I've come to think about them for anyone else  
who is confused.

Please let me know if there's a mistake anywhere!

Normal variable bindings are by reference, or pointers to the value  
(in principle).
For example,
> val x = SomeVector
> val y = x
will share space, both point to the same object in the heap. This  
applies (in principle) to every object, even integers. Every object  
lives on the heap. Bindings are just pointers to these objects.   
Whenever you access a value by a name, you are implicitly  
dereferencing the pointer to the object pointed to.

However, MLton is clever enough to sometimes recognize that it's  
cheaper to just copy a variable, rather than store it on the heap.  
This is subject to some safety criteria that ensure the program  
cannot distinguish the behaviour from the 'every binding is a  
pointer' interpretation above. Another criteria is that the object is  
not 'too big', which would waste memory.

With an 'a vector, MLton may choose to pack the object directly into  
the vector. For example, int vector will not include binding pointers  
for each cell. If the contained object is too big, or too  
complicated, MLton may still use a pointer in the vector if it chooses.

When you write
> val x = ref 0
> val y = x
this creates a special mutable pointer on the heap. The binding 'x'  
points to the mutable pointer. The mutable pointer points to the  
integer 0. Thus, when you copy x to y, both binding x and binding y  
point to the same mutable pointer. When updating a ref, you first  
follow the binding's pointer to the mutable pointer, then modify the  
mutable pointer to point at something else. This is why 'y := 5'  
updates the value of '!x'.

Again, MLton is clever about these pointers and can optimize them  
away subject to safety criteria. In the above example, it could  
recognize that the integer 0 could simply be stored where the mutable  
pointer should be. This is analogous to how it can decide to  
eliminate the binding pointer for non-ref values. However, in this  
case it is the mutable pointer saved, not the binding pointer. If the  
ref'd object is too complicated, it may not be able to apply this  
optimization.

MLton may also be able to save the binding pointer. However, this  
savings will never place the mutable pointer on the stack, as a  
reference from the heap to the mutable pointer could outlive the  
current stack frame. So, when you have a variable like 'x' above, you  
are pretty much guaranteed that there is at least one level of  
indirection going on.

When an 'a ref is contained in a larger structure, MLton may be  
clever enough to embed the mutable pointer directly into the  
structure. It can only do this if the structure lives on the heap.  
For example, if you have a datatype or tuple that includes a mutable  
pointer, MLton might choose to put the mutable pointer directly  
inside the object. MLton never copies an object with an embedded  
mutable pointer, as this would be violate the 'every binding is a  
pointer' interpretation. Unfortunately, if you ever need to make  
another binding to these embedded refs, this optimization cannot be  
applied due to MLton's GC (no pointers to the inside of an object).

'a array is similar, except that MLton can safely embed all the  
references in the containing vector. This is safe, because there do  
not exist functions to make an 'a ref from the cell of an array. This  
is not the case with 'a ref vector. Here, the user might copy a ref  
binding out of the vector, which forbids MLton from embedding the  
reference into the vector, due to the GC.

In fact, MLton is so clever that it can often apply both  
optimizations at the same time. For a int array, MLton knows it can  
use the array cells (the mutable pointers) to just store the int in  
place of the mutable pointer. Thus, int array works as you would like  
it to with the FFI. However, from a conceptual point of view, there  
are still the two levels of pointers. Similarly, with a datatype  
containing an int ref, MLton may just place the int directly into the  
datatype. Modifying the ref cell modifies the value in the datatype.  
This object will never be copied and the user cannot not make new  
bindings to the embedded ref (or this optimization would not happen).

So, there's a reason it's confusing: it is complicated! Conceptually,  
the 'every binding is a pointer' and 'a ref introduces a mutable  
pointer' are correct. However, in practice, both of these pointers  
can often be eliminated completely by MLton.