[MLton] More on parallelism

Mon Dec 4 12:41:25 PST 2006

> I have a few questions about the viability of parallelism.  I'm doing 
> distributed/parallel systems research, and I'd like to use ML as a primary 
> language.

That would be great; the conjecture that systems programming could be 
improved by using high-level languages can always use more support.

> I'd like to know what exactly would need to be done to implement CML using 
> OS-level threads?  I've run across some abstracts for talks by John Reppy on 
> the subject, and I read the papers from the earlier thread on parallelism in 
> october.  But exactly what would need to be done?  (Depending on what is 
> needed, I could possibly work on parts of it)

It's a complicated answer, but it seems to generally break down into a few 
major components:

1) Extend the runtime system so that multiple OS-threads can be spawned to
    run ML code.  This is really the biggest challenge, since it requires
    changing the garbage collector and the various invariants between the
    garbage collector and the mutator.

    At ICFP, we discussed some ways of handling this.  The suggestion by
    Stephen was to run a multi-threaded MLton program with the generational
    garbage collector, where the nursery is divided among the OS-level
    threads, effectively giving each one it's own private allocation space.
    The advantage of this scheme is that it preserves some of the garbage
    collector structure -- like the fact that the heap is contiguous.

    There is a lot to accomplish this, but here are some pieces:

    a) Move various fields of the struct GC_state (runtime/gc/gc_state.h)
    which are used by ML code (in particular, frontier, limit, stackTop,
    stackLimit, exnStack, atomicState?, limitPlusSlop, stackBottom) into
    their own struct which should be accessible in some OS-thread local
    manner.  The GC_state will likely maintain a linked list of these
    structs to keep tabs on all of the running ML code.

    b) Rework the GC code to parcel out the nursery into individual
    nurserys; rework the GC entry code so that when one mutator exhausts
    its private heap space, the other mutators are stopped (by setting
    their frontiers to 0, which will trigger them to enter the gc).  This
    will allow a single-threaded GC to garbage collect the world, reset the
    various nurseries, and start up the mutators again.

    c) Fix various race conditions.  There may not be that many at this
    level, where we are only interested in maintaining the thread safety of
    the garbage collector and runtime system.  Card marking of the old
    generation doesn't need to be protected, as the modifications by the
    mutator are monotonic, and the single threaded GC will clear all the
    cards.  Large allocations by the mutator that are pre-tenured need to
    be protected.

2) Rework the MLton.Thread implementation to make use of multiple OS-level
    threads.  It is likely that the right framework here is that a runtime
    option indicates the total number of OS-level threads that one wants,
    which will be multiplexed among an arbitary number of MLton.Thread's.

    This is where one needs to start worrying about atomic operations for
    shared ML data structures.  In particular, one would likely have a
    shared queue of ML threads, with OS-level threads pulling off ML
    threads to run.  Signal handling (for preemptive ML threads) is another
    thing to figure out.

    Alternatively, one could design a MLton.PThread structure that exposes
    the OS-Level threads.  Here, one would be able to start a new PThread
    and initialize it with the ML thread of control.  This ML thread of
    control could set up the signal handler (for preemptive ML threads) and
    then start pulling ML threads off the shared queue.

    There is probably a bit of design left at this level.

3) Finish off a library of atomic operations.

    After 1) and 2), you've probably got multiple ML threads running in
    parallel, but nothing is preventing race conditions on ML data (i.e.,
    on ref cells).

    There is a lot of choices here for lock-free data structures, etc.

    Also, one might want to make various portions of the Basis Library
    thread safe.

4) Rework the CML Implementation to use the new MLton.Thread and atomic
    data structures.

    I know that John Reppy has already put a lot of thought into how this
    could be accomplished in SML/NJ.  As with the current CML
    implementation, we can borrow a lot of those ideas without much change.

I will note that 1) and 2) are really the major work.  And, after those 
portions, one would have a workable multi-threaded system.  MLton.Thread 
is a little low level, but you could certainly accomplish some things with 
it.

> Also, I'd like to know how difficult it would be to implement a structure 
> that would provide atomic read-modify-write operations (for lock-free data 
> structures).  I'm imagining something like this:
>
> signature ATOMIC_REF =
> sig
> type 'a ref
> val swap : 'a ref * 'a -> 'a
> val cmpSet : 'a ref * 'a -> bool
> val cmpSwap : 'a ref * 'a -> 'a
> end
>
> There'd be more functions, of course, but that shows the basic idea.  The 
> idea is the atomic operations are implemented as the corresponding CPU 
> instructions, which would require compiler support.

This would actually be fairly easy; one would just pattern match on the 
existing Ref primitives.  The semantics of cmpSet and cmpSwap need a 
little thought when 'a isn't a primitive type, though it presumably is 
pointer equality on boxed types.  There are various places where one 
would or wouldn't want flattening of atomic refs into containing data 
structures.  But, otherwise, these primitives would just flow through the 
entire compile and be emitted by the codegen as either the appropriate C 
function call or the appropriate assembly code.