[MLton] statically ensuring regular-language membership with fold

Stephen Weeks MLton@mlton.org
Sun, 16 Oct 2005 07:35:10 -0700


This note describes a construction that, given a regular language R
over alphabet {a, b, ...}, defines SML values "start" and "a", "b",
... such that the SML expression

  (start s1 s2 ... sn $): unit

is well typed if and only if the sequence <s1, s2, ..., sn> is in R,
where each si is in {a, b, ...}.  ("$" is our usual variable-arguments
terminator)

It is perhaps easier to understand the construction by example, so
here it is for the regular language (ab)*.  There is a three-state
DFA that accepts (ab)*.  Here is the transition table.

       a   b
      --- ---
   X | Y | Z |
   Y | Y | Y |
   Z | X | Y |
      --- ---

Z is the start state and is also the only accepting state.

The construction is built on top of the fold utility.

----------------------------------------------------------------------
fun $ (a, f) = f a
fun pass x f = f x
structure Fold =
   struct
      type ('a, 'b, 'c, 'd) step = 'a * ('b -> 'c) -> 'd
      type ('a, 'b, 'c, 'd) t = ('a, 'b, 'c, 'd) step -> 'd
      type ('a1, 'a2, 'b, 'c, 'd) step0 =
         ('a1, 'b, 'c, ('a2, 'b, 'c, 'd) t) step

      val fold = pass
      fun step0 h (a1, f) = fold (h a1, f)
   end
----------------------------------------------------------------------

Below is an expression of the DFA in an SML signature, where "start" is
a folder (of type Fold.t) and each symbol is a stepper (of type
Fold.step0).  Furthermore, each symbol directly expresses its column
in the transition table in its type.  Each step keeps track of the
state and each state has embedded within it whether it is accepting or
rejecting.  The finisher function of the fold requires the final state
to be accepting.

----------------------------------------------------------------------
signature AB =
   sig
      type accept
      type reject
      type ('a, 'b, 'c, 'd, 'e) state
      type ('a, 'b, 'c, 'd, 'e) trans
      type ('a, 'b, 'c) X = (reject, 'a, 'b, 'c, 'a) trans
      type ('a, 'b, 'c) Y = (reject, 'a, 'b, 'c, 'b) trans
      type ('a, 'b, 'c) Z = (accept, 'a, 'b, 'c, 'c) trans
      type ('t1, 't2, 't3, 'f1, 'f2, 'a, 'b, 'c, 'd, 'e, 'f, 'g, 'h, 'z) symbol
         = (('f1, 't1, 't2, 't3, ('f2, 'a, 'b, 'c, 'd) trans) state,
            ('f2, 'a, 'b, 'c, 'd) state,
            (accept, 'e, 'f, 'g, 'h) state,
            unit,
            'z) Fold.step0

      val start: ((accept, 'b, 'c, 'd, 'd) state,
                  (accept, 'e, 'f, 'g, 'h) state,
                  unit,
                  'z) Fold.t
      val a: (('a, 'b, 'c) Y,
              ('a, 'b, 'c) Y,
              ('a, 'b, 'c) X,
              'f1, 'f2, 'a, 'b, 'c, 'd, 'e, 'f, 'g, 'h, 'z) symbol
      val b: (('a, 'b, 'c) Z,
              ('a, 'b, 'c) Y,
              ('a, 'b, 'c) Y,
              'f1, 'f2, 'a, 'b, 'c, 'd, 'e, 'f, 'g, 'h, 'z) symbol
   end
----------------------------------------------------------------------

It is easy to check that that the above signature works as intended.
In the following functor, the uncommented examples type check and the
commented examples fail to typecheck.

----------------------------------------------------------------------
functor Test (S: AB): sig end =
   struct
      open S
      val () = start $
(*      val () = start a $ *)
(*      val () = start b $ *)
      val () = start a b $
(*      val () = start a b a $ *)
      val () = start a b a b $
   end
----------------------------------------------------------------------

There are a number of ways to implement AB.  Several of them are
mostly phantom and share some common definitions.

----------------------------------------------------------------------
functor GenAB (S: sig
                     type ('a, 'b, 'c, 'd, 'e) state
                     type ('a, 'b, 'c, 'd, 'e) trans
                  end) =
   struct
      open S
      type accept = unit
      type reject = unit
      type ('a, 'b, 'c) X = (reject, 'a, 'b, 'c, 'a) trans
      type ('a, 'b, 'c) Y = (reject, 'a, 'b, 'c, 'b) trans
      type ('a, 'b, 'c) Z = (accept, 'a, 'b, 'c, 'c) trans
      type ('t1, 't2, 't3, 'f1, 'f2, 'a, 'b, 'c, 'd, 'e, 'f, 'g, 'h, 'z) symbol
         = (('f1, 't1, 't2, 't3, ('f2, 'a, 'b, 'c, 'd) trans) state,
            ('f2, 'a, 'b, 'c, 'd) state,
            (accept, 'e, 'f, 'g, 'h) state,
            unit,
            'z) Fold.step0
   end
----------------------------------------------------------------------

Here is a pure phantom implementation.

----------------------------------------------------------------------
structure AB:> AB =
   struct
      structure S = GenAB (type ('a, 'b, 'c, 'd, 'e) state = unit
                           type ('a, 'b, 'c, 'd, 'e) trans = unit)
      open S
         
      fun start $ = Fold.fold ((), ignore) $
      fun symbol $ = Fold.step0 ignore $
      val a = symbol
      val b = symbol
   end
----------------------------------------------------------------------

Here is an implementation that dynamically keeps track of the state.

----------------------------------------------------------------------
structure AB:> AB =
   struct
      datatype state = X | Y | Z
      structure S = GenAB (type ('a, 'b, 'c, 'd, 'e) state = state
                           type ('a, 'b, 'c, 'd, 'e) trans = unit)
      open S

      fun start $ = Fold.fold (Z, ignore) $
      val symbol = Fold.step0
      fun a $ = symbol (fn X => Y | Y => Y | Z => X) $
      fun b $ = symbol (fn X => Z | Y => Y | Z => Y) $
   end
----------------------------------------------------------------------

Here is an implementation that dynamically keeps track of the
transitions, and prints them out at the end.

----------------------------------------------------------------------
structure AB:> AB =
   struct
      structure S = GenAB (type ('a, 'b, 'c, 'd, 'e) state = char list
                           type ('a, 'b, 'c, 'd, 'e) trans = unit)
      open S

      fun start $ =
         Fold.fold ([], fn cs => print (implode (rev (#"\n" :: cs)))) $
      fun symbol c = Fold.step0 (fn cs => c :: cs)
      fun a $ = symbol #"a" $
      fun b $ = symbol #"b" $
   end
----------------------------------------------------------------------

Calling our Test functor on the above AB yields

----------------------------------------------------------------------

ab
abab
----------------------------------------------------------------------

There is also a completely non-phantom implementation.  I'm not sure
if it has any practical use, but it does have one property that none
of the other implementations do -- one can use it without the
signature constraint and it still enforces membership in the regular
language.  All of the other implementations, because of their phantom
aspects, require the opaque signature constraint in order to function
properly.

I also mention this implementation for historical reasons, since it is
the one I built first, and used to help me derive the AB signature.

----------------------------------------------------------------------
structure AB: AB =
   struct
      datatype accept = Accept
      datatype reject = Reject
      datatype ('a, 'b, 'c) r = R of {X: 'a, Y: 'b, Z: 'c}
      type ('a, 'b, 'c, 'd, 'e) state = 'a * (('b, 'c, 'd) r -> 'e)
      type ('a, 'b, 'c) states =
         ((reject, 'a, 'b, 'c, 'a) state,
          (reject, 'a, 'b, 'c, 'b) state,
          (accept, 'a, 'b, 'c, 'c) state) r
      type ('a, 'b, 'c, 'd, 'e) trans =
         ('b, 'c, 'd) states -> ('a, 'b, 'c, 'd, 'e) state
      type ('a, 'b, 'c) X = (reject, 'a, 'b, 'c, 'a) trans
      type ('a, 'b, 'c) Y = (reject, 'a, 'b, 'c, 'b) trans
      type ('a, 'b, 'c) Z = (accept, 'a, 'b, 'c, 'c) trans
      type ('t1, 't2, 't3, 'f1, 'f2, 'a, 'b, 'c, 'd, 'e, 'f, 'g, 'h, 'z) symbol
         = (('f1, 't1, 't2, 't3, ('f2, 'a, 'b, 'c, 'd) trans) state,
            ('f2, 'a, 'b, 'c, 'd) state,
            (accept, 'e, 'f, 'g, 'h) state,
            unit,
            'z) Fold.step0

      fun state (f, sel): ('a, 'b, 'c, 'd, 'e) state =
         (f, fn R r => sel r)
      fun states (): ('a, 'b, 'c) states =
         R {X = state (Reject, #X),
            Y = state (Reject, #Y),
            Z = state (Accept, #Z)}
      fun run state = Fold.fold (state, fn (Accept, _) => ())
      fun trans sel (R states) = sel states
      val X: ('a, 'b, 'c) X = fn $ => trans #X $
      val Y: ('a, 'b, 'c) Y = fn $ => trans #Y $
      val Z: ('a, 'b, 'c) Z = fn $ => trans #Z $
      fun start $ = run (Z (states ())) $
      fun symbol r = Fold.step0 (fn (_, sel) => sel (R r) (states ()))
      fun a $ = symbol {X = Y, Y = Y, Z = X} $
      fun b $ = symbol {X = Z, Y = Y, Z = Y} $
   end         
----------------------------------------------------------------------

That's it for the example construction.  Hopefully the general
construction is now clear.  

Suppose we have a DFA with n states, S1, ..., Sn and m symbols u1,
... um.  Let d(Si, uj) be the state transitioned to from state Si on
symbol uj.  Define Fi to be "accept" if Si is accepting and to be
"reject" if Si is rejecting.  Let "s" be the index of the start state.
The signature corresponding to the DFA is the following.

  type accept
  type reject
  type ('a, 'a1, ..., 'an, 'e) state
  type ('a, 'a1, ..., 'an, 'e) trans
  type ('a1, ..., 'an) S1 = (F1, 'a1, ..., 'an, 'a1) trans
  ...
  type ('a1, ..., 'an) Sn = (Fn, 'a1, ..., 'an, 'an) trans
  type ('t1, ..., 'tn, 'f1, 'f2, 'a1, ..., 'an, 'b, 'c1, ..., 'cn, 'd, 'z)
     symbol
     = (('f1, 't1, ..., 'tn, ('f2, 'a1, ..., 'an, 'b) trans) state,
        ('f2, 'a1, ..., 'an, 'b) state,
        (accept, 'c1, ..., 'cn, 'd) state,
        unit, 'z) Fold.step0

  val start: ((Fs, 'a1, ..., 'an, 'as) state,
              (accept, 'b1, ..., 'bn, 'c) state,
              unit, 'z) Fold.t
  val u1: (('a1, ..., 'an) d(S1,u1),
           ...
           ('a1, ..., 'an) d(Sn,u1),
           'f1, 'f2, 'a1, ..., 'an, 'b, 'c1, ..., 'cn, 'd, 'z) symbol
  ...
  val um: (('a1, ..., 'an) d(S1,um),
           ...
           ('a1, ..., 'an) d(Sn,um),
           'f1, 'f2, 'a1, ..., 'an, 'b, 'c1, ..., 'cn, 'd, 'z) symbol

The size of the signature is proportional to the size of the DFA
(number of states times number of symbols).  An interesting question
that I do not know the answer to is how big the inferred type
derivations are.  I think because of the way fold works, they will be
quadratic in the number of input symbols, times the size of the DFA.

It wouldn't be too hard write some code that generates the signature
(and one of the implementations) for any DFA.

There are some immediate practical applications of this construction.
It allows one to easily conclude that a number of conditions for
optional arguments can be enforced statically, simply because they can
be expressed as a regular language.  For example, one can easily
guarantee that each optional argument occurs no more than once.
Furthermore, once that guarantee is in place, any other constraint can
be expressed, since all finite languages are regular.  So, for
example, one could require the optional arguments to occur in a
particular order.  Or, impose a pre-order.  Or, one could require that
the arguments occur as a prefix of some total order.

For many regular languages, there is a simpler signature that enforces
language membership than the construction above.  For example, for
(ab)*, here is a simpler signature.

----------------------------------------------------------------------
signature AB' =
   sig
      type x
      type z

      val start: (z, z, unit, 'w) Fold.t
      val a: (z, x, z, unit, 'w) Fold.step0
      val b: (x, z, z, unit, 'w) Fold.step0
   end
----------------------------------------------------------------------

This kind of simplification is possible only because of the very
simple nature of the DFA.  In addition to AB' being simpler than AB,
AB' also has the benefit of better error reporting.  With AB', one
gets a type error messages as soon as one attempts to enter the dead
state, while with AB one must wait until the $ terminator to get the
error.  I think it's not too hard to modify the general construction
to meet this property, though.

It is also possible to ensure membership in some non-regular
languages.  For example, here is a signature that ensures membership
in a^n b^n.

----------------------------------------------------------------------
signature ANBN =
   sig
      type A
      type B
      type Z
      type 'a S
      val start: (   Z *  A,    Z * 'c, unit, 'y) Fold.t
      val a:     (  'n *  A, 'n S *  A, 'w, 'x, 'y) Fold.step0
      val b:     ('n S * 'c,   'n *  B, 'w, 'x, 'y) Fold.step0
   end
----------------------------------------------------------------------

As ANBN makes clear, one can implement a type-level stack and do
pushes and pops within the steppers.  By merging this idea with DFA
construction above, I suspect it is possible to implement any pushdown
automaton (PDA), and hence to statically ensure membership in any
deterministic context-free language.  I'm not sure about
nondeterministic PDA's, but perhaps SML's type inference can be made
to do the guessing needed for nondeterminism.

Finally, it is possible to ensure membership in some non-context-free
languages.  For example, here is a signature that ensures membership
in a^n b^n c^n.

----------------------------------------------------------------------
signature ANBNCN =
   sig
      type A
      type B
      type C
      type Z
      type 'a S
      val start: (   Z *    Z *  A,    Z *    Z * 'd, unit, 'y) Fold.t
      val a:     (  'n *   'n *  A, 'n S * 'n S *  A, 'w, 'x, 'y) Fold.step0
      val b:     ('n S *   'm * 'd,   'n *   'm *  B, 'w, 'x, 'y) Fold.step0
      val c:     (   Z * 'm S * 'd,    Z *   'm *  C, 'w, 'x, 'y) Fold.step0
   end
----------------------------------------------------------------------