[MLton] lexical curiosities

Tue May 24 12:06:12 PDT 2011

I've been working on a Pygments (http://pygments.org) lexer for
Standard ML; see http://mlton.org/Pygments.

One issue with many of these syntax highlighters is that while they
admit the use of regular expressions for lexical tokens, they don't
quite have the same semantics as a traditional lexer generator.  For
instance, with Pygments, one gives regex/token rules much as one does
with a lexer generator.  But, Pygments chooses the first rule that
matches a prefix of the current input; in contrast, a traditional
lexer generator chooses the rule that matches the longest prefix of
the current input (and uses the first rule that matches in case of
ties).  So, for a traditional lexer generator, it suffices to give
rules like:

  ...
  "val" => Token.VAL
  ...
  "[a-zA-Z][a-zA-Z0-9_']+" => Token.ID(yytext)
  ...

and the input

  val value = 3

will be lexed as a VAL followed by an ID(value).  But, if one uses
these rules in this order with Pygments, then the input will be lexed
as VAL followed by VAL followed by ID(ue).

Looking around at some other Pygments lexers, a common way to resolve
this to use word-boundary matches in the regular expression for
keywords:

  ...
  "\b(val)\b" => Token.VAL
  ...

Unfortunately, a "word" (for the purposes of word-boundary) is defined
to be a non-empty sequence of alphanumeric and underscore characters.
Yet, as noted above, an SML identifier admits alphanumeric and
underscore and single-quote characters.  This means that, when using
the word-boundary "trick" for keywords, the input

  val val'ue = 3

would be lexed as a VAL followed by a VAL followed by a TYVAR('ue).

I solved this problem with the Pygments lexer just using the
"[a-zA-Z][a-zA-Z0-9_']" regular expression and using a callback to
determine if the match was a reserved word or an identifier.

Actually, I use a regular expression that matches full long
identifiers that end with either an alphanumeric identifier or a
symbolic identifier and use a callback to determine if the match was a
reserved word or an identifier (long or short).

Anyways, this got me thinking about the restriction that reserved
words are excluded from identifiers.  Consider the following
"program":

  val x = Int.val

Clearly, this is an illegal program, but the question is "Why is this
program illegal?".  Here's what MLton, HaMLet, and SML/NJ say:

[mtf at fenrir tmp]$ mlton z.sml
Error: z.sml 1.9.
  Undefined variable Int.val.
compilation aborted: parseAndElaborate reported errors
[mtf at fenrir tmp]$ hamlet z.sml
[loading standard basis library]
[processing z.sml]
z.sml:1.8-1.15: unknown identifier Int.val
[mtf at fenrir tmp]$ sml z.sml
Standard ML of New Jersey v110.73 [built: Fri May 20 11:55:37 2011]
[opening z.sml]
z.sml:1.13 Error: syntax error found at VAL
/Users/mtf/devel/smlnj/smlnj/bin/sml: Fatal error -- Uncaught
exception Compile with "syntax error" raised at
../compiler/Parse/main/smlfile.sml:15.24-15.46

Both MLton and HaMLet report a static semantic error; SML/NJ reports a
syntactic error.

SML/NJ's behavior is easily explained by the fact that SML/NJ
recognizes long identifiers as a syntactic form, rather than as a
lexical form.  (This is a minor deviation from the Definition;
Sections 2.4 and 2.5 clearly indicate that long identifiers are
lexical forms, which should not admit whitespace or comments between
components.)

MLton's and HaMLet's behavior are explained by the fact that they use
the following regular expressions to recognize long identifiers:

  alphanum=[A-Za-z'_0-9]*;
  alphanumId=[A-Za-z]{alphanum};
  sym=[-!%&$+/:<=>?@~`^|#*]|"\\";
  symId={sym}+;
  id={alphanumId}|{symId};
  longid=({alphanumId}".")*{id};

These regular expressions do not exclude reserved words from the
alphanumId and symId classes.  For non-qualified identifiers, the
order of rules in the lexer generator input dictates that reserved
words will be chosen over identifiers.  But, for long identifiers,
nothing prohibits the use of a reserved word as an identifier (in any
component).  Hence, both MLton and HaMLet also report a static
semantic error for the following program:

  val x = val.=>

Since regular languages are closed under difference, it would be
possible to define alphanumId and symId regular expressions that
excluded the reserved words.  (Indeed, the ml-ulex tool admits regular
expression complement, intersection, and union, so would admit
relatively concise definitions of alphanumId and symId regular
expressions that excluded the reserved words.)

But, here's a curiosity.  Suppose that we were to define alphanumId
and symId regular expressions that excluded the reserved words.  The
behavior of a traditional lexer generator is to always choose the rule
that matches the longest prefix of the current input.  So, the
program:

  val x = Int.val

would be lexed as VAL, ID(x), EQUALS, LONGID(Int.va), ID(l); and the program:

  val x = val.=>

would be lexed as VAL, ID(x), EQUALS, ID(va), LONGID(l.=), ID(>).
[The Definition allows "=" as a symbolic identifier, as a special
exception to the "exclude reserved words" restriction.]

Hard to say if there is a "best" way to handle reserved words in long
identifiers.  (And, since such programs would be rejected for *some*
reason, it isn't clear that it needs to be handled in the lexer.)  The
Definition states "Comments and formatting characters separate items
[of lexical analysis]", which might argue that "Int.val" cannot be
lexed as LONGID(Int.va) followed by ID(l), though clearly we want
"f(x)" to be lexed as ID(f), LPAREN, ID(x), RPAREN, despite there
being no comments or formatting characters between items.  But,
arguably, lexing "Int.val" as LONGID(Int.va) followed by ID(l) leads
to even stranger static semantic errors than those presently reported
by MLton and HaMLet.

I actually handled this in the Pygments lexer by using the callback on
a long identifier to mark any reserved word in a qualified identifier
as an ERROR token, while marking any reserved word in a non-qualified
identifier as a KEYWORD.  This might be an appropriate mechanism for
flagging illegal long identifiers in the lexer.