[MLton] Re: [Sml-basis-discuss] Unicode and WideChar support

Aaron Turon adrassi@gmail.com
Tue, 29 Nov 2005 09:39:42 -0600


> If you want to add Unicode support, then you have a working WideChar/
> String. Decoding UTF-8 into a WideChar is about 10-20 lines, so
> that's not much additional effort either. The real work is getting
> MLlex to support such a large character set. However, that's only
> needed for Unicode-enabled SML compilers.

I have been working with John Reppy on a (largely)
backwards-compatible replacement for ML-lex.  The new tool is based on
Brzozowski's notion of regular expression derivatives[1], making it
easy to support boolean operations on REs such as intersection and
negation.  Code generation is not finalized, but will most likely be
control-flow-based (one function per state, with tail calls) rather
than table-based.

We have designed the tool to support unicode.  I hope to have an
initial version out for testing some time next month -- please feel
free to send mail with suggestions or requests.

Best,
Aaron

[1]  Derivatives of Regular Expressions, Janusz A. Brzozowski, Journal
of the ACM, Volume 11, Issue 4, 1964.