[MLton] Re: [Sml-basis-discuss] Unicode and WideChar support

Geoffrey Alan Washburn geoffw@cis.upenn.edu
Tue, 29 Nov 2005 11:30:36 -0500


This is a multi-part message in MIME format.
--------------090805090107030003020101
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

Aaron Turon wrote:
> I have been working with John Reppy on a (largely)
> backwards-compatible replacement for ML-lex.  The new tool is based on
> Brzozowski's notion of regular expression derivatives[1], making it
> easy to support boolean operations on REs such as intersection and
> negation.  Code generation is not finalized, but will most likely be
> control-flow-based (one function per state, with tail calls) rather
> than table-based.
>
> We have designed the tool to support unicode.  I hope to have an
> initial version out for testing some time next month -- please feel
> free to send mail with suggestions or requests.
>   
    This would be great.  In the past to handle some ad-hoc uses of 
UTF-8 in my parsers I've had to build a custom
version of ml-lex with  CharSetSize >129. 

    Though given that there isn't yet an agreed upon Basis module for 
Unicode what does your lexer generate in terms of strings? 

-- 
[Geoff Washburn|geoffw@cis.upenn.edu|http://www.cis.upenn.edu/~geoffw/]


--------------090805090107030003020101
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta content=3D"text/html;charset=3DUTF-8" http-equiv=3D"Content-Type"=
>
  <title></title>
</head>
<body bgcolor=3D"#ffffee" text=3D"#000000">
Aaron Turon wrote:<br>
<blockquote
 cite=3D"midaac3b4680511290739m5c81bacexe32fb0fe1dae6f1c@mail.gmail.com"
 type=3D"cite">
  <pre wrap=3D"">I have been working with John Reppy on a (largely)
backwards-compatible replacement for ML-lex.  The new tool is based on
Brzozowski's notion of regular expression derivatives[1], making it
easy to support boolean operations on REs such as intersection and
negation.  Code generation is not finalized, but will most likely be
control-flow-based (one function per state, with tail calls) rather
than table-based.

We have designed the tool to support unicode.  I hope to have an
initial version out for testing some time next month -- please feel
free to send mail with suggestions or requests.
  </pre>
</blockquote>
=C2=A0=C2=A0=C2=A0 This would be great.=C2=A0 In the past to handle some =
ad-hoc uses of
UTF-8 in my parsers I've had to build a custom<br>
version of ml-lex with=C2=A0 CharSetSize &gt;129.=C2=A0 <br>
<br>
=C2=A0=C2=A0=C2=A0 Though given that there isn't yet an agreed upon Basis=
 module for
Unicode what does your lexer generate in terms of strings?=C2=A0 <br>
<br>
<pre class=3D"moz-signature" cols=3D"72">--=20
[Geoff Washburn|<a class=3D"moz-txt-link-abbreviated" href=3D"mailto:geof=
fw@cis.upenn.edu">geoffw@cis.upenn.edu</a>|<a class=3D"moz-txt-link-freet=
ext" href=3D"http://www.cis.upenn.edu/~geoffw/">http://www.cis.upenn.edu/=
~geoffw/</a>]
</pre>
</body>
</html>

--------------090805090107030003020101--