[MLton] Re: [Sml-basis-discuss] Unicode and WideChar support

John Reppy jhr@cs.uchicago.edu
Tue, 29 Nov 2005 10:51:10 -0600


The lexer doesn't generate strings.  The input is assumed to be 8-bit  
characters
(i.e., type char) and one can specify 7-bit, 8-bit, and UTF-8  
interpretations of
the character stream (ML-lex only supports 7-bit and 8-bit).

	- John

On Nov 29, 2005, at 10:30 AM, Geoffrey Alan Washburn wrote:

> Aaron Turon wrote:
>> I have been working with John Reppy on a (largely) backwards- 
>> compatible replacement for ML-lex. The new tool is based on  
>> Brzozowski's notion of regular expression derivatives[1], making  
>> it easy to support boolean operations on REs such as intersection  
>> and negation. Code generation is not finalized, but will most  
>> likely be control-flow-based (one function per state, with tail  
>> calls) rather than table-based. We have designed the tool to  
>> support unicode. I hope to have an initial version out for testing  
>> some time next month -- please feel free to send mail with  
>> suggestions or requests.
>     This would be great.  In the past to handle some ad-hoc uses of  
> UTF-8 in my parsers I've had to build a custom
> version of ml-lex with  CharSetSize >129.
>
>     Though given that there isn't yet an agreed upon Basis module  
> for Unicode what does your lexer generate in terms of strings?
>
> -- [Geoff Washburn|geoffw@cis.upenn.edu|http://www.cis.upenn.edu/ 
> ~geoffw/]