[MLton] WideChar

Stephen Weeks MLton@mlton.org
Fri, 10 Dec 2004 15:06:54 -0800


> Note, none of this is really about chars, but about encodings; that
> is why the exception makes sense (to me).  It is caused exactly by
> the fact that the function from bytes to unicode which is UTF-8 is
> only a partial function.

True.  I'm not sure which way to go.  I'd like to see stuff developed
further and see some more uses.  The tradition of the other scan
functions (e.g. Int.scan) is to return NONE when encountering invalid
input.  So, if we're writing a decoder from some encoding to WideChar

  decode: Encoding.t * (char, 'a) reader -> (WideChar.t, 'a) reader

then having the widechar return NONE when a character can't be scanned
seems plausible.  Although I find that confusing because it doesn't
distinguish from getting end of stream.  I don't really use this
reader stuff much, so I don't have a good intuition.

> With regards to the multi-level table compression for Lex, long ago
> in building up a DFA string matcher I needed to store things
> compactly so that speed would be optimized.  I just used the
> following hack: divide the character space into equivalence classes
> where 2 characters are equivalent iff all transitions from all
> states are the same.

This is an excellent suggestion if the number of characters is small,
which is fine for Char1 or Char2.  But for full Unicode (21 bits) it
might be a bit overkill.  In any case, we use this trick in the MLton
regexp library (which is only for 8-bit chars) based on you making
this same suggestion many years ago :-).  It does work well.

> With regard to word8 vs. int8, isn't is a problem either way?  I.e.,
> does the FFI support unsigned char?  

The FFI supports unsigned char via the SML type Word8.word.  See

	http://mlton.org/ForeignFunctionInterfaceTypes

> If so then that should be word8 while char should be int8 

Yes, that's what we do.