[MLton] WideChar

Henry Cejtin henry@sourcelight.com
Fri, 10 Dec 2004 16:22:19 -0600


With regargds to your point:

    When  converting  a  char  reader  to  a widechar reader, it is sometimes
        useful to raise an exception on encountering a widechar and sometimes
        useful to return NONE.  We should provide both types of converters.

The point I was making in my comment was connected with the fact that not all
collections of bytes are legal UTF-8 objects.   In  that  case,  the  problem
isn't  encountering a widechar, it is encountering bytes which make no sense.

Note, none of this is really about chars, but about encodings;  that  is  why
the exception makes sense (to me).  It is caused exactly by the fact that the
function from bytes to unicode which is UTF-8 is only a partial function.

With  regards  to  the  multi-level  table  compression  for Lex, long ago in
building up a DFA string matcher I needed to store things compactly  so  that
speed  would  be  optimized.   I  just  used  the  following hack: divide the
character space into equivalence classes where 2  characters  are  equivalent
iff all transitions from all states are the same.  Then you just do one extra
lookup (character to equivalence class) followed by the  usual  stuff.   That
first lookup would be the one where there would be lots of sharing.

I never saw a case where this performed poorly.

With regard to word8 vs. int8, isn't is a problem either way?  I.e., does the
FFI support unsigned char?  If so then that should be word8 while char should
be  int8 (except on some machines (MIPS?) where chars default to unsigned and
it is signed char that would be int8).

With  regards  to  locale dependency, isn't one of the huge points of unicode
exactly that isAlpha and isPrint do NOT depend on things like locale.

I have been pimped MANY times by code that depends on the locale  because  it
pretty  much  ONLY  makes sense when the output is to a human or the input is
from a human.  When things are between programs it is a disaster if  the  two
programs don't agree.