[MLton-user] SML unicode support

Henry Cejtin henry@sourcelight.com
Wed, 5 Jan 2005 13:18:09 -0600


There  is  no way to casually handle UTF-8 (or even Unicode) characters in C.
The encodings UTF-8 and UTF-16 do not store one character in 8  or  16  bits.
That  would  clearly not be possible because there are more than 256 and even
more than 65,536 Unicode characters.  UTF-8 and UTF-16 are ways  of  encoding
characters  as  COLLECTIONS  of  8-bit  bytes  or  16-bit  chunks.   Not  all
characters will take the  same  number  of  bytes/chunks.   UTF-32  lets  all
characters  be the same size (32-bits or 4 bytes) but no one stores them that
way externally (in files) because of the large waste of space.

The expectation is that files will be in UTF-8 or UTF-16 and on reading  them
they  will  be converted to something more convenient.  (Note, if you store a
string in UTF-8 itself, then you can't  go  to  the  N-th  character  without
walking through all the previous characters to see how long they are.)