[MLton] WideChar?

Tue, 7 Dec 2004 18:52:51 +0100

On Tue, Dec 07, 2004 at 08:55:16AM -0500, Matthew Fluet wrote:
> > Is there a reason this hasn't already been added using Word16/32?
> The usual: lack of time, lack of requests, lack of expertise.

I have some clue about internationalization issues, but am no expert.

> Also, my (limited) understanding of Unicode and related 'things' suggests
> that it isn't as easy as identifying Word16 and Word32 with corresponding
> character structures; that, variable length encodings, etc. make things
> more difficult.

By variable length encodings I presume you mean UTF-8 and UTF-16?

The intention is that UCS2, UCS4, and ISO-8859-1 be the working / internal
(to a program) representation of strings. The reason is that it is otherwise
hard to do simple operations like: truncating strings, measuring character
length, iteratings over strings, etc.

Note that UCS2 simply does not include the characters outside of the BMP
(Basic Multilinguagal Plane, aka Plane 0). When the so-called 'surrogate
characters' d800-dfff are used to encode UCS4 (aka UTF-32) Word16s are no
longer simple UCS2, but rather UTF-16.

Encodings such as UTF-8 and UTF-16 are intended to be used as external
representations where Unicode friendly programs interface with other,
possibly non-unicode applications. For example, xml files are usually
encoded in UTF-8 so that they may both include unicode characters and 
be edited by older text editors. These encodings have variable length.

The other character sets are all 'obsolete'. =)
Libraries like libiconv allow conversion between most of these, however.

What I would see as the 'ideal' SML solution would be:

There are Char : CHAR, UCS2 : CHAR, and WideChar = UCS4 : CHAR structures.
You would use a LargeChar to pick the largest available Char type.
UCS2String, LargeString, should also exist.
The values #"a" and "dfgdfsg" should be polymorphic just like 5.

It should be made explicit that Char is ISO-8859-1 (so Char.ord is Unicode).
It should be made explicit that SML source code files are by default UTF-8 
 (to permit Unicode characters inside strings).

There is an ICONV signature including at least:
  type string
  type char
  exception UnknownCharset of string

  val decode: string -> (char, 'a) reader -> (LargeChar, 'a) reader
  val encode: LargeChar -> string

  val registerDecoder: string -> ((char, 'a) reader -> (LargeChar, 'a) reader) -> unit
  val registerEncoder: string -> (LargeChar -> string) -> unit

Two structures for representing the source string type:
  IConv, IConvUCS2 (string = UCS2String)

encode and decode take a string naming the source charset.
If the runtime does not support the charset, the exception gets thrown.
The runtime promises to provide "UTF-{8,16,32}" and "ISO-8859-1".
The register* methods allows the user to add more charsets to the runtime.

Sadly, I don't think MLton can make use of iconv() to do this.
On the other hand, I think SML will yield faster and better code anyways.

Maybe add toLarge and fromLarge in CHAR and STRING signatures like INT.
Raising Overflow seems appropriate on conversion failure (where you tried to
pack a non-BMP code into UCS2, or non ISO-8859-1 code into Char).
This might not be needed since CHAR.ord and CHAR.chr work more or less.

The whole CHAR.is* family should never have been dumped inside CHAR.
The note that these are locale dependent under WideChar makes it even worse.
The idea of a per-process locale is a C screw-up that even C++ fixes.
The best way to handle this in SML is not clear (no OOP - hrm).

What is the idea behind WideTextPrimIO?
I don't know how to deal with all of the scan functions that expect a
Char.char. Suggestions? I think they should work with WideChar too.

How does one make official changes to the SML Basis Library anyways?

For MLton, I'd be happy to provide a MLton.IConv and MLTON_ICONV, WideChar,
WideString, and MLton.UCS2. However, these things really should be fixed in
the standard too.

If I implemented WideChar in MLton, I would at present ignore the comment
that WideChar.is* should be localized; that's just wrong. I would need help
making #"g" and "Dgfsg" polymorphic, also making source files UTF-8 is
probably beyond my present understanding of MLton.

Fixing characters would be the first step.
Defining SML source files as UTF-8 should have been done at the outset.
I only hope it's not too late to do now ...

The localization issues of CHAR.is* and Date are things I know less about.

-- 
Wesley W. Terpstra