[MLton] Unicode / WideChar

Wesley W. Terpstra wesley@terpstra.ca
Mon, 21 Nov 2005 14:08:28 +0100


On Nov 21, 2005, at 1:42 PM, skaller wrote:
>> Also, Char{1,2,4}.{<,toUpper,isAlpha,...} are all locale
>> *IN*-dependent.
>
> These are Unicode specific functions. By definition
> they're locale independent.

Florian is correct; some casing is locale dependent.
However, Unicode does specify a 'default' casing policy
which is what the Char{1,2,4} structures will adopt.

> I think some care should be taken to separate the
> Unicode functions from the String data structures.

I am obligated by STRING and CHAR to provide
a few functions in Char{1,2,4} and String{1,2,4}.
Everything I put in there will be locale independent.

For the rest of the Unicode stuff, and locale stuff,
you will use i18n.mlb (internationalization.mlb)

> I would actually argue, that Char? is wrong.
> They're not chars, they're integers, and they
> are not associated with any particular code set.

I disagree; a Char2 differs from Int16 in exactly
the fact that a Char2 has something to do with
a particular code set. That's why CHAR has all
those isAlpha, isAlnum, etc., methods.

> The reason is: you could write these functions for
> a different character set. There is a whole swag
> of archaic 8 bit character sets for example.

None of which will be supported by MLton.

If you want to work with another character set,
you need to import that data via charset converter
into WideChar (Char4) and work with it there.

> for example to decode say BIG5 and convert to Unicode,
> you will have to go through hoops .. and won't be able
> to do it at all without a BIG5_char abstraction.

Sure you can. Read it into a WideChar/String.
Unicode is a superset of all other codesets.

Unicode keeps track of a whole load of extra
properties like isUpper, only specialized for the
target language. isKatakana for Japanese, say.
These will all be available through the i18n.mlb

> BTW: I am curious about the Unicode database implementation:
> the database is BIG. How are you going to represent this
> efficiently? (Eg,  case mapping function)

Case mappings I haven't done yet. :-)

Property mapping is easier because Unicode nicely
grouped similar property code points together.