[MLton] Re: [Sml-basis-discuss] Unicode and WideChar support

John Reppy jhr@cs.uchicago.edu
Tue, 29 Nov 2005 19:10:49 -0600


I think that this proposal is too heavy weight for its usefulness.
The Basis design assumes that there is an implementation of the TEXT
signature for each char/string/substring type, so you'll have all the
arrays, vectors, slices, etc. for each type.  Furthermore, there need
to be conversion functions between types and perhaps multiple versions
of TextIO.

A different strategy (one that we considered at one point in the Basis
design, but then abandoned for reasons that I cannot remember), is to
separate the notion of character classification from representation.
For example, one could have two types of char (Char.char and  
WideChar.char),
but multiple classification modules (e.g., Ascii, ISO8859_1, ...) that
provide interpretations of these types.  Functions like isAlpha would
be part of these classification modules.

	- John

On Nov 29, 2005, at 4:36 PM, Stephen Weeks wrote:

>
>> Keeping with the mindset that a structure matching CHAR is in fact a
>> character set, not just a bag of integers, how about this:
>>
>> Char (8 bit, high ascii 'undefined') <-- required (raises Chr for
>> values beyond FF)
>> Ascii (7 bit) <-- required (raises Chr for values beyond 7F)
>>
>> Iso8859_1 (8 bit) <-- optional (raises Chr for values beyond FF)
>> Ucs2 (16 bit) <-- optional (raises Chr for surrogates and values
>> beyond FFFF)
>> WideChar (must be Unicode) <-- optional (raises Chr for surrogates
>> and values beyond 10FFFF)
>
> I like this proposal.
>
> As to whether \U escapes should accept 6 or 8 hex digits, I lean
> towards 8 because it seems possible that in the future we will need
> more than 6 digits, and I wouldn't want to break old code or to
> support 6 and 8 simultaneously.  Also, we have \u for the common case
> of 4 digits.  Finally, with source files allowed to be UTF-8, \U
> escapes should be pretty rare.
>
>> If we are banning values beyond 10FFFF, then perhaps we should also
>> ban values between D800-DFFF which may not appear in a conforming
>> UTF-32 string.
>
> Yes, that makes sense if we are really thinking of WideChar as
> Unicode.
>
>> One question is whether or not the Ucs2/Iso8859_1/Ascii structures
>> should have all of the extra structures that go with them
>> (Ucs2String, Ucs2Vector, Ucs2Substring, ...).
>
> One way to go would be to export functors that let people build these
> if they really want them.
> _______________________________________________
> Sml-basis-discuss mailing list
> Sml-basis-discuss@mailman.cs.uchicago.edu
> http://mailman.cs.uchicago.edu/mailman/listinfo/sml-basis-discuss
>