[MLton] Re: [Sml-basis-discuss] Unicode and WideChar support

Dave Berry dave@berrybental.me.uk
Sun, 27 Nov 2005 21:24:26 +0000


Hi Wesley,

It's good to see someone working on Unicode & SML.  IMO, this area of the 
Basis is likely to need some tweaks (at least) as we gain more practical 
experience.

Your first question is about the character set of the Char structure.  The 
idea behind this structure is that it should be the locale-independent 
7-bit ASCII characters, with the other 128 characters having no special 
semantics - analogous to the "C" locale.  For other character sets, you 
need to use WideChar.  This was largely a pragmatic decision, so that we 
could rely on one locale-independent character set that was easy to 
implement, while still providing locale support for those who needed it.

You are right that the basis does not specify locale parameters or how to 
set global locales.  It does use a global model,- the perceived advantage 
being that the same code could be run in different locales just by changing 
the environment, rather than changing the code.  Setting the locale was 
left for either an extension to the Basis or for the environment to specify.

You are also right that (WideChar.isX o WideChar.chr o Char.ord) != 
Char.isX, but only if (a) the character set used for WideChar is not a 
superset of 7-bit ASCII, or (b) the character tested is > chr(127), which 
is outwith the defined range of meaningful values for Char.  If you are 
dealing with ISO-8859-1 (say) then Char is by definition inadequate.

Underlying your whole post is the assumption that WideChar characters must 
be using Unicode.  This is not an assumption that the Basis makes - it 
allows for other wide character sets.  The WideChar structure was modelled 
on the C wchar_t type, which in turn was designed to support a 
character-set independent approach to handling international characters, as 
opposed to the universal character set approach of Unicode.  I don't know 
whether C still takes this approach or whether it's the best one to take, 
but it may explain why the structure is specified as it is.

If I understand your proposal correctly, you are suggesting that we make 
WideChar always be Unicode, make the existing WideChar use the default 
categorisation of Unicode, and add a new module for locale-dependent 
operations.  That seems a plausible approach.   It really needs someone to 
implement it and try it in anger.

Perhaps it would make sense to have an 8-bit equivalent of the 
locale-dependent module as well?  Then programmers could explicitly support 
ISO-8859-1 (and -2, -3, etc.)

I'm not familiar with isNumber, but it looks a reasonable suggestion to 
support it..  Which characters are included in isNumber but not isDigit?

I think we can remove the requirement that isAlpha = isLower + isUpper for 
WideChar.  I assume the rationale for this is that some languages don't 
have the concept of case?

It may be pragmatic to specify Char to be ISO-8859-1, to match Unicode (and 
HTML).  However, I'm against it because it gives people a misplaced 
expectation that it significantly addresses the 
internationalisation/localisation question.  E.g. I think your statement 
that ISO-8859-1 covers most of the "major" European languages is culturally 
biased.  Even if we define "major" as the main official languages of states 
in the European Union, several are not covered (e.g. Polish, Czech, Greek, 
Slovak, Maltese, Latvian, Lithuanian, ...).  I think it's worth noting that 
Poland is one of the larger EU states.  (And as I live in Scotland, I'll 
mention the celtic languages of Gaelic and Welsh, while conceding that 
these are spoken by small populations).  I'd rather keep Char as 7-bit ASCII.

Moving on to your section 2, I believe that the reason that chr and ord 
deal in ints is purely for backwards compatibility.  So I guess that having 
chr raise an exception for values > 10FFFF would work OK, when WideChar == 
Unicode.

There's nothing preventing any implementation from implementing other 
structure that match CHAR - they just won't be portable if they rely on 
compiler magic.  I'd have thought we could consider a Char16 structure if 
enough people are interested.

Your suggestions on parsing and serialisation seem reasonable to me.

If we allow source files that are encoded in UTF-8, what effect would this 
have on portability to compilers that don't use Unicode?  Or, to put this 
another way, what would be the minimum amount of support that an 
implementation would have to provide for UTF-8, and how much work would it 
be to implement?

Thank you for taking the time to write up your thoughts.  I hope my reply 
has helped to explain the rationale for the current design.

Best wishes,

Dave.