[MLton] Re: [Sml-basis-discuss] Unicode and WideChar support

Wesley W. Terpstra wesley@terpstra.ca
Tue, 29 Nov 2005 16:04:11 +0100


I've re-ordered parts of your email to address things with some  
linearity. :-)

On Nov 27, 2005, at 10:24 PM, Dave Berry wrote
> If I understand your proposal correctly, you are suggesting that we  
> make WideChar always be Unicode, make the existing WideChar use the  
> default categorisation of Unicode, and add a new module for locale- 
> dependent operations.

That's exactly what I am proposing.
While we're at it, there needs to be a charset encoder/decoder  
included too. Locale-specific date/time/number/currency formatting  
would also be good, as would something like gettext.

> Perhaps it would make sense to have an 8-bit equivalent of the  
> locale-dependent module as well?  Then programmers could explicitly  
> support ISO-8859-1 (and -2, -3, etc.)

I think this would add little. You can decode your input ISO-8859-x/ 
whatever into Unicode, and then work with it there. This is more  
flexible anyways since then your code will work with any character set.

> I'm not familiar with isNumber, but it looks a reasonable  
> suggestion to support it..  Which characters are included in  
> isNumber but not isDigit?

In ISO-8859-1, there are 1/4 and 1/2 symbols. Also, other languages  
include stranger concepts of number that aren't necessarily decimal,  
and thus can't be called a digit (base 10).

> I think we can remove the requirement that isAlpha = isLower +  
> isUpper for WideChar.  I assume the rationale for this is that some  
> languages don't have the concept of case?

Yes.

> I believe that the reason that chr and ord deal in ints is purely  
> for backwards compatibility.  So I guess that having chr raise an  
> exception for values > 10FFFF would work OK, when WideChar == Unicode.

That's what I will do then.

If we are banning values beyond 10FFFF, then perhaps we should also  
ban values between D800-DFFF which may not appear in a conforming  
UTF-32 string.

I also think the restriction that WideChar be a multiple of 8 bits  
should be removed. It serves no real purpose, AFAICT, so why limit an  
implementer's choices?

> If we allow source files that are encoded in UTF-8, what effect  
> would this have on portability to compilers that don't use  
> Unicode?  Or, to put this another way, what would be the minimum  
> amount of support that an implementation would have to provide for  
> UTF-8, and how much work would it be to implement?

Compilers without Unicode support already do the right thing:  
complain if given high ASCII. UTF-8 includes ASCII as a subset, so  
any file that only uses ASCII will work under both sorts of  
compilers. If you have high-ascii, then it means you have included  
Unicode values in your strings, which means that the SML source file  
requires Unicode. If the compiler doesn't support Unicode, then this  
is grounds for an error.

So, as far as I can see, the minimum work to implement this is  
changing the error message from something like 'high ascii forbidden'  
to 'this compiler doesn't support Unicode'.

If you want to add Unicode support, then you have a working WideChar/ 
String. Decoding UTF-8 into a WideChar is about 10-20 lines, so  
that's not much additional effort either. The real work is getting  
MLlex to support such a large character set. However, that's only  
needed for Unicode-enabled SML compilers.

> Your first question is about the character set of the Char  
> structure.  The idea behind this structure is that it should be the  
> locale-independent 7-bit ASCII characters, with the other 128  
> characters having no special semantics - analogous to the "C" locale.

The problem is that Char.is* assigns semantics to the high ASCII  
characters. At least for me, the distinction between a simple Word8  
and a Char is that a Char carries a character set with it. Observe  
the difference in their interfaces. The fact that all those methods  
are defined for 'char' means that you *have* assigned semantic  
meaning to high-ascii. You've defined that all high ascii is: not a  
control and not printable.

> It may be pragmatic to specify Char to be ISO-8859-1, to match  
> Unicode (and HTML).  However, I'm against it because it gives  
> people a misplaced expectation that it significantly addresses the  
> internationalisation/localisation question.  E.g. I think your  
> statement that ISO-8859-1 covers most of the "major" European  
> languages is culturally biased.

I concede the point that ISO-8859-1 is inadequate for Europe. :-)

My primary motivation for specifying that Char = ISO-8859-1 was that  
I wanted the character set to be a subset of Unicode. I thought that  
this would be the path of 'least surprise' for a programmer migrating  
his code from Char to WideChar. However, changing this definition  
could break existing SML programs, which expect isAlpha to return  
false above 7F.

> Underlying your whole post is the assumption that WideChar  
> characters must be using Unicode.  This is not an assumption that  
> the Basis makes - it allows for other wide character sets.  The  
> WideChar structure was modelled on the C wchar_t type, which in  
> turn was designed to support a character-set independent approach  
> to handling international characters, as opposed to the universal  
> character set approach of Unicode.  I don't know whether C still  
> takes this approach or whether it's the best one to take, but it  
> may explain why the structure is specified as it is.

Ok. This makes sense, and clears up a lot of the background reasoning  
for me.

If I were to predict the future, I would say that Unicode is the  
ASCII of tomorrow.
The basis already grants privileged status to ASCII, so it should for  
Unicode too.

That said, I agree that it is useful to allow for extra structures to  
match the CHAR interface. The Russians might like their SML  
implementations to include a KOI8R structure. However, after lifting   
the 'multiple of 8 bit' restriction, I'd like to impose a different  
restriction: 'must be fixed width'. This means that UTF-8 and UTF-16  
may not match the CHAR signature.

With respect to WideChar, it intuitively appeared to me that this was  
trying to be like LargeInt, ie: something which could contain all  
integers, or in this case, all characters. This is exactly what  
Unicode tries to be: a superset of all character sets. Therefore, I  
would argue, that specifying that WideChar MUST be Unicode is a  
perfectly natural thing to do.

> I'd rather keep Char as 7-bit ASCII.

Then why not raise Chr if you try to put in a value above 0x7F?

> There's nothing preventing any implementation from implementing  
> other structure that match CHAR - they just won't be portable if  
> they rely on compiler magic.  I'd have thought we could consider a  
> Char16 structure if enough people are interested.

Keeping with the mindset that a structure matching CHAR is in fact a  
character set, not just a bag of integers, how about this:

Char (8 bit, high ascii 'undefined') <-- required (raises Chr for  
values beyond FF)
Ascii (7 bit) <-- required (raises Chr for values beyond 7F)

Iso8859_1 (8 bit) <-- optional (raises Chr for values beyond FF)
Ucs2 (16 bit) <-- optional (raises Chr for surrogates and values  
beyond FFFF)
WideChar (must be Unicode) <-- optional (raises Chr for surrogates  
and values beyond 10FFFF)
... plus any number of locale-specific charset the implementor likes.

We have nice subset behaviour for Ascii, Iso8859_1, Ucs2, WideChar.
Char works as it always has, and is explicitly NOT a subset of the  
others, though it agrees for all values which have isAscii = true.  
This fact should be documented in bright flashing red.

One question is whether or not the Ucs2/Iso8859_1/Ascii structures  
should have all of the extra structures that go with them  
(Ucs2String, Ucs2Vector, Ucs2Substring, ...).

> You are right that the basis does not specify locale parameters or  
> how to set global locales.  It does use a global model,- the  
> perceived advantage being that the same code could be run in  
> different locales just by changing the environment, rather than  
> changing the code.

I think that this a very bad thing to do.

First, it is rarely as simple as just changing an environment  
variable unless the application developer has put effort into  
internationalization. Simply using WideChar does not indicate that  
the program has been internationalized; I can think of several non- 
internationalized applications which would need to use WideChar. If a  
program has not been carefully internationalized, it is quite  
possible that changing the environment locale will render the  
software inoperable or introduce mysterious bugs (imagine if bash  
looked at the locale variable when running shell scripts; how many  
would break if the number formatting changed?). If a programmer is  
using a method that depends on the environment, I think this should  
be made very clear in the interface to help prevent such problems.

Second, only single-user applications have a single locale; if the  
applications is a server, then it needs to be able operate in a  
different locale for each user. Furthermore, even single-user  
applications may need to operate with multiple locales. For example,  
a login program (like gdm) needs to allow a user to select his  
language/locale as part of the login procedure.

> Setting the locale was left for either an extension to the Basis or  
> for the environment to specify.

Allowing the global locale to be changed is an even more frightening  
prospect. Doesn't this deeply conflict with the design principles of  
a functional programming language? This would predicate large  
components of the software off of what amounts to a mutable global  
variable. Suppose I used memoization on some method of mine which  
used internally the is* methods. How does a global locale switch  
affect this? Hidden dependencies are simply bad.

I see no problem with having an immutable startup locale which is  
specified by the environment. This is similar in some respect to a  
command-line argument. However, I would argue that nothing should be  
predicated off of it unless specifically instructed by the  
programmer. eg:

signature LOCALE =
    type locale
    val initialLocale : locale
    fun lookupLocale: string -> locale

    structure CharCategories : sig
       val isSpace: locale * WideChar.char -> bool
       ...

This addresses all of my concerns, and still provides the  
functionality you would have gotten from a C-style global locale.  
BTW, you will notice this is the same approach taken by C++, which  
also recognized the problem with a global hidden locale and chose to  
throw out the C scheme. Of course, rather than a pair of arguments,  
it has locale objects, but this amounts to more or less the same.

> Your suggestions on parsing and serialisation seem reasonable to me.

You understand why I listed it as an incompatible change?
What I suggested meant that 0xDA, previously converted to '\xDA' is  
now converted to '\xC3\x9A'.

Actually, I just did some poking around, and found this:
#include <stdio.h>
#include <wchar.h>

int main() {
   wchar_t x = L'\U12345678';
   printf("%x\n", (int)x);
   return 0;
}

So, forget the bit about toCString being a problem. C99 adds \u and  
\U. For consistency, I suppose SML should accept \U12345678 instead  
of \U123456 even though the first two digits must be zero since the  
value has to be less than 0x110000.

Anyways, my new proposal for CHAR no longer has any points which  
would break compatibility with existing SML programs. That's a pretty  
big improvement from just one email round. Keep the comments coming,  
please. :-)