[MLton] Re: [Sml-basis-discuss] Unicode and WideChar support

John Reppy jhr@cs.uchicago.edu
Wed, 30 Nov 2005 11:22:20 -0600


On Nov 30, 2005, at 10:50 AM, Wesley W. Terpstra wrote:

> On Nov 30, 2005, at 3:10 PM, John Reppy wrote:
>> Having a character type without a corresponding string/substring  
>> type seems
>> weird.  Once you have string/substring, then you effectively have  
>> the vector
>> and slice structures too, so why not add arrays and array slices  
>> to get the
>> complete set?
>
> As Matthew already said, I think we should have those structures,  
> but perhaps
> not at the top-level namespace. If it's little work for an  
> implementation to provide
> them, and they don't pollute the namespace, then I see no problem  
> with having
> them. It's also not a particularly confusing concept... probably  
> less confusing
> than splitting representation and meaning into orthogonal concepts.

If they exist they have to be implemented, maintained, and specified.  
I think that
they will making working with text significantly more complicated, so  
I  want
to be convinced that they are necessary before committing to  
supporting them.

>
>>> What will Char.toWide do? As I already mentioned, high ascii  
>>> (128-255) is
>>> undefined. What does it map to in a WideChar?! I still think  
>>> defining high
>>> ascii to be *something* is better than nothing.
>>
>> I think that it depends on how one views the Char.char type.  In  
>> my view, it
>> is an enumeration of 256 values.  There are a collection of  
>> predicates that
>> classify these values and there is a standard string  
>> representation that
>> corresponds to the SML notion of character/string literals.  The  
>> value #"\128"
>> is perfectly well defined, it just doesn't happen to have a tight  
>> binding to
>> a particular glyph.
>
> Right, but when you convert it to Unicode you are binding it to a  
> glyph.
> So, which glyph do you bind it to? Or do you raise Chr?

I don't see converting to WideChar as "converting to Unicode".   
Instead, it is
an mapping of the data in the 8-bit representation to the 16 or 32-bit
representation.  The question is, what is the semantics of that mapping?
There are different ways to specify that and so we should provide  
multiple
conversion operations.  Some of these may raise Chr and others may  
not; it
depends on the semantics.

>
>>> Instead, you should use BinaryIO and compose it with a charset  
>>> decoder.
>>> An implementation will only have a few charset representations in  
>>> main
>>> memory and certainly no variable width ones. If you use a general  
>>> charset
>>> decoder for reading, then you can support all charsets with the  
>>> same code.
>>
>> For converting between data on disk/wire/etc., filters are the way  
>> to go (TextIO
>> already has this property for newline conversion), but there is  
>> the issue
>> of OS interfaces; for example, pathnames.
>
> I'm not sure I understand your point here...
> Do you mean that some system/kernel calls will need a particular  
> charset?
> As far as I know the only kernel with that feature is the windows  
> kernel, where
> it can take UCS2 strings as well as ASCII. (Another reason UCS2 is  
> needed).
>
> For filenames on UNIX, I suppose you might want to write out UTF-8  
> strings.
> That's not a big problem, though, since the same structure which  
> can wrap
> the BinIO readers also converts WideString.string to  
> Word8Vector.vector with
> the charset you specify.
>
> I don't really see how any of this relates to the usefulness of  
> TextIO, though.
> You wouldn't have used TextIO to create filenames anyways, would you?

Two points: TextIO already does translation, so one can imagine  
translation
for other multi-byte characters.  Second, TextIO takes string arguments
that specify filenames.  BTW, MacOS X supports Unicode filenames, but I
think they use UTF-8 to encode them.

>
>>> 1. If you write a string in SML 'val x = "asfasf"', then this  
>>> string must contain
>>> the code points which correspond to the symbol with shape 'a',  
>>> then 's', ...
>>> When you have a single storage type, with multiple charsets, then  
>>> this is
>>> ambiguous. ie: Is #"€" 0xA4 or 0x80? Depends on your charset!
>>
>> This was not the reason.  This problem is more of an editor  
>> problem and one of
>> the reasons that I'm not a big fan of extending the source token  
>> set of SML
>> beyond ASCII.
>
> I think I have explained myself badly; the problem I was trying to  
> describe
> has nothing to do with the editor. You are giving an SML compiler  
> an input
> file, that input file is in some character set the compiler  
> understands. The
> compiler knows that #"€" is the Euro sign, and the charset it was  
> written in
> the editor is irrelevant at this point, because the compiler  
> already decoded
> the file into it's internal representation.
>
> Rather, the problem comes in after the compiler does type  
> inference. The
> compiler has this character and it says, "Ok! This is going to be a  
> Char8.char
> which has an unspecified charset". Now, it has to think, what will  
> the binary
> value be that I write into the output programs text segment? The  
> compiler,
> as per your suggestion, doesn't know the charset of Char8, because you
> left it unspecified. Now it must decide, what on earth to do with a  
> Euro sign.
> Should it use 0xA4 for an ISO-8859-1 type of Char8 or 0x80 for a  
> windows
> extended ACSII Char8. The compiler knows that you want a Euro sign,  
> b/c
> that's what you wrote in the input file, but because Char8 does not  
> include
> a concept of charset, it is unable to decide what binary value this  
> turns into.
>
> This problem also appears for normal ASCII.
> Take the character #"c". What should the compiler do with it?
>
> It doesn't know your Char8 is going to be KOI8R, so it would  
> probably just
> use ASCII, and that means that when you later use the characters as  
> if it
> were a KOI8R it would be some completely random glyph, when what you
> clearly meant was the Russian letter 'c' (which sounds like 's').
>
> Does that make things more clear?
> This makes the fact that Char is ASCII extremely important.
> Otherwise, the compiler would have no way of transforming string  
> literals
> (which have been decoded/parsed already) into values in the heap.

I think you are drawing the wrong conclusion.  Instead of saying that  
Char.char
is ASCII, you should say that SML programs are interpreted as being  
encoded
in the ASCII character set (I think that the definition actually  
states this
assumption, but I don't have my copy handy to check).

>
>>> Finally, you would still need at least three representations  
>>> (1,2,4 byte).
>>> My proposal had five, which isn't terribly worse, and saves on the
>>> classification structures. If we say Char=ISO-8859-1, then there are
>>> only three structures in my proposal too. (Char, Ucs2, WideChar)
>>>
>>> I keep coming back to arguing for Char being ISO-8859-1. It makes  
>>> the
>>> problem of conversion between WideChar and Char so much cleaner...
>>
>> Why not just have 8-bit Char.char and 32-bit WideChar.char?
>
> Nearly all of Unicode fits into the first 16 bits. As a matter of
> practicality, many people use this for in-memory Unicode.