[MLton] Re: [Sml-basis-discuss] Unicode and WideChar support

John Reppy jhr@cs.uchicago.edu
Wed, 30 Nov 2005 08:10:02 -0600


On Nov 30, 2005, at 6:49 AM, Wesley W. Terpstra wrote:

> On Nov 30, 2005, at 2:10 AM, John Reppy wrote:
>> I think that this proposal is too heavy weight for its usefulness.
>
> I agree that it's pretty heavy-weight.
> However, at least in MLton creating the structures isn't a big deal.
>
>> The Basis design assumes that there is an implementation of the TEXT
>> signature for each char/string/substring type, so you'll have all the
>> arrays, vectors, slices, etc. for each type.
>
> What if we just said that only Char and WiderChar had the  
> structures at
> the toplevel? All the others only provide Ucs2Text, AsciiText, ...  
> That has
> very little namespace pollution, yet provides everything desired.  
> From my
> experience with MLton's Char, most/all of these structures can be  
> cookie-
> cutter stamped out of a functor, so it's not much trouble to  
> implement.
>
> Re: namespace polution, MLton has Int{1-64} and Word{1-64}.
> Now *that* is heavy weight! :-)

Having a character type without a corresponding string/substring type  
seems
weird.  Once you have string/substring, then you effectively have the  
vector
and slice structures too, so why not add arrays and array slices to  
get the
complete set?  My main concern is that you end up with a lot of  
modules that
most users won't use or understand.
>
>> Furthermore, there need to be conversion functions between types ...
>
> Well, a simple 'toWide' and 'fromWide' would take  care of that.
> (Analogous to the promotion to/from LargeInt)
>
> However, there are a couple problems here:
>
> WideChar does not exist ony many platforms. Is it possible to have  
> these
> elements of the CHAR signature marked as required iff. WideChar  
> exists?
>
> What will Char.toWide do? As I already mentioned, high ascii  
> (128-255) is
> undefined. What does it map to in a WideChar?! I still think  
> defining high
> ascii to be *something* is better than nothing.

I think that it depends on how one views the Char.char type.  In my  
view, it
is an enumeration of 256 values.  There are a collection of  
predicates that
classify these values and there is a standard string representation that
corresponds to the SML notion of character/string literals.  The  
value #"\128"
is perfectly well defined, it just doesn't happen to have a tight  
binding to
a particular glyph.

>
>> and perhaps multiple versions of TextIO.
>
> I don't think this is desirable.
>
> Instead, you should use BinaryIO and compose it with a charset  
> decoder.
> An implementation will only have a few charset representations in main
> memory and certainly no variable width ones. If you use a general  
> charset
> decoder for reading, then you can support all charsets with the  
> same code.

For converting between data on disk/wire/etc., filters are the way to  
go (TextIO
already has this property for newline conversion), but there is the  
issue
of OS interfaces; for example, pathnames.
>
>> A different strategy (one that we considered at one point in the  
>> Basis
>> design, but then abandoned for reasons that I cannot remember), is to
>> separate the notion of character classification from representation.
>
> I think I know why this wasn't done:
>
> 1. If you write a string in SML 'val x = "asfasf"', then this  
> string must contain
> the code points which correspond to the symbol with shape 'a', then  
> 's', ...
> When you have a single storage type, with multiple charsets, then  
> this is
> ambiguous. ie: Is #"€" 0xA4 or 0x80? Depends on your charset!

This was not the reason.  This problem is more of an editor problem  
and one of
the reasons that I'm not a big fan of extending the source token set  
of SML
beyond ASCII.

>
> 2. Simply taking a string which was previously considered an  
> ISO-8859-1
> string and declaring that it is now an ISO-8859-15 string would be  
> typesafe,
> yet buggy. If you used phantom types like 'charset char, you might  
> be able
> to avoid the worst.
>
> 3. Maybe not then, but now: backwards compatibility.

I think that the reason had more to do with reducing the number of  
modules
in the specification, but I'd have to dig through the mail to figure  
this
out.

>
> Finally, you would still need at least three representations (1,2,4  
> byte).
> My proposal had five, which isn't terribly worse, and saves on the
> classification structures. If we say Char=ISO-8859-1, then there are
> only three structures in my proposal too. (Char, Ucs2, WideChar)
>
> I keep coming back to arguing for Char being ISO-8859-1. It makes the
> problem of conversion between WideChar and Char so much cleaner...

Why not just have 8-bit Char.char and 32-bit WideChar.char?

	- John