[MLton] Unicode... again

Fri Feb 9 13:51:25 PST 2007

On Feb 9, 2007, at 10:05 PM, Matthew Fluet wrote:
>> Once again I find myself needing Unicode in MLton.
>
> Just to orient the discussion of "what to implement where": you  
> find yourself needing to process Unicode files with an SML program  
> compiled by MLton /or/ you find yourself needing to have Unicode  
> strings in an SML program compiled by MLton.
>
> The former doesn't require any changes to the compiler (not they  
> wouldn't be welcome).

TBH, I don't see much of this distinction. If I'm processing Unicode  
files with an SML program compiled by MLton, I will almost certainly  
need WideChar from MLton, which does not exist. I could certainly  
fake it with a Word32, but the basis tells me to use another type.

>> - CharX differs from IntX in that a CharX contains a character.  
>> This sounds obvious, but it caused considerable debate earlier. I  
>> hope that given the above definition of character, things are  
>> clear. A character corresponds to our concept of the letter 'a',  
>> irrespective of the font. A character is NOT a number. It is not  
>> even a code point.
>
> I don't recall the details of the earlier debate, but while  
> expecting CharX to differ from IntX sounds good, it doesn't give  
> much insight into the representation.  In particular the 'X' would  
> almost certainly seem to imply a fixed-width word/integer.

At the moment, there is only Char and WideChar in what I've been  
writing. I never meant to actually call them Char8/16/32 this time  
around. I think you are completely correct that it would otherwise  
imply how the character is stored. The representation is now  
controlled the same way int width is controlled. I've also  
generalized this for Char (though not all the way to adding the  
command-line option, as that would break Byte).

>> - For the time being I choose to ignore the basis' claim that "in  
>> WideChar, the functions toLower, toLower, isAlpha,..., isUpper  
>> and, in general, the definition of a ``letter'' are locale- 
>> dependent" and raise an Unimplemented exception for these methods.  
>> I think the standard is dreadfully misguided in assuming a global  
>> locale, and I defer what to do here till later as it is what  
>> blocked my progress last time. (IMO these functions have only  
>> questionable use, anyway)
> I think that is reasonable.

Actually, since I've functorized the Char implementation in the  
basis, it's presently following the exact same rules for WideChar as  
well. Locale-specific methods should be in another structure IMO. One  
that is parameterized by the locale.

> As I understand the implementation of the latter in MLton, any  
> string that has \uXXXX will be inferred to have type  
> String16.string = Char16Vector.vector and any string that has  
> \UXXXXXXXX will be inferred to have type String32.string =  
> Char32Vector.vector.  (Inference might also force the type to a  
> higher StringN.string type.)

That's exactly what I expected. :-)

> That would seem to lend more support for Char16 as BMP and Char32  
> as full unicode.

skaller has changed my mind since I last wrote this. Providing a  
Char16 (under any name) encourages people to use it. Just providing a  
WideChar (=Char32) is probably better. If people need a more memory  
efficient representation, they can convert WideChar/WideString into a  
Word8Vector.vector that is UTF-8 encoded.

> I don't see CharX or StringX as any encoding.

Actually, with hind-sight, they DO have an encoding. The succ/pred  
methods in CHAR require that imply that we encode them as code points.

>> Agreed? Can I just whip this up and check it in? ;-)
>
> I believe that there is still a unicode branch in the repository.   
> I would recommend that you merge changes from trunk into that  
> branch and continue development there.
> That gives people a chance to see development and suggest changes  
> before we merge them into trunk.

The branch hit a dead-end. The new 64-bit changes also obsoleted it.  
My new changeset started fresh off the current trunk, and is almost  
complete. I could make a new branch with them or send a patch to the  
list.

> The lastest version of SML/NJ (ver 110.62) includes
>   signature UTF8
>   structure UTF8 : UTF8

I'll take a look at this to see about UTF-8 conversion, once we have  
WideChar in svn.

PS. I need a heap sort under MLton's licence. Anyone have a bug  
tested (and short) implementation?