[MLton] Unicode... again

Fri Feb 9 03:20:31 PST 2007

On Feb 9, 2007, at 4:32 AM, skaller wrote:
> In the spirit I'd drop 16 bit support initially. Just provide
> * 32 bits
> * 8 bits
> * UTF-8
> I18N consensus is that 32 bits is the right compromise,
> and UTF-8 encoding is the right space efficient one if you're
> willing to give up random access.

This raises an interesting point: should UTF-8 have its own type?

For most character encoding schemes (CES) it makes sense to just  
convert them to Word8Vector.vector. However, UTF-8 and UTF-16 can  
also be used as an in-memory character encoding form (CEF). Gtk,  
SQLite3, and many others make this choice. When you don't need random  
access, they are good choices.

Looking over the STRING signature, the only things you could not do  
with a UTF-8/16 string CEF in reasonable time complexity are sub,  
extract, and substring (and thus Substring). If you added an iterator  
interface in their place, this might be acceptable. Such a string  
would have char = WideChar.char.

I would defer this till later, though. First we need WideChar and an  
iconv binding

> On Fri, 2007-02-09 at 09:59 +1100, Michael Norrish wrote:
>> Pragmatically, I
>> wonder how important you think providing the 16 bit character type  
>> is.
>> It seems a kind of optional extra for people who want space-efficient
>> BMP.  Or do you imagine the vast majority of people will want to just
>> use the BMP, and will therefore resent wasting 16 bits per char?
>
> That's the problem! The vast majority of people do indeed want
> primarily the BMP. But they shouldn't be allowed easy access to it:
> the whole point of a Standard is as a guide for what everyone
> SHOULD do to facilitate communication and interoperability, and
> 32 bits is the way to go here, not 16, which is a stupid compromise
> made prematurely by greedy industrial powers.

Initially I thought a BMPChar would be very often used. However,  
skaller is right: we shouldn't be encouraging the use of a 16-bit  
character. If you need space efficiency, you should use a CES as a  
CEF. You lose random access, but if you were concerned about space  
efficiency, perhaps you don't care.

>>  (It certainly does seem as if there won't be much use of stuff  
>> outside BMP, but who can tell?)

The People's Republic China requires more than the BMP for Mandarin.  
AFAIK, that's it for modern written human languages.

At any rate, it sounds like everyone who has spoken so far thinks  
that WideChar should be 32 bits.

The remaining question is whether to include a BMPChar&BMPString (16  
bit) at all. Initially I thought this was a given necessity, but now  
I am not sure. I'll start whipping up a patch that adds WideChar as  
Char32, with no Char16.

gbuday at gmail.com wrote:
> I'm getting into using ml-ulex, which is a unicode-able lexer for
> sml/nj. As far as I understood, it uses   4-byte chars:
> ml-lpt/ml-ulex/BackEnds/SML/template-ml-ulex.sml contains
>
>      structure W = Word32
>      type wchar = W.word
>
> It would be nice to be able to use ml-ulex with your proposed unicode
> library. For the first step I'll try to compile ml-ulex with mlton.

That's great! Is it backwards compatible with the existing lex used  
in MLton?

I'm still waiting for feedback from the major stake-holders, of  
course...