[MLton] Unicode... again

skaller skaller at users.sourceforge.net
Fri Feb 9 05:50:54 PST 2007


On Fri, 2007-02-09 at 12:20 +0100, Wesley W. Terpstra wrote:
> On Feb 9, 2007, at 4:32 AM, skaller wrote:
> > In the spirit I'd drop 16 bit support initially. Just provide
> > * 32 bits
> > * 8 bits
> > * UTF-8
> > I18N consensus is that 32 bits is the right compromise,
> > and UTF-8 encoding is the right space efficient one if you're
> > willing to give up random access.
> 
> This raises an interesting point: should UTF-8 have its own type?

That's a hard question IMHO, to which I have ideas but no
definitive answer. Bill Plauger once told me that C does best
without any character type at all (char is really an integer).
Furthermore the meaning of escape sequences and various 
'glyphs' isn't defined by the C standard -- other than those
required to represent C itself.

I think the idea is that typing isn't expressive enough to
really handle the huge complexity of character sets, encoding,
etc etc and leaving it open was the best way because it admitted
many solutions.

The problem with this is that it also leaves all the solutions
being non-portable wrt 'characters' although calculations with
integers is fairly deterministic.

The problem I try to look at is this one: what does

	"\x88"

mean? 

* In a byte string: one byte, hex code 88.
* In a UCS4 string a 32 bit word, value hex 88
* In a UTF-8 string, the two byte encoding of hex 88.

Note none of these meaning has anything to do with character
sets or Unicode, but depends only on the encoding (CEF?).

BTW: the use of \x88 here just means 'code point hex 88'.
You MIGHT chose instead that \x88 is byte 88 even in UTF-8,
and that you have to use \u0088 if you want the UTF-8 encoding
of a 32 bit code point .. the point is that the same 'number'
results in 3 distinct byte sequences .. suggesting these three
kinds of string MUST be distinct types .. note this only
applies to literals, and the argument is based on the assumption
the same 'escape' sequence has the same meaning (code point hex 88).


On the other hand consider

	"A'" -- with an accent of some kind, ONE character

This is really hard for my brain. What this means cannot be portable
as such. This isn't a portable representation, but an encoding 
in the input character set of the compiler, which has to be 
translated into an encoding of the same code point in the 
run time string format (which need not agree with that 
used by the compiler!).

Again, the string has 3 distinct representations, assuming
we can understand which 'character' the literal is supposed
to be encoding.

So again, I think there MUST be three distinct types,
and again, this only applies to literals.

I'm sure alternatives exist to this interpretation.
I'm NOT arguing in favour of it or against it, just
presenting it.

Fact is .. I'd really like to find an answer to the question
myself. My language Felix only provides two types at the moment:

	"...." // 8 bit string
	u".. " // 32 bit string

and 

	"\x88" --> byte x88, even if it is invalid utf-8
	"\u0088" --> UTF-8 encoding of code point x88
	u"\u0088" --> UCS4 encoding of code point x88
	u"\x88" --> GAK I HAVE NO IDEA .. probably should be illegal?

The downside of this scheme is that the same string can be used
for 

* 8 bit code points
* UTF-8 encoding

at the same time, which is not only inconsistent logically,
it is also unsound in that you can generate a string you thought
was UTF-8, but which contains an invalid UTF-8 sequence.

If that happens due to I/O that might be acceptable but it should
never happen as a result of the compiler transcoding a literal.

Tradeoff between flexibility and safety here.. 
I don't know the answer. What do you think?


-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net



More information about the MLton mailing list