[MLton] Unicode... again
Wesley W. Terpstra
wesley at terpstra.ca
Fri Feb 9 07:00:49 PST 2007
On Feb 9, 2007, at 2:50 PM, skaller wrote:
> The problem with this is that it also leaves all the solutions
> being non-portable wrt 'characters' although calculations with
> integers is fairly deterministic.
I don't really agree here. We are making the choice that characters
are opaque. When you use the \x \u and \U syntax, you are specifying
a Unicode code point. That's consistent with the definition of SML
without WideChar, and makes good sense for SML with WideChar.
SML prevents you from doing arithmetic with characters. Therefore,
the CEF used by MLton is completely arbitrary. For efficiency
reasons, we will use the Unicode code point as an integer, but we
need not do so. The 'CHAR.ord' and 'CHAR.chr' functions are defined
to map characters <-> code points according to Unicode. Therefore,
the operations done in the integer form are likewise well defined.
> The problem I try to look at is this one: what does
> * In a byte string: one byte, hex code 88.
> * In a UCS4 string a 32 bit word, value hex 88
> * In a UTF-8 string, the two byte encoding of hex 88.
> Note none of these meaning has anything to do with character
> sets or Unicode, but depends only on the encoding (CEF?).
MLton would have to use the correct CEF for the string literal, based
on the inferred type. It has to do this even for #"x", because that
might be a single byte, two bytes, or four bytes. If we added a UTF-8
type string, your "\x88" example would be interpreted as "\u0088" and
converted into the appropriate two-byte encoding under UTF-8.
I'm now leaning towards not adding UTf-8, as I think it's
unnecessary, and nasty that there would be no corresponding Char
type. (The basis quite heavily assumes there is a Char type specific
to each String type, and that these String types are the same as the
monomorphic vector over that type)
> BTW: the use of \x88 here just means 'code point hex 88'.
> You MIGHT chose instead that \x88 is byte 88 even in UTF-8,
I think this would be a mistake. If you wanted to (for some odd
reason) write your string literal as UTF-8 escaped with SML \x
escapes, then you should put that into a Word8Vector via Byte.
There's no type problems then as it would be a CharVector input.
> suggesting these three kinds of string MUST be distinct types
This has to be the case anyway, as not all Char implementations have
the same width.
> On the other hand consider
> "A'" -- with an accent of some kind, ONE character
> This is really hard for my brain. What this means cannot be
> portable as such.
I don't agree. If the source code was written as:
val x = "пришет"
The source file had some CES (which I've argued we should just make
UTF-8). This is parsed at compile-time into Unicode characters
(possibly with this ml-ulex). After being parsed, MLton knows the
sequence of Unicode code points in that string literal. When MLton
needs to write this into the text segment of the binary, it would do
so depending on the inferred string type. If the inferred string type
was simply String.string, you should get a compile-time error to the
effect that this string is "too big" for the type. If the type is
WideString.string, then it will be written as a four byte value in
machine endian order. If there were a UTF-8 type, it would be written
as UTF-8 CEF.
> Fact is .. I'd really like to find an answer to the question
> myself. My language Felix only provides two types at the moment:
> "...." // 8 bit string
> u".. " // 32 bit string
> "\x88" --> byte x88, even if it is invalid utf-8
> "\u0088" --> UTF-8 encoding of code point x88
> u"\u0088" --> UCS4 encoding of code point x88
> u"\x88" --> GAK I HAVE NO IDEA .. probably should be illegal?
I think the choices you've listed are all consistent with how I
intend for this to work. The u"\x88" should be code point 0x88.
> The downside of this scheme is that the same string can be used for
> * 8 bit code points
> * UTF-8 encoding
> at the same time, which is not only inconsistent logically,
> it is also unsound in that you can generate a string you thought
> was UTF-8, but which contains an invalid UTF-8 sequence.
> If that happens due to I/O that might be acceptable but it should
> never happen as a result of the compiler transcoding a literal.
> Tradeoff between flexibility and safety here..
I don't see this point. There is exactly one type inferred for each
string. MLton will never write incorrect UTF-8 to the text segment as
it would do so from a sequence of code points. If your input source
file had invalid UTF-8, then it would be a parse error. Even if the
source file had UTF-8 format and used a UTF-8 string literal, MLton
would decode the source file into WideString, decide that the literal
is used as a UTF8String, and then re-encode that WideString back to
UTF-8 in the program's text segment.
More information about the MLton