[MLton] Unicode and WideChar support

Wesley W. Terpstra wesley@terpstra.ca
Fri, 25 Nov 2005 19:41:14 +0100


--Apple-Mail-18-18051494
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=UTF-8;
	delsp=yes;
	format=flowed

Good evening!

I have been working on adding support for Unicode to the MLton
compiler (www.mlton.org). To the best of my knowledge, no other
SML compiler presently supports Unicode (please let me know if
I am wrong here). Stephen Weeks redirected me to this list as the
appropriate place to push for changes to the CHAR signature. :-)

I have several concerns regarding the definition of CHAR.
This email is rather long, so I have attempted to organize this
into the following sections:

1. is{Alpha,Graph,...}
2. Representation of WideChar in memory
3. Parsing/serialization
4. The charset of SML source files
5. My suggested amendments

-----
1. is{Alpha,Graph,...}
-----

The standard specifies that a normal 'Char.char' is taken to be in
the "extended ASCII 8-bit characters". Clearly, the intent is that the
type be 8 bits wide, and include ASCII in the first 128 characters.
However, what is intended for the latter 128 characters? There are
many extensions of ASCII to 8 bit.

It seems to me that the most reasonable thing to say is that the
character set is ISO-8859-1, which is the most popular of the 8 bit
ASCII extensions. Also, ISO-8859-1 enjoys the property that
(WideChar.chr o Char.ord) will leave the character unchanged
since ISO-8859-1 is embedded code point for character into the
Unicode standard. Finally, ISO-8859-1 suffices to cover most of
the major European languages.

Next, the CHAR standard states:
> In WideChar, the functions toLower, toLower, isAlpha,..., isUpper =20
> and, in general, the definition of a ``letter'' are locale-dependent.

However, nowhere in the CHAR interface is there a provision for
specifying the locale to the is* methods. Requiring that the entire
program has one global locale is a very bad idea. C does this,
and it is a source of many problems. Furthermore, this means that
WideChar.isSpace x may disagree with Char.isSpace x. Also, it
means that the behaviour of a utility programs might depend on
the environment variables they are run on, in unexpected ways.

I think a more sane thing to specify would be that WideChar
use the 'default categorization' given by Unicode:
http://www.unicode.org/Public/4.1.0/ucd/UCD.html#General_Category_Values
This means that WideChar, too, would be locale independent.

However, It is definitely the case that there are locale dependent
to{Upper,Lower} and is{Digit,...} methods, but these cannot fit into
the CHAR interface as it exists now, anyways. For one, toUpper
may need to return multiple characters given one input character.
Furthermore, other languages use entirely different is* methods.

I propose that locale dependent classification methods be moved
into a distinct structure which operates exclusively on WideChars.
Such a structure should provide methods for determining character
class for a given (character, locale) pair. Furthermore, there should
be a whole lot more is* methods reflecting other languages. This is
outside the scope of what I want to talk about now.

There are further problems in the definition of the is* methods:
	isAlpha \superset isUpper \cup isLower
	isAlphaNum \superset isAlpha \cup isDigit
... this contradicts the guarantees made in the basis for WideChar:
> isAlpha c
> returns true if c is a letter (lowercase or uppercase).
> isAlphaNum c
> returns true if c is alphanumeric (a letter or a decimal digit).

If those two requirements were removed, then the map could be
(http://www.unicode.org/Public/4.1.0/ucd/=20
UCD.html#General_Category_Values):
	isUpper =3D category 'Lu'
	isLower =3D 'Ll'
	isAlpha =3D 'L?'
	isAlphaNum =3D '{L,N}?'
	isCntrl =3D 'C?'
	isDigit =3D 'Nd'
	isNumber =3D 'N?' <----- I want this added to CHAR
	isHexDigit =3D Hex_Digit =
(http://www.unicode.org/Public/4.1.0/ucd/=20
UCD.html#Hex_Digit)
	isSpace =3D White_Space =
(http://www.unicode.org/Public/4.1.0/ucd/=20
UCD.html#White_Space)
	isPrint =3D not 'C?'
	isGraph =3D isPrint andalso not isSpace
	isPunct =3D isGraph andalso not isAlphaNum

I am glad that the standard did not require that
toUpper o toLower =3D toUpper
toLower o toUppser =3D toLower
... as these can also not be guaranteed.
A footnote to this effect might be a good idea.

If you agree with me up to this point, then we have another problem.
(WideChar.isX o WideChar.chr o Char.ord) !=3D Char.isX

This is due to the fact that the character set of Char was not
specified beyond the ASCII range. If taken to be ISO-8859-1,
then the following changes happen:

isUpper accepts C0-D6,D8-DE
isLower accepts AA,B5,BA,DF-F6,F8-FF
isDigit - unchanged
isNumber =3D isDigit + B2,B3,B9,BC-BE
isAlpha accepts AA,B5,BA,C0-D6,D8-F6,F8-FF
isAlphaNum ...
isHexDigit - unchanged
isGraph accepts A1-FF
isPrint accepts  A0-FF
isPunct accepts A1-A9,AB-B1,B4,B6-B8,BB,BF,D7,F7
isControl accepts 80-9F
isSpace accepts A0 and 85

toLower and toUpper cover these new characters too.

-----
2. Representation of Char in memory
-----

According to the current basis standard:
> The optional WideChar structure defines wide characters, which are =20
> represented by a fixed number of 8-bit words (bytes)
and the WideChar structure must provide:
> val maxChar : char
> val maxOrd : int
>
> val ord : char -> int
> val chr : int -> char

At the present time, the Unicode standard specifies "code points"
(Unicode speak for characters) between 0..10FFFF. This requires
21 bits of memory. The typical representations of Unicode in main
memory are UTF-8, UTF-16, and UTF-32, using bit widths 8,16,32
respectively. UTF-8 and UTF-16 are both variable length encodings.
Since String.sub should be constant time, only UTF-32 is a viable
choice out of these for WideString (and thus WideChar).

In MLton, an int can only fit 32 bits, or 31 bits for positive numbers.
This means that a UTF-32 value can't fit into 'ord'.

I see a number of ways out:

1. 'chr' should raise Chr if the parameter is > 10FFFF
... therefore, 'ord' never has a problem.
2. we allow implementations to use 21 bits (not a byte multiple)
... what purpose does this constraint serve anyways?
3. we have 'ord' and 'chr' use word instead
... which actually makes sense since code points are in hex

I have no real opinion as to which approach is best.
I do think that it would be insane to actually pack unicode into
anything other than 32 bits, but a 21 bit wrapper seems ok.
Perhaps I lean towards #1 for compatibility, but #3 for purity.

Also, I think the basis should allow additional structures to
match the CHAR signature. A 16-bit type would would be nice.

-----
3. Parsing/serialization
-----

WideChar.toString and WideChar.scan must pack unicode into a
normal String.string. For code points up to FFFF this is easy; use
the \uFFFF SML escape. For larger values, this presents a problem.
MLton already includes a \U12345678 for values that are too big.
(If we opt to restrict WideChar to 10FFFF, this will be changed to
only allow 6 digits) I think that this is the best solution here.

WideChar.{from,to}CString is more of a problem. C does not include
Unicode escapes, as far as I know. Should we mandate that the
string first be converted into UTF-8 and then escaped as C? This
would be the most useful thing I can think of for these functions.

-----
4. The charset of SML source files
-----

It is unreasonable to require programmers to write:
val x =3D "\u041f\u0440\u0438\u0432\u0435\u0442"
instead of
val x =3D "=D0=9F=D1=80=D0=B8=D0=B2=D0=B5=D1=82"

Therefore, for MLton, we intend to specify that SML source files
are encoded in UTF-8, not ASCII. Furthermore, anything which
WideChar.isPrint would accept, may appear in strings. I think that
any other SML implementation supporting Unicode should do
likewise.

-----
5. My suggested amendments
-----

Here's an executive summary of what I think needs to change.

Documentation only changes (if no existing Unicode SMLs):
   Define WideChar.is* by the Unicode default category
     - ie: also locale independent
   Relax the requirement that isAlpha =3D isUpper + isLower
   Redefine isAlphaNum as isAlpha + isNumber (not isDigit)
   Declare that Char is ISO-8859-1; not just any extension

Extensions that should be safe:
   Add to CHAR an isNumber method
   Accept source files in UTF-8 (not just ASCII)
   Allow WideChar.isPrint inside SML source strings
   Add an \U123456 or \U12345678 escape to SML
     - use it in WideChar.{to,from}String

Changes that may cause breakage to existing code:
   Modify Char.is* definition to line up with Unicode
   Modify Char.to{Upper,Lower} to match Unicode
   Redefine toCString to first encode to UTF-8

--Apple-Mail-18-18051494
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=UTF-8

<HTML><BODY style=3D"word-wrap: break-word; -khtml-nbsp-mode: space; =
-khtml-line-break: after-white-space; ">Good evening!<DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>I have been working on =
adding support for Unicode to the MLton</DIV><DIV>compiler (<A =
href=3D"http://www.mlton.org">www.mlton.org</A>). To the best of my =
knowledge, no other</DIV><DIV>SML compiler presently supports Unicode =
(please let me know if</DIV><DIV>I am wrong here). Stephen Weeks =
redirected me to this list as the</DIV><DIV>appropriate place to push =
for changes to the CHAR signature. :-)</DIV><DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>I have several concerns =
regarding the definition of CHAR.</DIV><DIV>This email is rather long, =
so I have attempted to organize this</DIV><DIV>into the following =
sections:</DIV><DIV><BR class=3D"khtml-block-placeholder"></DIV><DIV>1. =
is{Alpha,Graph,...}</DIV><DIV>2. Representation of WideChar in =
memory</DIV><DIV>3. Parsing/serialization</DIV><DIV>4. The charset of =
SML source files</DIV><DIV>5. My suggested=C2=A0amendments</DIV><DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>-----</DIV><DIV>1. =
is{Alpha,Graph,...}</DIV><DIV>-----</DIV><DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>The standard specifies that =
a normal 'Char.char' is taken to be in</DIV><DIV>the "extended ASCII =
8-bit characters". Clearly, the intent is that the</DIV><DIV>type be 8 =
bits wide, and include ASCII in the first 128 =
characters.</DIV><DIV>However, what is intended for the latter 128 =
characters? There are</DIV><DIV>many extensions of ASCII to 8 =
bit.</DIV><DIV><BR class=3D"khtml-block-placeholder"></DIV><DIV>It seems =
to me that the most reasonable thing to say is that =
the</DIV><DIV>character set is ISO-8859-1, which is the most popular of =
the 8 bit</DIV><DIV>ASCII extensions. Also, ISO-8859-1 enjoys the =
property that</DIV><DIV>(WideChar.chr o Char.ord) will leave the =
character unchanged</DIV><DIV>since ISO-8859-1 is embedded code point =
for character into the</DIV><DIV>Unicode standard. Finally, ISO-8859-1 =
suffices to cover most of</DIV><DIV>the major European =
languages.</DIV><DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>Next, the CHAR standard =
states:</DIV><DIV><BLOCKQUOTE type=3D"cite"><DIV><FONT =
class=3D"Apple-style-span" face=3D"Times" size=3D"4"><SPAN =
class=3D"Apple-style-span" style=3D"font-size: 16px;">In =
</SPAN></FONT><A =
href=3D"http://mlton.org/basis/char.html#WideChar:STR:SPEC"><FONT =
class=3D"Apple-style-span" color=3D"#0021E7" face=3D"Courier" =
size=3D"3"><SPAN class=3D"Apple-style-span" style=3D"font-size: =
13px;">WideChar</SPAN></FONT></A><FONT class=3D"Apple-style-span" =
face=3D"Times" size=3D"4"><SPAN class=3D"Apple-style-span" =
style=3D"font-size: 16px;">, the functions </SPAN></FONT><A =
href=3D"http://mlton.org/basis/char.html#SIG:CHAR.toLower:VAL:SPEC"><FONT =
class=3D"Apple-style-span" color=3D"#0021E7" face=3D"Courier" =
size=3D"3"><SPAN class=3D"Apple-style-span" style=3D"font-size: =
13px;">toLower</SPAN></FONT></A><FONT class=3D"Apple-style-span" =
face=3D"Times" size=3D"4"><SPAN class=3D"Apple-style-span" =
style=3D"font-size: 16px;">, </SPAN></FONT><A =
href=3D"http://mlton.org/basis/char.html#SIG:CHAR.toLower:VAL:SPEC"><FONT =
class=3D"Apple-style-span" color=3D"#0021E7" face=3D"Courier" =
size=3D"3"><SPAN class=3D"Apple-style-span" style=3D"font-size: =
13px;">toLower</SPAN></FONT></A><FONT class=3D"Apple-style-span" =
face=3D"Times" size=3D"4"><SPAN class=3D"Apple-style-span" =
style=3D"font-size: 16px;">, </SPAN></FONT><A =
href=3D"http://mlton.org/basis/char.html#SIG:CHAR.isAlpha:VAL:SPEC"><FONT =
class=3D"Apple-style-span" color=3D"#0021E7" face=3D"Courier" =
size=3D"3"><SPAN class=3D"Apple-style-span" style=3D"font-size: =
13px;">isAlpha</SPAN></FONT></A><FONT class=3D"Apple-style-span" =
face=3D"Times" size=3D"4"><SPAN class=3D"Apple-style-span" =
style=3D"font-size: 16px;">,..., </SPAN></FONT><A =
href=3D"http://mlton.org/basis/char.html#SIG:CHAR.isUpper:VAL:SPEC"><FONT =
class=3D"Apple-style-span" color=3D"#0021E7" face=3D"Courier" =
size=3D"3"><SPAN class=3D"Apple-style-span" style=3D"font-size: =
13px;">isUpper</SPAN></FONT></A><FONT class=3D"Apple-style-span" =
face=3D"Times" size=3D"4"><SPAN class=3D"Apple-style-span" =
style=3D"font-size: 16px;"> and, in general, the definition of a =
``letter'' are =
locale-dependent.</SPAN></FONT></DIV></BLOCKQUOTE><DIV><BR =
class=3D"khtml-block-placeholder"></DIV>However, nowhere in the CHAR =
interface is there a provision for<BR></DIV><DIV>specifying the locale =
to the is* methods. Requiring that the entire</DIV><DIV>program has one =
global locale is a very bad idea. C does this,</DIV><DIV>and it is a =
source of many problems.=C2=A0Furthermore, this means =
that</DIV><DIV>WideChar.isSpace x may disagree with Char.isSpace x. =
Also, it</DIV><DIV>means that the behaviour of a utility programs might =
depend on</DIV><DIV>the environment variables they are run on, in =
unexpected ways.</DIV><DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>I think a more sane thing =
to specify would be that WideChar</DIV><DIV>use the 'default =
categorization' given by Unicode:</DIV><DIV><A =
href=3D"http://www.unicode.org/Public/4.1.0/ucd/UCD.html#General_Category_=
Values">http://www.unicode.org/Public/4.1.0/ucd/UCD.html#General_Category_=
Values</A></DIV><DIV>This means that WideChar, too, would be locale =
independent.</DIV><DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>However, It is definitely =
the case that there are locale dependent</DIV><DIV>to{Upper,Lower} and =
is{Digit,...} methods, but these cannot fit into</DIV><DIV>the CHAR =
interface as it exists now, anyways. For one, toUpper</DIV><DIV>may need =
to return multiple characters given one input =
character.</DIV><DIV>Furthermore, other languages use entirely different =
is* methods.</DIV><DIV><BR class=3D"khtml-block-placeholder"></DIV><DIV>I =
propose that locale dependent classification methods be =
moved</DIV><DIV>into a distinct structure which operates exclusively on =
WideChars.</DIV><DIV>Such a structure should provide methods for =
determining character</DIV><DIV>class for a given (character, locale) =
pair. Furthermore, there should</DIV><DIV>be a whole lot more is* =
methods reflecting other languages. This is</DIV><DIV>outside the scope =
of what I want to talk about now.</DIV><DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>There are further =
problems=C2=A0in the definition of the is* methods:</DIV><DIV><SPAN =
class=3D"Apple-tab-span" style=3D"white-space:pre">	</SPAN>isAlpha =
\superset isUpper \cup isLower</DIV><DIV><SPAN class=3D"Apple-tab-span" =
style=3D"white-space:pre">	</SPAN>isAlphaNum \superset isAlpha \cup =
isDigit</DIV><DIV>... this contradicts the guarantees made in the basis =
for WideChar:</DIV><DIV style=3D"margin-top: 0px; margin-right: 0px; =
margin-bottom: 0px; margin-left: 0px; "><BLOCKQUOTE type=3D"cite"><FONT =
class=3D"Apple-style-span" face=3D"Courier" size=3D"3"><SPAN =
class=3D"Apple-style-span" style=3D"font-size: 13px;">isAlpha =
</SPAN></FONT><FONT class=3D"Apple-style-span" face=3D"Courier" =
size=3D"3"><SPAN class=3D"Apple-style-span" style=3D"font-size: =
13px;"><I>c</I></SPAN></FONT></BLOCKQUOTE></DIV><DIV style=3D"margin-top: =
0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; =
"><BLOCKQUOTE type=3D"cite"><DIV style=3D"margin-top: 0px; margin-right: =
0px; margin-bottom: 0px; margin-left: 0px; "><FONT =
class=3D"Apple-style-span" face=3D"Times" size=3D"4"><SPAN =
class=3D"Apple-style-span" style=3D"font-size: 16px;">returns =
</SPAN></FONT><FONT class=3D"Apple-style-span" face=3D"Courier" =
size=3D"3"><SPAN class=3D"Apple-style-span" style=3D"font-size: =
13px;">true</SPAN></FONT><FONT class=3D"Apple-style-span" face=3D"Times" =
size=3D"4"><SPAN class=3D"Apple-style-span" style=3D"font-size: 16px;"> =
if </SPAN></FONT><FONT class=3D"Apple-style-span" face=3D"Times" =
size=3D"4"><SPAN class=3D"Apple-style-span" style=3D"font-size: =
16px;"><I>c</I></SPAN></FONT><FONT class=3D"Apple-style-span" =
face=3D"Times" size=3D"4"><SPAN class=3D"Apple-style-span" =
style=3D"font-size: 16px;"> is a letter (lowercase or =
uppercase).</SPAN></FONT><FONT class=3D"Apple-style-span" face=3D"Times" =
size=3D"4"><SPAN class=3D"Apple-style-span" style=3D"font-size: =
16px;">=C2=A0</SPAN></FONT></DIV><DIV style=3D"margin-top: 0px; =
margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><FONT =
class=3D"Apple-style-span" face=3D"Courier" size=3D"3"><SPAN =
class=3D"Apple-style-span" style=3D"font-size: 13px;">isAlphaNum =
</SPAN></FONT><FONT class=3D"Apple-style-span" face=3D"Courier" =
size=3D"3"><SPAN class=3D"Apple-style-span" style=3D"font-size: =
13px;"><I>c</I></SPAN></FONT></DIV><DIV style=3D"margin-top: 0px; =
margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><FONT =
class=3D"Apple-style-span" face=3D"Times" size=3D"4"><SPAN =
class=3D"Apple-style-span" style=3D"font-size: 16px;">returns =
</SPAN></FONT><FONT class=3D"Apple-style-span" face=3D"Courier" =
size=3D"3"><SPAN class=3D"Apple-style-span" style=3D"font-size: =
13px;">true</SPAN></FONT><FONT class=3D"Apple-style-span" face=3D"Times" =
size=3D"4"><SPAN class=3D"Apple-style-span" style=3D"font-size: 16px;"> =
if </SPAN></FONT><FONT class=3D"Apple-style-span" face=3D"Times" =
size=3D"4"><SPAN class=3D"Apple-style-span" style=3D"font-size: =
16px;"><I>c</I></SPAN></FONT><FONT class=3D"Apple-style-span" =
face=3D"Times" size=3D"4"><SPAN class=3D"Apple-style-span" =
style=3D"font-size: 16px;"> is alphanumeric (a letter or a decimal =
digit).</SPAN></FONT><FONT class=3D"Apple-style-span" face=3D"Times" =
size=3D"4"><SPAN class=3D"Apple-style-span" style=3D"font-size: =
16px;">=C2=A0</SPAN></FONT></DIV></BLOCKQUOTE></DIV><DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>If those two requirements =
were removed, then the map could be</DIV><DIV>(<A =
href=3D"http://www.unicode.org/Public/4.1.0/ucd/UCD.html#General_Category_=
Values">http://www.unicode.org/Public/4.1.0/ucd/UCD.html#General_Category_=
Values</A>):</DIV><DIV><SPAN class=3D"Apple-tab-span" =
style=3D"white-space:pre">	</SPAN>isUpper =3D category =
'Lu'</DIV><DIV><SPAN class=3D"Apple-tab-span" style=3D"white-space:pre">	=
</SPAN>isLower =3D 'Ll'</DIV><DIV><SPAN class=3D"Apple-tab-span" =
style=3D"white-space:pre">	</SPAN>isAlpha =3D 'L?'</DIV><DIV><SPAN =
class=3D"Apple-tab-span" style=3D"white-space:pre">	=
</SPAN>isAlphaNum =3D '{L,N}?'</DIV><DIV><SPAN class=3D"Apple-tab-span" =
style=3D"white-space:pre">	</SPAN>isCntrl =3D 'C?'</DIV><DIV><SPAN =
class=3D"Apple-tab-span" style=3D"white-space:pre">	</SPAN>isDigit =3D=
 'Nd'</DIV><DIV><SPAN class=3D"Apple-tab-span" style=3D"white-space:pre">	=
</SPAN>isNumber =3D 'N?' &lt;----- I want this added to =
CHAR</DIV><DIV><SPAN class=3D"Apple-tab-span" style=3D"white-space:pre">	=
</SPAN>isHexDigit =3D Hex_Digit (<A =
href=3D"http://www.unicode.org/Public/4.1.0/ucd/UCD.html#Hex_Digit">http:/=
/www.unicode.org/Public/4.1.0/ucd/UCD.html#Hex_Digit</A>)=C2=A0</DIV><DIV>=
<SPAN class=3D"Apple-tab-span" style=3D"white-space:pre">	=
</SPAN>isSpace =3D White_Space (<A =
href=3D"http://www.unicode.org/Public/4.1.0/ucd/UCD.html#White_Space">http=
://www.unicode.org/Public/4.1.0/ucd/UCD.html#White_Space</A>)</DIV><DIV><S=
PAN class=3D"Apple-tab-span" style=3D"white-space:pre">	</SPAN>isPrint =3D=
 not 'C?'</DIV><DIV><SPAN class=3D"Apple-tab-span" =
style=3D"white-space:pre">	</SPAN>isGraph =3D isPrint andalso not =
isSpace</DIV><DIV><SPAN class=3D"Apple-tab-span" =
style=3D"white-space:pre">	</SPAN>isPunct =3D isGraph andalso not =
isAlphaNum</DIV><DIV><BR class=3D"khtml-block-placeholder"></DIV><DIV>I =
am glad that the standard did not require that=C2=A0</DIV><DIV>toUpper o =
toLower =3D toUpper</DIV><DIV>toLower o toUppser =3D =
toLower</DIV><DIV>... as these can also not be guaranteed.</DIV><DIV>A =
footnote to this effect might be a good idea.</DIV><DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>If you agree with me up to =
this point, then we have another problem.</DIV><DIV>(WideChar.isX o =
WideChar.chr o Char.ord) !=3D Char.isX</DIV><DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>This is due to the fact =
that the character set of Char was not</DIV><DIV>specified beyond the =
ASCII range. If taken to be ISO-8859-1,</DIV><DIV>then the following =
changes happen:</DIV><DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>isUpper accepts =
C0-D6,D8-DE</DIV><DIV>isLower accepts =
AA,B5,BA,DF-F6,F8-FF</DIV><DIV>isDigit - unchanged</DIV><DIV>isNumber =3D =
isDigit + B2,B3,B9,BC-BE</DIV><DIV>isAlpha accepts =
AA,B5,BA,C0-D6,D8-F6,F8-FF</DIV><DIV>isAlphaNum ...</DIV><DIV>isHexDigit =
- unchanged</DIV><DIV>isGraph accepts A1-FF</DIV><DIV>isPrint accepts=C2=A0=
 A0-FF</DIV><DIV>isPunct accepts =
A1-A9,AB-B1,B4,B6-B8,BB,BF,D7,F7</DIV><DIV>isControl accepts =
80-9F</DIV><DIV>isSpace accepts A0 and 85</DIV><DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>toLower and toUpper cover =
these new characters too.</DIV><DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>-----</DIV><DIV>2. =
Representation of Char in memory</DIV><DIV>-----</DIV><DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>According to the current =
basis standard:</DIV><DIV><BLOCKQUOTE type=3D"cite"><FONT =
class=3D"Apple-style-span" face=3D"Times" size=3D"4"><SPAN =
class=3D"Apple-style-span" style=3D"font-size: 16px;">The optional =
</SPAN></FONT><FONT class=3D"Apple-style-span" face=3D"Courier" =
size=3D"3"><SPAN class=3D"Apple-style-span" style=3D"font-size: =
13px;">WideChar</SPAN></FONT><FONT class=3D"Apple-style-span" =
face=3D"Times" size=3D"4"><SPAN class=3D"Apple-style-span" =
style=3D"font-size: 16px;"> structure defines wide characters, which are =
represented by a fixed number of 8-bit words =
(bytes)</SPAN></FONT></BLOCKQUOTE>and the WideChar structure must =
provide:<BR></DIV><DIV style=3D"margin-top: 0px; margin-right: 0px; =
margin-bottom: 0px; margin-left: 0px; "><BLOCKQUOTE type=3D"cite"><DIV =
style=3D"margin-top: 0px; margin-right: 0px; margin-bottom: 0px; =
margin-left: 0px; "><FONT class=3D"Apple-style-span" face=3D"Courier" =
size=3D"3"><SPAN class=3D"Apple-style-span" style=3D"font-size: =
13px;"><B>val</B></SPAN></FONT><FONT class=3D"Apple-style-span" =
face=3D"Courier" size=3D"3"><SPAN class=3D"Apple-style-span" =
style=3D"font-size: 13px;">=C2=A0</SPAN></FONT><A =
href=3D"http://mlton.org/basis/char.html#SIG:CHAR.maxChar:VAL"><FONT =
class=3D"Apple-style-span" face=3D"Courier" size=3D"3"><SPAN =
class=3D"Apple-style-span" style=3D"font-size: 13px;"><FONT =
class=3D"Apple-style-span" =
color=3D"#0021E7">maxChar</FONT></SPAN></FONT></A><FONT =
class=3D"Apple-style-span" face=3D"Courier" size=3D"3"><SPAN =
class=3D"Apple-style-span" style=3D"font-size: =
13px;">=C2=A0</SPAN></FONT><FONT class=3D"Apple-style-span" =
face=3D"Courier" size=3D"3"><SPAN class=3D"Apple-style-span" =
style=3D"font-size: 13px;"><B>:</B></SPAN></FONT><FONT =
class=3D"Apple-style-span" face=3D"Courier" size=3D"3"><SPAN =
class=3D"Apple-style-span" style=3D"font-size: =
13px;">=C2=A0char</SPAN></FONT></DIV><DIV style=3D"margin-top: 0px; =
margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><FONT =
class=3D"Apple-style-span" face=3D"Courier" size=3D"3"><SPAN =
class=3D"Apple-style-span" style=3D"font-size: =
13px;"><B>val</B></SPAN></FONT><FONT class=3D"Apple-style-span" =
face=3D"Courier" size=3D"3"><SPAN class=3D"Apple-style-span" =
style=3D"font-size: 13px;">=C2=A0</SPAN></FONT><A =
href=3D"http://mlton.org/basis/char.html#SIG:CHAR.maxOrd:VAL"><FONT =
class=3D"Apple-style-span" face=3D"Courier" size=3D"3"><SPAN =
class=3D"Apple-style-span" style=3D"font-size: 13px;"><FONT =
class=3D"Apple-style-span" =
color=3D"#0021E7">maxOrd</FONT></SPAN></FONT></A><FONT =
class=3D"Apple-style-span" face=3D"Courier" size=3D"3"><SPAN =
class=3D"Apple-style-span" style=3D"font-size: =
13px;">=C2=A0</SPAN></FONT><FONT class=3D"Apple-style-span" =
face=3D"Courier" size=3D"3"><SPAN class=3D"Apple-style-span" =
style=3D"font-size: 13px;"><B>:</B></SPAN></FONT><FONT =
class=3D"Apple-style-span" face=3D"Courier" size=3D"3"><SPAN =
class=3D"Apple-style-span" style=3D"font-size: =
13px;">=C2=A0int</SPAN></FONT></DIV><DIV style=3D"margin-top: 0px; =
margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal =
normal normal 13px/normal Courier; min-height: 16px; "><BR></DIV><DIV =
style=3D"margin-top: 0px; margin-right: 0px; margin-bottom: 0px; =
margin-left: 0px; "><FONT class=3D"Apple-style-span" face=3D"Courier" =
size=3D"3"><SPAN class=3D"Apple-style-span" style=3D"font-size: =
13px;"><B>val</B></SPAN></FONT><FONT class=3D"Apple-style-span" =
face=3D"Courier" size=3D"3"><SPAN class=3D"Apple-style-span" =
style=3D"font-size: 13px;">=C2=A0</SPAN></FONT><A =
href=3D"http://mlton.org/basis/char.html#SIG:CHAR.ord:VAL"><FONT =
class=3D"Apple-style-span" face=3D"Courier" size=3D"3"><SPAN =
class=3D"Apple-style-span" style=3D"font-size: 13px;"><FONT =
class=3D"Apple-style-span" =
color=3D"#0021E7">ord</FONT></SPAN></FONT></A><FONT =
class=3D"Apple-style-span" face=3D"Courier" size=3D"3"><SPAN =
class=3D"Apple-style-span" style=3D"font-size: =
13px;">=C2=A0</SPAN></FONT><FONT class=3D"Apple-style-span" =
face=3D"Courier" size=3D"3"><SPAN class=3D"Apple-style-span" =
style=3D"font-size: 13px;"><B>:</B></SPAN></FONT><FONT =
class=3D"Apple-style-span" face=3D"Courier" size=3D"3"><SPAN =
class=3D"Apple-style-span" style=3D"font-size: =
13px;">=C2=A0char=C2=A0</SPAN></FONT><FONT class=3D"Apple-style-span" =
face=3D"Courier" size=3D"3"><SPAN class=3D"Apple-style-span" =
style=3D"font-size: 13px;"><B>-&gt;</B></SPAN></FONT><FONT =
class=3D"Apple-style-span" face=3D"Courier" size=3D"3"><SPAN =
class=3D"Apple-style-span" style=3D"font-size: =
13px;">=C2=A0int</SPAN></FONT></DIV><DIV style=3D"margin-top: 0px; =
margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><FONT =
class=3D"Apple-style-span" face=3D"Courier" size=3D"3"><SPAN =
class=3D"Apple-style-span" style=3D"font-size: =
13px;"><B>val</B></SPAN></FONT><FONT class=3D"Apple-style-span" =
face=3D"Courier" size=3D"3"><SPAN class=3D"Apple-style-span" =
style=3D"font-size: 13px;">=C2=A0</SPAN></FONT><A =
href=3D"http://mlton.org/basis/char.html#SIG:CHAR.chr:VAL"><FONT =
class=3D"Apple-style-span" face=3D"Courier" size=3D"3"><SPAN =
class=3D"Apple-style-span" style=3D"font-size: 13px;"><FONT =
class=3D"Apple-style-span" =
color=3D"#0021E7">chr</FONT></SPAN></FONT></A><FONT =
class=3D"Apple-style-span" face=3D"Courier" size=3D"3"><SPAN =
class=3D"Apple-style-span" style=3D"font-size: =
13px;">=C2=A0</SPAN></FONT><FONT class=3D"Apple-style-span" =
face=3D"Courier" size=3D"3"><SPAN class=3D"Apple-style-span" =
style=3D"font-size: 13px;"><B>:</B></SPAN></FONT><FONT =
class=3D"Apple-style-span" face=3D"Courier" size=3D"3"><SPAN =
class=3D"Apple-style-span" style=3D"font-size: =
13px;">=C2=A0int=C2=A0</SPAN></FONT><FONT class=3D"Apple-style-span" =
face=3D"Courier" size=3D"3"><SPAN class=3D"Apple-style-span" =
style=3D"font-size: 13px;"><B>-&gt;</B></SPAN></FONT><FONT =
class=3D"Apple-style-span" face=3D"Courier" size=3D"3"><SPAN =
class=3D"Apple-style-span" style=3D"font-size: =
13px;">=C2=A0char</SPAN></FONT></DIV></BLOCKQUOTE></DIV><DIV =
style=3D"margin-top: 0px; margin-right: 0px; margin-bottom: 0px; =
margin-left: 0px; "><BR class=3D"khtml-block-placeholder"></DIV><DIV =
style=3D"margin-top: 0px; margin-right: 0px; margin-bottom: 0px; =
margin-left: 0px; ">At the present time, the Unicode standard specifies =
"code points"</DIV><DIV style=3D"margin-top: 0px; margin-right: 0px; =
margin-bottom: 0px; margin-left: 0px; ">(Unicode speak for characters) =
between 0..10FFFF. This requires</DIV><DIV style=3D"margin-top: 0px; =
margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">21 bits of =
memory. The typical representations of Unicode in main</DIV><DIV =
style=3D"margin-top: 0px; margin-right: 0px; margin-bottom: 0px; =
margin-left: 0px; ">memory are UTF-8, UTF-16, and UTF-32, using bit =
widths 8,16,32</DIV><DIV style=3D"margin-top: 0px; margin-right: 0px; =
margin-bottom: 0px; margin-left: 0px; ">respectively. UTF-8 and UTF-16 =
are both variable length encodings.</DIV><DIV style=3D"margin-top: 0px; =
margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">Since =
String.sub should be constant time, only UTF-32 is a viable</DIV><DIV =
style=3D"margin-top: 0px; margin-right: 0px; margin-bottom: 0px; =
margin-left: 0px; ">choice out of these for WideString (and thus =
WideChar).</DIV><DIV style=3D"margin-top: 0px; margin-right: 0px; =
margin-bottom: 0px; margin-left: 0px; "><BR =
class=3D"khtml-block-placeholder"></DIV><DIV style=3D"margin-top: 0px; =
margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">In MLton, an =
int can only fit 32 bits, or 31 bits for positive numbers.</DIV><DIV =
style=3D"margin-top: 0px; margin-right: 0px; margin-bottom: 0px; =
margin-left: 0px; ">This means that a UTF-32 value can't fit into =
'ord'.</DIV><DIV style=3D"margin-top: 0px; margin-right: 0px; =
margin-bottom: 0px; margin-left: 0px; "><BR =
class=3D"khtml-block-placeholder"></DIV><DIV style=3D"margin-top: 0px; =
margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">I see a =
number of ways out:</DIV><DIV style=3D"margin-top: 0px; margin-right: =
0px; margin-bottom: 0px; margin-left: 0px; "><BR =
class=3D"khtml-block-placeholder"></DIV><DIV style=3D"margin-top: 0px; =
margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">1.=C2=A0'chr' =
should raise Chr if the parameter is &gt; 10FFFF</DIV><DIV =
style=3D"margin-top: 0px; margin-right: 0px; margin-bottom: 0px; =
margin-left: 0px; ">... therefore, 'ord' never has a problem.</DIV><DIV =
style=3D"margin-top: 0px; margin-right: 0px; margin-bottom: 0px; =
margin-left: 0px; ">2. we allow implementations to use 21 bits (not a =
byte multiple)</DIV><DIV style=3D"margin-top: 0px; margin-right: 0px; =
margin-bottom: 0px; margin-left: 0px; ">... what purpose does this =
constraint serve anyways?</DIV><DIV style=3D"margin-top: 0px; =
margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">3. we have =
'ord' and 'chr' use word instead</DIV><DIV style=3D"margin-top: 0px; =
margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">... which =
actually makes sense since code points are in hex</DIV><DIV =
style=3D"margin-top: 0px; margin-right: 0px; margin-bottom: 0px; =
margin-left: 0px; "><BR class=3D"khtml-block-placeholder"></DIV><DIV =
style=3D"margin-top: 0px; margin-right: 0px; margin-bottom: 0px; =
margin-left: 0px; ">I have no real opinion as to which approach is =
best.</DIV><DIV style=3D"margin-top: 0px; margin-right: 0px; =
margin-bottom: 0px; margin-left: 0px; ">I do think that it would be =
insane to actually pack unicode into</DIV><DIV style=3D"margin-top: 0px; =
margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">anything =
other than 32 bits, but a 21 bit wrapper seems ok.</DIV><DIV =
style=3D"margin-top: 0px; margin-right: 0px; margin-bottom: 0px; =
margin-left: 0px; ">Perhaps I lean towards #1 for compatibility, but #3 =
for purity.</DIV><DIV style=3D"margin-top: 0px; margin-right: 0px; =
margin-bottom: 0px; margin-left: 0px; "><BR =
class=3D"khtml-block-placeholder"></DIV><DIV style=3D"margin-top: 0px; =
margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">Also, I think =
the basis should allow additional structures to</DIV><DIV =
style=3D"margin-top: 0px; margin-right: 0px; margin-bottom: 0px; =
margin-left: 0px; ">match the CHAR signature. A 16-bit type would would =
be nice.</DIV><DIV style=3D"margin-top: 0px; margin-right: 0px; =
margin-bottom: 0px; margin-left: 0px; "><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>-----</DIV><DIV>3. =
Parsing/serialization</DIV><DIV>-----</DIV><DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>WideChar.toString and =
WideChar.scan must pack unicode into a</DIV><DIV>normal String.string. =
For code points up to FFFF this is easy; use</DIV><DIV>the \uFFFF SML =
escape. For larger values, this presents a problem.</DIV><DIV>MLton =
already includes a \U12345678 for values that are too big.</DIV><DIV>(If =
we opt to restrict WideChar to 10FFFF, this will be changed =
to</DIV><DIV>only allow 6 digits) I think that this is the best solution =
here.</DIV><DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>WideChar.{from,to}CString =
is more of a problem. C does not include</DIV><DIV>Unicode escapes, as =
far as I know. Should we mandate that the</DIV><DIV>string first be =
converted into UTF-8 and then escaped as C? This</DIV><DIV>would be the =
most useful thing I can think of for these functions.</DIV><DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>-----</DIV><DIV>4. The =
charset of SML source files</DIV><DIV>-----</DIV><DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV><DIV>It is unreasonable to =
require programmers to write:</DIV><DIV>val x =3D =
"\u041f\u0440\u0438\u0432\u0435\u0442"<FONT class=3D"Apple-style-span" =
face=3D"Times" size=3D"4"><SPAN class=3D"Apple-style-span" =
style=3D"font-size: 16px;"></SPAN></FONT></DIV><DIV>instead =
of</DIV><DIV>val x =3D "=D0=9F=D1=80=D0=B8=D0=B2=D0=B5=D1=82"</DIV><DIV><B=
R class=3D"khtml-block-placeholder"></DIV><DIV>Therefore, for MLton, we =
intend to specify that SML source files</DIV><DIV>are encoded in UTF-8, =
not ASCII. Furthermore, anything which</DIV><DIV>WideChar.isPrint would =
accept, may appear in strings. I think that</DIV><DIV>any other SML =
implementation supporting Unicode should =
do</DIV><DIV>likewise.</DIV><DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>-----</DIV><DIV>5. My =
suggested=C2=A0amendments</DIV><DIV>-----</DIV><DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>Here's an executive summary =
of what I think needs to change.</DIV><DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>Documentation only changes =
(if no existing Unicode SMLs):</DIV><DIV>=C2=A0 Define WideChar.is* by =
the Unicode default category</DIV><DIV>=C2=A0=C2=A0 =C2=A0- ie: also =
locale independent</DIV><DIV>=C2=A0 Relax the requirement that isAlpha =3D=
 isUpper + isLower</DIV><DIV>=C2=A0 Redefine isAlphaNum as isAlpha + =
isNumber (not isDigit)</DIV><DIV>=C2=A0 Declare that Char is ISO-8859-1; =
not just any extension</DIV><DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>Extensions that should be =
safe:</DIV><DIV>=C2=A0 Add to CHAR an isNumber method</DIV><DIV>=C2=A0 =
Accept source files in UTF-8 (not just ASCII)</DIV><DIV>=C2=A0 Allow =
WideChar.isPrint inside SML source strings</DIV><DIV>=C2=A0 Add an =
\U123456 or \U12345678 escape to SML</DIV><DIV>=C2=A0=C2=A0 =C2=A0- use =
it in WideChar.{to,from}String</DIV><DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>Changes that may cause =
breakage to existing code:</DIV><DIV>=C2=A0 Modify Char.is* definition =
to line up with Unicode</DIV><DIV>=C2=A0 Modify Char.to{Upper,Lower} to =
match Unicode</DIV><DIV>=C2=A0 Redefine toCString to first encode to =
UTF-8</DIV><DIV><FONT class=3D"Apple-style-span" face=3D"Times" =
size=3D"4"><SPAN class=3D"Apple-style-span" style=3D"font-size: =
16px;"></SPAN></FONT></DIV></DIV></BODY></HTML>=

--Apple-Mail-18-18051494--