[MLton] WideChar

Stephen Weeks MLton@mlton.org
Fri, 10 Dec 2004 12:24:48 -0800


Some thoughts on the WideChar stuff.  Some of this is covered in
others' email and some is simply summarizing and stating my current
position.

* To make it clear that the char type is ISO-8859-1, I added a note to
  http://mlton.org/BasisLibrary.

* The behavior of the Char and String functions does not depend on
  locale.

* Writing string constants using only \u escapes makes it practically
  impossible to write non-English programs.  We will support UTF-8
  encoded strings. 

* \u is not sufficient to get all Unicode, hence there is an
  unacceptable omission in the Definition -- we should also allow
  \Uxxxxxxxx.  We should not drop support for \u, since the Definition
  requires it, and also it is by far the more common case.

* Restricting variable names to printable ASCII is painful for
  non-English-speaking programmers.  We should move toward making the
  base alphabet that MLton accepts Unicode, and make the default
  encoding of programs be UTF-8.

* It is a mistake to argue for extensions to the Definition based on
  the fact that they only hurt portability away from MLton, not to
  MLton.  There are other valid reasons for extensions, but this is
  not one of them.  This is exactly the argument that SML/NJ has used
  many times, and it has harmed the SML community by fragmenting it.

* There are standard table compression techniques (multi-level tables
  with sharing) that can make a Unicode ML-Lex feasible.

* When converting a char reader to a widechar reader, it is sometimes
  useful to raise an exception on encountering a widechar and
  sometimes useful to return NONE.  We should provide both types of
  converters. 

* In the basis library, char is defined as int8 rather than word8 so
  the FFI works.  In C, char means signed char, which may have a
  different calling convention that unsigned char.  If we defined
  char as Word8.word, then in order to import/export a function that
  deals with chars, one would have to use int8 and coerce on the SML
  side.  That seems like a major pain.

* Using a datatype for encoding names is preferable to using strings.
  Using datatypes does not introduce any  problems with adding new
  encodings.  For example, with strings, pattern matches will look
  like 

	case enc of
	   "UTF8" => ...
	 | "UTF16" => ...
	 | "my-encoding" => ...
	 | _ => error "unknown encoding"

  With a datatype, the same match would look like

	case enc of
	   UTF8 => ...
	 | UTF16 => ...
	 | X "my-encoding" => ...
	 | _ => error "unknown encoding"

  In either case, adding a new encoding (either as an extension or a
  special variant) causes no problems.  And with the datatype of known
  encodings, one gets the benefit in the common case of type-checker
  supported agreement of encoding name.

* We should put the new localization stuff in an MLB library, not the
  MLton structure.

* I don't see what OOP buys in handling locales (i.e. I don't see any
  use for inheritance, dynamic dispatch, or otherwise).  We can simply
  have functions that depend on the locale.

  signature LOCALE =
     struct
        type t

        val make: ??? -> t
        val isAlpha: t * LargeChar.t -> bool
        val isPrint: t * LargeChar.t -> bool
     end

  You could make this look more OOP by using a record of member
  methods, as below.

  signature LOCALE_FACTORY =
     struct
        val make: ??? -> {isAlpha: LargeChar.t -> bool,
                          isPrint: LargeChar.t -> bool}
     end

  Since these signatures are completely equivalent (one can write a
  functor mapping between them) I'd go for the first approach, as it
  is more idiomatic SML.