[MLton] Unicode / WideChar

Mon, 21 Nov 2005 19:15:18 +1100

On Mon, 2005-11-21 at 08:11 +0100, Florian Weimer wrote:
> * Henry Cejtin:
> 
> > Ah, sorting by codepoints is just what I want: something NOT locale
> > dependent.
> 
> But it still won't work well within UTF-16 environments. 8-(
> 
> > (Actually, there is still the problem of alternate representations of the
> > same codepoint in UTF-8, right?  I think it has to do with marks.)
> 
> Correctly implemented UTF-8 does not suffer from the multiple
> representation problem.

Just to clarify: Unicode can be sorted by codepoint.
There is no issue of marks as such. All of UTF-8, UTF-16,
UCS-2be, UCS-2le, UCS-4be, UCS-4le .. and other representations,
can be sorted with appropriate low level comparisons.

In particular UTF-8 will sort correctly with any 8 bit clean
bytewise sort, as will UCS-2be and UCS-4be. Little endian
representations needs a wordsize aware sort.

There is no need to bother with any special considerations!
Only 3 sort algorithms are required to handle all cases:

1. byte wise
2. 16 bit word wise
3. 32 bit word wise

On a big endian machine, all three are identical.

Don't be confused though .. the codec used has
NOTHING to do with Unicode at all. UTF-8 for example
has NOTHING to do with Unicode. It is a way of
representing ANY sequence of a subrange of integers.
Thus it can ALSO be used for, say, Latin-1, Latin-2,
or any other 'character set'.

In Unicode, certain 'characters' have more than one
representation: for example certain accented characters
can be represented by either one Unicode code point,
or two: the base character plus a combining character
(the accent). In general, the order of combining
characters isn't specified, and there is a lot of
ambiguity for certain texts (such as Arabic I think),
where the context makes a difference.

This is also true in European languages, where
the first letter of the first word of a sentence
is gratuitously Capitalised.

There is a standard for *normalisation* to remove
some of these ambiguities: eg: combining characters
are always removed when there is a one code point
representation available, when multiple combining
characters are required, the order is specified.

you may need to normalise text before sorting it
if you need more deterministic behaviour.

Finally note NONE of this is even remotely related
to collating sequences. Collation is a way to sort
human readable script. It only applies to fixed
subsets of Unicode, and it is heavily application
dependent and very complicated. For example
to produce a book index in English you must
use the collating sequence:

AaBbCc ...

which is NOT the same as the code point ordering,
additionally some code points such as control characters
are not in the sequence at all .. they cannot be printed,
it's an error to include them in the input to a collation
algorithm.

The bottom line for system tools is: provide the three
algorithms above. And nothing else. Do not attempt
to detect the encoding and choose the algorithm
automatically -- it has to be specified by the user.
HTML for example SHOULD be sorted using UTF-8 *unless*
it contains a specific encoding tag in the header,
or, for some reason you know the 'system' has already
transcoded it (eg to UCS-4).

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net