[MLton] Unicode / WideChar

Stephen Weeks MLton@mlton.org
Tue, 29 Nov 2005 19:47:18 -0800


> The increase in binary size comes from storing the unicode mapping.
> I've worked at compressing the Unicode database, and have gotten all
> the information needed down to under 6k now.
...
> > Would it be possible to set things up so that
> > the startup memory use and time, as well as the code-size increase,
> > would only happen if the WideChar functions were actually used?
> 
> Because I functorized the Char stuff into CharFn and this uses the
> Unicode common lookup table, Char itself will pull this in.

Ah.  Now that I have looked at your code, I understand.  I don't think
it would be too hard to tweak things to share most of the Char stuff,
while not sharing the is* functions, which require the Unicode table.
In fact, I guess this is now necessary, since Char is not a subset of
Unicode.

I can imagine it will be useful to special case Char in other places
too, for example in the memoize function.

> > hash(c) = ((c >> 14) * 16837 + c) & 0x7FFF
> 
> By the way, you might notice that if MLton kept track of the bit
> widths of promoted types (which I still want for 32*32 muls!!!),
> then this becomes a simple hash(c) = c for ASCII.

True.  If it became necessary, it would be easy enough to abstract out
the hash function to improve efficiency of Char.

BTW, for anyone that wants to follow Wesley's code, it's in the
"unicode" branch of the SVN at branches/unicode/basis-library/text.

One other thing that would be useful in there is a Makefile that
builds and calls gen-hash and gen-lists in the right way.