[MLton] Unicode... again
Wesley W. Terpstra
wesley at terpstra.ca
Fri Feb 9 10:12:45 PST 2007
On Feb 9, 2007, at 5:56 PM, skaller wrote:
>>> Tradeoff between flexibility and safety here..
>> I don't see this point.
> The point is people want to work with other charsets and
> encodings whether you like it or not, and for that purpose
> you can choose whether to let them use the string type,
> or force them to use a raw encoding like (Word8Vector?)
That's what I plan to do. If it's in a String, it is supposed to be
only values with code point < 256. If you want to work with some
encoded text, well that's a blob. Blobs are Word8Vector.vector. We
will be providing a function similar to iconv that allows incremental
conversion of WideChar <-> Word8. If I recall, we were planning to do
it similar to the ('a, 'b) reader types used already in SML.
> However if you want to read strings
> from files in UTF-8 format (at run time i mean)
> you can get errors, and if you want
> to do it FAST you cannot detect them: validating the input
> isn't really an option (it will cause a whole extra pass
> on the file which is WAY too expensive).
I don't see how this is any different from parsing any input in
general. You can incrementally parse it, and when an error occurs
your parser reports an error. By using the SML ('a, 'b) reader
abstraction, you can just compose a UTF-8/whatever decoder up to a
BinIO stream and connect that in turn to your parser. MLton will
inline all the abstraction anyway. :-)
As an aside: I've made WideChar = 16/32 bits a compiler flag, similar
to how the default int type is chosen.
More information about the MLton