[MLton] Unicode... again

Fri Feb 9 17:59:05 PST 2007

On Fri, 2007-02-09 at 19:12 +0100, Wesley W. Terpstra wrote:

> > However if you want to read strings
> > from files in UTF-8 format (at run time i mean)
> > you can get errors, and if you want
> > to do it FAST you cannot detect them: validating the input
> > isn't really an option (it will cause a whole extra pass
> > on the file which is WAY too expensive).
> 
> I don't see how this is any different from parsing any input in  
> general.

In principle it isn't. In practice suppose you are Google
and you're doing text searches of web pages made by
other people .. you mmap the file into memory and start
scanning. You're using advanced search algorithms on the
text to avoid inspecting every character (Boyer Moore
etc) .. there's no way you're going to want to bother
validating the text is UTF-8 compliant, that would defeat
the whole point of ultra fast scan algorithms.

Even if you do this incrementally, say using a buffer
and a stream concept, so you obtain good locality
and avoid cache spills thereby, you probably still wont
do it because the extra overhead is still critical and
correctness is not only not important, it's the other
way around: if the web page has a bad encoding, well
too bad, you're trying to index someone else's page,
and you need to proceed even if the page is bugged.

And yes I do know this is an extreme example.

I have a related real life case that is not. 

I have actual Python code that tries to do I18n stuff.
It includes some string literals with embedded hex codes.
My code actually does stuff like charset conversions,
and I have tests for stuff like Latin-1 to UTF-8 etc.
All using Python strings.

Only Python changed the rules on me and banned certain
data in strings .. and in later versions actually enforced it.

This didn't just break my code .. it broke Debian,
which stupidly recompiles the whole of site-packages
when it installs Python .. including my code, thus
breaking Python (due to a bug in Debian packaging).

All this because of an illegal char in a string
that is never used.

I believe you must consider: when you're transcoding you need
options for handling errors. In 90% of real life cases
you do NOT want to abort or throw an exception, you 
use a substitute instead and keep going. This makes
transcoding routines hard to write because there are
many ways to deal with errors and it's hard to provide
them all.

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net