TextIO.scanStream problems

Stephen Weeks sweeks@intertrust.com
Fri, 29 Sep 2000 09:39:04 -0700 (PDT)


I believe I completely understand what's going on with the huge memory usage of
TextIO.scanstream.  There are two problems, one due to the specification of
scanStream, the other due to MLton's horrible (20000921) implementation.

The specification (and MLton implementation) of scanStream is:

      fun scanStream f ins =
	 case f StreamIO.input1 (getInstream ins) of
	    NONE => NONE
	  | SOME(v, s) => (setInstream(ins, s); SOME v)

The unfortunate result of the spec is that ins, which points at the front of the
instream, is alive throughout the entire scanning operation (the call to f).
Thus, without some pretty magical compiler analysis (not present in MLton), the
entire instream must be kept alive until the setInstream.  So, when scanning in
a file, you will need at least space to store the characters in the file.

If that's not bad enough (I think it's so bad I'm gonna send mail to
sml-implementers), MLton has a *huge* per character overhead.  Henry, you were
being extremely generous when you guessed 12 bytes per character.  The correct
number, according to my calculations, is 9 words, or 36 bytes.  Coincidentally,
this comes out to about 500M for a 13.9M file.  Thus you won't even be able to
read it on a 1G machine.  

I think I can get the overhead to an acceptable level by using lists of arrays
of chars.  This should make it so that when there is a pointer to the head of
the stream and a pointer to the end, the overhead is <2bytes per character.  I'm
looking into it.

But, I emphasize once again the SMLs streams are insane and should not be used.