fast I/O

Henry Cejtin henry@sourcelight.com
Tue, 19 Sep 2000 01:23:01 -0500


I  have  thought some more about the general problem of mutable vs. immutable
I/O stuff, and here is the problem as I  see  it:  First  note  that  reading
almost  anything  requires  at  least  one character of look ahead.  Consider
reading an integer: you don't know it is done until you  look  one  character
too far.

This fact does not, of course, require immutable objects marking positions in
I/O streams of course.  It doesn't technically require any  support  at  all.
Unfortunately,  without  any support from the underlying I/O system, you will
be forced to hold the looked ahead character in a variable  and  to  pass  it
around.  What's worse, some times you won't have it, so you will have to read
it just to pass in.

Rejecting that approach, we have to add some support in the I/O system.  What
Pascal  did,  which was quite elegant, but kind of a pain to use, was to have
the notion of 1 character look ahead built in by providing the equivalent  of
a  `peek'  function.   This returned the next character, but did NOT advance.
The reason why this is nice is because all combinations of  calls  are  legal
(see  the  C  solution  for a system where this is not the case), but it is a
pain to use.  Consider using this the code to read  in  a  (weakly)  positive
integer (no white space skipping):

    fun getuint (stream): int =
           let fun loop ac =
                      let val ch = peek stream
                      in if isDigit ch
                            then (advance stream;
                                  loop (10*ac + ord ch - ord #"0"))
                            else ac
               val ch = peek stream
           in if isDigit ch
                 then loop (ord ch - ord #"0")
                 else raise ???
           end

Note  the  need  to  advance  the stream.  2 calls per character, and even if
advance returns the next character, it is a pain to use.

The approach used in C is to provide ungetc instead of peek.  This  is  quite
ugly  because:  it  isn't  legal  to  call  unget  twice  in a row without an
intervening getc call, and also what happens if  you  unget  something  other
than  the last character read?  What does ungetting EOF mean?  Ignoring that,
it is easy to use and a bit more compact than peek.

You could, of course, generalize unget  so  that  you  could  unget  as  many
characters in a row as you wanted (and then you would get them in the reverse
order).  This is all quite consistent and nice, but note it  isn't  quite  as
modular  as the functional approach.  The reason is that anything that does a
read which may have to be pushed back has to know that so that it  will  save
the  characters.  With the functional approach, no one has to know except for
the routine that is going to decide to go back.  The down  side  of  this  is
that only the GC will know that some save set of characters can be gotten rid
of now.  The code generator is almost always going to have  to  generate  the
saved  state  even  though  the code typically only wants 1 character of look
ahead.

I think that my conclusion is that the  arbitrary-unget  is  the  way  to  go
because  per-character overhead is just too high with anything else.  I don't
see any way to make this work with the `reader' concept  though.   Wait,  you
could  do this by having the `state' be the last character looked at (maybe).

Any way, I think that the mod you sent me, although correct, is going  to  be
either  too  slow,  or  else  is  going  to  force me to carry around the one
character of look ahead in my code.  Both of  these  alternatives  seem  very
very bad.