serialization

Stephen Weeks sweeks@wasabi.epr.com
Mon, 2 Aug 1999 01:30:10 -0700 (PDT)


This weekend, I implemented (de)serialization in MLton.  Externally,
what's available is
      val serialize: 'a -> Word8Vector.vector
      val deserialize: Word8Vector.vector -> 'a
The implementation is about 250 lines in src/runtime/gc.c.

Right now there are two problems:
	- It doesn't work if 'a is an arrow type.
	- It isn't safe, in that if you feed deserialize a bogus
		vector, unpredictable things may happen.

As to the arrow type problem, I propose to change the flow analysis as 
follows:
	- have one set of lambdas for each arrow type that is serialized
		(recall that the flow analysis runs on SXML, so this
		is a known finite number of sets)
	- the result of deserialization to type t is the set for t
	- insert a coercion at calls to serialize from the argument 
		set to the serialize set for that type

As to the safety problem, there are several possible solutions I have
thought of, none of which I am entirely happy with.
	1. Build a predicate for each type that checks if a 
		Word8Vector.vector is a valid serialization of some
		object of that type.  Deserialize calls the predicate
		before running. 
	2. Statically choose a random number r_t (say 128 bits) for
		each type t.  Prefix every serialized object of type t
		with r_t.  Deserialize checks the prefix before running.
	3. Dynamically create serializer/deserializer pairs of
		functions with the random approach of (2).

Here are some of the tradeoffs.

(1) is completely safe, but doesn't seem very easy to implement.  The
	predicates would have to be constructed (automatically) per
	program.  I don't think it can be written in Cps, so it would
	have to be done at the Machine level.  I am pretty sure the
	information is reasonably accessible to the backend.
(2) is reasonably safe (i.e. there is some very low probability of
	error).  It is however not safe wrt malicious users who
	purposely feed bad Word8Vectors.
(3) can be completely implemented in SML, given the primitives defined 
	above.  However, it has the extreme disadvantage that two
	MLton processes started separately, even from the same
	executable, cannot communicate, since the random numbers are
	chosen dynamically.  This would seem to defeat one of the
	major uses of serialization.

Any ideas?

BTW, along the way, I also fixed MLton.size so that it runs in time
proportional to the number of pointers in the object instead of having 
to do a full GC.