[MLton] Re: [MLton-commit] r6352

Thu Jan 24 19:27:57 PST 2008

On Thu, 24 Jan 2008, Vesa Karvonen wrote:
> On Jan 24, 2008 10:33 PM, Matthew Fluet <fluet at mlton.org> wrote:
>> Added a primitive (structural) polymorphic hash.
>> On objects with identity (array,ref,etc.), it returns a constant.
>> (This matches the behavior of MoscowML's Polyhash.hash function.)
>
> Interesting!  This touches something that I've been thinking about
> recently.  I think that SML's arrays and refs (anything with identity)
> should include a hash value.  Without it (or something else that
> allows efficient discrimination), there just isn't any way to create
> an efficient mapping with objects with identity as the domain.

Agreed.  But, there are tricky bits.

> I've
> been thinking about redefining all array types in my extended basis
> library to include a generated hash value with the array.  With arrays
> it would probably be possible to do it without having to change any
> client programs (as long as all parts of the program are compiled
> against the extended basis).  Assuming that a program doesn't call the
> hash function on arrays, a compiler like MLton could probably drop the
> hash field from the wrapped array objects, so there would likely be no
> performance penalty for programs don't use the feature (and the
> penalty would likely be fairly small anyway).

MLton could probably drop the hash field (there are specific passes for 
dropping unused components of tuples and constructors).  Since the 
generated hash value would probably be derived from a stateful value 
(e.g., a global counter), MLton won't eliminate the updates to the state.

> Unfortunately, refs
> can't be redefined in a library, so a language/basis library solution
> would be better.

The only aspects of refs that can't be redefined is the ref constructor 
for use in patterns.  It is a fairly little used feature (and, arguably, a 
misfeature), so I suspect few clients would notice the difference.

>> We could do better on objects with identity (e.g., gensym a hash value
>> at birth points and flow to the hashing function), but this suffices
>> for my immediate purposes.
> [...]
>
> I'd love to have that feature in SML (generated hash value for objects
> with identity).  I find it both theoretically and practically
> important.  MLton should show the way here! :-)

Unfortunately, the only really robust way I can think of to give objects 
with identity a hash value is to grow the object header and include the 
hash value there.  Using program analysis to add hash values works for a 
lot of code, but not quite all.  The obvious thing to do is to determine 
the ref/array variables in the program that can flow to the hash function. 
Then, simply pair each of those variables with their hash values.  At 
Ref_ref and Array_array (i.e,. the construction points), we generate a 
hash value.  This can work out, even with nested types.  For example, we 
would translate
   int array ref
to
   (int array * word32) ref * word32
Where the inner word32 is the hash of the array and the outer word32 is 
the hash of the ref.  Every update of the ref with a new array would also 
update with the new array's hash.

The problem is with FFI.  We allow ref/array types in the FFI, which 
means, it is possible to have the following:

  val f = _import "foo": int array ref * int array ref -> int array ref;

Now, suppose I do:

  val r = f (r1, r2)
  val w1 = MLton.hash r
  val w2 = MLton.hash (!r)

We can't change the FFI type: both, because we don't allow tuple types in 
the FFI and because even if we did, the C code won't be expecting tuple 
types.  We could almost get by with a slightly different translation: we 
would translate
   int array ref
to
   int array ref * word32 ref * word32
Now we carry the array's hash in a separate ref that is updated whenever 
the first is updated.  This ensures that the "int array ref" remains and 
could be passed through the FFI (by simply dropping the hash components). 
But, note that the "foo" function returns an "int array ref" -- under the 
translation, this would need to have a matching "word32 ref" and "word32" 
components, because the returned value is hashed.  But, there's no way to 
conjur up the the righ hash values -- in particular, "foo" is probably 
returning one of the two input refs, but we can't know which one and it 
would be incorrect to generate a new hash value at this introduction 
point, because the ref already exists with a hash value.

Anyways, there is a slight variation on this, where do the flow analysis 
to determine which ref/arrays are hashed and which are used in FFI. 
ref/arrays that are hashed but not used in FFI could be handled as 
described above, by generating and pairing then with a hash value. 
ref/arrays that are hashed and used in FFI lose -- they get treated as 
they are now, they all get a constant hash value.  I suspect that this 
would work well for many applications.