[MLton] hash-set.sml low level points

Stephen Weeks MLton@mlton.org
Tue, 13 Jul 2004 17:30:26 -0700


> I  was  looking  at  hash-set.sml  in mlton/lib/mlton/basic and noticed a few
> strange things:
> 
> In newOfSize, it isn't correct to have the multiplication by 4 done on  words
> because  then  an overflow will not be detected.  

True.

> I would think that the line should change from
>
>     { hash = hash,
>       numBuckets = 0w4 * Word.roundUpToPowerOfTwo (Word.fromInt size) }
> 
> to
> 
>     { hash = hash,
>       numBuckets = Word.roundUpToPowerOfTwo (Word.fromInt (4 * size)) }

This doesn't quite fix it, because Word.roundUpToPowerOfTwo may miss
an overflow as well.  I fixed things by making all the computations on
integers.

> In fromList, the function body passed in the second arg to List.foreach is of
> the form
> 
>     (ignore (...); ())
> 
> Clearly  this  is  redundant.   (I  would say just use ignore to get the unit
> value.)

Yep.  Fixed.

> For the stats', if you want the avg  to  be  the  expected  number  of  times
> equality  is  checked  on  lookups,  then  that  is not the average size of a
> bucket, but, assuming that all keys are equally likely, the avergae size^2/2.
> Isn't this what people really care about?

Dunno.  If I were to publicize this interface, I would make sure
to expose enough so that people could compute whatever stats they
want.  I have no objection to you adding another field to the stats
with what you want.

> Finally,  I'm curious that you make the table size so big (4 times the number
> of elements.  

No good reason.

> Obviously the optimal value all depends on how expensive testing for
> key equality is, but I would guess that it is almost always very
> quick.  Even in cases where general equality can be expensive, it
> tends not to be in the hash table because you only compare it
> against other items with the same hash (mod the table size), so
> often the test is very fast.  E.g., for strings, the sizes or first
> few characters are often different.

Agreed.

> I wonder what happens if you replace the 4 by 1 (which is what I
> typically use) in a self compile.

I await the results of your experiment.

> It is a bit also a bit strange that tables are never shrunk,
> although it all depends on how much you remove elements.

Yeah.  HashSet.remove is used only once in MLton, so I've never been
motivated to fix it.  I would be happy to see it fixed.