[MLton] floating points in Mlton

Matthew Fluet fluet@cs.cornell.edu
Thu, 25 Mar 2004 10:17:30 -0500 (EST)


> I have a question about mlton and its handling of floating points. Should
> there be a performance difference between a call that pushes a tuple of
> floating point numbers onto the stack and a call that pushes each float
> individually onto the stack.

There will certainly be a performance difference in the time to setup the
call.

> The first scheme will simply pass an address
> to where the tuple is located on the heap, the later will pass multiple
> addresses where each float is located on the heap (they are boxed
> correct?).

No primitive types ({int,word}{8,16,32,64},real{32,64}) are boxed.

Passing a tuple (of anything) means passing a 32bit pointer, while passing
three doubles means passing 3 64bit values.  Furthermore, on the x86,
manipulating floating points requires bouncing through the floating point
register stack, which is usually slower than the general purpose
registers.

> Thus, it seems like we have the exact same number of memory
> indirections we need to follow to get to the floats.

Since real64 values aren't boxed, then we don't need to follow a pointer
to get it.  When the three real64 values are passed, then we can read them
directly from the stack.  When the tuple is passed, we can read the
pointer directly from the stack and indirect to the real64.

If the function you are calling reads each of the floating point values
and doesn't call any other function that needs the floating point values,
then it does seem that overhead of each shoul be about the same.  But the
flattened version will start to do much worse if the floating point values
are just begin passed through this function to another.

> As a sanity check, I compared my flattener to mlton without flatten and
> local flatten (IE no flattening, since we should always be better than
> this basic case).

I don't think that is true.  I'm sure that there are cases where no
flattening is better than all flattening.  Imagine the following
pathological case:

f (n : int, x : (int * int * int), y : (int * int * int) ) =
  if n = 0
     then (#1 x) + (#2 x) + (#3 x)
  else f (n - 1, y, x)
val x = f (100, (1,2,3), (4,5,6))

The all flatten case will need to move 3X data around.

> What I found is that there are only 3 benchmarks that we
> do worse, and in each case we increase the number of arguments passed
> around by a significant factor. For example, in nucliec we do so by a
> factor of 10.

I'm not surprised by the big slowdown in a flattened nucleic.