[MLton] Constant Folding of FP Operations

Mon Jun 2 07:14:57 PDT 2008

On Sun, Jun 1, 2008 at 10:29 PM, Matthew Fluet <fluet at tti-c.org> wrote:
> On Fri, 23 May 2008, Vesa Karvonen wrote:
[...]
> Seems like a good optimization, and a sound one.  Did you run the
> benchmarks and observe any speedups?

No, I don't think I ran the benchmarks with this optimization, but I see
that you did.  However, I did try the optimization with a slightly
modified version (note the parentheses in "store + (...)"):

  fun eval_real_poly n = let
     val msg = n div 10
     val x1 = 4.0
     val x2 = x1
     val x3 = x1
     val x4 = x1
     val x5 = x1
     val x6 = x1
     fun eval (0,store) = store
       | eval (n,store) =
         let in
            if n mod msg = 0 then print ("iter: " ^ Int.toString n ^
                                         "\n") else ();
            eval (n-1,
                  store + ((~x1)*x4 + x2*x5 +(~x3)*x5 +(~x2)*x6 + x3*x6 +
                           x4*(x2 + x3 + x5 + x6) + x4*(~x1)+x4*(~x4)))
         end
  in
     eval (n,0.0)
  end

  val _ = print (Real.toString (eval_real_poly 500000000) ^ "\n")

of Sean McLauglin's example:

  http://mlton.org/pipermail/mlton-user/2007-December/001294.html

After changing the SML expression to match the C expression (as shown
above), MLton performed constant folding on the right-hand side expression
and performance was much closer to C.

> On amd64-linux, I get the following:
>
> MLton0 -- ~/devel/mlton/mlton.svn.trunk/build/bin/mlton -codegen amd64
> MLton1 -- ~/devel/mlton/mlton.svn.trunk/build/bin/mlton -codegen c
> MLton2 -- ~/devel/mlton/mlton.svn.trunk/build.real-cf/bin/mlton -codegen
> amd64
> MLton3 -- ~/devel/mlton/mlton.svn.trunk/build.real-cf/bin/mlton -codegen c
> run time ratio
> benchmark         MLton0 MLton1 MLton2 MLton3
> barnes-hut          1.00   1.04   0.84   0.88
[...]
> ray                 1.00   1.06   0.94   1.05
[...]
>
> This seems to only show a non-noise speedup on barnes-hut

That would match my expectation that this optimization applies only
rarely.  I'm actually surprised to see such a large improvement on
barnes-hut.

> (and maybe ray).

Probably.  As we know, gcc already performs constant folding of FP, so I
would not be surprised if the optimization would not necessarily improve
performance with the C backend (as seems to be that case with ray).

> I'd really like to know a good way of cutting down the noise in the
> benchmarks; consider that fib and tailfib (which use no FP operations,
> and so yield identical assembly code) show a 1.04 and 1.06 slowdown,
> respectively.

How are the times measured?  Looking very briefly at benchmark/main.sml
(line 79) it would seem to use an average of times over a 60 second
sampling period.  Using the average probably does not give the best
results.  For a deterministic benchmark it should be best to run the
benchmark multiple times and only take the lowest time.  Any time above
the lowest possible time for a deterministic benchmark should be noise
from the system (switches to other processes trashing caches, etc.).  So,
by taking an average, you are actually letting the noise enter the
measurement.  (This was mentioned in the paper on dominators

  http://www.cs.rice.edu/~keith/EMBED/dom.pdf

based on which I implemented the new dominators algorithm for MLton a
while back.)

> The C-- specification of floating-point expressions makes the rounding mode
> an explicit argument; e.g.:

That sounds reasonable for an intermediate (or target) language.

>> About the patch.  The compiler changes do not compile with SML/NJ,
>> which does not support 32-bit floating point (Real32), [...]
>
> You might consider dynamically checking the value of Real32.precision,
> which would allow you to distinguish a "real" Real32 from a "dummy"
> Real32 that is bound to Real64.

Hmm... That approach would probably make it reasonably easy to make it
compile with SML/NJ.  The approaches I had in mind were somewhat more
complicated.  I'll have to try this.

> There also seems to be a little-endian assumption for constant-folding
> castToWord and castFromWord.  [...]

Not exactly.  The assumption is that the endiannes of reals and words
agree in the sense that if you convert a real/word to little endian bytes
and then convert the little endian bytes to a word/real, you'll get the
desired cast.  The operations are (in pseudo code):

  castFromWord = RealFromLittle o WordToLittle
  castToWord = WordFromLittle o RealToLittle

That should be identical to:

  castFromWord = RealFromBig o WordToBig
  castToWord = WordFromBig o RealToBig

(I think that the Pack structures are also missing from SML/NJ, so they
have to be faked as well.)

> It shouldn't be difficult to allow Prim.apply to return more complicated
> expressions.  There aren't that many (static) uses of Prim.apply in the
> compiler; you would simply need to allow uses of Prim.apply to yield a
> sequence of IL statements, so that you could bind constants to variables
> before using them in the IL primitive.

Right.  I was thinking about returning an expression (tree), but a
sequence of expressions probably fits the ILs better.  However, a
difficulty with the sequence of statements approach would seem to be that
Prim.apply is polymorphic in the type of variables, so something needs to
be done to make it possible to effectively create new variables in
Prim.apply.

-Vesa Karvonen