[MLton] Constant Folding of FP Operations

Mon Jun 2 10:09:32 PDT 2008

On Mon, 2 Jun 2008, Vesa Karvonen wrote:
> On Sun, Jun 1, 2008 at 10:29 PM, Matthew Fluet <fluet at tti-c.org> wrote:
>> On Fri, 23 May 2008, Vesa Karvonen wrote:
> [...]
>> Seems like a good optimization, and a sound one.  Did you run the
>> benchmarks and observe any speedups?
>
> No, I don't think I ran the benchmarks with this optimization, but I see
> that you did.  However, I did try the optimization with a slightly
> modified version (note the parentheses in "store + (...)"):
>
> of Sean McLauglin's example:
>
>  http://mlton.org/pipermail/mlton-user/2007-December/001294.html
>
> After changing the SML expression to match the C expression (as shown
> above), MLton performed constant folding on the right-hand side expression
> and performance was much closer to C.

Good to see that the optimization kicked in, even if it is an artificial 
example.

> Probably.  As we know, gcc already performs constant folding of FP, so I
> would not be surprised if the optimization would not necessarily improve
> performance with the C backend (as seems to be that case with ray).

Except that after considering Sean McLaughlin's example, we don't allow 
gcc to constant fold any floating-point options (in the C-codegen):
   http://mlton.org/cgi-bin/viewsvn.cgi?view=rev&rev=5799
This is accomplished by making all real constants globals, and accessed 
via an indirect load, so gcc can't see the constant value.

So, since the C-codegen didn't show any improvement with FP 
constant-folding on the ray benchmark, I suspect that the improvement in 
the native codegen was just system noise.

>> I'd really like to know a good way of cutting down the noise in the
>> benchmarks; consider that fib and tailfib (which use no FP operations,
>> and so yield identical assembly code) show a 1.04 and 1.06 slowdown,
>> respectively.
>
> How are the times measured?  Looking very briefly at benchmark/main.sml
> (line 79) it would seem to use an average of times over a 60 second
> sampling period.  Using the average probably does not give the best
> results.  For a deterministic benchmark it should be best to run the
> benchmark multiple times and only take the lowest time.  Any time above
> the lowest possible time for a deterministic benchmark should be noise
> from the system (switches to other processes trashing caches, etc.).  So,
> by taking an average, you are actually letting the noise enter the
> measurement.  (This was mentioned in the paper on dominators
>
>  http://www.cs.rice.edu/~keith/EMBED/dom.pdf
>
> based on which I implemented the new dominators algorithm for MLton a
> while back.)

That's an interesting view, to take the minimum time, rather than the 
average time.

>>> About the patch.  The compiler changes do not compile with SML/NJ,
>>> which does not support 32-bit floating point (Real32), [...]
>>
>> You might consider dynamically checking the value of Real32.precision,
>> which would allow you to distinguish a "real" Real32 from a "dummy"
>> Real32 that is bound to Real64.
>
> Hmm... That approach would probably make it reasonably easy to make it
> compile with SML/NJ.  The approaches I had in mind were somewhat more
> complicated.  I'll have to try this.

In the long term, it would be good to avoid reliance on Real32 for any 
purposes other than optimization.  For example, we could make
   datatype RealX.t = T of {size: RealSize.t,
                            value: IEEEReal.decimal_approx}
which mimics the treatment of WordX.t.

We currently require Real32.{fromString,isFinite} for elaborating real 
constants, Real32.{==,signBit} for equality on real constants, and 
Real32.format for emitting real constants.  If Real32 is bound to Real64, 
then we might accept Real32.real floating-point constants that are finite 
in Real64 but would not be finite in Real32.[*] If Real32 is bound to 
Real64, then we might judge two floating-point constants to be unequal 
when they are equal at Real32.real; this is benign, since we only use 
equality of floating-point constants combine equal constants, so a 
conservative equality is fine.  If Real32 is bound to Real64, then we 
might emit more digits than necessary to represent the value; again, this 
is benign.

[*] We currently use Real<N>.isFinite to give a compile-time error for 
floating-point constants that do not denote finite values.  The argument 
for this is that The Definition states:
   "it is a compile-time error if the constant does not make sense or does
   not denote a value within the machine representation chosen for the
   type."
Of course, one might argue that "1.0e123" denotes a specific 32-bit IEEE 
754 value (namely, +inf) just as much as a "1.0e~1" does -- for neither 
does the mathematical value of the IEEE value equal the mathematical value 
of the string.

>> There also seems to be a little-endian assumption for constant-folding
>> castToWord and castFromWord.  [...]
>
> Not exactly.  The assumption is that the endiannes of reals and words
> agree in the sense that if you convert a real/word to little endian bytes
> and then convert the little endian bytes to a word/real, you'll get the
> desired cast.  The operations are (in pseudo code):
>
>  castFromWord = RealFromLittle o WordToLittle
>  castToWord = WordFromLittle o RealToLittle
>
> That should be identical to:
>
>  castFromWord = RealFromBig o WordToBig
>  castToWord = WordFromBig o RealToBig

Ah, I see.