overflow checking benchmarks

Matthew Fluet fluet@CS.Cornell.EDU
Fri, 3 Nov 2000 14:20:23 -0500 (EST)


> Anything else?  Would it be easy to run the benchmarks, once with jo's and once
> without, everything else being equal?

Here are the results from a series of six benchmarks.  Essentially, I
added two flags to the x86 backend -native-force-check and
-native-emit-jo.

-native-force-check is a bool option.  When NONE, it translates Int_{} as
normal and Int_{}Check as normal.  When SOME true, it translates Int_{} as
the normal Int_{}Check and Int_{}Check as the normal Int_{}Check.  When
SOME false, it translates Int_{} as the normal Int_{} and Int_{}Check as
the normal Int_{}. Basically, NONE is the default behavior, and SOME b
either forces everything to be checked or forces nothing to be checked. 

-native-emit-jo is a bool.  When true, it emits jo/jno instructions as
normal.  When false, it doesn't emit any jo/jno instructions, but
everything else is the same.

Along with -DMLton_detectOverflow={0,1}, that gives a total of 12
different combinations.  For convience, I'm going to write 1TF to mean the
following set of flags:
-DMLton_detectOverflow=1  -native-force-check SOME true  -native-emit-jo false
and similarly for other combinations.

It turns out that there are only 6 "equivalence classes" of flags, because
-DMLton_detectOverflow either makes everything Int_{} or everything
Int_{}Check, and -native-emit-jo doesn't make a difference when nothing is
translated as Int_{}Check

Here are the groups:

0FF,0FT,0NF,0NT
All of these correspond to the default MLton behavior with no overflow
checking.

0TF
This corresponds to adding overflow checking only in the backend, but not
emitting jo instructions.

0TT
This corresponds to adding overflow checking only in the backend, and
emitting jo instructions.
This and 0TF should measure branch prediction penalties.

1FF,1FT
These two correspond to adding overflow checking in the middle-end, but
not translating to overflow checking in the backend.
This and 0N_ should measure missed CPS simplifications.

1TF,1NF
This corresponds to adding overflow checking in the middle-end,
translating to overflow checking in the backend, but not emitting jo
instructions.

1TT,1NT
This corresponds to the default MLton behavior with overflow checking.
This and 1TF should measure branch prediction penalties.


One caveat on 0TT: because there is no overflow checking in the
middle-end, there are no handler labels to set as the target of the jo
instruction.  So, I just picked a random static label corresponding to one
of my temporary variables.  This ends up in the BSS section, so all of
those jo instructions should look like conditional forward jumps.  If I
remember correctly, x86 branch prediction will predict a backward
conditional jump, but not a forward conditional jump.  In any event, all
of the jo directions are the same in 0TT; for the 1TT, there are probably
a mix of forward and backward jumps, depending on when the handler blocks
are emitted relative to the jo instructions. 

Here are the numbers: (running times on 266MHz PII)

                    0F_     0TF     0TT     1F_     1TF     1TT
----------------------------------------------------------------
barnes-hut         15.27   15.13   15.10   16.12   15.91   15.91
checksum           10.35   11.39   12.74   10.91   12.18   13.72
count-graphs       14.40   14.76   14.86   14.98   15.19   15.07
fft                71.89   72.02   71.11   71.57   71.77   71.97
fib                12.38   12.37   12.90   12.35   12.37   12.90
knuth-bendix       22.84   23.05   23.19   23.13   22.86   22.85
lexgen             40.60   40.75   40.74   40.52   39.95   40.09
life               75.39   74.22   76.67   77.84   72.24   80.10
logic              78.99   79.22   79.27   79.29   79.16   78.86
mandelbrot         20.31   20.21   21.18   25.61   26.66   24.76
matrix-multiply    13.59   13.92   14.52   11.80   11.39   12.15
merge             231.85  231.65  231.82  231.52  232.03  231.44
mpuz               50.74   45.49   48.02   50.38   47.43   48.26
nucleic            26.85   26.84   26.88   26.79   26.78   26.80
ratio-regions      26.69   26.54   27.29   26.50   26.90   27.29
simple             15.06   15.02   15.18   14.70   14.56   14.89
smith-normal-form   3.33    3.34    3.34    3.36    3.35    3.34
tak                30.84   30.83   31.33   30.82   30.82   31.93
tensor             14.89   14.30   14.97   14.60   18.12   22.00
tsp                36.27   35.62   35.61   35.80   35.89   35.89
vliw               22.98   23.46   23.76   23.02   23.44   23.83
wc                  6.37    6.39    6.27    8.69    8.77    8.35
zern               13.50   16.09   15.49   14.17   13.88   24.04

I'm not sure what conclusions to make.  Comparing 0F_ and 1F_ would seem
to indicate that there really aren't that many missed CPS simplifications.
In fact, some benchmarks are better under 1F_ than under 0F_
(e.g., matrix-multiply) which is surprising.

Comparing 0TF and 0TT and also 1TF and 1TT shows that there is a
non-trivial penalty to be paid for having jo instructions.  And there
really isn't any way to make those go away.

The fact that there is no clear ordering between 0F_ and 0TF and also
between 1F_ and 1TF tells me that there are some cases where the backend
simplifications are being too aggressive (and the breaking up of large
blocks by the overflow checking prevents these optimizations), but there
are also some cases where that breaking up of blocks is penalizing the
running time.

Any other insights?