"root" of ChunkPerFunc chunk

Thu, 2 Nov 2000 16:10:15 -0500 (EST)

> So the upshot is that the overflow checks are (in almost all cases) not
> interfering with the results of the code generator other than inserting jo's,
> right?  

Well, part of the "in almost all cases" are things like

RI(0) = Int_add(SI(4), 15)
SI(4) = RI(0)
with no previous or future uses of SI(4) in the block:

by default become

addl 15,SI(4)
(SI(4) as an address)

but with overflow checking becomes

movl SI(4),reg
addl 15,reg
jo
movl reg,SI(4)
(SI(4) as an address in both cases)

So, there are really more instructions, even with the jo stripped out.
On the other hand, there is (about) the same amount of memory traffic.

> If that's the case, then other possible explanations for the slowdown
> due to overflow checks are
>
> * The jo instructions
> * Missed CPS simplification
> 
> Anything else?  Would it be easy to run the benchmarks, once with jo's and once
> without, everything else being equal?

I'll see about running the benchmarks with a few different options:

1. with -DMLton_overflow, but translate Int_{a,s,m}Check as Int_{a,s,m}

2. with -DMLton_overflow, normal translate of Int_{a,s,m}Check, 
   but never emit a jo instruction

3. without -DMLton_overflow, but tranlate Int_{a,s,m} as Int_{a,s,m}Check,
   but never emit a jo instruction

4. without -DMLton_overflow, but translate Int_{a,s,m} as Int_{a,s,m}Check

1 should highlight missed CPS simplifications
2,3,4 should highlight backend simplifications and branch prediction and 
instruction cache issues