Matthew Fluet <fluet@CS.Cornell.EDU>
Wed, 26 Sep 2001 16:00:10 -0400 (EDT)
> I agree that 10% seems way huge. One thing I have seen, but this would only
> happen if the code or data is large enough, is to get burned by the cache
> associativity. The problem is that the cache uses physical addresses, so an
> 8K chunk of logically continuous memory (the point being bigger than a page)
> might not lay out well in the cache. I thought that the associativity of the
> L1 (even the D-cache) was large enough (at least size of cache / number of
> pages) that this effect could never happen there, but I have definitely seen
> it in the L2.
> You can test for this by copying the 2 a.out files and then running the
> copies. This will force the kernel to use different chunks of physical
> memory and hence will roll the dice again as far as cache layout.
Why will the copies necessarily get different chunks of physical memory?
> I think that this is worth trying (or even the more extreme of rebooting),
> but I don't see how it could happen in something so small.
Well, I tried it and got the same results. The numbers I quoted in the
e-mail I produced on this machine; but I first discovered the anomaly on
What's even stranger, going with the assumption that just decoding the
bigger offsets is resulting in the speed up, is that when I rearrange the
code a little more, bringing up the L_37 and L_30 blocks before the top
level handler code, which results in smaller offsets for some of the other
conditional branches, the performance gets worse (not as bad as base, but
worse than the optimized).
Anyways, the other machine that I was working on has the perfcntr kernel
patch, so I profiled a few runs. That didn't make much sense either.
Assuming I was measuring the things I thought I was measuring, then where
the two programs differened by more than noise, I would think base was
better (fewer cache misses, etc.). I don't remember the exact counters I
was measuring, but I got some oddities.