[MLton-user] Problem with Timer.checkCPUTimer on Sparc

Mon, 10 Nov 2003 08:46:31 -0800

> The spec seems clear: "Note that implementations should include any time spent
> in the garbage collector in the usr time component."

I agree, which is why we implemented it in MLton the way we did.  In
any case, John Reppy said what was intended and that he will add some
clarification to the spec.

> But I'll editorialise further: the spec is broken either way. You
> have two axes for time: usr/sys & gc/non-gc. Four bins in the cross
> product. If you fold any two of these bins together, you are going to
> screw up some measurement. The API needs to provide all four.

I agree that four would be best.  (these are available via a single
call to MLton.Rusage.rusage)

> Further, as you noticed yourself, there is an atomicity problem, in that you
> get CPU time and its related GC time with two different calls, but a GC could
> occur in-between. We can't language-lawyer our way out of that one; the API
> is busted.

Again, I agree.

I will forward your points to the basis library discussion list and cc
you.

> Run on a Sparc, compiled with 
>     -detect-overflow true -safe true -inline 100000
> so I get whatever the default GC machinery is.

Almost certainly stop and copy.  You can use "@MLton gc-summary --" to
confirm this.

> - Clearly, the recursive impl is the big winner on the SPARC. But it's such a
>   huge winner it's suspicious.

Indeed.  If your program doesn't inspect the output of merge1, then
MLton's optimizer will be smart enough to avoid allocating the result.
At that point, merge1 becomes a simple tail recursive walk down two
lists until either one is empty.  Sadly, MLton's optimizer is not
smart enough to know that lists are of finite length, and so cannot
optimize away that loop :-).  In any case, I suspect that this is
what's happening.  You can confirm by compiling -profile alloc and
checking that no allocation is attributed to merge1, or by looking at
the .ssa if you really want.

> These numbers are weird in *multiple* ways, as I'll outline below. 
> 
> - First, the measurements in the sexp in the left columns are
>       #usr(CheckCPUTimer tm)	(usr time -- gc & non-gc)
>       #sys(CheckCPUTimer tm)	(sys time -- gc & non-gc)
>       CheckGCTime tm		(gc time -- usr & sys)
>     in usec. The *meaning* of these things is, as indicated to the side, 
>     your older semantics (as opposed to your brand-new fixup).

To make sure we're on the same page, here is my understanding of how
the timings in 20030716-2 work.  Let's allocate every unit time spent
in the program to one of the four bins.

	gc	non-gc
user	A	B
sys	C	D

Then, the Timer structure gives you

 #usr(CheckCPUTimer tm)	= A + B + C
 #sys(CheckCPUTimer tm)	= C + D
 CheckGCTime tm         = A + C

>   The measurements in the columns on the right are in msec & hand-generated
>   from the left columns: non-gc = (usr+sys-gc).

So, your non-gc = (A + B + C) + (C + D) - (A + C) = B + C + D.
I.E. it includes gc system time.

In the new world, with my patch, Timer gives you

 #usr(CheckCPUTimer tm)	= A + B
 #sys(CheckCPUTimer tm)	= C + D
 CheckGCTime tm         = A

So, your non-gc = (A + B) + (C + D) - A = B + C + D.

OK.

> - It's kind of suspicious that the usec-precise times are always in 
>   10msec units. 1/100th of a second is pretty coarse!

Timer is implemented using getrusage(), which appears to return values
with this granularity, at least on my Linux and Sparc machines.  I'd
be interested to hear from any experts out there who can tell how to
get more precision out of getrusage or some other function.

But, I always like to run benchmarks for at least 5-10 seconds anyways,
to try to avoid other noise and make this issue moot.

> - It's kind of suspicious that gc *dominates* these measurements --
>   these runs are spending 10x as much time in the gc as in the computing
>   (look at the right columns, where I did some subtraction by hand).
>   Huh?

I'm suspicious of benchmarks that run for such a short time.  I ran
the three merges on two lists of sum length 100,000, repeating each
merge 100 times.  I used a version that adds 1 to the first element of
the result to ensure that merge1 actually produces the list.  Here are
the times I see on my 500 MHz SPARC.

((usr 5720000) (sys 990000) (gc 770000))
((usr 4660000) (sys 460000) (gc 800000))
((usr 4490000) (sys  10000) (gc 560000))

This puts gc in the ~15% range, which jibes with intense allocation, a
copying GC, and a heap live ratio of 8 (the MLton default).  It also
jibes with the percentage given by @MLton gc-summary.

If you run your benchmarks for longer, what kind of numbers do you
see?  Similar to this?

> - Is this slow? 20 msec = 1/50th of a second to merge 10,000 integers?
>   That's one list element every 2 usec, on a 450Mhz SPARC,
>   i.e., one list element every 900 cycles, yeah, that seems a
>   little slow. 
...
>   That's about 25 insn per list element? So we're talking ~ 36 cycles/insn,
>   back-of-the-envelope? That seems real high.

For my merge1 number, I get 671 nanoseconds per element, i.e 336
cycles per element, which at your estimate of 25 instructions per
element comes out to 13 cycles/insn.

> Here are the numbers, for SML/NJ & MLton, on a linux x86 & a Solaris Sparc:
> 
> Merging numbers: ((listlen 100000) (nreps 10) (nlists 1000))
> Measuring usr+sys non-gc time in msec
> x86 (mongkok -- 1.2Ghz)
>   Mlton (total process run time: usr=6m1.6s / sys=0.37s)
>     (rec     58530)
>     (iter   140880)
>     (iter2  131230)
>   SML/NJ
>     (rec    482710)
>     (iter   264830)
>     (iter2  266200)
> SPARC (forge -- 450Mhz)
>   Mlton
>     (rec     33990)	;  15.3 cycles/elt & 20 insns/elt => 0.76 cycles/insn
>     (iter   253530)	; 114.1 cycles/elt & 25 insns/elt => 4.56 cycles/insn
>     (iter2  250540)	; 112.7 cycles/elt & 20 insns/elt => 5.64 cycles/insn
>   SML/NJ
>     (rec    920900)
>     (iter   627720)
>     (iter2  608730)

Here's what I see with listlen 100,000, nreps 10,000, and nlists 1.
x86
((usr 80.750000) (sys 0.330000) (gc  3.750000))
((usr 76.720000) (sys 0.280000) (gc 17.930000))
((usr 76.200000) (sys 0.130000) (gc 17.620000))
SPARC
((usr 551.750000) (sys 1.240000) (gc 39.110000))
((usr 451.090000) (sys 1.040000) (gc 57.780000))
((usr 448.910000) (sys 0.430000) (gc 58.070000))

For merge1 on the SPARC, this comes out to 553 nanoseconds per
element, comparable to what I saw before.

So, the only unexplained phenomenon to me is the bizarre timings you
saw for your benchmarks running for 1-2 tenths of a second.  Hopefully
those go away when you do longer runs.