[MLton-user] Problem with Timer.checkCPUTimer on Sparc

Sat, 8 Nov 2003 18:00:54 -0500

   > Note that gc time is 170 usec & usr time is 140 usec. But that is
   > impossible, since usr time *includes* gc time, according to the SML doc
   > at
   > 
   >     http://www.standardml.org/Basis/timer.html
   ...
   > So it seems like my code took negative time. ??? Is this a bug? 
   > Did I make a mistake?

   The Basis spec doesn't make it clear to me whether system time spent
   during garbage collection should be allocated to usr or sys. Right
   now, MLton gets the usr and sys values from a call to getrusage.  This
   has the effect of allocating system time spent during garbage
   collection to sys.  To allocate the system time to usr would require
   extra bookkeeping to keep track of the system time spent in gc and to
   subtract it off from sys and add it to usr.  That seems odd to me.
   I'll ask around to see what other SML implementors think.

The spec seems clear: "Note that implementations should include any time spent
in the garbage collector in the usr time component."

But I'll editorialise further: the spec is broken either way. You
have two axes for time: usr/sys & gc/non-gc. Four bins in the cross
product. If you fold any two of these bins together, you are going to
screw up some measurement. The API needs to provide all four.

Further, as you noticed yourself, there is an atomicity problem, in that you
get CPU time and its related GC time with two different calls, but a GC could
occur in-between. We can't language-lawyer our way out of that one; the API
is busted.

   So, with the current MLton implementation, usr + sys should always be
   greater than or equal to gc, but usr alone might not be.

Here is what I am measuring: I have three implementations of list-merge. I
create a pair of random, ordered int lists, whose sum length is 10,000 elts.
Then for each merge function, I turn on the timer, merge the lists, and turn
off the timer. I do this for three such randomly generated merge-pairs. I can
pretty tightly bound the amount of alloc that's happening here:

  - impl 1: recursive. Needs up to 10,000 stack frames & up to 10,000 conses.

  - impl 2: iterate&reverse. Builds the answer backwards in a loop, 
      then reverses it. Needs 0 stack frames & up to 20,000 conses.

  - impl 3: As in 2, but the inner loop has 7 vars, not 3 and loop is
      "rotated" to ensure that it's operating on non-empty lists. Should run
      faster if there are enough callee-saves regs to hold all 7 iter vars
      (or if the < compare gets inlined and there are enough general-purpose
      regs). Allocates as in impl 2.

(The 3 functions appear at the end of this msg.) That gives you 9 merges: 3
impls * 3 merge-pairs. Below are the nine timings, straight out of
checkCPUTimer & checkGCTime. Total heap alloc for program: 3 * (10k + 20k +
20k) = 150k conses.

Run on a Sparc, compiled with 
    -detect-overflow true -safe true -inline 100000
so I get whatever the default GC machinery is.

; (usec)				   (msec)
; Measurements				   non-gc   gc  impl & merge-pair
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
((usr      0) (sys     0) (gc       0))         0    0  recursive 1
((usr 150000) (sys 20000) (gc  160000))        10  160  iterate1  1
((usr 130000) (sys 10000) (gc  140000))         0  140  iterate2  1
((usr      0) (sys     0) (gc       0))         0    0  recursive 2
((usr 100000) (sys 70000) (gc  150000))        20  150  iterate1  2
((usr 110000) (sys 30000) (gc  130000))        10  130  iterate2  2
((usr      0) (sys     0) (gc       0))         0    0  recursive 3
((usr 140000) (sys 20000) (gc  140000))        20  140  iterate1  3
((usr 100000) (sys 40000) (gc  110000))	       30  110  iterate2  3

These numbers are weird in *multiple* ways, as I'll outline below. 

- First, the measurements in the sexp in the left columns are
      #usr(CheckCPUTimer tm)	(usr time -- gc & non-gc)
      #sys(CheckCPUTimer tm)	(sys time -- gc & non-gc)
      CheckGCTime tm		(gc time -- usr & sys)
    in usec. The *meaning* of these things is, as indicated to the side, 
    your older semantics (as opposed to your brand-new fixup).

  The measurements in the columns on the right are in msec & hand-generated
  from the left columns: non-gc = (usr+sys-gc).

- It's kind of suspicious that the usec-precise times are always in 
  10msec units. 1/100th of a second is pretty coarse!

- It's kind of suspicious that gc *dominates* these measurements --
  these runs are spending 10x as much time in the gc as in the computing
  (look at the right columns, where I did some subtraction by hand).
  Huh?

- It's kind of suspicious that the gc spends so much time in the OS. It's just
  wandering around doing reads & writes to memory. (The *program* is
  *definitely* spending *zero* time in the OS during the timed portions --
  it's just merging two integer lists. Unless paging is a lot more
  cpu-intensive than I'd previously thought?)

- Clearly, the recursive impl is the big winner on the SPARC. But it's such a
  huge winner it's suspicious.

- Is this slow? 20 msec = 1/50th of a second to merge 10,000 integers?
  That's one list element every 2 usec, on a 450Mhz SPARC,
  i.e., one list element every 900 cycles, yeah, that seems a
  little slow. The inner loop on the iterate1 impl is:
    - 2 nil tests
    - 2 car loads; 2 cdr loads
    - an inlined (yeah, Mlton!) fixnum <
    - a cons
    - jump to top of loop
    plus the cost of a revAppend, which is likely
    - nil test
    - car & cdr load
    - cons
    - jump
  (I've appended the actual code to this msg; it's function merge2.)
  That's about 25 insn per list element? So we're talking ~ 36 cycles/insn,
  back-of-the-envelope? That seems real high.

More data: I did more runs, this time with 1000 different 100k-elt list-pairs.
I would turn the timer on, do the merge 10 times, and then turn it off. I
collected usr+sys-gc, in accordance w/your claim that there is some gc time in
both the usr & sys component, so I measured "non-gc sys+usr" time. (How I
generate a list pair: I count from 100k down to 1; for each int, I flip a
psuedo-random coin and put the int in list xs or list ys. When done, I have
two ordered lists, xs & ys, whose sum length is 100k-elts.)

Here are the numbers, for SML/NJ & MLton, on a linux x86 & a Solaris Sparc:

Merging numbers: ((listlen 100000) (nreps 10) (nlists 1000))
Measuring usr+sys non-gc time in msec
x86 (mongkok -- 1.2Ghz)
  Mlton (total process run time: usr=6m1.6s / sys=0.37s)
    (rec     58530)
    (iter   140880)
    (iter2  131230)
  SML/NJ
    (rec    482710)
    (iter   264830)
    (iter2  266200)
SPARC (forge -- 450Mhz)
  Mlton
    (rec     33990)	;  15.3 cycles/elt & 20 insns/elt => 0.76 cycles/insn
    (iter   253530)	; 114.1 cycles/elt & 25 insns/elt => 4.56 cycles/insn
    (iter2  250540)	; 112.7 cycles/elt & 20 insns/elt => 5.64 cycles/insn
  SML/NJ
    (rec    920900)
    (iter   627720)
    (iter2  608730)

Check the cycles/insn counts I calculated out to the right for the Sparc MLton
run. For the iter function, cycles/insn looks in the ballpark for what is
probably a memory-bound loop with bad locality. However, the value for the
recursive function -- less than 1 cycle/insn -- is WAY too low. Something
funny is going on there -- maybe the compiler is getting the insn count down
in some way I don't understand (see my calculations below), or the timing
accounting is still off. Weird.

Other notes:
  - In Mlton, recursive merge crushes iterate&reverse, 
    but with SML/NJ, it's the other way 'round. 
  - Mlton trounces SML/NJ in all instances, by *integer* factors.

    -Olin

===============================================================================
Here's how I calculate cycles/insn
-------------------------------------------------------------------------------
(define (cycles/insn time-sec insns/elt listlen nreps nlists cycles/sec)
  (let* ((cycles/elt (cycles/elt time-sec listlen nreps nlists cycles/sec)))
    (/ cycles/elt insns/elt)))	     

(define (cycles/elt time-sec listlen nreps nlists cycles/sec)
  (let* ((elts (* listlen nlists nreps))   ; total elts processed
         (cycles (* time-sec cycles/sec))) ; total cycles elapsed
    (exact->inexact (/ cycles elts))))

Back-of-the-envelope estimates for insns/elt
-------------------------------------------------------------------------------
(assuming loop iter vars all live in regs, w/no stack traffic across < call.)
  recur: ~ 20 insns
    2 nil-tests (2 insns)
    2 car loads & 2 cdr loads (4 insns)
    1 fixnum < (1 insn)
    1 recursive call (4 insns)
    1 cons (5 insns)
    1 return (2 insns)
  iter: ~ 25 insns
    2 nil-tests (2 insns)		    	[iter]
    2 car loads & 2 cdr loads (4 insns)
    1 fixnum < (1 insn)
    1 cons (5 insns)
    1 jump (1 insn)
    1 nil test (1 insn)		    		[revApp]
    1 car load & 1 cdr load (2 insns)
    1 cons (5 insns)
    1 jump (1 insn)
  iter2: ~ 20 insns
    1 fixnum < (1 insn)		    [iter]
    1 nil test (1 insn)
    1 car & cdr load (2 insns)
    1 cons (5 insns)
    1 jump (1 insn)
    1 nil test (1 insn)		    [revApp]
    1 car load & 1 cdr load (2 insns)
    1 cons (5 insns)
    1 jump (1 insn)

The actual merge functions
-------------------------------------------------------------------------------
fun merge1 op< (xs,ys) =	(* Recursive *)
  let fun m([], ys) = ys
	| m(xs, []) = xs
	| m(xs as (x :: xt), ys as (y :: yt)) = 
          if y < x then y :: m(xs,yt) else x :: m(xt,ys) 
  in m(xs,ys)
  end

fun merge2 op< (xs,ys) =	(* Iterate/reverse *)
  let fun m([], ys, racc) = List.revAppend(racc,ys)
	| m(xs, [], racc) = List.revAppend(racc,xs)
	| m(xs as (x :: xt), ys as (y :: yt), racc) = 
	  if y < x then m(xs, yt, y :: racc)
	  else m(xt, ys, x :: racc)
  in m(xs,ys,[])
  end

(* Iterate/reverse; better if we have plenty of registers. *)
fun merge3 op< (xs,[]) = xs
  | merge3 op< ([],ys) = ys
  | merge3 op< (xs as (x::xrest), ys as (y::yrest)) =
    let val revapp = List.revAppend
	fun m(xs,x,xt, ys,y,yt, racc) = 
	    if y < x
	    then case yt of [] => revapp(racc, y::xs)
			  | y2::ys => m(xs,x,xt,  yt,y2,ys, y::racc)
	    else case xt of [] => revapp(racc, x::ys)
			  | x2::xs => m(xt,x2,xs, ys,y,yt,  x::racc)
    in m(xs,x,xrest, ys,y,yrest, [])
    end