SSA

Mon, 22 Oct 2001 12:30:16 -0400 (EDT)

> > > We could (?) always put the exnStack element in a place relative to the
> > > stack, say in the word immediately below stackBottom.  (Although that sort
> > > of messes with getting stacks mod 8 aligned.)  The GC_thread struct
> > > seems a little overly abstract -- it doesn't seem as though two threads
> > > with different exnStacks could point to the same execution stack?
> > 
> > Yes, I think you're right.  We could flatten the GC_stack data
> > structure into the GC_thread data structure.  That would be nice,
> > would save a little space, simplify the runtime a little, and enable
> > tracing the exception stack.  
> 
> I like tracing the exception stack; it seems a little more natural than
> the stub continuation label that fakes the liveness of both the cont and
> handler.  I think I'll take a crack at this over the weekend.

O.k.  Here are the results of my experiments with the runtime.

First, we can't flatten GC_stack into GC_thread.  The reason is that
SML code and objects only refere to GC_threads, not GC_stacks.  Now,
when the current thread runs out of stack space, it copies it's stack
data to a new, larger stack and updates the stack field of the current
thread.  This ensures that all ML objects refering to the current
thread "see" the new stack.  My first attempt at combining all the
GC_stack and GC_thread data works o.k. for single threaded apps, and even
for some multi-threaded apps, but mutex.sml and thread2.sml both were
failing.

So, I just moved the exnStack field from the GC_thread struct to the
GC_stack struct.  This slightly penalizes setting handlers and raising,
because access to the exnStack field goes through one extra level of
redirection.

That all worked out o.k., so I took a crack at setting frame layouts for
handlers and walking the handler stack as part of GC.

Setting everything up wasn't too difficult.  The changes to backend were
pretty minimal, although if we decide to go this route, I'll go through
and clean up some more.  The C codegen needed almost no changes.  The x86
codegen just needed to save the frame index with a handler label.

The runtime required a little more work.  I disabled the "Frame layouts"
portion of the invariant check in gc.c.  This was checking that all of the
live slots in a frame were less than the frame size.  The issue with this
is what we mean by frame size.  I decided that I wanted it to mean "the
amount to subtract from top to get to the frame for which these offsets
make sense."  For continuations, this is the intuitive notion of frame
size -- we subtract the whole frame.  For handlers, this isn't quite the
same, we actually subtract the offset of the slot for the code pointer; in
general, we allocate the slot for the code pointer before all the slots
that are live in a handler, so you can get situations where some of the
live offsets are larger than the "size" of the frame.

Actually changing GC_foreachPointerInObject to walk the handler stack was
not difficult.  However, forward implicitly assumes that we only every
visit a pointer in an object once.  When we walk the continuation stack
and then walk the handler stack, we're likely to hit some pointers more
than once.  For a quick hack, I made forward check to see if the pointer
has already been forwarded to toSpace; if so, just return.  This should
slow down the GC, but I don't see it happening in practice (see benchmarks
below).  I think we can make GC_foreachPointerInObject a little more
complicated and have it walk the continuation and handler stacks at the
same time, and only hit each pointer once.  That should only force stacks
to pay a little more in GC time.

Anyways, here are the results of the benchmarks:

MLton0 -- mlton-cvs
MLton1 -- mlton-newer
compile time
benchmark         MLton0 MLton1
barnes-hut           3.1    2.8
checksum             0.6    0.6
count-graphs         1.9    2.0
DLXSimulator         4.8    5.0
fft                  1.3    1.3
fib                  0.6    0.6
hamlet              56.2   56.4
knuth-bendix         2.6    2.6
lexgen               6.2    6.3
life                 1.5    1.5
logic                6.9    6.8
mandelbrot           0.7    0.6
matrix-multiply      0.7    0.7
md5                  1.5    1.4
merge                0.7    0.6
mlyacc              26.1   25.9
mpuz                 1.0    1.0
nucleic              3.6    3.6
peek                 1.1    1.1
psdes-random         0.7    0.6
ratio-regions        3.0    3.0
ray                  3.8    3.9
raytrace            10.3   10.3
simple               7.6    7.5
smith-normal-form    9.3    9.3
tailfib              0.6    0.6
tak                  0.6    0.6
tensor               3.5    3.5
tsp                  1.8    1.8
tyan                 4.4    4.5
vector-concat        0.7    0.7
vector-rev           0.7    0.6
vliw                14.5   14.5
wc-input1            1.8    1.8
wc-scanStream        1.9    1.9
zebra               14.1   14.2
zern                 1.1    1.1

run time
benchmark         MLton0 MLton1
barnes-hut           4.9    4.9
checksum             4.1    4.1
count-graphs         5.8    6.1
DLXSimulator        20.2   21.0
fft                 10.8   12.6
fib                  3.9    4.0
hamlet              10.0   10.2
knuth-bendix         8.2    8.4
lexgen              12.9   12.8
life                 9.9   10.2
logic               32.2   32.3
mandelbrot           8.6   15.3
matrix-multiply      6.8    6.7
md5                  0.6    0.6
merge               70.3   67.8
mlyacc              11.8   11.8
mpuz                 5.6    5.5
nucleic              8.7    8.7
peek                 4.3    4.4
psdes-random         4.2    4.2
ratio-regions       10.9   10.9
ray                  4.6    4.9
raytrace             5.7    5.7
simple               7.4    7.4
smith-normal-form    1.2    1.2
tailfib             20.3   19.1
tak                  9.8    9.5
tensor               7.6    7.2
tsp                 11.5   11.5
tyan                23.6   23.6
vector-concat        7.7    7.3
vector-rev           5.7    5.7
vliw                 7.6      *
wc-input1            2.6    2.6
wc-scanStream        4.4    4.4
zebra                2.8    2.8
zern                47.1   45.4

run time ratio
benchmark         MLton1
barnes-hut           1.0
checksum             1.0
count-graphs         1.0
DLXSimulator         1.0
fft                  1.2
fib                  1.0
hamlet               1.0
knuth-bendix         1.0
lexgen               1.0
life                 1.0
logic                1.0
mandelbrot           1.8
matrix-multiply      1.0
md5                  1.0
merge                1.0
mlyacc               1.0
mpuz                 1.0
nucleic              1.0
peek                 1.0
psdes-random         1.0
ratio-regions        1.0
ray                  1.1
raytrace             1.0
simple               1.0
smith-normal-form    1.0
tailfib              0.9
tak                  1.0
tensor               1.0
tsp                  1.0
tyan                 1.0
vector-concat        1.0
vector-rev           1.0
wc-input1            1.0
wc-scanStream        1.0
zebra                1.0
zern                 1.0

size
benchmark          MLton0  MLton1
barnes-hut         60,594  61,018
checksum           38,365  38,733
count-graphs       57,917  58,469
DLXSimulator       95,749  96,357
fft                46,953  47,353
fib                38,349  38,749
hamlet            970,824 978,280
knuth-bendix       77,590  78,638
lexgen            140,717 141,565
life               56,269  56,677
logic             166,389 166,829
mandelbrot         38,325  38,709
matrix-multiply    38,773  39,141
md5                46,014  46,470
merge              39,357  39,741
mlyacc            432,621 434,221
mpuz               44,133  44,549
nucleic            77,973  78,357
peek               46,118  46,598
psdes-random       39,365  39,733
ratio-regions      59,661  60,109
ray                84,592  85,304
raytrace          178,237 179,213
simple            165,345 165,969
smith-normal-form 143,886 144,326
tailfib            38,069  38,453
tak                38,397  38,781
tensor             63,021  63,533
tsp                51,310  51,710
tyan               94,958  95,766
vector-concat      39,013  39,397
vector-rev         38,837  39,221
vliw              281,705 283,145
wc-input1          56,918  57,502
wc-scanStream      59,278  59,846
zebra             133,550 134,198
zern               43,960  44,360

The easy comments; the fact that we have frame layouts for both
continuations and handlers increases the static data of the executables.  

For the most part, the GC hacks aren't noticable in the running times.

mandelbrot has me completely stumped.  The x86 assembly produced by the
x86 codegen is identical in both the old and new compilers.  Running with
gc-summary reveals that neither program ever invokes the GC, so the
runtime changes can't explain it.  Alignment of the tight loops?  The
change in the runtime did affect the addresses of the produced assembly.
I tried aligning loop headers at mod 4 and mod 8 boundaries, and recovered
about a second -- nothing compared to the 6.5 seconds we lost.  Profiling
shows that the same loops are hot in both, just that one takes a lot
longer.

[fluet@lennon mandelbrot]$ mlprof -d 1 mandelbrot.cvs mlmon.cvs.out 
18.42 seconds of CPU time
main_0              100.00%
     L_30    48.21%        
     L_26    31.27%        
     loop3_1  3.91%        
     L_45     3.37%        
     L_44     3.09%        
     L_27     3.04%        
     L_31     2.82%        
     L_46     2.06%        
     loop2_1  1.14%        
     L_32     1.09%        
[fluet@lennon mandelbrot]$ mlprof -d 1 mandelbrot.newer mlmon.newer.out 
26.47 seconds of CPU time
main_0              100.00%
     L_30    56.18%        
     loop3_1 17.34%        
     L_26    10.31%        
     L_46     4.00%        
     loop2_1  3.93%        
     L_44     2.72%        
     L_31     2.34%        
     L_45     1.21%        
     L_27     1.10%        
     L_32     0.87%        

Here are the top 3 blocks in .ssa and .s:

  L_26 ()
    x_25 = Real_fromInt (x_20)
    x_24 = Real_add (x_26, x_25)
    x_10 = Real_mul (x_24, global_15)
    loop3_1 (x_6, x_10, global_8)
  loop3_1 (x_13, x_14, x_2)
    x_17 = Int_lt (x_2, global_12)
    case x_17 of
      false => L_34 | true => L_30
  L_30 ()
    x_12 = Real_mul (x_14, x_14)
    x_11 = Real_mul (x_13, x_13)
    x_16 = Real_add (x_11, x_12)
    x_15 = Real_gt (x_16, global_14)
    case x_15 of
      false => L_32 | true => L_31

L_26:
	fildl (0+(32*1))(%ebp)
	faddL (0+(8*1))(%ebp)
	fmulL (globaldouble+(1*8))
	fstL (0+(40*1))(%ebp)
	movl $0,(0+(64*1))(%ebp)
	fstpL (0+(56*1))(%ebp)
	fldL (0+(24*1))(%ebp)
	fstpL (0+(48*1))(%ebp)
loop3_1:
	movl (0+(64*1))(%ebp),%esi
	cmpl $2048,%esi
	jnl L_27
L_30:
	fldL (0+(56*1))(%ebp)
	fld %st
	fmul %st, %st
	fldL (0+(48*1))(%ebp)
	fld %st
	fmul %st, %st
	fld %st(2)
	fadd %st(1), %st
	fldL (globaldouble+(0*8))
	fxch %st(1)
	fcompp
	fnstsw %ax
	testw $0x4500,%ax
	jz L_63

Like I said, they are identical in both versions.  

[fluet@lennon mandelbrot]$ nm -n mandelbrot.cvs | grep "..."
0804908c t loop1_1
080490b1 t loop2_1
080490c0 t L_26
080490df t loop3_1
080490ee t L_30
[fluet@lennon mandelbrot]$ nm -n mandelbrot.newer | grep "..."
08049094 t loop1_1
080490b9 t loop2_1
080490c8 t L_26
080490e7 t loop3_1
080490f6 t L_30

Alignment doesn't seem significantly better in the first.  Long shot
guess: alignment of individual floating point instructions?  I can't find
any support for this in the Intel documentation, but I'm at a loss.

The failure of vliw under the new compiler is not directly related to the
changes I made.  Using the currently checked in sources, I get the
following:

[fluet@lennon vliw]$ ./vliw @MLton gc-summary fixed-heap 600k --
Segmentation fault

And compiling -g, I get:

[fluet@lennon vliw]$ ./vliw @MLton gc-summary fixed-heap 600k --
gc.c 583: assert(0 == s->fromSize or (s->frontier <= s->limit + LIMIT_SLOP
and s->limit == s->base + s->fromSize - LIMIT_SLOP)) failed.
Aborted

Commenting out that assert and looking at where the seg fault occurs, I
think I've got a guess.  One SSA function is the following:

fun member_1 (x_6660, x_6659) = L_4441()
  L_4441 ()
    loop_385 (x_6659, x_6660)
  loop_385 (x_6677, x_6664)
    x_6678 = #2 x_6677
    x_6663 = #1 x_6677
    case x_6678 of
      ::_2 => L_4454 | nil_1 => L_4443
  L_4454 (x_6662, x_6673)
    case x_6663 of
      Env_2 => L_4453 | Env_15 => L_4452
  L_4453 ()
    x_6676 = String_equal (x_6673, x_6664)
    L_4445 (x_6676)
  L_4452 (x_6672, x_6670)
    L_4451 (explode_0 (x_6664, x_6672))
  L_4451 (x_6675)
    L_4450 (pref_0 (x_6675))
  L_4450 (x_6674)
    L_4449 (implode_0 (x_6674, x_6670))
  L_4449 (x_6668)
    L_4448 (explode_0 (x_6673, x_6672))
  L_4448 (x_6671)
    L_4447 (pref_0 (x_6671))
  L_4447 (x_6669)
    L_4446 (implode_0 (x_6669, x_6670))
  L_4446 (x_6666)
    x_6667 = String_equal (x_6668, x_6666)
    L_4445 (x_6667)
  L_4445 (x_6665)
    case x_6665 of
      false => L_4444 | true => L_4442
  L_4444 ()
    x_6661 = (x_6663, x_6662)
    loop_385 (x_6661, x_6664)
  L_4443 ()
    global_12
  L_4442 ()
    global_83

There is an allocation L_4444 (which seems like it could be avoided using
some sort of local flattening, but that doesn't really matter.)  Looking
at the corresponding .toMOut, there is no limit check with bytes > 0 in
this code.  So, I think the seg faults are coming from just running past
the end of the heap frontier.

Anyways, I just found this, so I haven't taken a look at what needs to be
done to fix it.