[MLton-devel] heap profiling

Wed, 11 Sep 2002 19:22:27 -0700

Alain writes:

> I was wondering if we could not benefit from some form of heap
> profiling. We have some apps that really do allocate a lot, with
> GC time in some cases higher than 66%. Do you have ideas on this
> topic ?

We discussed heap profiling in August 2001 and came up with what we
think is a reasonable approach, but never implemented anything.  Our
idea was to do something similar to mlprof that would provide dynamic
bytes allocated counts at the SSA (then called CPS) function and basic
block level.  We would keep a counter for each allocation point in the
program and bump that counter by the number of bytes allocated each
time the allocation point was reached.  There are some messy details
dealing with how the data is stored and implementing 64 bit
operations, but having read through our discussions again it all looks
reasonable to me.

For the record, here is my summary of our previous mail on the topic.
We had the following design choices to make

* How to store the data while running
  - linked list of counters, one per point (linked in when the point
	is reached)
  - array with as many elements as there are # of bytes in text
	segment (just like mlprof)
  - parallel arrays of with # allocation points elements
	one array stores the program counter for each allocation point
	one array stores the bytes allocated for each allocation point
* How to store the data for analysis
  - dense (like mlprof currently does)
  - sparse
* Whether to use 32 or 64 bit bins
* How to increment the counter
  - C call at each allocation
  - inline code
* At what compiler pass to insert the increments

Here is what we decided.

* How to store the data while running
  - for mlprof, keep as it is now, with an array with 32 bit bins, one
    bin for every byte in the program.
  - for mlprof-space, use parallel arrays with one element per
    allocation point.  The program counter array uses 32 bit bins and
    the bytes allocated array uses 64 bit bins.

* How to store the data for analysis
  - for both mlprof and mlprof-space, use a sparse, unsorted
    representation with 32 bits for each address and 64 bits for each
    count.  For mlprof, this will be an improvement over the current
    dense representation as long as less than 1/3 of the bins are
    nonzero, which seems likely.

* How to increment the counter
  - Use few instructions inline.
    The disadvantage of this is that we have to add a little compiler
    support for 64 bit add -- but we may want this anyway.  The
    disadvantage is the compile-time and run-time cost.

* At what compiler pass to insert the increments
  - At the time Matthew proposed adding code to the x86-codegen.
    However, since then the C codegen has continued to survive, and we
    have seen the simplification that comes from moving stuff out of
    the codegens into the backend.  So, I propose to put it in the
    backend.

One thing we didn't consider a year ago was how the approach would
interact with the selective profiling infrastructure that we have
since added.  I think the parallel arrays work very nicely with
allowing the user program to switch which set of counters is being
used.  We simply need to allocate one array of 64 bit bins for each
unit of profiling data.

So, it all seems pretty reasonable to me.  I have a pretty good feel
for the difficulty of everything except for adding the 64 bit support
to the native codegen.  Matthew, could you comment on that?  If that's
too hard, we could going with a C call per allocation.  Maybe the
performance hit will be acceptable.

Alain, I'm not sure if you've played around enough with mlprof to know
whether this kind of heap profiling information would be helpful.  But
if you have, let us know what you think.

For ease of reference, here is all the previous mail.

--------------------------------------------------------------------------------

From: Matthew Fluet <mfluet@intertrust.com>
Date: Thu, 23 Aug 2001 10:14:24 -0700 (PDT)
Subject: assembly/profiling question

I've been thinking about extending profiling to gather allocation stats.
Essentially, I'd like to reuse the existing profiling infrastructure
(mlprof, mlmon.out file structure, even the bin datastructure and file
output in prof.c, if possible).  Ideally, I'd like to just change every
addl X,gcState.frontier
to
addl X,gcState.frontier
addl X,(appropriate bin

Now, when you ran mlprof, you'd see the % of total program allocation done
in each CPS function and in each CPS block.

It seems to me that by the time the entire program is linked, then
the address of the appropriate bin can be known -- it's just
(bin_addr + (current_addr - text_sec_start_addr))
so, I want to write
addl X,(bin_symbol + (. - _start))

(Yes, this is slightly different than the way prof.c is set up right now.
I was thinking of linking in an assembly directive that allocated space
for the bins in the bss section.)

But, I can't even get (. - _start) to assemble; the _start symbol is
unknown in the current file.  Am I just stuck?

I know I can do it with a few additional instructions to compute the
address, but the goal had been to not need any additional registers.

--------------------------------------------------------------------------------

From: Henry Cejtin <henry@sourcelight.com>
Date: Thu, 23 Aug 2001 12:49:10 -0500
Subject: Re: assembly/profiling question

The  problem is (almost certainly) that you can't use an external symbol in a
negative context.  I.e., all uses of externals must be external+constant.

If you want to profile allocation  (which  you  can't  do  if  you  are  also
prfiling  run time since it would skew the times) then wouldn't it be easier
to just add one extra word allocated which  had  the  program  counter?   I'm
thinking that the GC would first do a scan of the old space and write out the
data.  Maybe this is more trouble.

Ah, how about this technique (a bit  more  expensive  run-time  wise):  every
allocation point (or function, or what ever granularity you want) allocates a
counter which also has a  link  field.   Then  allocation  (or  entering  the
allocation  region)  consists of         if link NULL:                 link =
start of list of used bins                 start of used bins  =  my  counter
The  result  is  that on exit you have a linked list of all the counters that
were used.

I seem to recall that some scheme like  this  is  what  gprof  uses  for  its
function call counts.

--------------------------------------------------------------------------------

From: Stephen Weeks <sweeks@intertrust.com>
Date: Thu, 23 Aug 2001 19:33:18 -0700
Subject: Re: assembly/profiling question

Matthew and I talked a bit about profiling, and it really would help if
space-mlprof uses the same infrastructure as time-mlprof.  So, he's going to
stick with a variant of the scheme he mentioned, 

> addl X,gcState.frontier
> to
> addl X,gcState.frontier
> addl X,(appropriate bin)

except that since i's not possible to get the assembler to stick in the
constant we want, there will be several instructions inserted that actually
compute the address at runtime.  This won't affect allocation behaviour of the
program at all, so we should be fine.

--------------------------------------------------------------------------------

From: Henry Cejtin <henry@sourcelight.com>
Date: Thu, 23 Aug 2001 21:46:19 -0500
Subject: Re: assembly/profiling question

I don't understand what he wants to do.  You can't use
	addl	X, (bin_symbol + (. - _start))
(ignoring linker problems) because the bis are going to be word sized, so
you have to scale.  Besides that, it would require an array whosesize is the
same as the size of the program times 4.  (The profiler uses this much.)
If you are willing to do that, then it is only one extra indirection.  You
store the base address of the array minus _start in some cell, and then
the store is to the contents of that word plus `.'.
I can understand the convenience of having the output file be in the same
format (what ever that exactly means) as the mlprof.out file, but the data
in memory can just be what ever and be traversed when the file is written out.
A linked list is just fine.  You follow the links into a new array, sort them
and write it out.  What is the problem?
Note, either way the code being space-prof'd is going to be slower, so you
can just call a C function at each allocation point.  This function can now
compute where you are (from the return address) and just needs to get the
size in some register.
I must be missing something since all of these seem really simple to me.

--------------------------------------------------------------------------------

From: Stephen Weeks <sweeks@intertrust.com>
Date: Thu, 23 Aug 2001 19:59:47 -0700
Subject: Re: assembly/profiling question

> Besides that, it would require an array whose size is the
> same as the size of the program times 4.  (The profiler uses this much.)

We want that for space profiling too.  In fact, 2^32 could really be a
limitation in self compiles.

> If you are willing to do that, then it is only one extra indirectin.  You
> store the base address of the array minus _start in some cell, and then
> the store is to the contents of that word plus `.'.

Sounds excellent.  I await Matthew's approval.

> I can understand the convenience of having the output file be in the same
> format (what ever that exactly means) as the mlprof.out file, but the data
> in memory can just be what ever and be traversed when the file is written out.
> A linked list is just fine.  You follow the links into a new array, sort them
> and write it out.  What is the problem?

No problem.  What you say is true enough.  It's just a question of the
complexity of the code introduced into the code generator and the performance
overhead.  If we can come up with a scheme with only an extra instruction per
allocation, that would be ideal.  Maybe what you suggested above works.

> Note, either way the code being space-prof'd is going to be slower, so you
> can just call a C function at each allocation point.  This function can now
> cmpute where you are (from the return address) and just needs to get the
> size in some register.

I think the slowdown for this would be unnacceptable.

--------------------------------------------------------------------------------

From: Henry Cejtin <henry@sourcelight.com>
Date: Thu, 23 Aug 2001 22:12:43 -0500
Subject: Re: assembly/profiling question

I certainly didn't mean to imply that I thought that the slow down would be
too large to make the space profiling usable.  It is just that you have to
remember never to do time profiling on the same run.

--------------------------------------------------------------------------------

From: Henry Cejtin <henry@sourcelight.com>
Date: Thu, 23 Aug 2001 22:20:35 -0500
Subject: space profiling

I  just  realized that the space profiling is definitely going to have a real
problem with overflowing the bins.  If I have a hot loop allocating  lots  of
junk,  it  isn't  hard  for one location to be responsible for more than 2^32
bytes of allocation.  This seems like anothe  argument  for  going  for  the
procedure call, and having the linked list use 64 bits worth of counter (plus
32 bits for the link).  This way you are safe from wrapping.

--------------------------------------------------------------------------------

From: Stephen Weeks <sweeks@intertrust.com>
Date: Fri, 24 Aug 2001 09:52:59 -0700
Subject: space profiling

> I  just  realized that the space profiling is definitely going to have a real
> problem with overflowing the bins.  If I have a hot loop allocating  lots  of
> junk,  it  isn't  hard  for one location to be responsible for more than 2^32
> bytes of allocation. 

I agree.

> This seems like another  argument  for  goig  for  the
> procedure call, and having the linked list use 64 bits worth of counter (plus
> 32 bits for the link).  This way you are safe from wrapping.

Couldn't we just go to 64 bit bins?

--------------------------------------------------------------------------------

From: Matthew Fluet <mfluet@intertrust.com>
Date: Fri, 24 Aug 2001 09:53:53 -0700 (PDT)
Subject: Re: assembly/profiling question

Just so we all have the same context, here's a fuller description of what
I'd like to add.

Add a Profiling structure to the MLton structure:

structure Profiling : PROFILING
signature PROFILING =
  sig
       (* a compile-time constant, like MLton.debug *)
       val profiling: bool

      (* reset all the bin counters to 0 *)
       val reset: unit -> unit
       (* write out all of the bins to a mlmon.out format file,
        * with the given file name
	*)
       val write: string -> unit
  end

Append (prepend?) to the input source something like

val _ = if MLton.Profiling.profiling
          then OS.Process.atExit (fn () => MLton.Profiling.write "mlmon.out")
          else ()

(Yes, this has slightly different semantics than the current profiling, I
think.  The mlmon.out file won't have the timing info for GC_done.  Maybe
the right thing is to let the final write be handled as it is, by
registering atExit with C, rather than with ML.)

The whole point being that I can now profile sub-portions of the program.
For example, I think it would be great to modify Control.trace to
automatically profile each top-level pass.

To support all this, we need to change prof.c and export a few more
functions.  But I don't think that is so bad.

Then I was thinking about different hings I could do with the bins and
mlprof.  The advantage of mlprof is that it is already set up to process
and display information by CPS/SSA functions, CPS/SSA blocks, and x86
blocks.  So, by accumulating allocation counts into bins, I automatically
can see which CPS functions are doing the majority of allocation, and even
figure out which blocks in the function are doing the most allocation.
(Probably obvious, but the key here is that I get to see dynamic counts;
it's easy to look at a block and see that it's allocating 10 tuples, but I
don't know whether that block is executed once or 10 million times.)
Similarly, we could get dynamic function counts, by incrementing a bin on
entry to each CPS function.  Likewise, dynamic block counts.  The only
issue with all this is that with only one set of bins, we can only profile
one thing at a time, and it will be a compile-time choice, not a runtime
choice.  (But, that seems to be o.k.  If I'm profiling time, I don't want
the results pollutd by additional checks by the program checking to see
if it's currently gathering allocation stats.)

> I don't understand what he wants to do.  You can't use
> 	addl	X, (bin_symbol + (. - _start))
> (ignoring linker problems) because the bis are going to be word sized, so
> you have to scale.  Besides that, it would require an array whose size is the
> same as the size of the program times 4.  (The profiler uses this much.)

Yes, I forgot the scaling factor.  But that just scales the expression
above by a constant value.  I still claim that if the bin array is
statically allocated (e.g., in bss), then I could compute the absolute
address of the bin to increment during the final link.  I don't see the
big issue, with the linker.  I can do
addl X, (external_symbol)
in some assembly file, and the assembler just leaves a 32bit field ready
to be filled in later.  The only difference between this and the above is
that the above is a calculated 32bit value, rather than just dropping in
the adress of external_symbol.

(Killer MLton app:  write a linker that does "what I want." ;)

> If you are willing to do that, then it is only one extra indirection.  You
> store the base address of the array minus _start in some cell, and then
> the store is to the contents of that word plus `.'.

That's fine, and probably what I'll settle on.  The only issue will be
convicing the codegen that references to memory that use the symbol "."
are always different (and presumably may-aliasing).  (The issue is that it
looks at memory references as syntactic constructs, and implicitly assumes
that label/symbol immediates are constant.)

> > in memory can just be what ever and be traversed when the file is written out.
> > A linked list is just fine.  You follow the links into a new array, sort them
> > and write it out.  What is the problem?
>
> No problem.  What you say is true enough.  It's just a question of the
> complexity of the code introduced into the code generator and the performance
>overhead.  If we can come up with a scheme with only an extra instruction per
> allocation, that would be ideal.  Maybe what you suggested above works.

The linked list would work o.k., but as Steve said, there is the
complexity (and location) of the code introduced.  Remember, everything
that is added by the codegen is added as assembly.  So, checking the link
pointer, setting the link pointer, updating the start of bins, and
incrementing the counter looks nice in C, but is messier in x86.  There is
also the fact that introducing the test for the link pointer inserts new
blocks and join points to the program.  A slightly similar issue with
making a C function call at each allocation (and I agree with Steve here
that the cost of that would be prohibitive.)  Likewise, I either insert
the additional code during the translation from the Machine IL (in which
case the extra code goes through all of the intervening passes) or I do it
sometime later, where I might have to be more careful (e.g, if I insert
after register allocation, I've got to step carefully around available
registers).

--------------------------------------------------------------------------------

From: Matthew Fluet <mfluet@intertrust.com>
Date: Fri, 24 Aug 2001 09:57:26 -0700 (PDT)
Subject: Re: space profiling

> I  just  realized that the space profiling is definitely going to have a real
> problem with overflowing the bins.  If I have a hot loop allocating  lots  of
> junk,  it  isn't  hard  for one location to be responsible for more than 2^32
> bytes of allocation.  This seems like another  argument  for going  for  the
> procedure call, and having the linked list use 64 bits worth of counter (plus
> 32 bits for the link).  This way you are safe from wrapping.

Yeah, the overflow might be an issue.  This would require additional
changes to mlprof.

--------------------------------------------------------------------------------

From: Henry Cejtin <henry@sourcelight.com>
Date: Fri, 24 Aug 2001 12:55:00 -0500
Subject: Re: space profiling

You could just use 64-bit bins, but then the space requirements double again
(8 bytes of bin for every byte of object code) and now the generated code at
each allocation point is a 32-bit add to a 64-bit counter.  That seems to push
harder forcalling a procedure which can do it and do the linking to save
space.

--------------------------------------------------------------------------------

From: Stephen Weeks <sweeks@intertrust.com>
Date: Fri, 24 Aug 2001 11:01:50 -0700
Subject: Re: space profiling

> You could just use 64-bit bins, but then the space requirements double again
> (8 bytes of bin for every byte of object code) and now the generated code at
> each allocation point is a 32-bit add to a 64-bit counter.  That seems to push
> harder for calling a procedure which can do it and do the linking to save
>space.

The largest program we know of is MLton, whose text segment is currently just
under 6M.  I can live with 48M (since it's not in the MLton heap), although I
agree that's pushing it.

Another possibility would be to coalesce the bins.  How bad would it be to
cut the number of bins in half?

--------------------------------------------------------------------------------

From: Matthew Fluet <mfluet@intertrust.com>
Date: Fri, 24 Aug 2001 11:14:55 -0700 (PDT)
Subject: Re: space profiling

> > You could just use 64-bit bins, but then the space requirements double again
> > (8 bytes of bin for every byte of object code) and now the generated code at
> > each allocation point is a 32-bit add to a 64-bit counter.  That seems to push
> > harder for calling a procedure which can do it and do the linkng to save
> > space.
>
> The largest program we know of is MLton, whose text segment is currently just
> under 6M.  I can live with 48M (since it's not in the MLton heap), although I
> agree that's pushing it.

But, that 48M competes with the heap for space.  (Yes, we don't need to
look at it during a GC, but if we allocate the space before GC_init, it's
that much less for the heap; if we allocate the space after GC_init, we
might fail because the heap might be taking up too much.  Or maybe it
doesn't matter -- we'll just end up swapping in and out sections of the
bins.)

> Another possibility would be to coalesce the bins.  How bad would it be to
> cut the number of bins in half?

It makes the computation of the right bin just a little more complicated.
It might also give "incorrect" results, where time profile ticks for
blocks get shifted forward or backwards.  Probably o.k. for space
profiling, because there would be enough "buffer" space (i.e., setting
everything up for the allocaion and the tick) between the start of a
block and the occurence of the tick, and likewise between the tick and the
end of the block.

--------------------------------------------------------------------------------

From: Stephen Weeks <sweeks@intertrust.com>
Date: Fri, 24 Aug 2001 11:58:42 -0700
Subject: Re: space profiling

> It makes the computation of the right bin just a little more complicated.
> It might also give "incorrect" results, where time profile ticks for
> blocks get shifted forward or backwards.  Probably o.k. for space
> profiling, because there would be enough "bffer" space (i.e., setting
> everything up for the allocation and the tick) between the start of a
> block and the occurence of the tick, and likewise between the tick and the
> end of the block.

That's what I was thinking.  32 bit bins for time and 64 bit bins for space.

It's a minor change to mlprof (once we get Int64 added to MLton) to make it
handle both formats.

--------------------------------------------------------------------------------

From: Henry Cejtin <henry@sourcelight.com>
Date: Sat, 25 Aug 2001 00:13:40 -0500
Subject: Re: assembly/profiling question

I  finally  had a chance to read through your (Matthew) mail on the Profiling
structure.  Note that at the very least  we  need  separate  flags  for  heap
profiling  and  for  time profiling since you can't do both correctly.  Since
the heap profiling requires ifferent code, I agree  that  it  has  to  be  a
compile-time   option.    The   time   profiling   can,   if   you   use  the
`__attribute__((weak))' trick used in gc.c, be just a link-time option, which
is kind of nice, but not a big deal.

Next, I agree that having the same format file for heap and time profiling is
rather convenient, but I suggest that the translation can be done in the code
to  write  the  file.   Actually,  it  would  probably  good to make the time
profiling data be sparse, so just go with an all sparse  form  of  the  data.
I.e.,  the output consists of some header (different from the current one for
compatibility) and then a sequence of bins, each of which contains an address
and  a non-zero count.  I would probably vote for the bin addresses NOT being
sorted in the file, and require the mlprof program to sort  them.   With  the
sparse  bins,  it  would  probably  be  ok  to always make them 64 bits.  The
argument is that with a 100 HZ clock rate even if the pogram  runs  for  one
hour  of  CPU time you have a maximum of 360,000 bins, so at 12 bytes per bin
(4 bytes for the address and 8 for the counter), this is only 4.3  meg.   The
sparse  representation  with 64 bit counters costs 3 times the space per non-
zero bin, so it is smaller if the bins are at most  1/3  non-zero.   This  is
quite  likely  since  lots  of bins are zero because of instructions that are
more than one byte long.

The only funny extra thing about time profiling is the `unknown'  count  (the
number  of  times  that  an  interrupt  happened  when  you  were in a shared
library).

Hm, I was thinking that the only code  added  for  heap  profiling  would  be
something  which  loaded  one  register  with  the  size of the block, loaded
another with the address of the bin and then did a jsr to  a  fixed  function
(written  by  hand).   Now that I think about this I see that this won't work
because you don't have a stack.  Still, that doesn't really matter since  the
folloing would work:

    One register gets the size of the block being allocated.
    One register gets the address of the bin.
    One register gets the address to return to.

and  then  you jump to a fixed location.  This location (in libmlton.a) would
have the hand-written code to do the linking and incrementing of the bin.   I
guess  that  the  code  generator would have to know to emit this sequence of
loads and the jump (probably easy) and also that the registers are  going  to
get used (could be hard).

Actually,  here  is  what I would do: there would be a fixed area of size 3*4
bytes.  Now the code at each allocation would be

    Store register 1 in the first word of the fixed area.
    Load register 1 with the size of the block being allocated.
    Store register 2 in the second word of the fixed area.
    Load register 2 with the address of the bin.
    Store register 3 in the third word of the fixed area.
    Load register 3 with the address following the jump instruction.
   jump to the fixed code.

Note that now this code has NO  effect  that  the  code  generator  needs  to
understand.  The only connection between it and the code generator is getting
the size of the block.

As to the time overhead, I claim it will be essentially squat.  I.e., I would
expect  even  in  a program allocating lots of very small things that running
with heap profilinig on would probably be 1/4 the speed of  running  with  it
off  at worst.  Is that too much?  (I admit, the above is just a gut feeling,
but the point is that you are doing a bit more  than  a  dozen  extra  memory
references, and the base line is at least 2-3 words allocated.)

Now to make the format of the file invariant between heap and time profiling,
you translate in the code which writes out the file.  In the  heap  case  you
don't  do much: just walk the bins and write them out.  (Note, in memomry the
bin includes a next pointer, an address (of the allocation point in the  code
stream)  and  a 64 bit countr, so 4 words.  The write function doesn't write
out the next field.)

In the time profiling case, you just walk  through  the  bins  (which  are  a
contiguous  array) and write out the corresponding location and count, padded
to 64 bits, for each count which is  non-zero.   Note,  the  reason  for  the
difference  in  memory  layout  is  that the time profiling must minimize the
impact on CPU time (since that is what we are measuring) and  also  we  don't
know  what  bins  are  possible  to use when we start, so we just allocate an
array for all of them.  Also the bins only have to be  32  bits  (since  that
won't overflow for over a year of CPU time).

--------------------------------------------------------------------------------

From: Henry Cejtin <henry@sourcelight.com>
Date: Mon, 27 Aug 2001 16:06:47 -0500
Subject: space profiling

No comment on my suggestion for heap profiling?

--------------------------------------------------------------------------------

From: Stephen Weeks <sweeks@intertrust.com>
Date: Mon, 27 Aug 2001 15:32:56 -0700
Subject: space profiling

> No comment on my suggestion for heap profiling?

I just took a look.  I am trying to sort through the many issues in my head.
Here are some thoughts.

* I like your idea for inserting instructions so that the code generator doesn't
  have to know anything about how profiling works or how whatever statements get
  inserted muck with registers.  I think that is the way to go and should be
  applicable to any scheme, if needed.

* I like the idea of separating the constraints on in-memory representation of
  profiling information from the constraints on file format.

* I like the idea of uing the same sparse, unsorted, file format both for space
  and time. 

* I agree that for time profiling, we should stick with an array of 32 bit bins,
  one bin per byte of code.

* I think that for space profiling, we should use an array of 96 bit bins, one
  bin *per allocation point*, not per byte of executable.  The bin stores the 32
  bit code address and the 64 bit counter.  Each allocation point is assigned a
  unique integer.  With this approach, we might even be able to use Matthew's
  idea for one increment instruction to do the bump (the table and the unique
  integer are compile/link time constants -- the only unknown at runtime is
  the allocation amount, and then only in the case of arrays).  The number of
  allocation points, even for a self-compile, is small enough to make this
  approach fine.  I just measured and there are 55,651 allocations and 1,922
  array allocations in a self compile.

* I am not willing to give up a run time factor of four for space profiling,and
  probably not even a factor of 2.  I would like to continue brainstorming until
  we find an approach that we are all happy with that has hopefully no more than
  a 10% or 20% hit.  Maybe what I suggested above meets this?

--------------------------------------------------------------------------------

From: Henry Cejtin <henry@sourcelight.com>
Date: Mon, 27 Aug 2001 18:02:29 -0500
Subject: Re: space profiling

It sounds like you like all of my idea except for the run-time cost.  I haven't
actually produced a test of it yet (I will later today), but my claim was that
I can't believe that it would EVER be more than a factor of 4 times slower.
Note that if the program isn't allocating much than the overhead is irrelevant,
and if it is, then the temporary area and allocation bins are all going to
be in L1 cache, which will make this very fast.  I think that it will really
not hurt much at all.
I'll do some tests today hackng up some assembler code to see how much of a
burden it is.

--------------------------------------------------------------------------------

From: Stephen Weeks <sweeks@intertrust.com>
Date: Mon, 27 Aug 2001 16:10:52 -0700
Subject: Re: space profiling

> It sounds like you like all of my idea except for the run-time cost.  I haven't
> actually produced a test of it yet (I will later today), 

CVS is *much* more important.

> but my claim was that
> I can't believe that it would EVER be more than a factor of 4 times slower.
> Note that if the program isn't allocating much than the overhead is irrelevant,
> and if it is, then the temporary area and allocation bins are all going to
> be in L1 cache, which will make this very fast.  I think that it will really
> not hurt much at all.
> I'll do some tests today hacking up some assembler ode to see how much of a
> burden it is.

How about my idea of an array with a bin per allocation point?  It seems to have
both low time and space overhead.

Also, I don't completely understand your linked list approach.  How do you get
from a code address to the appropriate bucket?  Is a pointer stored in the code?
Or is it like my approach with a per code point spot know at compile time?  If
so, then the only difference between our approaches is that mine allocates all
the bins statically in an array, while yours allocates them dynamically.  As
I've argued, the most we're gonna see is 60,000 bins, so why not allocate them
statically.

--------------------------------------------------------------------------------

From: Henry Cejtin <henry@sourcelight.com>
Date: Mon, 27 Aug 2001 18:49:36 -0500
Subject: Re: space profiling

One  problem  (perhaps not a big deal) with statically allocating the bins is
that you have to know the number of allocation points.  Also if you are going
to  avoid  writing  out  0's  (which  will probably make the mlprof file much
smaller if you do) then you still have to scan the bin to see  the  size.   I
agree, the difference isn't much either way.

If the bins are statically allocated, then the code at an allocation point is
going to look like:

    save flags some where
    addl    <allocation size>, <low32 bits of counter>
    adcl    $0, <high 32 bits of counter>
    restore flags from some where

If the bins are dynamically allocated, then the code at an  allocation  point
is going to be:

    movl    %esp, <save area>
    movl    <save area - 4>, %esp
    pushl   <allocation size>
    pushl   $<my bin address>
    jsr     <magic routine>

You  can't  store  the bin after the jsr because it is in code space which is
not writable.

And now the magic routine can do what ever  is  needed,  including  restoring
everything  (not  trivial  to  get the stack back and then do the return, but
doable).

Ok, I agree, the fixed bins is a win.

Did you agree or disagree that for timing the bins should still be 96 bits on
output?   I  think  they should just so that the code in mlprof doesn't worry
about more than one kind of bin.  Going to the sparse variety  will  probably
save more space than this will cost us.

CVS moves another slot higher on the magic list.

--------------------------------------------------------------------------------

From: Stephen Weeks <sweeks@intertrust.com>
Date: Mon, 27 Aug 2001 16:56:08 -0700
Subject: Re: space profiling

> One  problem  (perhaps not a big deal) with statically allocating the bins is
> that you have to know the number of allocation points. 

No problem.  I think it is easy for Matthew to recognize frontier increments.

> If the bins are statically allocated, then the code at an allocation point is
> going to look like:
> 
>     save flags some where
>     addl    <allocation size>, <low 32 bits of counter>
>     adcl    $0, <high 32 bits of counter>
>     restore flags from some where

Yuck, but it makes sense.

> Ok, I agree, the fixed bins is a win.

Excellent.

> Did you agree or disgree that for timing the bins should still be 96 bits on
> output?   I  think  they should just so that the code in mlprof doesn't worry
> about more than one kind of bin.  Going to the sparse variety  will  probably
> save more space than this will cost us.

I agree.  One format for both time and space, sparse, 96 bit.

--------------------------------------------------------------------------------

From: Matthew Fluet <fluet@CS.Cornell.EDU>
Date: Tue, 28 Aug 2001 09:16:22 -0400 (EDT)
Subject: Re: space profiling

> > One  problem  (perhaps not a big deal) with statically allocating the bins is
> > that you have to know the number of allocation points. 
> 
> No problem.  I think it is easy for Matthew to recognize frontier increments.

Yes, that's no problem.  I now believe that we can do space profiling with
no impact on register allocation and without extraneous help from the
linker (at the expense of an insane number of symbols in the resulting
executable).  The trick is that the register allocation will build up a
global data section, to be emitted as a final .S file.

My proposal is to do the folowing.

We recognize when register allocation is processing an instruction of the
form
addl <size of allocation>,<frontier location>

This should be allocated as
addl <reg or immediate>,<reg>

Now, we gensym two labels: LabelPC and LabelBin

Now, we emit the allocated instruction as

LabelPC:
	addl <reg or immediate>,<reg>
	addl <reg or immediate>,$(LabelBin+4)
	adcl $0,$(LabelBin+8)

And we add to the global data section

LabelBin:
	.long $LabelPC, 0, 0

Once we're done register allocation for the whole program, the global data
section will look like:

LabelBinA:
	.long $LabelPCA, 0, 0
LabelBinB:
	.long $LabelPCB, 0, 0
...
LabelBinZ:
	.long $LabelPCZ, 0, 0

We choose some common label known to the output routine and emit the
global data section as:

.data
LabelSpaceProfBins:
	.long <number of bins>
LabelBinA:
	.long $LabelPCA, 0, 0
LabelBinB:
	.long $LabelPCB, 0, 0
...
LabelBinZ:
	.long $LabelPCZ, 0, 0

Now, on emission of space profiling info, the output routine should just
need to opy the right number of bins from LabelSpaceProfBins.  (Or, I
guess, just emit the non-zero bins, to use the sparce format.)

> > If the bins are statically allocated, then the code at an allocation point is
> > going to look like:
> > 
> >     save flags some where
> >     addl    <allocation size>, <low 32 bits of counter>
> >     adcl    $0, <high 32 bits of counter>
> >     restore flags from some where

So, this is similar to what I'm proposing above.  I don't quite know what
you mean by "save flags some where".  The previous instruction should be
the frontier increment, which will set the condition flags in some way,
but they are never used.  If you meant save registers some where, then the
proposal above eliminates the need.

> > Ok, I agree, the fixed bins is a win.

The only disadvantage is that the size of the executable will be much
larger, but if Steve is saying that there are only 60,000 allocation
points, then that shouldn't be a problem.  We might get some slow link
time as the linker attempts to resolve all 120,000 symbols.

O.k.  New proposal (although this puts more burden on the output routine).
Instead of gensyming new symbols, we just keep a counter of the number of
allocation points seen.  Each bin is now a constant offset from the start
of the bins.  We still need symbols to store the PC somewhere.  With this
method, I'd suggest two static arrays.  One of just bins and one of just
PC addresses.  (The one of just bins could be put in .bss section, which
saves a little space in the final executable.)  The output routine needs
to walk both arrays in tandem, but that shouldn't be a big problem.

> > Did you agree or disagree that for timing the bins should still be 96 bits on
> > output?   I  think  they should just so that the code in mlprof doesn't worry
> > about more than one kind of bin.  Going to the sparse variety  will  probably
> > save more space than this will cost us.
> 
> I agree.  One format for both time and space, sparse, 96 bit.

ounds good, although that means we need 64bit integer support.  I don't
think that will be too difficult.  I looked at what gcc does for long long
operations, and it's pretty straightforward.  Probably the biggest issue
will be redoing the C calling convention to support returns of 64 bit
values in %eax and %edx.

--------------------------------------------------------------------------------

From: Stephen Weeks <sweeks@intertrust.com>
Date: Tue, 28 Aug 2001 09:32:22 -0700
Subject: Re: space profiling

> Yes, that's no problem.  I now believe that we can do space profiling with
> no impact on register allocation and without extraneous help from the
> linker (at the expense of an insane number of symbols in the resulting
> executable). 
...
> O.k. New proposal (although this puts more burden on the output routine).
> Instead of gensyming new symbols, we just keep a counter of the number of
> allocation points seen.  Each bin is now a constant offset from the start
> of the bins.  We still need symbols to store the PC somewhere.  With this
> method, I'd suggest two static arrays.  One of just bins and one of just
> PC addresses.  (The one of just bins could be put in .bss section, which
> saves a little space in the final executable.)  The output routine needs
> to walk both arrays in tandem, but that shouldn't be a big problem.

I prefer this to the insane number of symbols approach.

Does it make sense to move any of this into MachineOutput?

--------------------------------------------------------------------------------

From: Henry Cejtin <henry@sourcelight.com>
Date: Tue, 28 Aug 2001 13:57:08 -0500
Subject: Re: space profiling

In  my  mail,  the  reason for the `save flags some where' and `restore flags
from some where' was the idea that the entire  rest  of  the  code  generator
could  be  completely unaware that any thing was going on.  If this is no big
deal to the code generator, then it look like the 2 parallel arrays  is  the
way to go.

As to the 64 bit integer support, for now just using IntInf's should be fine.
It isn't like mlprof does  much.   As  to  the  code  in  the  program  being
profiled, I would still probably vote (weakly) for doing this in C.  With the
global nature of MLton, any ML code could cause changes in other things.

--------------------------------------------------------------------------------

From: Matthew Fluet <fluet@CS.Cornell.EDU>
Date: Wed, 29 Aug 2001 11:57:56 -0400 (EDT)
Subject: Re: space profiling

> As to the 64 bit integer support, for now just using IntInf's should be fine.
> It isn't like mlprof does  much.   

True.  But, we'll want them eventually, particularly for Position.int's in
files.

> As  to  the  code  in  the  program  being
> profiled, I would still probably vote (weakly) for doing this in C.  With the
> global nature of MLton, any ML code could cause changes in other things.

Vote for doing what in C?

--------------------------------------------------------------------------------

From: Stephen Weeks <sweeks@intertrust.com>
Date: Wed, 29 Aug 2001 09:25:30 -0700
Subject: Re: space profiling

> > As  to  the  code  in  the  program  being
> > profiled, I would still probably vote (weakly) for doing this in C.  With the
> > global nature of MLton, any ML code could cause changes in other things.
> 
> Vote for doing what in C?

I believe he means the atExit stuff.  And it is a good point, that we don't want
profiling to interfere with optimization of the program -- the easiest way to do
that is keep it out of ML.

--------------------------------------------------------------------------------

-------------------------------------------------------
In remembrance
www.osdn.com/911/
_______________________________________________
MLton-devel mailing list
MLton-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mlton-devel