before first PROFILE line

Henry Cejtin henry@clairv.com
Fri, 9 Jul 1999 00:05:06 -0500


I have just done a careful (by hand) investigation of the profiling result on
ratio regions, which reported a lot of time in `before first PROFILE line' of
Chunks.   I  deduce  that  the output of mlprof is completely correct, and we
really are spending that time there.

To be more specific, in my run, mlprof claimed that 5.93%  of  the  time  was
spent  before  the first PROFILE line of Chunk2.  Since there were a total of
2361 ticks (23.61 CPU seconds) in the run, that means 140 ticks.

The code at the start of Chunk2 is:

        0804d710 <Chunk2>:
        Chunk2():
        /tmp/ratio.c:6772
         804d710:       83 ec 58                subl   $0x58,%esp
         804d713:       55                      pushl  %ebp
         804d714:       57                      pushl  %edi
         804d715:       56                      pushl  %esi
         804d716:       53                      pushl  %ebx
         804d717:       8b 1d 4c 21 05 08       movl   0x805214c,%ebx
         804d71d:       8b 35 28 21 05 08       movl   0x8052128,%esi
        /tmp/ratio.c:6814
         804d723:       83 3d c8 21 05 08 40    cmpl   $0x40,0x80521c8

The first PROFILE messages of Chunk2 is line 6813:
        /* PROFILE: MLTON_chunkSwitch (magic) */
and so the first line of the chunkSwitch should be line 6814, and as  we  can
see  in the above code, this is exactly correct (the compare to $0x40 (or 64)
is indeed the check for the first case).

Summing all this up, we see that Chunk2 starts at 0x804D710 and that the code
for MLTON_chunkSwitch starts at 0x804D723.

Ok,  now  here  is  the  information  from the gmon.out file for the start of
Chunk2:

                address         number of ticks
        [0x804d710, 0x804d714)          76
        [0x804d714, 0x804d718)          45
        [0x804d71c, 0x804d720)          7
        [0x804d720, 0x804d724)          12

Note, I charge all of a bin to the center of the bin, so all these  bins  get
charged to the before first PROFILE area, and the only doubt is that the last
bin might not really all be correct (some part of those 12 ticks should  have
been charged to MLTON_chunkSwitch, but we don't know how much).

Thus the real correct answer is that between
        76 + 45 + 7 = 128
and 140 tick (or between 5.42% and 5.93%) really was used by Chunk2 before it
got to the chunkSwitch.  Note, the only thing that  this  code  is  doing  is
allocating  some  space  on  the C stack (a subtract from a register), saving
some registers on the stack, and loading some globals  into  registers.   The
last two actions are the code generated for

        char    *stackTop = gcState.stackTop;
        pointer frontier = gcState.frontier;

Now this could only happen if we are entering calling the C function Chunk2 a
lot.  In the 23.61  CPU  seconds  that  ratio  runs  for,  Chunk2  is  called
29,291,557 times.  The time charged to this code is
and so each call is taking
        1.4 seconds / 29291557 = 47.8 nanoseconds = 19 cycles
on  my  400 MHz machine.  7 instructions, 8 memory references (6 writes and 2
reads), each of 32 bits.  Definitely not too bad.

My summary: mlprof is working perfectly.