mlprof output

Thu, 22 Mar 2001 18:33:51 -0500 (EST)

Here's what I currently have working for the mlprof output. (I generated
the gmon.out file by compiling logic with -p -native false, so the counts
don't mean anything relative to the executable compiled with -native true,
but at least it produced enough data to get some results).

[fluet@lennon mlprof-native]$ ./mlprof 
wrong number of args
Usage: mlprof [-d{0|1|2}] [-s] a.out-file gmon.out-file

The -s option is the same as before, determines whether or not static
functions are included.  The -d option determines the detail (or depth) of
the profiling for the generated native code.  The default is 0, and yields
the following:

[fluet@lennon mlprof-native]$ ./mlprof logic gmon.out 
60.61 seconds of CPU time
x_52                                        63.09% 
x_178                                       13.03% 
main_0                                       7.99% 
x_781                                        7.03% 
init_dummy (C @ 0x8048C18)                   2.69% 
saveGlobals (C)                              2.19% 
x_973                                        1.55% 
x_56                                         0.41% 
x_603                                        0.35% 
x_585                                        0.33% 
x_715                                        0.25% 
x_558                                        0.23% 
oc_4                                         0.21% 
x_540                                        0.18% 
x_521                                        0.15% 
x_97                                         0.08% 
x_688                                        0.07% 
x_83                                         0.05% 
x_149                                        0.03% 
x_670                                        0.03% 
x_651                                        0.03% 
x_733                                        0.02% 

Everything not marked (C) corresponds to information derived from the
static labels.  For depth 0, these correspond to top level cps functions.
There is a slight tradeoff here, I think, with the old version of mlprof;
because we default to ChunkPerFunc for the native backend, this coarsest
grain profiling can correspond to very large chunks of code.  To get more
detailed information, we crank up the detail option. Here's a snippet of
-d 1:

60.61 seconds of CPU time
x_52                                                63.09% 
     L_2330,L_889                           13.02%   8.22% 
     L_2324,L_885                            7.95%   5.02% 
     L_1349,L_2813                           7.95%   5.02% 
     L_1937                                  6.69%   4.22% 
     L_1359,L_2823                           4.68%   2.95% 
     ... (a little less than 200 more)
x_178                                               13.03% 
     L_115,L_1542                           43.80%   5.71% 
     L_113,L_1539                           21.01%   2.74% 
     L_105,L_1530                            5.19%   0.68% 
     L_1518,L_95                             3.92%   0.51% 
     L_100,L_1524                            3.16%   0.41% 
     ...
main_0                                               7.99% 
     L_1463,L_33                            48.14%   3.84% 
     main_0                                 25.21%   2.01% 
     L_33                                   12.60%   1.01% 
     L_13                                    7.64%   0.61% 
     L_12                                    3.72%   0.30% 
     L_10                                    1.65%   0.13% 
     L_60                                    0.41%   0.03% 
     loop_8                                  0.21%   0.02% 
     L_38                                    0.21%   0.02% 
     L_0                                     0.21%   0.02% 
...

With detail level 1, we start incorporating labels that were generated by
backend.fun.  Depth 0 labels correspond to cps functions (which is passed
to blocks in the machine IL in the profileName field); Depth 1 labels
correspond to labels on blocks in the machine IL.  A lot of these labels
were inherited directly from CPS IL continuation declarations, so you
should be able to trace back to the .cps and coordinate somewhat.
Combined labels, like L_2330,L_889, correspond to two blocks that were
combined by one of the native-codegen simplification passes.  Labels that
don't appear in the CPS IL were introduced by backend.fun in the
translation to the machine IL.  The two columns give percentages relative
to the containing depth 0 function and the whole program.

Finally, detail level 2 incorporates every label introduced by the native
codegen.

60.61 seconds of CPU time
x_52                                                        63.09% 
     L_2330,L_889                                   13.02%   8.22% 
          L_889,skipGC_125                 100.00%  13.02%   8.22% 
     L_2324,L_885                                    7.95%   5.02% 
          L_885,skipGC_123                 100.00%   7.95%   5.02% 
     L_1349,L_2813                                   7.95%   5.02% 
          L_1349,L_2813                    100.00%   7.95%   5.02% 
     L_1937                                          6.69%   4.22% 
          L_1938                            50.00%   3.35%   2.11% 
          doGC_96                           50.00%   3.35%   2.11% 
     L_1359,L_2823                                   4.68%   2.95% 
          L_1359,L_2823                    100.00%   4.68%   2.95% 

I don't know if this will be particularly useful, but it was easy enough
to compute.  We see more blocks that were combined by simplification
passes.  Labels like skipGC, doGC, statementLimitCheckLoop are all
introduced by the native codegen, so they don't correspond to any real
code in the CPS IL.  On the other hand, they do give some insight on the
interaction with the runtime; for example, time spent in an skipGC block
will generally correspond to some allocation being done after performing a
GC check.  Of course, we'll need to look at real gmon.out files for any of
this to be valid.

I looked through a -d 2 profile without filtering out blocks with no
ticks, and depth 2 is almost always just one block corresponding to the
depth 1 block; when it isn't, it's because it's broken up by a GC check,
array allocation loop, or an overflow check.

Let me know if this gives all the information we would like.  It should be
trivial to modify the gmon reader to work with whatever Henry cooks up.