profBug

Matthew Fluet mfluet@intertrust.com
Mon, 4 Jun 2001 08:31:09 -0700 (PDT)


> I  agree that if you don't specify -s it can get it wrong, but the problem is
> that if you clump all the static routines together in `<static>',  it  really
> isn't telling you anything useful.

True.  Hopefully most profiles won't be dominated by C code, so this might
be a minor issue.  And, I suspect that when C code is a hot spot, it will
likely be the GC, which is easily recognized by the labels.

> It's  interesting that you prefer writing the stuff bottom-up.  I have always
> really liked top down.  Way way back I remember being pained by the bottom-up
> order  in  Pascal.   I  can  see why ML uses bottom-up, since everything is a
> `let*', but it still hurts.  I use structures to fix some of it, but lots  of
> times I would just like something like Haskell's `where'.  Oh well.

I agree that top down is the way to design something, but I've just gotten
into the habit of putting aux functions before their use.  Probably biased
by ML, but even in C, I would need to prototype functions, so I would just
put the function bodies there too.

> One  thing I'm still not completely clear on with the profiling.  What is the
> notion of the -d stuff.  Is the default (`-d 0') supposed to be  at  the  CPS
> function  level and everything above that only mapping to the assembler code?
> Have we abandoned trying to relate the profiling to the CPS stuff at all?  It
> seems that currently the profiling stuff only relates to the assembly code we
> generate.  (Of course this could just be my out-of-date version of MLton.)
> What exactly are the higher -d values supposed to show?  (I know it is more
> detail, but I'm trying to ask if there is any concept of what more I get.)

Here's the info I wrote up about the -d option.  This is from an old
e-mail, so the output format is a little different, but otherwise it's the
same.  The remaining option, -t n, is a threshhold option and suppresses
the display of labels that account for less than n% of the time; good for
cutting out all of the really minor labels.

>From fluet@hoho.cs.cornell.edu Mon Jun  4 08:18:46 2001
Date: Thu, 22 Mar 2001 18:33:51 -0500 (EST)
From: Matthew Fluet <fluet@hoho.cs.cornell.edu>
To: MLton@sourcelight.com
Subject: mlprof output

Here's what I currently have working for the mlprof output. (I generated
the gmon.out file by compiling logic with -p -native false, so the counts
don't mean anything relative to the executable compiled with -native true,
but at least it produced enough data to get some results).

[fluet@lennon mlprof-native]$ ./mlprof
wrong number of args
Usage: mlprof [-d{0|1|2}] [-s] a.out-file gmon.out-file

The -s option is the same as before, determines whether or not static
functions are included.  The -d option determines the detail (or depth) of
the profiling for the generated native code.  The default is 0, and yields
the following:

[fluet@lennon mlprof-native]$ ./mlprof logic gmon.out
60.61 seconds of CPU time
x_52                                        63.09%
x_178                                       13.03%
main_0                                       7.99%
x_781                                        7.03%
init_dummy (C @ 0x8048C18)                   2.69%
saveGlobals (C)                              2.19%
x_973                                        1.55%
x_56                                         0.41%
x_603                                        0.35%
x_585                                        0.33%
x_715                                        0.25%
x_558                                        0.23%
oc_4                                         0.21%
x_540                                        0.18%
x_521                                        0.15%
x_97                                         0.08%
x_688                                        0.07%
x_83                                         0.05%
x_149                                        0.03%
x_670                                        0.03%
x_651                                        0.03%
x_733                                        0.02%

Everything not marked (C) corresponds to information derived from the
static labels.  For depth 0, these correspond to top level cps functions.
There is a slight tradeoff here, I think, with the old version of mlprof;
because we default to ChunkPerFunc for the native backend, this coarsest
grain profiling can correspond to very large chunks of code.  To get more
detailed information, we crank up the detail option. Here's a snippet of
-d 1:

60.61 seconds of CPU time
x_52                                                63.09%
     L_2330,L_889                           13.02%   8.22%
     L_2324,L_885                            7.95%   5.02%
     L_1349,L_2813                           7.95%   5.02%
     L_1937                                  6.69%   4.22%
     L_1359,L_2823                           4.68%   2.95%
     ... (a little less than 200 more)
x_178                                               13.03%
     L_115,L_1542                           43.80%   5.71%
     L_113,L_1539                           21.01%   2.74%
     L_105,L_1530                            5.19%   0.68%
     L_1518,L_95                             3.92%   0.51%
     L_100,L_1524                            3.16%   0.41%
     ...
main_0                                               7.99%
     L_1463,L_33                            48.14%   3.84%
     main_0                                 25.21%   2.01%
     L_33                                   12.60%   1.01%
     L_13                                    7.64%   0.61%
     L_12                                    3.72%   0.30%
     L_10                                    1.65%   0.13%
     L_60                                    0.41%   0.03%
     loop_8                                  0.21%   0.02%
     L_38                                    0.21%   0.02%
     L_0                                     0.21%   0.02%
...

With detail level 1, we start incorporating labels that were generated by
backend.fun.  Depth 0 labels correspond to cps functions (which are passed
to blocks in the machine IL in the profileName field); Depth 1 labels
correspond to labels on blocks in the machine IL.  A lot of these labels
were inherited directly from CPS IL continuation declarations, so you
should be able to trace back to the .cps and coordinate somewhat.
Combined labels, like L_2330,L_889, correspond to two blocks that were
combined by one of the native-codegen simplification passes.  Labels that
don't appear in the CPS IL were introduced by backend.fun in the
translation to the machine IL.  The two columns give percentages relative
to the containing depth 0 function and the whole program.

Finally, detail level 2 incorporates every label introduced by the native
codegen.

60.61 seconds of CPU time
x_52                                                        63.09%
     L_2330,L_889                                   13.02%   8.22%
          L_889,skipGC_125                 100.00%  13.02%   8.22%
     L_2324,L_885                                    7.95%   5.02%
          L_885,skipGC_123                 100.00%   7.95%   5.02%
     L_1349,L_2813                                   7.95%   5.02%
          L_1349,L_2813                    100.00%   7.95%   5.02%
     L_1937                                          6.69%   4.22%
          L_1938                            50.00%   3.35%   2.11%
          doGC_96                           50.00%   3.35%   2.11%
     L_1359,L_2823                                   4.68%   2.95%
          L_1359,L_2823                    100.00%   4.68%   2.95%

I don't know if this will be particularly useful, but it was easy enough
to compute.  We see more blocks that were combined by simplification
passes.  Labels like skipGC, doGC, statementLimitCheckLoop are all
introduced by the native codegen, so they don't correspond to any real
code in the CPS IL.  On the other hand, they do give some insight on the
interaction with the runtime; for example, time spent in a skipGC block
will generally correspond to some allocation being done after performing a
GC check.  Of course, we'll need to look at real gmon.out files for any of
this to be valid.

I looked through a -d 2 profile without filtering out blocks with no
ticks, and depth 2 is almost always just one block corresponding to the
depth 1 block; when it isn't, it's because it's broken up by a GC check,
array allocation loop, or an overflow check.

Let me know if this gives all the information we would like.  It should be
trivial to modify the gmon reader to work with whatever Henry cooks up.