[MLton] fixing -codegen c -profile time for the release

Tue, 15 Nov 2005 20:47:16 -0800

Matthew's recent mail to MLton-user summarized our profiling situation
nicely.

  Time profiling is broken when the C-codegen is used with gcc 4.

To state the obvious:

  Time profiling works fine with the native codegen.
  Time profiling works fine with gcc 2 and 3.
  All non-time profiling works.

I state this because I want to keep clear what part of our world is
broken and why it's broken, so we don't try to fix things that work
fine.  Our basic approach to profiling is to propagate annotations
from the front end to the codegen, and ask the codegen to insert
labels into the executable, so that the runtime can map IP to label
and back to annoation.  This works flawlessly with the native codegen,
because our native codegen never duplicates the labels.  This fails
with gcc 4, which sometimes duplicates labels.  Furthermore, it could
have failed with gcc 2 or 3, at least according to the spec, but
hasn't because we've been lucky.

It seems to me that the ideal fix is to figure out how to tell GCC
exactly what we want, without inhibiting its optimization as much as
possible.  Unfortunately, as far as I can tell, no one has even
proposed a way to do the former, let alone while simultaneously
achieving the latter (in fact, this seems impossible).

Any solution that uses volatile asm is broken because gcc can
duplicate it.  So, my earlier proposal of adding -fno-tree-ch was
broken -- it happens to work with the particular gcc and case that we
tested, but is likely to break in the future.  It further has the
disadvantage of inhibiting gcc.  Matthew's tweaks to
{Declare,}ProfileLabel leave around the volatile asm, and so also may
break.  I don't entirely understand Florian's separate-section
solution, but I still see the volatile asm, so I guess it would have
similar problems.

The right thing is to ask the gcc people more directly if there is a
way to easily do what we want.  Florian, could you do this?

On the assumption that there is no good way to do what we want with
gcc, and based on the fact that anything we do there is relying on
gcc-specific trickery, perhaps we should go for different approach.

Wesley proposed an approach based on debugging formats and
interpreting them at runtime (or perhaps mlprof time).  That seems to
me to be headed in the wrong direction, as it makes us more intimately
tied with messy details of other systems and hence less portable.  It
may make sense as a longer term project, and for use in debugging, but
doesn't fit well with our desire to release ASAP.

Matthew proposed using another approach entirely -- to maintain the
current source position via program operations, which we already know
how to handle in the optimizer and every codegen.  Previously, I
rejected that approach because of its significant impact on the
performance, and hence perturbation of profile data.  However, the
benchmarks that were done at the time and which supported this view
were all with the native codegen, not the C codegen.  Because the C
codegen has worse performance than the native codegen, the impact of
the approach is likely to be less.  Also, gcc is a very different
beast than our codegen, which could make the impact different as
well.

So, my conclusion is that we should go with Matthew's approach for the
release.  But it should only be used when time profiling with the C
codegen -- there's no reason to hurt profiling in other situations.
The simplicity, robustness, and portability of the approach outweigh
the performance impact in this one case.