[MLton] Crashes with 64-bit native code generator on Windows

Mon Nov 23 17:39:30 PST 2009

Hello Wesley,

Wesley W. Terpstra wrote:
> Sorry for the slow reply.
>
> On Wed, Nov 11, 2009 at 5:10 AM, David Hansel
> <hansel at reactive-systems.com> wrote:
>> The code below does causes the crash when called from within our (large)
>> code.  It does not produce a crash when called within a small example.
>> As mentioned,  this is the only FFI call that is actually called by
>> the code.  We do have to include another function making an FFI call
>> in order to make the crash happen.  However,  that call is never executed
>> before the crash.  It could be executed some time later.  If that second
>> FFI code is not present,  the crash does not happen.
>
> I went ahead and tried to build it.
> [...]
> The resulting program worked. Are you using similar compile options?

We are using Microsoft Visual C++ to create our DLL's so (just in case there
was some obscure compiler setting that we were missing) I gave it a try
and compiled our DLL with gcc -- which didn't change anything.  Our MLton
command line is:

mlton @MLton gc-summary hash-cons 1.0 --
  -target x86_64-w64-mingw32
  -codegen native -profile no -profile-stack false -const 'Exn.keepHistory false'
  -drop-pass deepFlatten -link-opt -ldl -output foo.exe -verbose 2 foo.mlb

I do not know exactly what the '-drop-pass deepFlatten' does but it was put
in by Stephen Weeks back in 2006 when he assisted us in making our code
compile with MLton.  If I remember correctly there was a compiler performace
issue.  However,  as you said before, optimization settings are probably not
the problem here.

> In the time since your last post have you perhaps found a more
> complete crash example?

Unfortunately no.  I have been trying but the crash goes away anytime I cut
down the code some more to produce a smaller example.

>> My question basically is this:  do you have any suggestions on how to
>> debug this any further?  Any MLton command-line options for debugging?
>
> Well, there's -debug true, but gdb under 64-bit windows is so flakey I
> wouldn't bother trying that. In fact, the MLton.msi doesn't include
> the debug version of the runtime (it is over 200MB due to the windows
> debugging format), so you would need to build MLton from source to get
> the debug library. I doubt it would help you, though.

That's unfortunate.

>> Are there any optimization passes that we should try to disable?
>
> I doubt this is an optimization problem.
>
>> Do you know of any caveats that we might have missed when creating our
>> DLLs?
>
> Ok, here are the things I can think of from the top of my head:
> 0) You're loading a 32-bit dll instead of a 64-bit one. Double check.

Double- and triple-checked that.

> 1) Windows might require a stack alignment that doesn't match the
> amd64 FFI codegen. Your program happens to end up with bad alignment,
> and my programs have just never been unlucky. You could declare a
> volatile local 64-bit variable and printf it's address in the C code.
> See if the offset of this variable fails to be 64-bit aligned (only)
> in the failing programs.

An alignment problem or something similar is what I suspect,  too.
Creating a local variable won't help because the process dies even
before the first time it enters the code in the DLL,  so any printf
in there will not happen before the crash.

> 2) The __stdcall is confusing gcc. There is only one calling
> convention under win64. Try specifying nothing.

I've tried with and without.  No difference.

> However, I am guessing blind! Without a way to reproduce this I can't
> really help. I've used the FFI quite heavily under win64 in one of our
> recent projects without problems, so FFI definitely works most of the
> time. It's possible you've found a corner case, which can often be an
> alignment problem.
> Is the program really too secret to release the buggy part of its
> source code? MLton is free. ;)

It's good to hear that the FFI has been tested in win64.  I completely
understand about the guessing,  we do have the same problems with our
customer (our product not working with their code,  can't send the code).
Unfortunately since this is a commercial application and we do have to
include a large part of our code to make the crash happen I can definitely
not post the code to the list.  If we can't figure this out otherwise we
might be able to set up an NDA with you so we could send the code to
you in private.

One thing I can make available is the executable that actually experiences
the crash as well as the MLton-produced assembly code.  I don't know what
kind of debugging tools you have available and whether that would be
any help.  Please let me know.

There are two observations that I have made since my last post.  They
may or may not be related to the actual problem but I thought I'd
mention them anyways:

I was looking into what could be causing the problem and came across
file MLton/lib/mlton/sml/mlnlffi-lib/memory/linkage-libdl.sml which
is of course used by the FFI.  I wasn't completely sure what the "era"
deal in that code is,  so I changed the body of function "get" to just
"f()",  resolving the FFI function's address before every call.  After
that change,  all crashes were gone.  Furthermore,  changing the body
of "get" to just "a" does NOT fix the crashes.  That looked good
so I added some "print" statements in "get" to see whether there is
a problem with the address not being resolved properly.  Unfortunately,
just adding the "print" statements also made the crashes go away. In
fact,  just adding 'print "";' at the beginning of "get" eliminates
the crashes.  Interestingly,  this eliminates the crashes completely.
With other changes in our code I was able to eliminate some instances
of the crashes but new ones would pop up at other places.  I suspect
that the proximity of this code to the actual FFI calls might play
a role in that.

I gave the "Debugging Tools for Windows" debugger a try and loaded
the crashing executable there.  With that,  I was able to track the
crash in our simplest example to the following assembly code:

00000000`0054b2c8 4c897df8        mov     qword ptr [rbp-8],r15
00000000`0054b2cc 48892d3d408000  mov     qword ptr [rsim4c_mlton!MLton_main+0x402901 (00000000`0080403d)],rbp
00000000`0054b2d3 4c892526408000  mov     qword ptr [rsim4c_mlton!MLton_main+0x4028ea (00000000`00804026)],r12
00000000`0054b2da ff15683d8000    call    qword ptr [rsim4c_mlton!MLton_main+0x40262c (00000000`00803d68)] ds:00000000`00d4f048=0000000000000000

Note the "=0" address at the end.  The crash happens because the result
address of the indirect call is 0,  which could be some hint but I don't
know how to look into this any further.  Do you have a suggestion how to
track this back to MLton's assembly output or even to the original ML code?

Best regards,

David

-- 
  ----------------------------------------------------------
  David Hansel
  http://www.reactive-systems.com/
  OpenPGP (GnuPG) public key file:
  http://www.reactive-systems.com/~hansel/pgp_public_key.txt
  ----------------------------------------------------------