SML numerical benchmark and MLton

Mon, 19 Jul 1999 14:22:46 -0700 (PDT)

> 1) SML/NJ is producing a big overhead when compiling functors. I
> have not delved seriously into the compiler internals, but using
> a functor seems to generate a set of gerenic (!) procedures,
> much like the overhead implied by virtual methods in C++. Since
> MLton does produce a different compilation unit for each module
> this overhead is supressed.

Yes, when you use the compilation manager, SML/NJ does compile each
file separately, and this means generating generic code with boxing
and no inlining.  Matthias Blume, for part of his thesis, did some
cross module inlining stuff, but I don't know if it ever made it into
the mainline SML/NJ development and I don't think it ever showed
tremendous benefit.

> 2) SML/NJ does not inline the numerical code for complex
> numbers. Even though I was able to supress the functor overhead
> by using "grep" and "sed" and manually generating modules,
> SML/NJ still refuses to inline the small operations of addition,
> product, etc, that involves complex numbers.

I'm surprised that SML/NJ won't inline the really small functions
given the chance.

In any case, since you're using sed and grep, I thought you might be
interested in a couple of tricks that I use to speed up programs under
SML/NJ.  First, you can get SML/NJ to compile the whole program by
wrapping it in local declaration

	val _ = SMLofNJ.Internals.GC.messages false ;   
	local
	<contents of _test.sml>
	in
	end

The GC.messages stuff is just to make their GC not spit out so much
drivel.  More importantly, the local declaration will cause SML/NJ to
compile the whole program, which will allow it to do cross-module
optimizations.  In principle, this means that SML/NJ could do as well
as MLton.  In particular, functors get turned into lambdas in one of
their intermediate languages and they could choose to inline them.  In
practice, SML/NJ isn't as aggressive as MLton about inlining or
unboxing and also has trouble because the results of type inference on
the original program trickle down to later phases of the compiler.
For example, if they see
  functor F() = struct fun 'a f(x: 'a) = x end
that f is generic will be noticed by the original type inference pass, 
and it will be difficult to specialize f to particular types (even if
f is only used at one type) later on in compilation.

Another tool is to use MLton's defunctorizer (mlton -d), which takes
an SML program with modules and transforms it to a "core" SML program
without modules by duplicating the body of each functor for each
functor application.  I have sometimes seen as much as a factor of 2
or 3 speedup by defunctorization.

> I have attached you a newer version. It is very alpha stage,
> since I have rewritten it from scratch while I was far from
> home. Now MLton gets 44 ms and SML/NJ 110 ms on a +* test.

Here are the numbers that I get for your +* benchmark on 300x300
complex tensors.  The factor of 4 between NJ whole and MLton is more
in line with what I am used to seeing on numerical benchmarks.

		time(s)
NJ separate	25.696 
NJ whole	16.188
NJ defunc	15.657
MLton		 4.401

BTW, in testing out the defunctorizer on your code, I found a bug in
MLton's implementation of the basis library (the bug only affects the
defunctorizer).  If you want to use the 1999-7-12 version, you will
need to replace basis-library/arrays-and-vectors/mono-array.sml with
the following file, remove lib/world.mlton, and run the following from
within the src directory.
	make ../lib/world.mlton

--------------------------------------------------------------------------------

(* Copyright (C) 1997-1999 NEC Research Institute.
 * Please see the file LICENSE for license information.
 *)
functor MonoArray(V: MONO_VECTOR):> MONO_ARRAY =
   struct
      structure Vector = V
      type elem = V.elem
      open Array
      type array = elem array
   end

structure Word8Array = MonoArray(Word8Vector)
structure CharArray = MonoArray(CharVector)
structure BoolArray = MonoArray(BoolVector)
structure IntArray = MonoArray(IntVector)
structure RealArray = MonoArray(RealVector)
structure Real64Array = RealArray