OCAML bignums

Thu, 16 Nov 2000 10:13:27 -0800 (PST)

Here is a post I made to comp.lang.functional about comparing bignums in OCAML
and MLton.

--------------------------------------------------------------------------------

> This is dreadful! Magma takes 0.611 seconds for this big
> integer computation, meaning OCaml is 10 times slower
> than Magma.
> This surprises me since OCaml's performance is general
> seems excellent. It also surprises me since I don't
> understand how Num can be fast and Big_int slow ...?
>  
> [my platform is a big Sparc machine running Solaris, all code
> compiled with unsafe option, O4 optimisation passed to gcc,
> inlining set aggressively, etc]

I tried out your program on my 400 mhz P6.  Here is the command line I
used to build the executable.

	ocamlopt -inline 100 -ccopt -O4 -unsafe -cc gcc -cclib	\
		-lnums unix.cmxa nums.cmxa z.ml

Here are the times it printed.  They are similar to what you saw on
the Sparc.

0.120u 0.000s 0.12 99.1%
1.210u 0.000s 1.20 100.5%
6.790u 0.000s 6.79 100.0%

As a reminder, 0.12 is the time for the normal integer loop, 1.21 is
the time for the Num loop, and 6.79 is the time for the Big_int loop.

As a comparison, I tried the integer and big integer computations in
SML (there are no standard arbitrary precision rationals) using MLton
(http://www.sourcelight.com/MLton).  I made one tweak so that the
computation runs for 1E7 instead of 1E6 iterations.  I also added
print statements and array update statements to make sure the result
is correct and not optimized away.  Here is the SML code.

fun timeit (f, a, toString) =
   let
      fun time() =
	 let val {utime, stime, ...} = Posix.ProcEnv.times ()
	 in Time.+ (utime, stime)
	 end
      val t = time ()
      val r = f a
      val t = Time.- (time (), t)
      val _ = print (concat ["result is ", toString r, "\n",
			     "time is ", Time.toString t, "\n"])
   in ()
   end
fun timebasic (f, xo, a) = Array.foldl f xo a
val d = 10000000
val a = Array.array (d, 1)
val _ = Array.update (a, 0, 2)
val _ = timeit (timebasic, (op +, 0, a), Int.toString)
val a = Array.array (d, 1)
val _ = Array.update (a, 0, 2)
val _ = timeit (timebasic, (IntInf.+, 0, a), IntInf.toString)

This was compiled with the command line "mlton z.sml".  Here are the results.

result is 10000001
time is 0.410
result is 10000001
time is 0.650

So, the penalty in MLton for using big integers is less than a factor
of 2, while it was more than a factor of 60 in OCAML.  As John Prevost
pointed out, the hit in OCAML is due to Big_int always using a less
efficient representation, even for small ints.  But the hit in OCAML
for using Num, which uses the efficient representation for small
integers is still a factor of 10, which is pretty high.  Another
strangeness is that the integer loop runs about 3 times faster in
MLton than in OCAML (10 times the computation in 3 times the time) and
the big integer loop runs about 100 times faster (10 times the
computation in one tenth the time).  I guess we have an explanation
for the big integer loop, but I don't have one for the normal integer
loop.  Also, even the Num code in OCAML which uses the specialized
representation for small integers is 20X slower than the big integer
code in MLton (one tenth the computation in twice the time).

FYI, here is how MLton handles big integers.  Integers in the range
[-2^31, 2^31) are represented unboxed, with a single tag bit to
distinguish them from pointers.  Larger integers are represented as a
pointer to a data structure in the GNU multiprecision library (gmp) .
Integer addition first does a test on its inputs, and if both are
small, does the addition without calling gmp.  So, for this benchmark, 
all of the additions happen inline on unboxed integers.

Anyways, I'd be interested to see the OCAML performance with a more
efficient big integer package.

> But this still doesn't explain why adding two arrays
> containing a million Nums each (all 1) was so slow compared
> to adding a million Nums (all 1). This difference in speed
> suggests some performance problem with arrays.

My (complete) guess is that there is some overhead due to the fold,
possibly missed inlining?  Perhaps this overhead is the same
explanation for why the OCAML integer loop is 3X slower than in MLton?