[MLton] Vector.fromList[] performance

Wed Oct 25 14:54:47 PDT 2006

> I often find myself converting tuples (x,y) into vectors using
> Vector.fromList[x,y]. 
...
> Unfortunately, the performance of this utterly sucks
...
> In the program below, I notice a 10x degradation over C.
...
> SML/NJ allows this using the #[] construct, which helps SML/NJ
> on this test program knock the pants off mlton.

You see a ratio for MLton-to-C of 11.2 and a ratio for SML/NJ-to-C of
3.0.  On my machine, an x86, I see a ratio for MLton-to-C of 11.1 and
a ratio for SML/NJ-to-C of 4.1.  So, we're seeing roughly the same
numbers.

I'm not sure if your benchmark measures what you think, though.  I
believe that the reason why SML/NJ does so much better than MLton is
because it is not allocating *anything* for the vectors.  Because of
the use of the #[...] syntax, SML/NJ does indeed optimize the vector
creation, avoiding creation of the list.  Furthermore, its optimizer
can see the creation of an immutable object, the extraction of a field
from that object, and that the object is not returned.  So, it can
simply drop creation of the object altogether, holding on to the
fields until the Vector.sub is complete.

To test this hypothesis, I changed your program to the following one
(I added another factor of ten iterations to get longer running
times).

  ------------------------------------------------------------
  val n = 100
  val a = Array.array (n, #[0, 0, 0, 0, 0])
  fun vectors (i, sum) = let
     val () = Array.update (a, sum mod n, #[i, 1+sum, 5*sum*sum, 6*i, 2*i])
     val v = Array.sub (a, sum mod n)
     val sumadd = Vector.sub(v, sum mod 5)
     val sum = (sum + sumadd) mod 763
  in
     if i = 1 then sum else vectors (i-1, sum)
  end
  fun doit () = print (Int.toString (vectors (100 * 1000 * 1000, 0)) ^ "\n")
  ------------------------------------------------------------

This program also creates one vector per iteration of the loop.
However, it stashes that vector in an array and immediately pulls it
out.  That is enough to confuse both MLton's and SML/NJ's optimizer so
that they can't tell that the vector is the one that was just created.
On my machine, this slows down the SML/NJ-compiled program by more
than a factor of two (from 5.6s to 12.6s).

I tried three more experiments with variants of the program above:

  1. a 5-element array instead of #[...]
  2. Vector.fromList instead of #[...]
  3. a 5-element tuple instead of #[...]

Those three are all SML, and so I could compare MLton and SML/NJ.
Here's are the running times (in seconds) that I saw.

     SML/NJ  MLton
  1.   24.8   17.7
  2.   18.7   21.2
  3.   13.4   11.4

This seems consistent with my hypothesis.  Once you prevent SML/NJ's
optimizer from eliminating the vector, it has to allocate it, just
like MLton, and performs roughly the same.  Furthermore, the running
times for the tupled version (13.4 and 11.4) are roughly comparable to
the SML/NJ's version with #[...] (12.6).

As another test, I changed the representation of vectors in your
program by prefixing it with the following

  ------------------------------------------------------------
  structure Vector = struct
     datatype 'a t = T of 'a * 'a * 'a * 'a * 'a
     val make5 = T
     fun sub (T t, i) =
        case i of
           0 => #1 t
         | 1 => #2 t
         | 2 => #3 t
         | 3 => #4 t
         | 4 => #5 t
         | _ => raise Fail "sub"
  end
  ------------------------------------------------------------

With this change, the running times are now the following:

     SML/NJ  MLton
       8.9     5.5

So, MLton completely optimized away the tuple allocation, just as
SML/NJ did with the #[...] version, and now runs about 3 times faster,
i.e. at the same speed as SML/NJ with the #[...].  I'm not sure why
switching to tuples slows SML/NJ down, though.

All right, so that explains the differences between MLton and SML/NJ
on this example.  What are my conclusions?

It would be helpful for MLton to optimize Vector.fromList, although
not quite as much as your initial benchmark might lead one to think.
Comparing variants (2) and (3) of my variant of your program shows
that switching from Vector.fromList to tuples yielded a 2x speedup.
That would be nice to get.  I did look into whether MLton could
optimize Vector.fromList using its current infrastructure, and
Matthew's comment is spot on.

> 'Vector.fromList' is implemented in SML, so the compiler doesn't
> treat it specially.  Trying to identify exactly this code would be
> pretty tough.

For anyone who wants to see what MLton sees, I've attached the SSA
control-flow graph for the original "vectors" function, which uses
Vector.fromList.   The relevant blocks in the SSA do the following:

  1. cons up list (L_178)
  2. loop to compute the length of list (L_179, L_180)
  3. allocate an array (L_184)
  4. loop through the list, filling in the array (L_186, L_187)
  5. convert the array to vector (L_188)

So, it's pretty much out of the question to turn things into a direct
vector construction by that point.  On the other hand, adding a
Vector_fromList primitive, implementing Vector.fromList in the basis
via the primitive, and treating the primitive specially in the
optimizer shouldn't be too hard.  With a primitive, the optimizer
would simply have to notice that the argument to fromList is a
directly constructed list.

Anyways, since MLton doesn't currently optimize Vector.fromList, the
only thing I can think of to improve your current program is to
implement your own vectors, with special cases for construction of
small ones.  That is, implement the something like following
signature.

  ------------------------------------------------------------
  signature VECTOR = sig

     type 'a t

     val make0: unit -> 'a t
     val make1: 'a -> 'a t
     val make2: 'a * 'a -> 'a t
     val make3: 'a * 'a * 'a -> 'a t
     val make4: 'a * 'a * 'a * 'a -> 'a t
     val make5: 'a * 'a * 'a * 'a * 'a -> 'a t
     val sub: 'a t * int -> 'a
     val tabulate: int * (int -> 'a) -> 'a t

  end
  ------------------------------------------------------------

I can see a couple of different ways to implement VECTOR.  You could
use a sum type that uses tuples to represent small vectors.  Something
like the following.

  ------------------------------------------------------------
  structure Vector:> VECTOR = struct

     datatype 'a t =
        V0
      | V1 of 'a
      | V2 of 'a * 'a 
      | V3 of 'a * 'a * 'a
      | V4 of 'a * 'a * 'a * 'a
      | V5 of 'a * 'a * 'a * 'a * 'a
      | VN of 'a vector

     fun make0 () = V0
     val make1 = V1
     val make2 = V2
     val make3 = V3
     val make4 = V4
     val make5 = V5

     fun tabulate (n, f) =
        case n of
           0 => V0
         | 1 => V1 (f 0)
         | 2 => V2 (f 0, f 1)
         | 3 => V3 (f 0, f 1, f 2)
         | 4 => V4 (f 0, f 1, f 2, f 3)
         | 5 => V5 (f 0, f 1, f 2, f 3, f 4)
         | _ => VN (Vector.tabulate (n, f))

     fun sub (v, i) =
        case v of
           V0 => raise Subscript
         | V1 a =>
              if i = 0 then a else raise Subscript
         | V2 (a0, a1) =>
              (case i of
                  0 => a0
                | 1 => a1
                | _ => raise Subscript)
         | V3 (a0, a1, a2) =>
              (case i of
                  0 => a0
                | 1 => a1
                | 2 => a2
                | _ => raise Subscript)
         | V4 (a0, a1, a2, a3) =>
              (case i of
                  0 => a0
                | 1 => a1
                | 2 => a2
                | 3 => a3
                | _ => raise Subscript)
         | V5 (a0, a1, a2, a3, a4) =>
              (case i of
                  0 => a0
                | 1 => a1
                | 2 => a2
                | 3 => a3
                | 4 => a4
                | _ => raise Subscript)
         | VN v => Vector.sub (v, i)

  end
  ------------------------------------------------------------

This representation even lets MLton optimize away local allocations
like SML/NJ did in your original code.  It does, however, cost for an
additional layer of boxing for large vectors and for additional
dispatching.

Another possibility is to implement VECTOR using arrays, relying on
the signature to hide the mutability.

  ------------------------------------------------------------
  structure Vector1:> VECTOR = struct

     datatype 'a t = T of 'a array

     fun tabulate (n, f) = T (Array.tabulate (n, f))

     fun sub (T a, i) = Array.sub (a, i)

     fun make0 () = T (Array.tabulate (0, fn _ => raise Fail "make0"))

     fun make1 a = T (Array.array (1, a))

     fun make2 (a0, a1) = let
        val a = Array.array (2, a0)
        val () = Array.update (a, 1, a1)
     in
        T a
     end

     fun make3 (a0, a1, a2) = let
        val a = Array.array (3, a0)
        val () = Array.update (a, 1, a1)
        val () = Array.update (a, 2, a2)
     in
        T a
     end

     fun make4 (a0, a1, a2, a3) = let
        val a = Array.array (4, a0)
        val () = Array.update (a, 1, a1)
        val () = Array.update (a, 2, a2)
        val () = Array.update (a, 3, a3)
     in
        T a
     end

     fun make5 (a0, a1, a2, a3, a4) = let
        val a = Array.array (5, a0)
        val () = Array.update (a, 1, a1)
        val () = Array.update (a, 2, a2)
        val () = Array.update (a, 3, a3)
        val () = Array.update (a, 4, a4)
     in
        T a
     end

  end
  ------------------------------------------------------------

Either of these approaches could benefit from a vector-literal syntax
patterned off of

  http://mlton.org/ArrayLiteral

Hopefully this helps some.  Let us know if you're able to recover some
of the speed.