slow matrix multiply

Matthew Fluet mfluet@intertrust.com
Tue, 10 Jul 2001 19:37:25 -0700 (PDT)



Here's annotated assembly for my working version.  I stress working;
there are some obvious problems with what's here:

> fun loop (k, sum) =
>    if k < 0
>      then sum
>    else loop (k - 1, sum + sub (m1, i, k) * sub (m2, k, j))
>

loop_42:
# Assume sum -> %ebx
# Assume j -> %ecx
# Assume ??? -> %edx
# Assume k -> %ebp
# Assume i -> %esp
# Assume frontier -> %esi
# Assume stackTop -> %edi
	cmpl $0,%ebp			# if k < 0
	jl L_369
L_370:
	movl %ebp,%eax			# %eax = k
	movl %esi,(gcState+12)		# gcState.frontier = %esi
	movl %ebp,%esi			# %esi = k
	decl %eax			# %eax = k - 1
	movl %eax,(200*1)(%edi)         # store k - 1 (spill)
	movl %esp,%eax			# %eax = i
	movl %edx,(172*1)(%edi)         # store ??? (spill)
	movl %esp,(188*1)(%edi)         # store i (spill)
	movl $30,%esp			# %esp = 30
	imull %esp			# %esp = i * 30
	addl %eax,%esi                  # %esi = i * 30 + k
	movl (144*1)(%edi),%esp         # %esp = m1
	movl %ebx,(196*1)(%edi)         # store sum (spill)
	movl (%esp,%esi,4),%ebx         # %ebx = sub(m1, i, k)
	movl %ebp,%eax                  # %eax = k
	movl $30,%ebp                   # %ebp = 30
	imull %ebp                      # %eax = k * 30
	addl %ecx,%eax                  # %eax = k * 30 + j
	xchgl %ebx,%eax                 # %eax = sub(m1, i, k), %ebx = k * 30 + j
	movl (160*1)(%edi),%ebp         # %ebp = m2
	imull (%ebp,%ebx,4)             # %eax = sub(m1, i, k) * sub(m2, k, j)
	addl (196*1)(%edi),%eax         # sum = sum + ...
	movl (gcState+12),%esi          # %esi = gcState.frontier
	movl (172*1)(%edi),%edx         # reload ???
	movl %eax,%ebx                  # %ebx = sum + ...
	movl (188*1)(%edi),%esp         # %esp = i
	movl (200*1)(%edi),%ebp         # %ebp = k - 1
	jmp loop_42

Unfortunately, this is no shorter than what's produced by the release.
We're hurt somewhat by the fact that multiplications kill %edx.
That's why we spill %edx at the top of the loop and reload it at the
end.  %edx is used close to the top of L_369, so the heuristic for
what to carry in registers says to carry it.  (Unfortunately, there is
a imul before the actual use, so it's spilt and restored anyways.
This may be an argument for not having %edx in the carrying registers
list; or at least not into blocks with multiplications or divisions.)
k is about the same in both versions; in the new version we spill and
reload in the same iteration; maybe this is better? (there's at least
one jump instruction between the load and the use, so maybe there's
better pipelining??).  The spilling of i is somewhat unfortunate; it's
defed somewhere before entering this loop, so I assume that the value
carried in %esp is not synchronized with the stack slot for i; hence,
when we need to free up a register, we spill it and write it back to
the stack slot.  Since i is never defed in this loop, it would have
been better to notice that i is loop invariant, and therefore ensure
that i is synchronized before entering the loop.  Same could be said
about j, although we didn't need to spill j.  Sum is also similar, but
because it's defed in this loop

Also, notice gcState.frontier.  Never used, but the "calling convention"
for near jumps is to carry frontier in %esi.  So we spill and reload;  On
the todo list is to track the liveness of stackTop and frontier, which
should then recognize that frontier can be left out of the values carried
in registers around this loop.