Team PLClub ICFP entry -- comparing the performance of OCAML and SML

Fri, 13 Oct 2000 14:46:26 -0400

|> So, you are correct for Real.Math.sqrt -- the SML/NJ
|> implementation is about 10 times slower than the C math library.  I
|> then computed a rough estimate for the number of seconds spent by
|> SML/NJ in sqrt for each benchmark by taking the number of sqrts
|> divided by SML/NJ's rate for sqrts.  Here are the numbers for each
|> benchmark, along with the original running time measurements.
|>                                                 
|>               number of    number SML/NJ   Total running time
|>                calls to        of   sqrt 
|>               Real.Math     sqrts   time  OCAML  MLton SML/NJ
|> -----------   ---------  --------  -----  -----  -----  -----
|> holes            252374    252359    0.6    1.8    3.2    3.9
|> fov              416558    416545    0.9    1.5    1.8    3.2
|> intercyl         789548    735426    1.6    1.6    2.1    4.3
|> snowgoon         470715    362464    0.8    2.9    3.3    5.1
|> dice            1263757    925063    2.0    3.9    4.9    8.8
|> golf             193489    130088    0.3    1.5    2.4    3.1
|> cone-fractal     603638    455931    1.0    3.7    4.3    6.5
|> large             73500     73487    0.6    4.3    3.0    6.7
|> pipe            1087559    983314    2.2    5.4    4.6    7.9
|> chess           1192926   1030668    2.3   16.0   15.5   21.6
|> fractal         2296066   2229266    4.9   12.2    8.5   45.4
|> 
|> From this table, I conclude that the performance penalty for using
|> sqrt in SML/NJ is a significant cause for why SML/NJ is slower than
|> OCAML or MLton, but not the only cause.
|> 

Thanks for all the experiments. But I would not make any conclusion
this quickly. For example, I am surprised that SML/NJ only spent
10% of its time executing sqrt on chess and fractal if "sqrt" is
indeed a hot spot. 

SML/NJ also implements many of its Real primitives in ML itself
(e.g., floor, trunc, round, cell, ...), so they may also contribute
to the slowdown. 

|> As to my statement "MLton did not penalize the programming style as
|> much as SML/NJ did", I apologize for not being more clear.  I was
|> referring to the table in my post that showed the speedup John
|> Reppy achieved by changing the programming style.  
|> ....................
|>

But John only made three modifications: using unsafe array operations,
uncurrying, and replacing loop index reference cells by tail recursive
functions. Since (in your previous message) you didn't think unsafe
array operation is the problem, I assume you were talking about the
other two aspects (in which presumably MLton does a better job
supporting them than SML/NJ).

|> > I guess you were trying to say that MLton does a better uncurrying
|> > optimization and supports "reference cell" more efficiently than SML/NJ.
|> 
|> I was not.  In fact, MLton does not at present have an explicit
|> uncurrying optimization, although the combination of some of its
|> other optimizations has the effect of some uncurrying.  There are
|> some known cases where MLton has a performance penalty for using
|> curried functions.  The MLton developers are trying to eliminate
|> them. As to reference cells, I did not mention them in my post.

See above.

|> 
|> > But perhaps this has more to do with the fact that MLton does
|> > more aggressive whole-program analysis while SML/NJ emphasize
|> > more on the scalability.
|> 
|> My conclusion is that some of the performance difference between
|> SML/NJ and OCAML or MLton is due to sqrt and some is unexplained.
|> As to the unexplained part, it may be due to whole program analysis
|> or may be due to more mundane things like register allocation or
|> peephole optimization.  I don't know.  If I had to guess, I would
|> say it is not due to whole-program analysis, since OCAML achieves
|> very good performance with separate compilation.
|> 

If I had to guess, I still think it is mostly due to the fact that 
Ocaml/MLton try to do more work in native C while SML/NJ tries to do 
more in ML itself.

I don't believe the unexplained part has anything to do with whole
program analysis, register allocation, or peephole optimization.
The new SML/NJ backend does a pretty good job of all these and a 
more sophisticated optimizer can seldom yield more than 10% speedup.

-Zhong Shao
(shao-zhong@cs.yale.edu)