discussion of X86 floating point on comp.lang.ml

Thu, 19 Oct 2000 10:44:55 -0700 (PDT)

Matthew, I don't know if you follow comp.lang.ml, so just in case, you =
will
probably find the following article interesting.

http://x53.deja.com/threadmsg_md.xp?thitnum=3D1&mhitnum=3D4&CONTEXT=3D97=
1977159.1300693019&new=3D1&AN=3D683376389.1&uniq=3D971977204.1300758563

 Subject:
       Re: Team PLClub ICFP entry --
       comparing the performance of
       OCAML and SML
 Date:
       10/19/2000
 Author:
       Xavier Leroy
       <Xavier.Leroy@see.my.sig.for.address>

      =20
                                  << previous  =B7  next >>=20

 Allen Leung <leunga@cs.nyu.edu> writes:
 =20
 >    Actually, the SML/NJ backend currently uses the ``wrong''
 framework for
 > FP register allocation on the x86.  Instead of using the FP stack
 registers
 > as registers, it uses them only as temporaries for evaluation
 expressions.
 > Virtual registers are actually placed on the (memory) stack. =20
 =20
 OCaml does exactly the same, and I believe this is actually the
 ``right'' framework for FP on the x86 -- at least the one intended
 by the designers of the Pentium and Pentium Pro/II/III.  For
 instance, loading from / storing to a register deep in the FP
 register stack is nearly as expensive as loading from / storing to
 the memory stack, provided it is in L1 cache.
 =20
 I did various experiments with using the FP register stack as real
 registers, and it did not improve performance w.r.t. the
 simple-minded strategy you describe above.
 =20
 But it is true that better FP performance will be obtained by using
 the SSE2 extension (announced on the latest Pentiums and on
 AMD's x86-64 processor), which at last provide "real"
 floating-point registers.
 =20
 > So there is a huge penalty with FP intensive loops, compared to
 using the
 > ``right'' framework.  How many of these benchmarks are FP
 intensive?  The
 > performance of SML/NJ may have something to do with the RA.=20
 =20
 As I said, OCaml uses the same framework as SML/NJ here, and
 this doesn't prevent it from outperforming SML/NJ by a good
 factor on FP-intensive stuff.  So, the explanation of SML/NJ's
 performance is to be found elsewhere.
 =20
 - Xavier Leroy