Re[4]: [ODE] Faster ODE

Henri Hakl henri at
Tue Nov 26 06:43:01 2002

Sounds good.

I know Intel's SML - I used it (among other things) to write a small
specialized matrix library of my own (only works on 3x1, 4x1 vectors and
3x3, 4x4 matrices. No LU decomposition or other "higher" functionality; but
it has damn fast "basic" functionality (normalize, transpose, multiply,
arithmetic, etc). If anybody is interested go look at:

You should know that the SML still has some room for improvement; for
example their 4x4 matrix multiply can still gain another 10% speed with some
clever use of SSE2 instructions. Nonetheless it is in some cases futile to
implement SIMD algorithms, as FPU versions are as fast or faster - and less
hassle to create. Unforetunately you have to do both before you can figure
out which one is more suited...

I've written a seperate LDLT decomposition using the "classic" approach (not
the one used in ODE) and FPU optimization which is quite fast. And I wrote a
seperate assembler FPU version of the replacement unit I provided, but it is
only about 1% faster. (So I think the readability is more important in this

Oh yea... and don't forget that you should have seperate FPU, MMX, 3DNow,
3DNow!, SSE and SSE2 versions to keep everybody happy and working at the
most efficient. (Oh... and don't use an SSE version for the Athlon
processors - although they support SSE it is nowhere near as fast as the
Intel equivalents.)

Still :)) - I'll be happy to look at the code you develop for ODE and see if
I can fit some additional speed in there. ;)

----- Original Message -----
From: "Nguyen Binh" <>
To: <>; "Henri Hakl" <>
Cc: <>
Sent: Tuesday, November 26, 2002 3:50 AM
Subject: Re[4]: [ODE] Faster ODE

> Hi ,
> HH> I've been thinking about SIMD (MMX, 3DNow(!), SSE(2)) instructions for
> HH> and it is quite possible that it can bring about harmony and speed.
But one
> HH> thing that is likely going to cause problems is the SSE(2) code.
> HH> For optimal performance a number of details need to be implemented.
> HH> and matrices need to be of a horizontal size that is a factor of 4
(this is
> HH> implemented and the reason why, for example, a 3x3 matrix is defined
as a 12
> HH> TReal (3x4) structure.
>     Take a look at Small Matrix Lib (SML) of Intel, you'll see that we
>     at least a way to solve this.
> HH> However, the structures also have to be aligned onto 16-byte
boundaries. To
> HH> allow for optimal SSE(2) access (using movaps) each 128-bit memory
> HH> that is accessed has to be alligned on a 16-byte memory boundary. This
is a
> HH> problem in ODE, as every math structure now is required to be 16-byte
> HH> aligned; this is difficult to achieve because ODE calls/uses
sub-matrices of
> HH> matrices, and it may be difficult to guarantee that every sub-matrix
> >>also< correctly 16-byte aligned.
>     Also, SML solved these.
>     But we may consider move all matrix, vector,... of ODE to the
>     matrix, vector ,... of SML. I assure it'll not be hard cause right
>     now vectors,matrixs of ODE are barely a typdef TReal* .
>     Moreover, in SML we have a type of dimension-variable matrix that
>     has built-in LU decomposition function.
> HH> Additionally SSE2 primarily adds double-float functionality to the
> HH> instructions. This can help somewhat for speed in the TReal = double
> HH> but isn't likely (just my guess) to have as tremendous a speed bonus
as 4
> HH> single floats that can be handled simultaneously for TReal = single.
>     OK, just your guess. I'm SIMDing ODE, I'll put my benchmark when
>     it finishes.
> --
> Best regards,
> ---------------------------------------------------------------------
>    Nguyen Binh
>    Software Engineer
>    Glass Egg Digital Media
>    Me Linh Point Tower, 10th Floor
>    2 Ngo Duc Ke
>    District 1, Ho Chi Minh City
>    Vietnam
>    Fax:  (84.8)823-8392
> ---------------------------------------------------------------------