Re[2]: [ODE] Faster ODE
Henri Hakl
henri at cs.sun.ac.za
Mon Nov 25 13:46:01 2002
This is a multi-part message in MIME format.
------=_NextPart_000_031D_01C294D4.732027E0
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
I've been thinking about SIMD (MMX, 3DNow(!), SSE(2)) instructions for =
ODE -
and it is quite possible that it can bring about harmony and speed. But =
one
thing that is likely going to cause problems is the SSE(2) code.
For optimal performance a number of details need to be implemented. =
Vectors
and matrices need to be of a horizontal size that is a factor of 4 (this =
is
implemented and the reason why, for example, a 3x3 matrix is defined as =
a 12
TReal (3x4) structure.
However, the structures also have to be aligned onto 16-byte boundaries. =
To
allow for optimal SSE(2) access (using movaps) each 128-bit memory =
vector
that is accessed has to be alligned on a 16-byte memory boundary. This =
is a
problem in ODE, as every math structure now is required to be 16-byte
aligned; this is difficult to achieve because ODE calls/uses =
sub-matrices of
matrices, and it may be difficult to guarantee that every sub-matrix is
>also< correctly 16-byte aligned.
Additionally SSE2 primarily adds double-float functionality to the SIMD
instructions. This can help somewhat for speed in the TReal =3D double =
case,
but isn't likely (just my guess) to have as tremendous a speed bonus as =
4
single floats that can be handled simultaneously for TReal =3D single.
Anyway... ;)
Henri
----- Original Message -----
From: "Nguyen Binh" <ngbinh@glassegg.com>
To: <ode-admin@q12.org>; "Russ Smith" <russ@q12.org>
Cc: "Jeffrey Palmer" <jeffrey.palmer@acm.org>; <ode@q12.org>
Sent: Monday, November 25, 2002 5:11 AM
Subject: Re[2]: [ODE] Faster ODE
>
> I think the best way to improve ODE speed is using CPU-
> specialized instructions like MMX,SIMD,SSE(2).
>
> The refs can be :
> http://LibSimd.sourceforge.net
> SML library of Intel. (Very nice!)
>
> --
> Best regards,
>
> ---------------------------------------------------------------------
> Nguyen Binh
> Software Engineer
> Glass Egg Digital Media
> Me Linh Point Tower, 10th Floor
> 2 Ngo Duc Ke
> District 1, Ho Chi Minh City
> Vietnam
> Fax: (84.8)823-8392
> www.glassegg.com
> ---------------------------------------------------------------------
>
>
> _______________________________________________
> ODE mailing list
> ODE@q12.org
> http://q12.org/mailman/listinfo/ode
------=_NextPart_000_031D_01C294D4.732027E0
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Diso-8859-1">
<META content=3D"MSHTML 6.00.2600.0" name=3DGENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=3D#ffffff>
<DIV>I've been thinking about SIMD (MMX, 3DNow(!), SSE(2)) instructions =
for ODE=20
-<BR>and it is quite possible that it can bring about harmony and speed. =
But=20
one<BR>thing that is likely going to cause problems is the SSE(2)=20
code.<BR><BR>For optimal performance a number of details need to be =
implemented.=20
Vectors<BR>and matrices need to be of a horizontal size that is a factor =
of 4=20
(this is<BR>implemented and the reason why, for example, a 3x3 matrix is =
defined=20
as a 12<BR>TReal (3x4) structure.<BR><BR>However, the structures also =
have to be=20
aligned onto 16-byte boundaries. To<BR>allow for optimal SSE(2) access =
(using=20
movaps) each 128-bit memory vector<BR>that is accessed has to be =
alligned on a=20
16-byte memory boundary. This is a<BR>problem in ODE, as every math =
structure=20
now is required to be 16-byte<BR>aligned; this is difficult to achieve =
because=20
ODE calls/uses sub-matrices of<BR>matrices, and it may be difficult to =
guarantee=20
that every sub-matrix is<BR>>also< correctly 16-byte=20
aligned.<BR><BR>Additionally SSE2 primarily adds double-float =
functionality to=20
the SIMD<BR>instructions. This can help somewhat for speed in the TReal =
=3D double=20
case,<BR>but isn't likely (just my guess) to have as tremendous a speed =
bonus as=20
4<BR>single floats that can be handled simultaneously for TReal =3D=20
single.<BR><BR>Anyway... ;)<BR> Henri<BR><BR><BR>----- Original =
Message=20
-----<BR>From: "Nguyen Binh" <<A=20
href=3D"mailto:ngbinh@glassegg.com">ngbinh@glassegg.com</A>><BR>To: =
<<A=20
href=3D"mailto:ode-admin@q12.org">ode-admin@q12.org</A>>; "Russ =
Smith" <<A=20
href=3D"mailto:russ@q12.org">russ@q12.org</A>><BR>Cc: "Jeffrey =
Palmer" <<A=20
href=3D"mailto:jeffrey.palmer@acm.org">jeffrey.palmer@acm.org</A>>; =
<<A=20
href=3D"mailto:ode@q12.org">ode@q12.org</A>><BR>Sent: Monday, =
November 25, 2002=20
5:11 AM<BR>Subject: Re[2]: [ODE] Faster=20
ODE<BR><BR><BR>><BR>> &nbs=
p; I=20
think the best way to improve ODE speed is using=20
CPU-<BR>> specialized =
instructions like=20
MMX,SIMD,SSE(2).<BR>><BR>>  =
; =20
The refs can be=20
:<BR>> &nbs=
p; =20
<A=20
href=3D"http://LibSimd.sourceforge.net">http://LibSimd.sourceforge.net</A=
><BR>> &nbs=
p; =20
SML library of Intel. (Very nice!)<BR>><BR>> --<BR>> Best=20
regards,<BR>><BR>>=20
---------------------------------------------------------------------<BR>=
> =20
Nguyen Binh<BR>> Software=20
Engineer<BR>> Glass Egg Digital=20
Media<BR>> Me Linh Point Tower, 10th=20
Floor<BR>> 2 Ngo Duc Ke<BR>> =
District=20
1, Ho Chi Minh City<BR>> =
Vietnam<BR>> =20
Fax: (84.8)823-8392<BR>> <A=20
href=3D"http://www.glassegg.com">www.glassegg.com</A><BR>>=20
---------------------------------------------------------------------<BR>=
><BR>><BR>>=20
_______________________________________________<BR>> ODE mailing =
list<BR>>=20
<A href=3D"mailto:ODE@q12.org">ODE@q12.org</A><BR>> <A=20
href=3D"http://q12.org/mailman/listinfo/ode">http://q12.org/mailman/listi=
nfo/ode</A><BR></DIV></BODY></HTML>
------=_NextPart_000_031D_01C294D4.732027E0--