Problems designing my vectorized math library

**Chris_F** · November 10th, 2010, 01:47 AM

Ok, so lets make this part clear right of the bat: I know there are plenty of math libraries that already exist D3D's, XNA's, the list goes on. I'm aware of this.

My first approach was to create a class named float4 which would contain 4 packed and 16 byte aligned floating point numbers. It would offer initializers and overloaded operators written with SSE intrinsics for vector and matrix math.

The problem, is that operator overloading doesn't mix well with intrinsics, especially on my compiler, MSVC.

Here is an example of my implementation:

Code:

class float4
{
private:
        __m128 v;

public:

...

        inline float4& operator+=(const float4& rhs)
	{
		v = _mm_add_ps(v, rhs.v);
		return *this;
	}
        inline const float4 operator+(const float4& rhs) const
	{
		return float4(*this) += rhs;
	}
};

when you compare this code like so:

Code:

float4 a(1,2,3,4);
float4 b(5,6,7,8);
float4 c = a + b;

// VS

_m128 a = { 1,2,3,4 };
_m128 b = { 5,6,7,8 };
_m128 c = _mm_add_ps(a, b);

You will find that the float4 class with operators is 10x slower than the pure intrinsic code (in my tests, no need to question my methods, they were accurate enough.)

I found this article http://www.gamasutra.com/view/featur...form_simd_.php which talks about this issue. Basically, the compiler is unnecessarily moving the data sse_register >> memory >> x87 FPU >> memory >> sse_register which is what kills the performance. It's probably slower than pure x87 code. In the article he claims that the solution is to use a C like interface instead, one which inlines everything and passes by value to avoid moving data out of the SSE registers.

The headache gets worse at this point, as I have switched to a C like design in which float4 is actually just a typedef for __m128 as the article suggested.

Code:

//Example
typedef __m128 float4;

inline float4 ps_add(float4 v1, float4 v2)
{
	return _mm_add_ps(v1, v2);
}

Reimplementing complex math using this instead of the class version is orders of magnitude faster, but these are the issues I'm currently facing:

1: Microsoft's STL implementation of vector is no good for aligned data. You can implement an aligned allocator, but there version of vector<>.resize() passes by value instead of reference, so if you try and make a vector of any type of struct or class that contains aligned data, you will receive a compile-time error (oddly it works OK if you use __m128 in it, but not a class that contains a __m128). The solution to this may be to switch to a different STL, I'm looking into STLport atm.

2: [strike]I don't know if STLport will solve this next problem. Using MSVC STL, if I create a vector<float4> (ok in this case because float4 is just a __m128), it works just fine, but I can't use push_back at all because if I try to push back a float4 it says ".push_back' must have class/struct/union"[/strike]

3: The question of how to implement a float4x4, which is a 4x4 float matrix. My first thought was to simply use:

Code:

typedef float4* float4x4;

inline float4x4 matrix_create()
{
	return (float4x4)_aligned_malloc(sizeof(float4)*4, 16);
}

It's easy enough to work with, and performs well, but now I have to remember to use _aligned_free to delete every matrix I make. I tried to use std::array instead, but that doesn't offer a custom allocator in it's template arguments, so I can't guarantee 16 byte alignment.

I feel like I'm stuck between a rock and a hard place.

Thanks in advanced to anyone who took the time to read all of this.

Thread: Problems designing my vectorized math library

Thread Tools

Display

Threaded View

Problems designing my vectorized math library

Posting Permissions