Problems designing my vectorized math library

**Chris_F** · November 10th, 2010, 01:47 AM

Ok, so lets make this part clear right of the bat: I know there are plenty of math libraries that already exist D3D's, XNA's, the list goes on. I'm aware of this.

My first approach was to create a class named float4 which would contain 4 packed and 16 byte aligned floating point numbers. It would offer initializers and overloaded operators written with SSE intrinsics for vector and matrix math.

The problem, is that operator overloading doesn't mix well with intrinsics, especially on my compiler, MSVC.

Here is an example of my implementation:

Code:

class float4
{
private:
        __m128 v;

public:

...

        inline float4& operator+=(const float4& rhs)
	{
		v = _mm_add_ps(v, rhs.v);
		return *this;
	}
        inline const float4 operator+(const float4& rhs) const
	{
		return float4(*this) += rhs;
	}
};

when you compare this code like so:

Code:

float4 a(1,2,3,4);
float4 b(5,6,7,8);
float4 c = a + b;

// VS

_m128 a = { 1,2,3,4 };
_m128 b = { 5,6,7,8 };
_m128 c = _mm_add_ps(a, b);

You will find that the float4 class with operators is 10x slower than the pure intrinsic code (in my tests, no need to question my methods, they were accurate enough.)

I found this article http://www.gamasutra.com/view/featur...form_simd_.php which talks about this issue. Basically, the compiler is unnecessarily moving the data sse_register >> memory >> x87 FPU >> memory >> sse_register which is what kills the performance. It's probably slower than pure x87 code. In the article he claims that the solution is to use a C like interface instead, one which inlines everything and passes by value to avoid moving data out of the SSE registers.

The headache gets worse at this point, as I have switched to a C like design in which float4 is actually just a typedef for __m128 as the article suggested.

Code:

//Example
typedef __m128 float4;

inline float4 ps_add(float4 v1, float4 v2)
{
	return _mm_add_ps(v1, v2);
}

Reimplementing complex math using this instead of the class version is orders of magnitude faster, but these are the issues I'm currently facing:

1: Microsoft's STL implementation of vector is no good for aligned data. You can implement an aligned allocator, but there version of vector<>.resize() passes by value instead of reference, so if you try and make a vector of any type of struct or class that contains aligned data, you will receive a compile-time error (oddly it works OK if you use __m128 in it, but not a class that contains a __m128). The solution to this may be to switch to a different STL, I'm looking into STLport atm.

2: [strike]I don't know if STLport will solve this next problem. Using MSVC STL, if I create a vector<float4> (ok in this case because float4 is just a __m128), it works just fine, but I can't use push_back at all because if I try to push back a float4 it says ".push_back' must have class/struct/union"[/strike]

3: The question of how to implement a float4x4, which is a 4x4 float matrix. My first thought was to simply use:

Code:

typedef float4* float4x4;

inline float4x4 matrix_create()
{
	return (float4x4)_aligned_malloc(sizeof(float4)*4, 16);
}

It's easy enough to work with, and performs well, but now I have to remember to use _aligned_free to delete every matrix I make. I tried to use std::array instead, but that doesn't offer a custom allocator in it's template arguments, so I can't guarantee 16 byte alignment.

I feel like I'm stuck between a rock and a hard place.

Thanks in advanced to anyone who took the time to read all of this.

**superbonzo** · November 10th, 2010, 03:14 AM

Originally Posted by Chris_F

1: Microsoft's STL implementation of vector is no good for aligned data. You can implement an aligned allocator, but there version of vector<>.resize() passes by value instead of reference, so if you try and make a vector of any type of struct or class that contains aligned data, you will receive a compile-time error (oddly it works OK if you use __m128 in it, but not a class that contains a __m128). The solution to this may be to switch to a different STL, I'm looking into STLport atm.

take a look at how the Eigen library approaches this problem; the solutions adopted consist in respecializing the whole std::vector template; in particular, the newest solution respacializes those vectors specifyed with the eigen allocator ( that acts both as an aligning allocator and a specialization tag ) ... one more reason to leave these kind of things to library writers

....

**D_Drmmr** · November 10th, 2010, 03:15 AM

Originally Posted by Chris_F

You will find that the float4 class with operators is 10x slower than the pure intrinsic code (in my tests, no need to question my methods, they were accurate enough.)

Why not provide some test code for others to run? There's little point talking about performance without being able to measure things.

Originally Posted by Chris_F

In the article he claims that the solution is to use a C like interface instead, one which inlines everything and passes by value to avoid moving data out of the SSE registers.

I am always doubtful when people claim you shouldn't use C++ constructs because they are too slow, and then see them not using common C++ practices. In this case, unnecessary use of pointers and declaring all variables at the beginning of a function.
Why not try passing by value when using a class?

Originally Posted by Chris_F

(oddly it works OK if you use __m128 in it, but not a class that contains a __m128).

Perhaps there is a specialization for vector<__m128>.

Originally Posted by Chris_F

but I can't use push_back at all because if I try to push back a float4 it says ".push_back' must have class/struct/union"

Show the code.

Originally Posted by Chris_F

It's easy enough to work with, and performs well, but now I have to remember to use _aligned_free to delete every matrix I make.

That's what you get when you throw C++ languages features overboard. Instead, you can wrap this in a class and make your life easier.

**Chris_F** · November 10th, 2010, 03:42 AM

Originally Posted by D_Drmmr

Why not provide some test code for others to run? There's little point talking about performance without being able to measure things.

I don't see the point. I looked at a dissassembly and the code is horribly mangled and in addition I did some basic tests and timed them. If you want to call my methods of testing inaccurate, go ahead, but I assure you they wouldn't account for an order of magnitude difference in performance. We're talking about the difference between waiting around 20 seconds for a test to complete and waiting around for over 3 minutes.

I am always doubtful when people claim you shouldn't use C++ constructs because they are too slow, and then see them not using common C++ practices. In this case, unnecessary use of pointers and declaring all variables at the beginning of a function.
Why not try passing by value when using a class?

That's exactly how I feel too, but given the results I got, I can't really argue with it unless someone can prove it all wrong. He gets good results with the Intel compiler, but I'm stuck with MSVC.

Show the code.

What? There isn't exactly much to show.

Code:

vector<__m128> myVector;
__m128 a = _mm_set_ps1(0);
myVector.push_back(a);    // Compiler Error, __m128 is apparently neither class, struct nor union

That's what you get when you throw C++ languages features overboard.

Believe me, I rather not, after all I tried to do things "the C++ way" the first time around.

**JohnW@Wessex** · November 10th, 2010, 03:59 AM

Code:

vector<__m128> myVector(); // A function called 'myVector' that takes no parameters and returns a vector<__m128>
__m128 a = _mm_set_ps1(0);
myVector.push_back(a); // Compile error!

**Chris_F** · November 10th, 2010, 04:13 AM

Originally Posted by JohnW@Wessex

Code:

vector<__m128> myVector(); // A function called 'myVector' that takes no parameters and returns a vector<__m128>
__m128 a = _mm_set_ps1(0);
myVector.push_back(a); // Compile error!

That's not what I meant. I didn't copy/paste that, if that's what you think, just scribbled something down between code brackets to illustrate my point but made a mistake.

Try it for yourself. Make a std:vector of __m128 and try to push one back if you don't believe me.

Edit: Actually I can't seem to replicate the error. In any case, the last time it happened I was using something to the effect of:

Code:

vector<float4> myVector(4096);

That issue was the least of my worries.

**JohnW@Wessex** · November 10th, 2010, 04:21 AM

Originally Posted by Chris_F

Try it for yourself. Make a std:vector of __m128 and try to push one back if you don't believe me.

Well, I removed the parentheses, and this compiles and works.

Code:

#include <vector>

using namespace std;

#include <xmmintrin.h>

int main() 
{
    vector<__m128> myVector;
    __m128 a = _mm_set_ps1(0);
    myVector.push_back(a);  
}

**D_Drmmr** · November 10th, 2010, 05:12 AM

Originally Posted by Chris_F

I don't see the point. I looked at a dissassembly and the code is horribly mangled and in addition I did some basic tests and timed them. If you want to call my methods of testing inaccurate, go ahead,

No need to get defensive. I didn't say your method of testing is flawed.
However, others could benefit from being able to run the test you made, maybe even try to optimize the original code.

Originally Posted by Chris_F

That's exactly how I feel too, but given the results I got, I can't really argue with it unless someone can prove it all wrong. He gets good results with the Intel compiler, but I'm stuck with MSVC.
...
Believe me, I rather not, after all I tried to do things "the C++ way" the first time around.

Even if you need to abandon C++ constructs in some parts of the code for the sake of performance, it doesn't mean you have to resort to C style for everything. IMO, if you observe that some code works faster than some other code, but don't understand why, you should consider it a special circumstance and not a general rule. Only if you understand why one thing is faster than another (maybe not in detail, but at least it shouldn't be magic), you are able to form a general rule.

**Paul McKenzie** · November 10th, 2010, 05:26 AM

Originally Posted by Chris_F

Try it for yourself. Make a std:vector of __m128 and try to push one back if you don't believe me.

A vector is not anything special. It is just C++ code. therefore if there is a problem with vector::push_back(), then there is a problem in general with copying and assigning that type.

Maybe that type cannot be assigned or copied safely, which is a requirement for vector . In other words, just run-of-the-mill C++ code would also be faulty if you used value semantics on that type (i.e. passing by value, returning by value, assignment, copying, etc.)

Regards,

Paul McKenzie

**itsmeandnobodyelse** · November 10th, 2010, 05:55 AM

Originally Posted by Chris_F

That's not what I meant. I didn't copy/paste that, if that's what you think, just scribbled something down between code brackets to illustrate my point but made a mistake.

Try it for yourself. Make a std:vector of __m128 and try to push one back if you don't believe me.

Edit: Actually I can't seem to replicate the error. In any case, the last time it happened I was using something to the effect of:

Code:

vector<float4> myVector(4096);

That issue was the least of my worries.

If you can't reproduce the issue you should remove all of the stlport things you already have configured. I made very bad experiences with stlportespecially with the MSVC compilers. Stlport was developed to times where each vendor has its own less compatible implementation of STL. MSVC 6.0 even had a STL released prior to C++ Standard in 1998. Nowadays all major compilers have good to very good STL implementations and most issues are not due to non-compatibility but by changes because of fulfilling the standard. One example is the vector issue you had which would do what you expected in VC6 and VC7 but not in VC8 (VS2005) or later.

BTW, the standard guarantees that the internal array of std::vector is a pure C array. So, you can do all C operations on it beside of reallocating and nevertheless have the power of a dynamic array with safe allocation.

**Lindley** · November 10th, 2010, 07:37 AM

I think you should take a close look at how the Eigen library (linked above) approaches the issue. They also use SSE accelerations when appropriate. This means they've either found a way to make them efficient in a C++ interface, or they might better understand the reasons why you can't do that in some cases. In any event you could probably benefit from asking questions on their mailing list.

**jwbarton** · November 10th, 2010, 02:11 PM

I took a quick look at the code generation for an intrinsic __m128 using direct calls to _mm_add_ps() to add the 4 floats, and then the member routine in your float4 class which also calls _mm_add_ps() to add the 4 floats.

The big difference is that the compiler can't treat your float4 class as an intrinsic __m128, so it needs to update the memory of the object when it produces the result. When directly using an intrinsic __m128, the compiler will generate code that keeps intermediate results in the processor registers. This can make a huge difference in performance.

One way that you can get some of the syntactic sugar that you are looking for is to implement non-member overrides for the basic operations (instead of members of a float4 class). This will allow the compiler to use the result of the operation as an intrinsic and keep it in a processor register.

For example, you can implement the operator+ as follows:

Code:

__m128 operator+(__m128 a, __m128 b)
{
   return _mm_add_ps(a, b);
}

You should be able to provide non-member overrides for +, -, *, / which make calls to the appropriate _mm routines.

You can't implement an operator+= this way, as assignment operators can't be a non-member functions.

Note: I am using VS2010, so the results may be different on an earlier version of the compiler.

**laserlight** · November 10th, 2010, 03:20 PM

Originally Posted by jwbarton

You can't implement an operator+= this way, as assignment operators can't be a non-member functions.

Actually, you can: it is only the copy assignment operator that must be a member function.

**jwbarton** · November 10th, 2010, 03:48 PM

I stand corrected.

I had the syntax wrong and the compiler error I got didn't help.

Is this correct?

Code:

__m128& operator+=(__m128& a, __m128 b)
{
   return (a = _mm_add_ps(a, b));
}

I am not clear on what the return value of operator+= should be. Should it be a reference or a value?

**laserlight** · November 10th, 2010, 03:56 PM

Originally Posted by jwbarton

I am not clear on what the return value of operator+= should be. Should it be a reference or a value?

The first parameter should be of a reference type since the argument is supposed to be modified. Consequently, it makes sense to return a reference in view that it is normally more efficient to do so.

Is your idea here to keep float4 as a POD type with only a single __m128 member so that the compiler can treat a float4 object as an intrinsic __m128?

Thread: Problems designing my vectorized math library

Thread Tools

Display

Problems designing my vectorized math library

Re: Problems designing my vectorized math library

Re: Problems designing my vectorized math library

Re: Problems designing my vectorized math library

Re: Problems designing my vectorized math library

Re: Problems designing my vectorized math library

Re: Problems designing my vectorized math library

Re: Problems designing my vectorized math library

Re: Problems designing my vectorized math library

Re: Problems designing my vectorized math library

Re: Problems designing my vectorized math library

Re: Problems designing my vectorized math library

Re: Problems designing my vectorized math library

Re: Problems designing my vectorized math library

Re: Problems designing my vectorized math library

Posting Permissions