Problems designing my vectorized math library

**jwbarton** · November 10th, 2010, 04:01 PM

I am not the original poster. I was just curious what the reasons were for the performance problems reported by the original poster.

My idea was to try to provide the original poster with the syntactic sugar that would make it possible to use the __m128 intrinsic type with normal math operators instead of needing to call the _mm routines.

The original poster set up the float4 class to provide operator overloads for this syntactic sugar.

**Chris_F** · November 10th, 2010, 05:46 PM

Thanks for the suggestions, jwbarton.

I gave those non member overrides seem to perform just as well as the inline functions, at least in release mode. In debug, they are a couple times slower.

It still seems as if the best method of implementing float4 is as a typedef for __m128, with some non-member overloads for convenience, but I am still unsure of how to implement float4x4.

**Chris_F** · November 10th, 2010, 06:17 PM

Actually, I took a look through the code in the article I linked in my original post, and he has non-member overloads as well, with the following comment above the,

Code:

//	Overloaded operators, left here just as a reference.
//	WARNING: This bloats the code as expressions grow

**monarch_dodra** · November 11th, 2010, 06:39 AM

Originally Posted by Chris_F

Thanks for the suggestions, jwbarton.

I gave those non member overrides seem to perform just as well as the inline functions, at least in release mode. In debug, they are a couple times slower.

It still seems as if the best method of implementing float4 is as a typedef for __m128, with some non-member overloads for convenience, but I am still unsure of how to implement float4x4.

Performance of a debug build is irrelevant.

The difference between a non-member (potentially non-friend) operator, and a member operator, is purely conceptual. The only difference should be if the compiler allows or doesn't allow the operator, but the result should be the same.

Originally Posted by Chris_F

Actually, I took a look through the code in the article I linked in my original post, and he has non-member overloads as well, with the following comment above the,

Code:

//	Overloaded operators, left here just as a reference.
//	WARNING: This bloats the code as expressions grow

Inline methods, by definition is code bloat. But a good code bloat. The alternative is either making them non-inline, ad you'd probably feel a difference in performance of several orders of magnitude. Or not provide the overloads, in which case users would just write by hand the same thing.

The important part is for the users to understand the cost of each operation, and always choose the right one:

Code:

float4 a = b + c + d;
vs
float4 a = b;
a+=c;
a+=d;

Chances are the second version is much faster. I know some library use template magic and temporary objects to optimize the first version, but I call it pointless. It's nothing more than syntactic sugar, for programmers who should be good enough understanding why they shouldn't have been using the first version in the first place.

PS: for operator+, consider:

Code:

float4 operator+(const float4& lhs, const float4&)
{
    float4 ret = lhs; ret+=rhs;
    return ret;
}

The act of creating a named varaible, rather than temporary, can help trigger NRVO (named return value optimization). You can read more about it here:

Boost::operators, or better yet, just use boost operators, and forget about it.

**jwbarton** · November 11th, 2010, 02:56 PM

Originally posted by monarch_dodra
The difference between a non-member (potentially non-friend) operator, and a member operator, is purely conceptual. The only difference should be if the compiler allows or doesn't allow the operator, but the result should be the same.

While conceptually this is true, in practice it depends on the compiler implementation. It is true that the computed result of using a non-member operator with __m128 is the same as making a class that contains an __m128 and using a member operator.

However, the code generated isn't the same (at least with the VS2010 compiler that I use). The compiler understands __m128 as an intrinsic type which it can pass around in the SSE registers of the processor. When using an __m128 member of a class, it stops passing around the results in the SSE registers, and makes significantly more loads and stores from the member variable of the class.

Originally posted by monarch_dodra

Code:

float4 a = b + c + d;
vs
float4 a = b;
a+=c;
a+=d;

Chances are the second version is much faster.

As far as whether using the first or second version is faster, this is also a quality of implementation issue. As far as I can tell (I only tried this simple example), the code generation looks the same for both versions when using non-member operators with the __m128 intrinsic. It may change when using a more complicated expression. Someone looking for the fastest possible result would need verify that the compiler generated acceptable code or would need to code it explicitly.

**Chris_F** · November 11th, 2010, 05:28 PM

Yes monarch_dodra, the "bloat" I was referring too was not the usual that results from inlining a function, but instead the unnecessary shuffling of data in and out of the SSE registers.

Thread: Problems designing my vectorized math library

Thread Tools

Display

Re: Problems designing my vectorized math library

Re: Problems designing my vectorized math library

Re: Problems designing my vectorized math library

Re: Problems designing my vectorized math library

Re: Problems designing my vectorized math library

Re: Problems designing my vectorized math library

Posting Permissions