CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Page 1 of 2 12 LastLast
Results 1 to 15 of 21
  1. #1
    Join Date
    Aug 2008
    Posts
    902

    Problems designing my vectorized math library

    Ok, so lets make this part clear right of the bat: I know there are plenty of math libraries that already exist D3D's, XNA's, the list goes on. I'm aware of this.

    My first approach was to create a class named float4 which would contain 4 packed and 16 byte aligned floating point numbers. It would offer initializers and overloaded operators written with SSE intrinsics for vector and matrix math.

    The problem, is that operator overloading doesn't mix well with intrinsics, especially on my compiler, MSVC.

    Here is an example of my implementation:

    Code:
    class float4
    {
    private:
            __m128 v;
    
    public:
    
    ...
    
            inline float4& operator+=(const float4& rhs)
    	{
    		v = _mm_add_ps(v, rhs.v);
    		return *this;
    	}
            inline const float4 operator+(const float4& rhs) const
    	{
    		return float4(*this) += rhs;
    	}
    };
    when you compare this code like so:

    Code:
    float4 a(1,2,3,4);
    float4 b(5,6,7,8);
    float4 c = a + b;
    
    // VS
    
    _m128 a = { 1,2,3,4 };
    _m128 b = { 5,6,7,8 };
    _m128 c = _mm_add_ps(a, b);
    You will find that the float4 class with operators is 10x slower than the pure intrinsic code (in my tests, no need to question my methods, they were accurate enough.)

    I found this article http://www.gamasutra.com/view/featur...form_simd_.php which talks about this issue. Basically, the compiler is unnecessarily moving the data sse_register >> memory >> x87 FPU >> memory >> sse_register which is what kills the performance. It's probably slower than pure x87 code. In the article he claims that the solution is to use a C like interface instead, one which inlines everything and passes by value to avoid moving data out of the SSE registers.

    The headache gets worse at this point, as I have switched to a C like design in which float4 is actually just a typedef for __m128 as the article suggested.

    Code:
    //Example
    typedef __m128 float4;
    
    inline float4 ps_add(float4 v1, float4 v2)
    {
    	return _mm_add_ps(v1, v2);
    }
    Reimplementing complex math using this instead of the class version is orders of magnitude faster, but these are the issues I'm currently facing:

    1: Microsoft's STL implementation of vector is no good for aligned data. You can implement an aligned allocator, but there version of vector<>.resize() passes by value instead of reference, so if you try and make a vector of any type of struct or class that contains aligned data, you will receive a compile-time error (oddly it works OK if you use __m128 in it, but not a class that contains a __m128). The solution to this may be to switch to a different STL, I'm looking into STLport atm.

    2: [strike]I don't know if STLport will solve this next problem. Using MSVC STL, if I create a vector<float4> (ok in this case because float4 is just a __m128), it works just fine, but I can't use push_back at all because if I try to push back a float4 it says ".push_back' must have class/struct/union"[/strike]

    3: The question of how to implement a float4x4, which is a 4x4 float matrix. My first thought was to simply use:

    Code:
    typedef float4* float4x4;
    
    inline float4x4 matrix_create()
    {
    	return (float4x4)_aligned_malloc(sizeof(float4)*4, 16);
    }
    It's easy enough to work with, and performs well, but now I have to remember to use _aligned_free to delete every matrix I make. I tried to use std::array instead, but that doesn't offer a custom allocator in it's template arguments, so I can't guarantee 16 byte alignment.

    I feel like I'm stuck between a rock and a hard place.

    Thanks in advanced to anyone who took the time to read all of this.
    Last edited by Chris_F; November 10th, 2010 at 04:23 AM.

  2. #2
    Join Date
    Oct 2008
    Posts
    1,456

    Re: Problems designing my vectorized math library

    Quote Originally Posted by Chris_F View Post
    1: Microsoft's STL implementation of vector is no good for aligned data. You can implement an aligned allocator, but there version of vector<>.resize() passes by value instead of reference, so if you try and make a vector of any type of struct or class that contains aligned data, you will receive a compile-time error (oddly it works OK if you use __m128 in it, but not a class that contains a __m128). The solution to this may be to switch to a different STL, I'm looking into STLport atm.
    take a look at how the Eigen library approaches this problem; the solutions adopted consist in respecializing the whole std::vector template; in particular, the newest solution respacializes those vectors specifyed with the eigen allocator ( that acts both as an aligning allocator and a specialization tag ) ... one more reason to leave these kind of things to library writers ....

  3. #3
    Join Date
    Jul 2005
    Location
    Netherlands
    Posts
    2,042

    Re: Problems designing my vectorized math library

    Quote Originally Posted by Chris_F View Post
    You will find that the float4 class with operators is 10x slower than the pure intrinsic code (in my tests, no need to question my methods, they were accurate enough.)
    Why not provide some test code for others to run? There's little point talking about performance without being able to measure things.
    Quote Originally Posted by Chris_F View Post
    In the article he claims that the solution is to use a C like interface instead, one which inlines everything and passes by value to avoid moving data out of the SSE registers.
    I am always doubtful when people claim you shouldn't use C++ constructs because they are too slow, and then see them not using common C++ practices. In this case, unnecessary use of pointers and declaring all variables at the beginning of a function.
    Why not try passing by value when using a class?
    Quote Originally Posted by Chris_F View Post
    (oddly it works OK if you use __m128 in it, but not a class that contains a __m128).
    Perhaps there is a specialization for vector<__m128>.
    Quote Originally Posted by Chris_F View Post
    but I can't use push_back at all because if I try to push back a float4 it says ".push_back' must have class/struct/union"
    Show the code.
    Quote Originally Posted by Chris_F View Post
    It's easy enough to work with, and performs well, but now I have to remember to use _aligned_free to delete every matrix I make.
    That's what you get when you throw C++ languages features overboard. Instead, you can wrap this in a class and make your life easier.
    Cheers, D Drmmr

    Please put [code][/code] tags around your code to preserve indentation and make it more readable.

    As long as man ascribes to himself what is merely a posibility, he will not work for the attainment of it. - P. D. Ouspensky

  4. #4
    Join Date
    Aug 2008
    Posts
    902

    Re: Problems designing my vectorized math library

    Quote Originally Posted by D_Drmmr View Post
    Why not provide some test code for others to run? There's little point talking about performance without being able to measure things.
    I don't see the point. I looked at a dissassembly and the code is horribly mangled and in addition I did some basic tests and timed them. If you want to call my methods of testing inaccurate, go ahead, but I assure you they wouldn't account for an order of magnitude difference in performance. We're talking about the difference between waiting around 20 seconds for a test to complete and waiting around for over 3 minutes.

    I am always doubtful when people claim you shouldn't use C++ constructs because they are too slow, and then see them not using common C++ practices. In this case, unnecessary use of pointers and declaring all variables at the beginning of a function.
    Why not try passing by value when using a class?
    That's exactly how I feel too, but given the results I got, I can't really argue with it unless someone can prove it all wrong. He gets good results with the Intel compiler, but I'm stuck with MSVC.

    Show the code.
    What? There isn't exactly much to show.

    Code:
    vector<__m128> myVector;
    __m128 a = _mm_set_ps1(0);
    myVector.push_back(a);    // Compiler Error, __m128 is apparently neither class, struct nor union

    That's what you get when you throw C++ languages features overboard.
    Believe me, I rather not, after all I tried to do things "the C++ way" the first time around.
    Last edited by Chris_F; November 10th, 2010 at 04:14 AM.

  5. #5
    Join Date
    Jul 2002
    Location
    Portsmouth. United Kingdom
    Posts
    2,727

    Re: Problems designing my vectorized math library

    Code:
    vector<__m128> myVector(); // A function called 'myVector' that takes no parameters and returns a vector<__m128>
    __m128 a = _mm_set_ps1(0);
    myVector.push_back(a); // Compile error!
    "It doesn't matter how beautiful your theory is, it doesn't matter how smart you are. If it doesn't agree with experiment, it's wrong."
    Richard P. Feynman

  6. #6
    Join Date
    Aug 2008
    Posts
    902

    Re: Problems designing my vectorized math library

    Quote Originally Posted by JohnW@Wessex View Post
    Code:
    vector<__m128> myVector(); // A function called 'myVector' that takes no parameters and returns a vector<__m128>
    __m128 a = _mm_set_ps1(0);
    myVector.push_back(a); // Compile error!
    That's not what I meant. I didn't copy/paste that, if that's what you think, just scribbled something down between code brackets to illustrate my point but made a mistake.

    Try it for yourself. Make a std:vector of __m128 and try to push one back if you don't believe me.

    Edit: Actually I can't seem to replicate the error. In any case, the last time it happened I was using something to the effect of:

    Code:
    vector<float4> myVector(4096);
    That issue was the least of my worries.
    Last edited by Chris_F; November 10th, 2010 at 04:22 AM.

  7. #7
    Join Date
    Jul 2002
    Location
    Portsmouth. United Kingdom
    Posts
    2,727

    Re: Problems designing my vectorized math library

    Quote Originally Posted by Chris_F View Post
    Try it for yourself. Make a std:vector of __m128 and try to push one back if you don't believe me.
    Well, I removed the parentheses, and this compiles and works.
    Code:
    #include <vector>
    
    using namespace std;
    
    #include <xmmintrin.h>
    
    int main() 
    {
        vector<__m128> myVector;
        __m128 a = _mm_set_ps1(0);
        myVector.push_back(a);  
    }
    "It doesn't matter how beautiful your theory is, it doesn't matter how smart you are. If it doesn't agree with experiment, it's wrong."
    Richard P. Feynman

  8. #8
    Join Date
    Jul 2005
    Location
    Netherlands
    Posts
    2,042

    Re: Problems designing my vectorized math library

    Quote Originally Posted by Chris_F View Post
    I don't see the point. I looked at a dissassembly and the code is horribly mangled and in addition I did some basic tests and timed them. If you want to call my methods of testing inaccurate, go ahead,
    No need to get defensive. I didn't say your method of testing is flawed.
    However, others could benefit from being able to run the test you made, maybe even try to optimize the original code.
    Quote Originally Posted by Chris_F View Post
    That's exactly how I feel too, but given the results I got, I can't really argue with it unless someone can prove it all wrong. He gets good results with the Intel compiler, but I'm stuck with MSVC.
    ...
    Believe me, I rather not, after all I tried to do things "the C++ way" the first time around.
    Even if you need to abandon C++ constructs in some parts of the code for the sake of performance, it doesn't mean you have to resort to C style for everything. IMO, if you observe that some code works faster than some other code, but don't understand why, you should consider it a special circumstance and not a general rule. Only if you understand why one thing is faster than another (maybe not in detail, but at least it shouldn't be magic), you are able to form a general rule.
    Cheers, D Drmmr

    Please put [code][/code] tags around your code to preserve indentation and make it more readable.

    As long as man ascribes to himself what is merely a posibility, he will not work for the attainment of it. - P. D. Ouspensky

  9. #9
    Join Date
    Apr 1999
    Posts
    27,449

    Re: Problems designing my vectorized math library

    Quote Originally Posted by Chris_F View Post
    Try it for yourself. Make a std:vector of __m128 and try to push one back if you don't believe me.
    A vector is not anything special. It is just C++ code. therefore if there is a problem with vector::push_back(), then there is a problem in general with copying and assigning that type.

    Maybe that type cannot be assigned or copied safely, which is a requirement for vector . In other words, just run-of-the-mill C++ code would also be faulty if you used value semantics on that type (i.e. passing by value, returning by value, assignment, copying, etc.)

    Regards,

    Paul McKenzie

  10. #10
    Join Date
    Oct 2009
    Posts
    577

    Smile Re: Problems designing my vectorized math library

    Quote Originally Posted by Chris_F View Post
    That's not what I meant. I didn't copy/paste that, if that's what you think, just scribbled something down between code brackets to illustrate my point but made a mistake.

    Try it for yourself. Make a std:vector of __m128 and try to push one back if you don't believe me.

    Edit: Actually I can't seem to replicate the error. In any case, the last time it happened I was using something to the effect of:

    Code:
    vector<float4> myVector(4096);
    That issue was the least of my worries.
    If you can't reproduce the issue you should remove all of the stlport things you already have configured. I made very bad experiences with stlportespecially with the MSVC compilers. Stlport was developed to times where each vendor has its own less compatible implementation of STL. MSVC 6.0 even had a STL released prior to C++ Standard in 1998. Nowadays all major compilers have good to very good STL implementations and most issues are not due to non-compatibility but by changes because of fulfilling the standard. One example is the vector issue you had which would do what you expected in VC6 and VC7 but not in VC8 (VS2005) or later.

    BTW, the standard guarantees that the internal array of std::vector is a pure C array. So, you can do all C operations on it beside of reallocating and nevertheless have the power of a dynamic array with safe allocation.

  11. #11
    Lindley is offline Elite Member Power Poster
    Join Date
    Oct 2007
    Location
    Seattle, WA
    Posts
    10,895

    Re: Problems designing my vectorized math library

    I think you should take a close look at how the Eigen library (linked above) approaches the issue. They also use SSE accelerations when appropriate. This means they've either found a way to make them efficient in a C++ interface, or they might better understand the reasons why you can't do that in some cases. In any event you could probably benefit from asking questions on their mailing list.

  12. #12
    Join Date
    Jan 2001
    Posts
    253

    Re: Problems designing my vectorized math library

    I took a quick look at the code generation for an intrinsic __m128 using direct calls to _mm_add_ps() to add the 4 floats, and then the member routine in your float4 class which also calls _mm_add_ps() to add the 4 floats.

    The big difference is that the compiler can't treat your float4 class as an intrinsic __m128, so it needs to update the memory of the object when it produces the result. When directly using an intrinsic __m128, the compiler will generate code that keeps intermediate results in the processor registers. This can make a huge difference in performance.

    One way that you can get some of the syntactic sugar that you are looking for is to implement non-member overrides for the basic operations (instead of members of a float4 class). This will allow the compiler to use the result of the operation as an intrinsic and keep it in a processor register.

    For example, you can implement the operator+ as follows:
    Code:
    __m128 operator+(__m128 a, __m128 b)
    {
       return _mm_add_ps(a, b);
    }
    You should be able to provide non-member overrides for +, -, *, / which make calls to the appropriate _mm routines.

    You can't implement an operator+= this way, as assignment operators can't be a non-member functions.

    Note: I am using VS2010, so the results may be different on an earlier version of the compiler.

  13. #13
    Join Date
    Jan 2006
    Location
    Singapore
    Posts
    6,765

    Re: Problems designing my vectorized math library

    Quote Originally Posted by jwbarton
    You can't implement an operator+= this way, as assignment operators can't be a non-member functions.
    Actually, you can: it is only the copy assignment operator that must be a member function.
    C + C++ Compiler: MinGW port of GCC
    Build + Version Control System: SCons + Bazaar

    Look up a C/C++ Reference and learn How To Ask Questions The Smart Way
    Kindly rate my posts if you found them useful

  14. #14
    Join Date
    Jan 2001
    Posts
    253

    Re: Problems designing my vectorized math library

    I stand corrected.

    I had the syntax wrong and the compiler error I got didn't help.

    Is this correct?

    Code:
    __m128& operator+=(__m128& a, __m128 b)
    {
       return (a = _mm_add_ps(a, b));
    }
    I am not clear on what the return value of operator+= should be. Should it be a reference or a value?

  15. #15
    Join Date
    Jan 2006
    Location
    Singapore
    Posts
    6,765

    Re: Problems designing my vectorized math library

    Quote Originally Posted by jwbarton
    I am not clear on what the return value of operator+= should be. Should it be a reference or a value?
    The first parameter should be of a reference type since the argument is supposed to be modified. Consequently, it makes sense to return a reference in view that it is normally more efficient to do so.

    Is your idea here to keep float4 as a POD type with only a single __m128 member so that the compiler can treat a float4 object as an intrinsic __m128?
    C + C++ Compiler: MinGW port of GCC
    Build + Version Control System: SCons + Bazaar

    Look up a C/C++ Reference and learn How To Ask Questions The Smart Way
    Kindly rate my posts if you found them useful

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured