I am not the original poster. I was just curious what the reasons were for the performance problems reported by the original poster.

My idea was to try to provide the original poster with the syntactic sugar that would make it possible to use the __m128 intrinsic type with normal math operators instead of needing to call the _mm routines.

The original poster set up the float4 class to provide operator overloads for this syntactic sugar.