I think you should take a close look at how the Eigen library (linked above) approaches the issue. They also use SSE accelerations when appropriate. This means they've either found a way to make them efficient in a C++ interface, or they might better understand the reasons why you can't do that in some cases. In any event you could probably benefit from asking questions on their mailing list.