Before we can give suggestions to speed it up, we need to know how you are currently doing it

But do you even need to do a copy in the first place? Could you just use a UInt16 array instead of the original int array?

And does this need to be fully portable, or is machine-specific code (including assembler) allowed?