I had been trying to make a reduction kernel to sum the contents of a very large array. I asked this question on stackoverflow, but still don't fully understand parts of the answer. For starters, what is meant by the last suggestion by Grizzly. I the below example, a step reduction, what is meant by stride for example? Do I call this with a global size smaller than the amount of the items in the array, and it will reduce the array to a new array with an amount of items equal to the global work size?

Code:
__kernel void reduction_step(__global const unsigned long* A, __global unsigned long  * C, uint size) {
        unsigned long sum=0;
        for(int i=start; i < size; i += stride)
                sum += A[i];
        C[get_global_id(0)]= sum;
}
http://stackoverflow.com/questions/8...ting-cuda-code