Optimization suggestions

**lawl_rock** · January 21st, 2011, 04:22 PM

I hope I'm not overstepping any boundaries, but I would really appreciate some help in optimizing these functions I wrote. I'm just looking for some general pointers.

Code:

int sum(int arr[], int start, int length) {
	int re=0;
	for (int i=start; i<length; i++) re+=arr[i];
	return re;
}

void divArr(int org[], int db, int length) {
	for (int i=0; i<length; i++) org[i]/=db;
}

void zeros(int arr[], int length) {
	for (int i=0; i<length; i++) arr[i]=0;
}

int round(float r) {
	return (int)floor(r+0.5);
}

void fastsmooth(int Y[], int smoothwidth, int length) {
	int w=smoothwidth;
	int SumPoints=sum(Y, 0, w);
	int* s=new int[length];
	zeros(s, length);
	int halfw=round((float)w/2);
	for (int k=0; k<length-w; k++) {
		s[k+halfw-1]=SumPoints;
		SumPoints-=Y[k];
		SumPoints+=Y[k+w];
	}
	divArr(s, w, length);
	SumPoints=sum(s, 0, w);
	zeros(Y, length);
	for (int k=0; k<length-w; k++) {
		Y[k+halfw-1]=SumPoints;
		SumPoints-=s[k];
		SumPoints+=s[k+w];
	}
	divArr(Y, w, length);
	delete [] s;
}

void deriv(int Y[], int length) {
	int* d=new int[length];
	d[0]=Y[1]-Y[0];
	d[length-1]=Y[length-1]-Y[length-2];
	for (int j=1; j<length-1; j++) {
		d[j]=(Y[j+1]-Y[j-1])/2;
	}
	memcpy((void*)Y, (void*)d, length*sizeof(float));
	delete [] d;
}

These functions are called for quite large sets of data, with 10,000,000 being a reasonable average length (I'm dealing with PCM data), and several (five, generally) such arrays being handled at once.

This code is also applied to similar data, and I'm aware that there may not be too many options for it, but again, I'd appreciate any general direction.

Code:

float Butterworth::Run(float input) {
	float output=input*gain;
	float new_hist;

	output-=history1*coef0;
	new_hist=output-history2*coef1;

	output=new_hist+history1*2.f;
	output+=history2;

	history2=history1;
	history1=new_hist;

	output-=history3*coef2;
	new_hist=output-history4*coef3;

	output=new_hist+history3*2.f;
	output+=history4;

	history4=history3;
	history3=new_hist;

	return output;
}

I realize that this is a lot, so thank you in advance if you take the time to give me any tips!

**Lindley** · January 21st, 2011, 04:30 PM

For the function zeros(), a call to memset() will usually be much faster than looping.

For the others, consider marking the functions as inline and putting them in a header file, as this may improve compiler optimization opportunities.

One thing which may or may not help is the use of standard algorithms. For instance, sum() is doing essentially the same thing as std::accumulate() in <numeric>, but the latter may be slightly faster if the compiler vendor was able to take advantage of any tricks for speed when writing it.

For deriv, you could consider using stack allocation (non-standard but usually available in the form of alloca() or something similar) rather than heap allocation for speed. (Variable sized arrays actually are in C99 and would do the same thing, but they aren't supported in C++.)

**Moschops** · January 21st, 2011, 07:38 PM

Loop unrolling sometimes helps.

Instead of

Code:

for (int i=0; i<some_val; i++)
{
  someFunction_involving(array[i]);
}

Code:

for (int i=0; i<some_val; i+=10)
{
  someFunction_involving(array[i]);
  someFunction_involving(array[i+1]);
  someFunction_involving(array[i+2]);
  someFunction_involving(array[i+3]);
  someFunction_involving(array[i+4]);
  someFunction_involving(array[i+5]);
  someFunction_involving(array[i+6]);
  someFunction_involving(array[i+7]);
  someFunction_involving(array[i+8]);
  someFunction_involving(array[i+9]);
}

Depending on the start and end values of i, you'll have to allow for end cases, but nonetheless I find it often helps.

**D_Drmmr** · January 21st, 2011, 07:45 PM

First off all, what's wrong with your space bar? Your code is packed together like a bunch of penguins.

Originally Posted by lawl_rock

These functions are called for quite large sets of data, with 10,000,000 being a reasonable average length (I'm dealing with PCM data), and several (five, generally) such arrays being handled at once.

In that case, I think you should try to optimize cache locality. You are going over the entire dataset dozens of times. Something like this

Code:

	int* s=new int[length];
	zeros(s, length);

is totally unnecessary, you can initialize the values to 0 when allocating them.

Code:

	for (int k=0; k<length-w; k++) {
		s[k+halfw-1]=SumPoints;
		SumPoints-=Y[k];
		SumPoints+=Y[k+w];
	}
	divArr(s, w, length);
	SumPoints=sum(s, 0, w);
	zeros(Y, length);

Here you perform three loops (disregarding the call to sum) where only one loop is necessary. The call to zeros is almost a complete waste, since the values will be overwritten in the following loop anyway.

**Philip Nicoletti** · January 21st, 2011, 08:01 PM

You might want to investigate OpenMP

**lawl_rock** · January 21st, 2011, 08:17 PM

Originally Posted by Moschops

Loop unrolling sometimes helps.

I've attempted that but found that it made minimal difference and made my code 10 times harder to read, but I might go back and optimize in that way further down the line.

Originally Posted by D_Drmmr

First off all, what's wrong with your space bar? Your code is packed together like a bunch of penguins.

I started programming writing mods for a sandbox type game (it was named Burning Sand), which had stricter spacing requirements, and it stuck.

Originally Posted by D_Drmmr

Something like this

Code:

	int* s=new int[length];
	zeros(s, length);

is totally unnecessary, you can initialize the values to 0 when allocating them.

Code:

	for (int k=0; k<length-w; k++) {
		s[k+halfw-1]=SumPoints;
		SumPoints-=Y[k];
		SumPoints+=Y[k+w];
	}
	divArr(s, w, length);
	SumPoints=sum(s, 0, w);
	zeros(Y, length);

Here you perform three loops (disregarding the call to sum) where only one loop is necessary. The call to zeros is almost a complete waste, since the values will be overwritten in the following loop anyway.

Here, my concern is that a 0 might overwrite a previously calculated value, although looking closer now, I don't think it's possible. I'm sure that with some if statements I'll be able to fix the problem.

Originally Posted by Philip Nicoletti

You might want to investigate OpenMP

I will definitely look into that, considering that I'm dealing with multiple arrays at once.

**nuzzle** · January 22nd, 2011, 03:23 AM

Originally Posted by lawl_rock

These functions are called for quite large sets of data, with 10,000,000 being a reasonable average length (I'm dealing with PCM data), and several (five, generally) such arrays being handled at once.

1. Reduce the number of loops. For example in fastsmooth() you first calculate all SumPoints in one loop and then you have another loop where you divide everything with w. You could as well do this in just one loop:

s[k+halfw-1]=SumPoints / w;

Furthermore you "smooth" the dataset twice, first once to s[] and then exactly the same back to Y[] again. Preferably both "smoothings" should be done once in the same loop.

And the "zeroing" of the whole arrays is not necessary since you owerwrite most elements later anyway..

2. Optimize the loop itself. The compiler probably will do it for you but why take the chance? I think you should subtract the 1 from halfw outside the loop and rearrange the calculation of the new SumPoints like this,

s[k+halfw] = SumPoints / w;
SumPoints += (Y[k+w] - Y[k]);

It may also pay off to use a pointer as loop variable instead of indexed array accesses.

3. Get rid of the big temporary arrays. You use them not to overwrite array elements but there's another way and that's to calculate ahead of writing. For example in fastsmooth() you could temporary store halfw SumPoints (for example in a circular buffer). When you calculate a new SumPoint you use Y[k+w] and Y[k] but you cannot write it back to Y[k+halfw] as of yet because this element is needed later. But the Y[k] position becomes available and here you write the SumPoint you calculated halfw iterations before (and which is now available in the circular buffer).

The above technique is even easier to use in deriv(). You don't need d[] and you don't need the memcopy. By delaying the overwrite of an element until it's read you can write to Y[] directly.

4. On processors with multiple cores you can divide up the work among the cores. For example with 4 cores each core can independently process 1/4 of a Y[] array at the same time. In practice this should speed up things 3-4 times.

**nuzzle** · January 22nd, 2011, 10:58 AM

In addition to my previous reply, this is how I think deriv should look (not tested):

Code:

void deriv(int Y[], int length) {
	int b1 = Y[0];
	int b2 = Y[1];
	Y[0] = b2 - b1;
	int L = length - 1;
	for (int j=1; j<L;) {
		int b0 = b1;
		b1 = b2;
		b2 = Y[j+1];
		Y[j++] = (b2 - b0)/2;
	}
	Y[L] -= Y[L-1];
}

The b0, b1 and b2 variables replace the circular buffer I was talking about because the number of elements that need to be saved are so few. In principle just two b-variables would be necessary but by using one extra the Y array needs to be read just once per iteration and that's good.

There are a few more operations per iteration but since the allocation/deallocation of the big temporary buffer and the copying step are gone I still think it's faster than the original. Finally I'm sure the biggest improvement would come from utilizing multiple cores by letting the algorithm process several parts of the Y array in parallel.

Optimizing fastsmooth would be more involved but essentially according to the same principle as deriv.

**VladimirF** · January 22nd, 2011, 02:00 PM

Originally Posted by lawl_rock

...I would really appreciate some help in optimizing these functions I wrote. I'm just looking for some general pointers.

A couple of general suggestions.
Since you have a LARGE array of data, I would look into utilizing SSE instructions, either by hand or by using compiler capable of generating SSE.
What is the valid range of your data? Can you go to something smaller than int? That will benefit you in two ways: better use of cache and possibility to pack more data items into single SSE instruction.
I would avoid micro-optimization (like loop unrolling) until all other options are exhausted. Besides, if the good compiler didn’t do it – most likely it is not going to help (and may hurt). BTW, what compiler do you use?
And, of course, the best optimization is the selection of best algorithm.
It looks like in your deriv() function you can avoid new array allocation and copy by using a couple of temp variables to store new calculated value of Y[n] for a step or two.
About your Butterworth::Run() function – are all undeclared variable changing outside of that function? I am 99% sure that floating point operations are performed on a double values internally, so using float adds extra conversions to/from double. When called millions of times, that could get expensive.
And the last thing (that is actually the first most important thing) – what did your profiler say? What part of your code takes most of the time?

**lawl_rock** · January 24th, 2011, 08:41 PM

nuzzle, thank you, your suggestions really helped.
VladimirF, I'm using VC++. I changed float to double and didn't see a noticeable change, but it might have just been subtle. According to the profiler and my observations, Butterworth::Run() is currently the most expensive function. The variables in that function are members of the Butterworth class and only change when I set the sample rate and frequency of the Butterworth filter.

**JohnW@Wessex** · January 25th, 2011, 03:55 AM

On my copy of Visual Studio 2008 I have the option of selecting 'precise' 'strict' or 'fast' floating point calculations. We found a significant improvement in speed when selecting 'fast', with minimal impact on results in our applications.

**VladimirF** · January 28th, 2011, 09:26 PM

Originally Posted by lawl_rock

I'm using VC++. I changed float to double and didn't see a noticeable change, but it might have just been subtle. According to the profiler and my observations, Butterworth::Run() is currently the most expensive function. The variables in that function are members of the Butterworth class and only change when I set the sample rate and frequency of the Butterworth filter.

Well, in my test switching from float to double saved about 5%. Not bad, considering the effort.
However, you get way more than that by using SSE, and since SSE operates on 128-bit register, it can process four floats (or only two doubles) at a time.
I have modified your code a little. First, to ensure proper alignment, I used 4-elements arrays for your coef and history variables.
Then I noticed (please correct me if I am wrong) that you multiply each of your history variable by corresponding coef. I did it in one SSE instruction instead of four separate multiplications.
Then, since the history is already loaded into xmm register, I multiplied it by 2. I know that you only use two of these four products, but you get four for the price of one anyway

Then I analyzed these two lines:

Code:

output-=history1*coef0;
new_hist=output-history2*coef1;

Of course, in my notation it was:

Code:

output-= tmp[0]; // history[0]*coef[0];
new_hist=output-tmp[1]; // history[1]*coef[1];

You can see that tmp[0] was added to tmp[1] (and then subtracted from something).
Since you do the same for tmp[2] and tmp[3], I packed both addition into one HADDPS (Horizontal-Add-Packed-Single) instruction (once again supplying dummy second parameter and ignoring half of the results).
Here is the class I used for testing:

Code:

class Butterworth
{
public:
	__declspec(align(16)) float coef[4];
	__declspec(align(16)) float history[4];
	__declspec(align(16)) float twos[4];
	__m128 coef4; 
	__m128 twos4; 
	float gain;
	Butterworth()
	{
		coef[0] = 2;
		coef[1] = 3;
		coef[2] = 4;
		coef[3] = 5;
		history[0] = 0;
		history[1] = 0;
		history[2] = 0;
		history[3] = 0;
		twos[0] = 2;
		twos[2] = 2;
		gain = 7;

		coef4 = _mm_load_ps(coef); 
		twos4 = _mm_load_ps(twos); 
	}
	float Run(float input);
	float RunSSE(float input);
};

Here is the test framework:

Code:

int _tmain(int argc, _TCHAR* argv[])
{
	Butterworth b;
	DWORD t1 = GetTickCount();
	for(int i = 9; i < 1000000; i++)
	{
		b.RunSSE((float)i);
		//b.Run((float)i);
	}
	DWORD t2 = GetTickCount();
	cout << t2 - t1 << endl;

	return 0;
}

And here are your original and my SSE function:

Code:

float Butterworth::Run(float input) 
{
	float output=input*gain;
	float new_hist;

	output-=history[0]*coef[0];
	new_hist=output-history[1]*coef[1];

	output=new_hist+history[0]*2.f;
	output+=history[1];

	history[1]=history[0];
	history[0]=new_hist;

	output-=history[2]*coef[2];
	new_hist=output-history[3]*coef[3];

	output=new_hist+history[2]*2.f;
	output+=history[3];

	history[3]=history[2];
	history[2]=new_hist;

	return output;
}

Code:

float Butterworth::RunSSE(float input) 
{
	__m128 hist4 = _mm_load_ps(history); 
	__declspec(align(16)) float tmp[4];
	__declspec(align(16)) float tmp2[4];
	
	// horisontal add; don't need the second register, just use hist4
	_mm_store_ps(tmp, _mm_hadd_ps(_mm_mul_ps(coef4, hist4), hist4)); 
	_mm_store_ps(tmp2, _mm_mul_ps(twos4, hist4));

	float output=input*gain;
	float new_hist;

	new_hist=output-tmp[0];

	output=new_hist+ tmp2[0];
	output+=history[1];

	history[1]=history[0];
	history[0]=new_hist;

	new_hist=output- tmp[1];

	output=new_hist+ tmp2[2];
	output+=history[3];

	history[3]=history[2];
	history[2]=new_hist;

	return output;
}

All in all I got almost 50% time saving. I am sure that going to assembly code you can save some more time on moving stuff in and out of xmm registers.
Oh, you would need these includes:

Code:

#include <emmintrin.h>
#include <xmmintrin.h>
#include <intrin.h>

Please let me know if that all made any sense.
Vlad

**lawl_rock** · February 1st, 2011, 08:00 PM

Vlad, I definitely appreciate that help. However, the SSE implementation does not function as the original function did, instead returning basically all zeroes. (The Butterworth class is a butterworth low-pass filter to be used on audio data.) I'm not sure whether this problem derived from my implementation or the original code you gave me, since I am not familiar with SSE. I will scrutinize the code further, however.
EDIT: Sorry, that was my fault! I left out the _mm_load_ps instructions (for coef4 and two4). I'm still scrutinizing the code so I can understand it (I hate having code and not knowing how it works). Thank you very much, I definitely observed an increase in performance.

Thread: Optimization suggestions

Thread Tools

Display

Optimization suggestions

Re: Optimization suggestions

Re: Optimization suggestions

Re: Optimization suggestions

Re: Optimization suggestions

Re: Optimization suggestions

Re: Optimization suggestions

Re: Optimization suggestions

Re: Optimization suggestions

Re: Optimization suggestions

Re: Optimization suggestions

Re: Optimization suggestions

Re: Optimization suggestions

Posting Permissions