CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 13 of 13
  1. #1
    Join Date
    Dec 2010
    Posts
    27

    Optimization suggestions

    I hope I'm not overstepping any boundaries, but I would really appreciate some help in optimizing these functions I wrote. I'm just looking for some general pointers.
    Code:
    int sum(int arr[], int start, int length) {
    	int re=0;
    	for (int i=start; i<length; i++) re+=arr[i];
    	return re;
    }
    
    void divArr(int org[], int db, int length) {
    	for (int i=0; i<length; i++) org[i]/=db;
    }
    
    void zeros(int arr[], int length) {
    	for (int i=0; i<length; i++) arr[i]=0;
    }
    
    int round(float r) {
    	return (int)floor(r+0.5);
    }
    
    void fastsmooth(int Y[], int smoothwidth, int length) {
    	int w=smoothwidth;
    	int SumPoints=sum(Y, 0, w);
    	int* s=new int[length];
    	zeros(s, length);
    	int halfw=round((float)w/2);
    	for (int k=0; k<length-w; k++) {
    		s[k+halfw-1]=SumPoints;
    		SumPoints-=Y[k];
    		SumPoints+=Y[k+w];
    	}
    	divArr(s, w, length);
    	SumPoints=sum(s, 0, w);
    	zeros(Y, length);
    	for (int k=0; k<length-w; k++) {
    		Y[k+halfw-1]=SumPoints;
    		SumPoints-=s[k];
    		SumPoints+=s[k+w];
    	}
    	divArr(Y, w, length);
    	delete [] s;
    }
    
    void deriv(int Y[], int length) {
    	int* d=new int[length];
    	d[0]=Y[1]-Y[0];
    	d[length-1]=Y[length-1]-Y[length-2];
    	for (int j=1; j<length-1; j++) {
    		d[j]=(Y[j+1]-Y[j-1])/2;
    	}
    	memcpy((void*)Y, (void*)d, length*sizeof(float));
    	delete [] d;
    }
    These functions are called for quite large sets of data, with 10,000,000 being a reasonable average length (I'm dealing with PCM data), and several (five, generally) such arrays being handled at once.

    This code is also applied to similar data, and I'm aware that there may not be too many options for it, but again, I'd appreciate any general direction.
    Code:
    float Butterworth::Run(float input) {
    	float output=input*gain;
    	float new_hist;
    
    	output-=history1*coef0;
    	new_hist=output-history2*coef1;
    
    	output=new_hist+history1*2.f;
    	output+=history2;
    
    	history2=history1;
    	history1=new_hist;
    
    	output-=history3*coef2;
    	new_hist=output-history4*coef3;
    
    	output=new_hist+history3*2.f;
    	output+=history4;
    
    	history4=history3;
    	history3=new_hist;
    
    	return output;
    }
    I realize that this is a lot, so thank you in advance if you take the time to give me any tips!

  2. #2
    Lindley is offline Elite Member Power Poster
    Join Date
    Oct 2007
    Location
    Seattle, WA
    Posts
    10,895

    Re: Optimization suggestions

    For the function zeros(), a call to memset() will usually be much faster than looping.

    For the others, consider marking the functions as inline and putting them in a header file, as this may improve compiler optimization opportunities.

    One thing which may or may not help is the use of standard algorithms. For instance, sum() is doing essentially the same thing as std::accumulate() in <numeric>, but the latter may be slightly faster if the compiler vendor was able to take advantage of any tricks for speed when writing it.

    For deriv, you could consider using stack allocation (non-standard but usually available in the form of alloca() or something similar) rather than heap allocation for speed. (Variable sized arrays actually are in C99 and would do the same thing, but they aren't supported in C++.)

  3. #3
    Join Date
    Apr 2008
    Posts
    118

    Re: Optimization suggestions

    Loop unrolling sometimes helps.

    Instead of
    Code:
    for (int i=0; i<some_val; i++)
    {
      someFunction_involving(array[i]);
    }
    Code:
    for (int i=0; i<some_val; i+=10)
    {
      someFunction_involving(array[i]);
      someFunction_involving(array[i+1]);
      someFunction_involving(array[i+2]);
      someFunction_involving(array[i+3]);
      someFunction_involving(array[i+4]);
      someFunction_involving(array[i+5]);
      someFunction_involving(array[i+6]);
      someFunction_involving(array[i+7]);
      someFunction_involving(array[i+8]);
      someFunction_involving(array[i+9]);
    }
    Depending on the start and end values of i, you'll have to allow for end cases, but nonetheless I find it often helps.
    Last edited by Moschops; January 21st, 2011 at 07:43 PM. Reason: I've been drinking

  4. #4
    Join Date
    Jul 2005
    Location
    Netherlands
    Posts
    2,042

    Re: Optimization suggestions

    First off all, what's wrong with your space bar? Your code is packed together like a bunch of penguins.
    Quote Originally Posted by lawl_rock View Post
    These functions are called for quite large sets of data, with 10,000,000 being a reasonable average length (I'm dealing with PCM data), and several (five, generally) such arrays being handled at once.
    In that case, I think you should try to optimize cache locality. You are going over the entire dataset dozens of times. Something like this
    Code:
    	int* s=new int[length];
    	zeros(s, length);
    is totally unnecessary, you can initialize the values to 0 when allocating them.
    Code:
    	for (int k=0; k<length-w; k++) {
    		s[k+halfw-1]=SumPoints;
    		SumPoints-=Y[k];
    		SumPoints+=Y[k+w];
    	}
    	divArr(s, w, length);
    	SumPoints=sum(s, 0, w);
    	zeros(Y, length);
    Here you perform three loops (disregarding the call to sum) where only one loop is necessary. The call to zeros is almost a complete waste, since the values will be overwritten in the following loop anyway.
    Cheers, D Drmmr

    Please put [code][/code] tags around your code to preserve indentation and make it more readable.

    As long as man ascribes to himself what is merely a posibility, he will not work for the attainment of it. - P. D. Ouspensky

  5. #5
    Join Date
    Aug 2000
    Location
    West Virginia
    Posts
    7,725

    Re: Optimization suggestions

    You might want to investigate OpenMP

  6. #6
    Join Date
    Dec 2010
    Posts
    27

    Re: Optimization suggestions

    Quote Originally Posted by Moschops View Post
    Loop unrolling sometimes helps.
    I've attempted that but found that it made minimal difference and made my code 10 times harder to read, but I might go back and optimize in that way further down the line.

    Quote Originally Posted by D_Drmmr View Post
    First off all, what's wrong with your space bar? Your code is packed together like a bunch of penguins.
    I started programming writing mods for a sandbox type game (it was named Burning Sand), which had stricter spacing requirements, and it stuck.

    Quote Originally Posted by D_Drmmr View Post
    Something like this
    Code:
    	int* s=new int[length];
    	zeros(s, length);
    is totally unnecessary, you can initialize the values to 0 when allocating them.
    Code:
    	for (int k=0; k<length-w; k++) {
    		s[k+halfw-1]=SumPoints;
    		SumPoints-=Y[k];
    		SumPoints+=Y[k+w];
    	}
    	divArr(s, w, length);
    	SumPoints=sum(s, 0, w);
    	zeros(Y, length);
    Here you perform three loops (disregarding the call to sum) where only one loop is necessary. The call to zeros is almost a complete waste, since the values will be overwritten in the following loop anyway.
    Here, my concern is that a 0 might overwrite a previously calculated value, although looking closer now, I don't think it's possible. I'm sure that with some if statements I'll be able to fix the problem.

    Quote Originally Posted by Philip Nicoletti View Post
    You might want to investigate OpenMP
    I will definitely look into that, considering that I'm dealing with multiple arrays at once.

  7. #7
    Join Date
    May 2009
    Posts
    2,413

    Re: Optimization suggestions

    Quote Originally Posted by lawl_rock View Post
    These functions are called for quite large sets of data, with 10,000,000 being a reasonable average length (I'm dealing with PCM data), and several (five, generally) such arrays being handled at once.
    1. Reduce the number of loops. For example in fastsmooth() you first calculate all SumPoints in one loop and then you have another loop where you divide everything with w. You could as well do this in just one loop:

    s[k+halfw-1]=SumPoints / w;

    Furthermore you "smooth" the dataset twice, first once to s[] and then exactly the same back to Y[] again. Preferably both "smoothings" should be done once in the same loop.

    And the "zeroing" of the whole arrays is not necessary since you owerwrite most elements later anyway..

    2. Optimize the loop itself. The compiler probably will do it for you but why take the chance? I think you should subtract the 1 from halfw outside the loop and rearrange the calculation of the new SumPoints like this,

    s[k+halfw] = SumPoints / w;
    SumPoints += (Y[k+w] - Y[k]);

    It may also pay off to use a pointer as loop variable instead of indexed array accesses.

    3. Get rid of the big temporary arrays. You use them not to overwrite array elements but there's another way and that's to calculate ahead of writing. For example in fastsmooth() you could temporary store halfw SumPoints (for example in a circular buffer). When you calculate a new SumPoint you use Y[k+w] and Y[k] but you cannot write it back to Y[k+halfw] as of yet because this element is needed later. But the Y[k] position becomes available and here you write the SumPoint you calculated halfw iterations before (and which is now available in the circular buffer).

    The above technique is even easier to use in deriv(). You don't need d[] and you don't need the memcopy. By delaying the overwrite of an element until it's read you can write to Y[] directly.

    4. On processors with multiple cores you can divide up the work among the cores. For example with 4 cores each core can independently process 1/4 of a Y[] array at the same time. In practice this should speed up things 3-4 times.
    Last edited by nuzzle; January 22nd, 2011 at 11:27 AM.

  8. #8
    Join Date
    May 2009
    Posts
    2,413

    Re: Optimization suggestions

    In addition to my previous reply, this is how I think deriv should look (not tested):

    Code:
    void deriv(int Y[], int length) {
    	int b1 = Y[0];
    	int b2 = Y[1];
    	Y[0] = b2 - b1;
    	int L = length - 1;
    	for (int j=1; j<L;) {
    		int b0 = b1;
    		b1 = b2;
    		b2 = Y[j+1];
    		Y[j++] = (b2 - b0)/2;
    	}
    	Y[L] -= Y[L-1];
    }
    The b0, b1 and b2 variables replace the circular buffer I was talking about because the number of elements that need to be saved are so few. In principle just two b-variables would be necessary but by using one extra the Y array needs to be read just once per iteration and that's good.

    There are a few more operations per iteration but since the allocation/deallocation of the big temporary buffer and the copying step are gone I still think it's faster than the original. Finally I'm sure the biggest improvement would come from utilizing multiple cores by letting the algorithm process several parts of the Y array in parallel.

    Optimizing fastsmooth would be more involved but essentially according to the same principle as deriv.
    Last edited by nuzzle; January 23rd, 2011 at 01:05 AM.

  9. #9
    Join Date
    Aug 2000
    Location
    New York, NY, USA
    Posts
    5,656

    Re: Optimization suggestions

    Quote Originally Posted by lawl_rock View Post
    ...I would really appreciate some help in optimizing these functions I wrote. I'm just looking for some general pointers.
    A couple of general suggestions.
    Since you have a LARGE array of data, I would look into utilizing SSE instructions, either by hand or by using compiler capable of generating SSE.
    What is the valid range of your data? Can you go to something smaller than int? That will benefit you in two ways: better use of cache and possibility to pack more data items into single SSE instruction.
    I would avoid micro-optimization (like loop unrolling) until all other options are exhausted. Besides, if the good compiler didn’t do it – most likely it is not going to help (and may hurt). BTW, what compiler do you use?
    And, of course, the best optimization is the selection of best algorithm.
    It looks like in your deriv() function you can avoid new array allocation and copy by using a couple of temp variables to store new calculated value of Y[n] for a step or two.
    About your Butterworth::Run() function – are all undeclared variable changing outside of that function? I am 99% sure that floating point operations are performed on a double values internally, so using float adds extra conversions to/from double. When called millions of times, that could get expensive.
    And the last thing (that is actually the first most important thing) – what did your profiler say? What part of your code takes most of the time?
    Vlad - MS MVP [2007 - 2012] - www.FeinSoftware.com
    Convenience and productivity tools for Microsoft Visual Studio:
    FeinWindows - replacement windows manager for Visual Studio, and more...

  10. #10
    Join Date
    Dec 2010
    Posts
    27

    Re: Optimization suggestions

    nuzzle, thank you, your suggestions really helped.
    VladimirF, I'm using VC++. I changed float to double and didn't see a noticeable change, but it might have just been subtle. According to the profiler and my observations, Butterworth::Run() is currently the most expensive function. The variables in that function are members of the Butterworth class and only change when I set the sample rate and frequency of the Butterworth filter.

  11. #11
    Join Date
    Jul 2002
    Location
    Portsmouth. United Kingdom
    Posts
    2,727

    Re: Optimization suggestions

    On my copy of Visual Studio 2008 I have the option of selecting 'precise' 'strict' or 'fast' floating point calculations. We found a significant improvement in speed when selecting 'fast', with minimal impact on results in our applications.
    "It doesn't matter how beautiful your theory is, it doesn't matter how smart you are. If it doesn't agree with experiment, it's wrong."
    Richard P. Feynman

  12. #12
    Join Date
    Aug 2000
    Location
    New York, NY, USA
    Posts
    5,656

    Re: Optimization suggestions

    Quote Originally Posted by lawl_rock View Post
    I'm using VC++. I changed float to double and didn't see a noticeable change, but it might have just been subtle. According to the profiler and my observations, Butterworth::Run() is currently the most expensive function. The variables in that function are members of the Butterworth class and only change when I set the sample rate and frequency of the Butterworth filter.
    Well, in my test switching from float to double saved about 5%. Not bad, considering the effort.
    However, you get way more than that by using SSE, and since SSE operates on 128-bit register, it can process four floats (or only two doubles) at a time.
    I have modified your code a little. First, to ensure proper alignment, I used 4-elements arrays for your coef and history variables.
    Then I noticed (please correct me if I am wrong) that you multiply each of your history variable by corresponding coef. I did it in one SSE instruction instead of four separate multiplications.
    Then, since the history is already loaded into xmm register, I multiplied it by 2. I know that you only use two of these four products, but you get four for the price of one anyway
    Then I analyzed these two lines:
    Code:
    output-=history1*coef0;
    new_hist=output-history2*coef1;
    Of course, in my notation it was:
    Code:
    output-= tmp[0]; // history[0]*coef[0];
    new_hist=output-tmp[1]; // history[1]*coef[1];
    You can see that tmp[0] was added to tmp[1] (and then subtracted from something).
    Since you do the same for tmp[2] and tmp[3], I packed both addition into one HADDPS (Horizontal-Add-Packed-Single) instruction (once again supplying dummy second parameter and ignoring half of the results).
    Here is the class I used for testing:
    Code:
    class Butterworth
    {
    public:
    	__declspec(align(16)) float coef[4];
    	__declspec(align(16)) float history[4];
    	__declspec(align(16)) float twos[4];
    	__m128 coef4; 
    	__m128 twos4; 
    	float gain;
    	Butterworth()
    	{
    		coef[0] = 2;
    		coef[1] = 3;
    		coef[2] = 4;
    		coef[3] = 5;
    		history[0] = 0;
    		history[1] = 0;
    		history[2] = 0;
    		history[3] = 0;
    		twos[0] = 2;
    		twos[2] = 2;
    		gain = 7;
    
    		coef4 = _mm_load_ps(coef); 
    		twos4 = _mm_load_ps(twos); 
    	}
    	float Run(float input);
    	float RunSSE(float input);
    };
    Here is the test framework:
    Code:
    int _tmain(int argc, _TCHAR* argv[])
    {
    	Butterworth b;
    	DWORD t1 = GetTickCount();
    	for(int i = 9; i < 1000000; i++)
    	{
    		b.RunSSE((float)i);
    		//b.Run((float)i);
    	}
    	DWORD t2 = GetTickCount();
    	cout << t2 - t1 << endl;
    
    	return 0;
    }
    And here are your original and my SSE function:
    Code:
    float Butterworth::Run(float input) 
    {
    	float output=input*gain;
    	float new_hist;
    
    	output-=history[0]*coef[0];
    	new_hist=output-history[1]*coef[1];
    
    	output=new_hist+history[0]*2.f;
    	output+=history[1];
    
    	history[1]=history[0];
    	history[0]=new_hist;
    
    	output-=history[2]*coef[2];
    	new_hist=output-history[3]*coef[3];
    
    	output=new_hist+history[2]*2.f;
    	output+=history[3];
    
    	history[3]=history[2];
    	history[2]=new_hist;
    
    	return output;
    }
    Code:
    float Butterworth::RunSSE(float input) 
    {
    	__m128 hist4 = _mm_load_ps(history); 
    	__declspec(align(16)) float tmp[4];
    	__declspec(align(16)) float tmp2[4];
    	
    	// horisontal add; don't need the second register, just use hist4
    	_mm_store_ps(tmp, _mm_hadd_ps(_mm_mul_ps(coef4, hist4), hist4)); 
    	_mm_store_ps(tmp2, _mm_mul_ps(twos4, hist4));
    
    	float output=input*gain;
    	float new_hist;
    
    	new_hist=output-tmp[0];
    
    	output=new_hist+ tmp2[0];
    	output+=history[1];
    
    	history[1]=history[0];
    	history[0]=new_hist;
    
    	new_hist=output- tmp[1];
    
    	output=new_hist+ tmp2[2];
    	output+=history[3];
    
    	history[3]=history[2];
    	history[2]=new_hist;
    
    	return output;
    }
    All in all I got almost 50% time saving. I am sure that going to assembly code you can save some more time on moving stuff in and out of xmm registers.
    Oh, you would need these includes:
    Code:
    #include <emmintrin.h>
    #include <xmmintrin.h>
    #include <intrin.h>
    Please let me know if that all made any sense.
    Vlad
    Vlad - MS MVP [2007 - 2012] - www.FeinSoftware.com
    Convenience and productivity tools for Microsoft Visual Studio:
    FeinWindows - replacement windows manager for Visual Studio, and more...

  13. #13
    Join Date
    Dec 2010
    Posts
    27

    Re: Optimization suggestions

    Vlad, I definitely appreciate that help. However, the SSE implementation does not function as the original function did, instead returning basically all zeroes. (The Butterworth class is a butterworth low-pass filter to be used on audio data.) I'm not sure whether this problem derived from my implementation or the original code you gave me, since I am not familiar with SSE. I will scrutinize the code further, however.
    EDIT: Sorry, that was my fault! I left out the _mm_load_ps instructions (for coef4 and two4). I'm still scrutinizing the code so I can understand it (I hate having code and not knowing how it works). Thank you very much, I definitely observed an increase in performance.
    Last edited by lawl_rock; February 1st, 2011 at 08:18 PM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured