CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 14 of 14
  1. #1
    Join Date
    May 2008
    Posts
    36

    Memory allocation performance

    Hi,

    (You can skip straight to the section "Specific questions" instead of reading all the other stuff)

    Project background
    I build alot of models in excel then manually compile them in java.
    The project I am looking to develop will enable me to automatically compile any excel model into code.

    My goals for this project are;
    • A reasonably short development time
    • All tools are stable and require no future support
    • The complied model is performance focused (as opposed to memory efficient)
    • For as much as possible the complied model is portable (runs on windows & linux, 32bit & 64bit I may also look at including graphic cards but that will be in stage 2). I might use cross-compilers or cygwin for linux builds.

    In order to meet all but the performance points I will be developing as much as possible in C#\Java. The actual calculation will be done in c++ and called by a java tool.

    The excel model will be translated into c++ by declaring a double for each cell and referencing all cells by a double pointer array. The pointer array will allow me to use Excel functions such as Offset and Index. All cells are assumed to be in double format (ie no strings\DateTime etc)

    I plan to support very large excel models (up to 1million cells) with the expectation that the memory required to hold the calculation will exceed 8MB (windows stack size)

    About me
    To date I have only ever programmed in managed languages (C#,Java mainly) and have no experience with low level languages/memory/cpu.

    I have spent the last few days thinking and reading about c++ and the stack/heap so I would say that understand only the basics

    As the only c++ used will be for calculations I think I will only need to know the basics. But given the possibly large number of variables declared a good knowledge of the heap/stack/pointers will be crucial to performance. Hence my questions.

    Specific questions
    I have some questions about c++ & memory that will effect the way in which I design my application (It performs a large calculation which is looped). I expect that 95% of the variables declared will be double64 and the remainder will be strings.

    First I do not care about the time it takes to allocate/deallocate memory as this will only happen once and the vast majority of the time will be spent in the calcs (ie retrieving and setting a variables value).
    Secondly I have to assume that I have no admin access to the machines running the applications so if stack size is OS dependent I cant change it.

    Question 1
    In an attempt to simulate my calculation I designed an application which allocated memory to the stack and again to the heap and tested the performance of each.
    What I found was that it was nearly 30% faster to use the heap than it was to use the stack (I wasn't expecting this). Code is in the next post.
    Are my findings correct, are computations using "double*"s declared on the heap faster than "double"s declared on the stack when you have alot of double64 variables declared?

    Question 2
    Is there a more time efficient approach that can be used to carry out a large calculation such as this?
    eg store the most accessed variables 'here' store the others 'here'

    Question 3
    After each calculation (loop) I need to to copy a section of variables and add them to a running total. So to do this the code; runs a calculation, copies a bunch of variables to a location in memory, runs another calculation, adds the same bunch of variables (different values) to the previously copied values, runs a calculation, ....

    To give you an idea of the size of the data copied I would not be surprised if it exceeded 50,000 variables.

    Each calculation is something like 4m operations (+-*/) so I expect a reasonably large amount of time will be spent copying the data so I would like to make this stage as efficient as possible.

    If all variables I wished to copy were declared in an array (and therefore declared consecutively in memory) is there an efficient way to do this, eg something like "memcpy_add()"

    Question 4
    Do c++ compilers add any extra statements that could slow performance. eg in C# there is arithmetic overflow and array bounds checking.
    I can make the assumption that there will be no such events.

    Thanks,
    Ger.

  2. #2
    Join Date
    May 2008
    Posts
    36

    Re: Memory allocation performance

    Here is the code I used when testing the relative performance of heap/stack.
    The results was that the heap ("double*") was 30% faster than the stack ("double").
    Is there something funny going on, the compiler doing something?

    Setup: Visual studio 2010, Win32 application, Release Build, everything else is default.

    Stack Code:
    Code:
    double Run_stack(double d1,double &tt) { //d1 is an initial variable, tt = TimeTaken
    	double minVal = 100000000;
    
    	double d2;
    	double d3;
    	double d4;
    	...   //declare variables all the way to d1000
    	double d1000;
    	
    	double start = clock();
    	for(int i=0;i<10000;i++) {
    		d2 = (d1 / 2) * d1;
    		d3 = (d2 / 3) * d2;
    		d4 = (d3 / 4) * d3;
    		....  // Run above calc all the way to d1000
    		d1000 = (d999 / 999) * d999;
    		d1 = d1000;
    		
    		minVal = min(minVal,d2);
    		minVal = min(minVal,d3);
    		minVal = min(minVal,d4);
    		...
    		minVal = min(minVal,d1000);
    	}
    
    	tt = (clock()-start);
    	return minVal;
    }
    Heap Code:
    Code:
    double Run_heap(double* d1,double &tt) {  //d1 is an initial variable, tt = TimeTaken
    	double* minVal = new double;*minVal = 100000000;
    
    	double* d2 = new double;
    	double* d3 = new double;
    	double* d4 = new double;
    	...   // declare variables all the way to d1000
    	double* d1000 = new double;
    	
    	double start = clock();
    	for(int i=0;i<10000;i++) {
    		*d2 = (*d1 / 2) * *d1;
    		*d3 = (*d2 / 3) * *d2;
    		*d4 = (*d2 / 4) * *d3;
    		...   // run calc all the way to d1000
    		*d1000 = (*d999 / 999) * *d999;
    		*d1 = *d1000;
    		
    		*minVal = min(*minVal,*d2);
    		*minVal = min(*minVal,*d3);
    		*minVal = min(*minVal,*d4);
    		...
    		*minVal = min(*minVal,*d1000);
    	}
    
    	tt = (clock()-start);
    	return *minVal;
    }

  3. #3
    Join Date
    Apr 1999
    Posts
    27,449

    Re: Memory allocation performance

    Quote Originally Posted by gleesonger View Post
    Here is the code I used when testing the relative performance of heap/stack.
    The results was that the heap ("double*") was 30% faster than the stack ("double").
    Is there something funny going on, the compiler doing something?
    First, are you timing a fully optimized build? If you're debugging a "debug" build, then your timings are not valid.

    Second, there is no way that calling "new" is faster than just declaring a double. When you call "new", you are calling the allocator to get memory from the heap. Then that new memory has to be recorded by the heap manager. Then a value has to be returned from the call and assigned to the pointer.

    Third, your HeapCode has a memory leak. Java is not C++, as a call to "new" in C++ is not the same as calling "new" in Java or C#. You failed to deallocate the memory in the Heap Code using the proper call to "delete". So your "heap" example does not properly do everything it needs to do to be correct. Unlike Java or C#, C++ has no "garbage collection" when you use "new" in this fashion.

    But to be honest, no C++ programmer writes code as you have done with the "Heap Code". The first method would have been written by someone who claims to be a C++ programmer, as the second method is a tell-tale sign that a C++ programmer didn't write the code, but someone who is a Java/C# programmer who are not aware of the differences between C++ and those other languages. Again, I'm being honest with you.

    Also, why not just declare a single array, instead of declaring 1000 "d" variables?

    Regards,

    Paul McKenzie
    Last edited by Paul McKenzie; July 28th, 2013 at 05:50 PM.

  4. #4
    Join Date
    May 2008
    Posts
    36

    Re: Memory allocation performance

    Quote Originally Posted by Paul McKenzie View Post
    First, are you timing a fully optimized build? If you're debugging a "debug" build, then your timings are not valid.
    I have selected a "Release" build. Are there more optimizations that can be selected in VS2010?

    Quote Originally Posted by Paul McKenzie View Post
    there is no way that calling "new" is faster than just declaring a double
    I am only concerned with how fast the calculation (loop) executes. The timing (start = clock()) is declared after the variables are declared and (timetaken = (clock()-start)) immediately after the loop quits.

    Quote Originally Posted by Paul McKenzie View Post
    your HeapCode has a memory leak
    I am aware of that but as I am not concerned with the time taken to deallocate memory and because this app is just for testing I wasn't bothered putting in another 1000 lines of code.

    Quote Originally Posted by Paul McKenzie View Post
    But to be honest, no C++ programmer writes code as you have done with the "Heap Code"....Again, I'm being honest with you
    As my c++ code will be automatically generated I wanted to quantify the impact of using the heap over the stack. If the heap was only marginally slower I'd stick everything on the heap, otherwise I'd build the logic to put as much variables as possible on the stack and the remainder on the heap. I had expected the stack to be faster and was surprised by the results I got. The reason for using the heap is that I believe my app will exceed the size of the stack and rather than running the risk of a stack overflow I thought just stick everything on the heap.

    Quote Originally Posted by Paul McKenzie View Post
    why not just declare a single array, instead of declaring 1000 "d" variables?
    When the code is generated I want to be able to debug it. In the actual build "d1" will be replaced with a more understandable name which will be lifted from the excel model. Its harder to debug array locations.

    Quote Originally Posted by Paul McKenzie View Post
    Again, I'm being honest with you
    Thats exactly what I'm after
    I believe I have answered all your points and that my original questions still stand if you would be able to revise my understanding that'd be great.
    Last edited by gleesonger; July 29th, 2013 at 03:47 AM.

  5. #5
    Join Date
    Apr 1999
    Posts
    27,449

    Re: Memory allocation performance

    Quote Originally Posted by gleesonger View Post
    I am aware of that but as I am not concerned with the time taken to deallocate memory and because this app is just for testing I wasn't bothered putting in another 1000 lines of code.
    Regardless of your goal, the test is not valid if you do not issue the calls to "delete". Those delete calls must be timed also, otherwise you're just going to be fooled into thinking that the function is faster when it isn't.
    As my c++ code will be automatically generated I wanted to quantify the impact of using the heap over the stack. If the heap was only marginally slower I'd stick everything on the heap,
    In your heap example, where do those pointers reside? You declared 1,000 pointer variables, and they are also on the stack. So there is no difference in stack usage between your "stack" example and "heap" example in terms of local variables.
    otherwise I'd build the logic to put as much variables as possible on the stack and the remainder on the heap.
    See previous paragraph. You've done nothing to alleviate stack usage in the pointer example, since you declared 1000 pointer variables, and those variables exist on the stack.

    Regards,

    Paul McKenzie

  6. #6
    Join Date
    May 2008
    Posts
    36

    Re: Memory allocation performance

    Quote Originally Posted by Paul McKenzie View Post
    Regardless of your goal, the test is not valid if you do not issue the calls to "delete". Those delete calls must be timed also, otherwise you're just going to be fooled into thinking that the function is faster when it isn't.
    I am not calculating the time it takes to call the function just the loop\calculation it self. In the real application the doubles themselves will only ever be declared once using an "Initialize()" function (variables will have class scope). Then I will call a "do_work()" function a few hundred times. Each call to the "do_work()" function will take between 10 and 60 sec to complete. So unless it takes more than a few seconds to allocate/release the variables then I am not concerned with how long it takes to manage variables on the heap.

    Quote Originally Posted by Paul McKenzie View Post
    See previous paragraph. You've done nothing to alleviate stack usage in the pointer example, since you declared 1000 pointer variables, and those variables exist on the stack.
    I had considered this, but I do not know exactly what the compiler is doing. As the pointer is never reassigned to a new memory address and nor is that address return to the code (ie we dont actually make use of the pointer) I am wondering is there some optimizations going on, eg not using pointers on the stack but referencing the heap memory directly.
    Suppose there is an easy way to test this, repeat above but declare (8MB (size_of_stack) / 32byte (size_of_pointer) + 1) doubles, if my hopes are false then this should cause an overflow.

    Other than optimization in the compiler there are other things which may be skewing results; eg due to the frequency of use the variables may be mapped to the processors cache, speeding up the calc greatly. If I increase the size of my test from 1000 doubles to a million then I will exceed the cache and the resulting paging will show the true performance of the calcs. Though I do not see why this would affect the stack and heap differently.

  7. #7
    Join Date
    Apr 1999
    Posts
    27,449

    Re: Memory allocation performance

    Quote Originally Posted by gleesonger View Post
    I had considered this, but I do not know exactly what the compiler is doing.
    Generate an assembly language listing of that function. Variables that are not used are optimized away by a good optimizing compiler.
    Its harder to debug array locations.
    What is hard about debugging arrays? But in general, you don't write code in a certain way because you don't have an adequate debugger. No one would declare 1,000 doubles when an array is what is called for. How about timing an array, as opposed to 1,000 doubles? Then the array is guaranteed to be contiguous, as opposed to 1,000 separate variables. Analogous to that, the array new (new[] / delete[]) would be used.

    Currently, you have created a contrived example that no one would code, and that is a function that contains 1,000+ different variables. Code that exceeds a certain size can also have its issues (I know for a fact that functions thousands of lines long will blow out the stack). If you changed that to an array (which is one variable), then you have a much better picture of what happens. For example:

    Code:
    #include <algorithm>
    //...
        double minVal = 100000000;
        double dArr[1000];
        dArr[0] = d1;
    
        for(int i=0;i<10000;i++) 
        {
             for (int j = 1; j < 1000; ++j)
                dArr[j] =  (dArr[j-1] / 2) * dArr[j-1];
             dArr[0] = dArr[999];
             double oldMinVal = minVal
             minval = *std::min_element(dArr + 1, dArr + 1000);
             minVal = min(oldMinVal, minVal);
        }
    Regards,

    Paul McKenzie
    Last edited by Paul McKenzie; July 29th, 2013 at 10:11 AM.

  8. #8
    Join Date
    May 2008
    Posts
    36

    Re: Memory allocation performance

    Quote Originally Posted by Paul McKenzie View Post
    Generate an assembly language listing of that function.
    I tried that but I wasn't able to locate my function and even if I did I probably wouldn't be able to understand the code.

    Quote Originally Posted by Paul McKenzie View Post
    What is hard about debugging arrays? But in general, you don't write code in a certain way because you don't have an adequate debugger.
    Apologies, when I said debugging I was referring to how easy the generated code is to read,
    "cashflow_5 = cashflow_4 + 5" is easier to read than "array[123] = array[56] + 5"

    Though I have just realized that I could use macros and "#define cashflow_5 123" and use "array[cashflow_5] = array[cashflow_4] + 5"

    Do you know how compilers handle array referencing when the location is hard coded?
    So does "array[X]" result in the machine doing the following; use pointer to get the start of array, move X * sizeof(variable) bytes from start of array.
    Or as X and the array size is know at compile time does the compiler just hard code the location of array[X] into the machine code. (Hopefully I explained myself there)

    Again this is me trying to reduce the number of operations to a minimum.

    Quote Originally Posted by Paul McKenzie View Post
    I know for a fact that functions thousands of lines long will blow out the stack
    I didn't know that. This will definitely effect me.
    I think my fix will be to split the code between different functions that way the stack will be release after each function is called. Thanks.

    Quote Originally Posted by Paul McKenzie View Post
    If you changed that to an array (which is one variable), then you have a much better picture of what happens
    I agree that this is the case with the example I gave but when I'm generating the code I wont be able to roll up the code so I cant use a loop I need to declare each calculation as a separate statement.
    True I could check for repetitions and try to condense some sections into loops but this adds to the complexity of my Excel->cpp compiler

  9. #9
    Join Date
    Apr 1999
    Posts
    27,449

    Re: Memory allocation performance

    Quote Originally Posted by gleesonger View Post
    I tried that but I wasn't able to locate my function and even if I did I probably wouldn't be able to understand the code.
    If you're using Visual C++, there is an option to mix the source line with the corresponding assembly code. Knowledge of assembly language is crucial if the goal is to figure out what a compiler is doing with your code in terms of optimizations.
    Apologies, when I said debugging I was referring to how easy the generated code is to read,
    Isn't generated code created so that it does the job, and not necessarily to be read (or even changed) by the user? For all the various code generators I know of, the goal is to generate compilable code -- whether the code can be read or understood by humans is usually irrelevant.
    Do you know how compilers handle array referencing when the location is hard coded?
    So does "array[X]" result in the machine doing the following; use pointer to get the start of array, move X * sizeof(variable) bytes from start of array.
    Or as X and the array size is know at compile time does the compiler just hard code the location of array[X] into the machine code. (Hopefully I explained myself there)
    A compiler can do anything, as long as the final result is the same. That's all that can be stated. Again, if you want to know what your particular brand of compiler does, the only way to know for sure is to generate an assembly listing.
    Again this is me trying to reduce the number of operations to a minimum.
    Then maybe you should code in assembly language.

    Persons who try to "outsmart the C++ compiler" by doing tricks to supposedly speed up the code usually wind up outsmarting themselves. The code is usually the same speed, if not slower than what an optimizing compiler can produce.

    The engineers that put the compiler together aren't novices -- they know the hardware they're targeting, and produce optimizing code that takes advantage of the hardware. In other words, unless you write your routine in assembly language, you will be very hard-pressed to beat the compiler at its game when it comes to optimizations.

    Also, the goal is to make your program work first. A program that is "optimized" but doesn't work correctly is hard if not impossible to fix or adjust, and therefore basically worthless and resulted in a waste of time for the programmer who spent hours creating this supposed optimized code. Once you have a working program, then you use a profiler to determine what is slow (if anything) and make adjustments.
    I didn't know that. This will definitely effect me.
    Actually, not the stack, but I know that a paging fault can occur with functions that are thousands of lines long. This is the case with Visual Studio set of compilers, and I would believe any Windows C++ compiler.

    Regards,

    Paul McKenzie
    Last edited by Paul McKenzie; July 29th, 2013 at 01:14 PM.

  10. #10
    Join Date
    Apr 2000
    Location
    Belgium (Europe)
    Posts
    4,626

    Re: Memory allocation performance

    from a technical POV. there is no such thing as "heap memory" and "stack memory".
    the computer only has "memory", and all of it is the same and the same speed (this is elementary PC hardware knowledge).

    if you try to put memory modules in your pc with different speeds, then either
    1) it won't work and your pc won't boot
    2) if you're lucky it'll work at the worst case settings for either module (i.e. all memory will be accessed SLOWER than the slowest of the modules).
    3) if you're really really lucky, it'll run at the speed of the slower memory module.


    that said. Within your program "some memory" will be organised as a stack, and some as heap, you have no control over which memory is put into which physical memory (modules), or even if they are continuously kept in memory.

    Due to how the CPU works:
    - stack based variables (up to the first 128 bytes of them) will be slightly faster than heap. because it doesn't need a pointer indirection.
    - beyond the first 128bytes worth of variables on the stack (per function), your mileage may vary


    As to actual results you're getting
    - your stack based solution is technically faster than your heap based one
    - you're getting the opposite result because the stack based approach exceeds your code cache and your heap based one doesnt. so you're getting the false impression the former is slower.
    - the array based solution paul gave will prove this, even though again, pure technically it does the same thing (with less code).


    You're going about things totally the wrong way, optimisation is a last concern, not a primary one. Make sure you have a well thought through design first that ALLOWS various ways to optimize. If you try to optimize at the start, then you could end up with a horrible design, that's hard or impossible to maintain and change, and that very likely won't even run as fast as an unoptimised "good design".

    If you know from the start that calculations are going to be your bottleneck...
    then paralellism is your best track towards performance. Either by maximizing technology like MMX, SSE, SSE2... and multithreading. Those are things you need to "design in" at the start, trying to optimize on a single core, and THEN trying to add in paralellisms isn't going to work.

    If you think memory allocation or stack/heap is your bottleneck... then I doubt you have much experience at all with any sort of real optimisation issues... because they rarely are the root cause. And if they are, "pooling" tends to be the way out.

  11. #11
    Join Date
    Apr 2000
    Location
    Belgium (Europe)
    Posts
    4,626

    Re: Memory allocation performance

    Quote Originally Posted by gleesonger View Post
    Apologies, when I said debugging I was referring to how easy the generated code is to read,
    "cashflow_5 = cashflow_4 + 5" is easier to read than "array[123] = array[56] + 5"

    Though I have just realized that I could use macros and "#define cashflow_5 123" and use "array[cashflow_5] = array[cashflow_4] + 5"
    no the idea is that you write
    cashflow[5] = cashflow[4] + 5;

    Do you know how compilers handle array referencing when the location is hard coded?
    So does "array[X]" result in the machine doing the following; use pointer to get the start of array, move X * sizeof(variable) bytes from start of array.
    Or as X and the array size is know at compile time does the compiler just hard code the location of array[X] into the machine code. (Hopefully I explained myself there)
    this is something you shouldn't worry about with C++, the compiler will do it's best to optimize it in whatever form ends up being better.

    Again this is me trying to reduce the number of operations to a minimum.
    number of operations do not equate to performance.
    sometimes more (but faster) operations or operations that can be execcuted simultaneously is better than a single more compelx operation that can't be paired at all.


    I didn't know that. This will definitely effect me.
    if performance is your concern,
    then the biggest optimisations will result from a better algorithm then parallelism, then optimizing for code/data cache alignment and reusage, loop unrolling, "better" CPU instructions.
    it seems like you're tryign to approach this from the wrong end. this is never a good idea.

  12. #12
    Join Date
    May 2008
    Posts
    36

    Re: Memory allocation performance

    Quote Originally Posted by OReubens View Post
    from a technical POV. there is no such thing as "heap memory" and "stack memory".
    the computer only has "memory", and all of it is the same and the same speed (this is elementary PC hardware knowledge).
    I am aware of where the heap and stack are located (a result of which is the speed is the same) and how allocation/deallocation differs.
    Am I right in saying that in normal operation accessing (r/rw) the variables on the stack is generally faster because, due to their frequency of use the variables are mapped to the processors cache?

    Quote Originally Posted by OReubens View Post
    you're getting the opposite result because the stack based approach exceeds your code cache and your heap based one doesnt. so you're getting the false impression the former is slower.
    Thanks.

    Quote Originally Posted by OReubens View Post
    You're going about things totally the wrong way, optimisation is a last concern, not a primary one. Make sure you have a well thought through design first that ALLOWS various ways to optimize.
    I totally agree with this. However there are some simple choices I have to make early on that if I wish to change may require a fairly decent architectural change.

    I intend to write the compiler and then in later versions add in optimizations. An example of which is SSE2 (which I wasn't aware of until you pointed it out, I knew it was in GPUs but didn't know CPUs supported it as well, thanks for that info)

    Quote Originally Posted by OReubens View Post
    then paralellism is your best track towards performance. Either by maximizing technology like MMX, SSE, SSE2... and multithreading.
    The most likley use of the compiled code is in grid computing. In this scenario the grid software will handle all parallelism. Other wize I will be multithreading in the application wrapping the c++ module (java)

    Quote Originally Posted by OReubens View Post
    no the idea is that you write
    cashflow[5] = cashflow[4] + 5;
    I have decided that I will write one large array to the heap but I'll dump the excel formula as a comment in the source.
    I originally didn't want to do this but it does seem to be the best approach.

  13. #13
    Join Date
    Apr 2000
    Location
    Belgium (Europe)
    Posts
    4,626

    Re: Memory allocation performance

    Quote Originally Posted by gleesonger View Post
    Am I right in saying that in normal operation accessing (r/rw) the variables on the stack is generally faster because, due to their frequency of use the variables are mapped to the processors cache?
    no.

    variables on the stack are faster than a pointer because.
    1) is the pointer is on the stack, you still need to get the pointer as well (and indirect the pointer on top)
    2) if the pointer is global, the stack still wins (by a narrow margin) if it's in the first 128bytes of the stack pointer because it'll have a shorter instruction. (global var will need an R/M+4bytes address, stack will need an ESP or EBP with a 1 byte relative).

    either of the 2 could end up executing faster depending on influences on the data cache. which you can't "organize" except if you're willing to go into specific processor optimisations.
    the "generic" answer is... the better approach (=stack if within 128bytes) will tend to be favoured across all CPU brands and types.


    I totally agree with this. However there are some simple choices I have to make early on that if I wish to change may require a fairly decent architectural change.
    from an architectural POV... you shouldn't use the stack for storing more than "a few" local variables.
    if your design needs thousands of bytes on the stack in a single function, you have a serious design issue.
    Last edited by OReubens; August 1st, 2013 at 07:30 AM.

  14. #14
    Join Date
    Aug 2000
    Location
    New York, NY, USA
    Posts
    5,656

    Re: Memory allocation performance

    Vlad - MS MVP [2007 - 2012] - www.FeinSoftware.com
    Convenience and productivity tools for Microsoft Visual Studio:
    FeinWindows - replacement windows manager for Visual Studio, and more...

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured