[RESOLVED] Performance of dynamic memory

**Mybowlcut** · June 2nd, 2009, 10:41 PM

Are there any performance advantages/disadvantages to using dynamic memory? From a bit of searching:

Originally Posted by http://www.codeguru.com/cpp/com-tech/atl/performance/article.php/c11395

It is a well-known fact that dynamic memory allocation is expensive. It incurs performance overhead both in execution time and space.

Originally Posted by http://www.springerlink.com/content/rye97wg111wken4e/

The most significant problem in dynamic memory allocation is fragmentation, which can cause the system to run out of memory and crash, if it is left unchecked.

I rarely use dynamic memory in my programming - generally only when I need polymorphism or control over the lifetime of objects... Should I keep it this way?

From reading on CG, I have found that most regulars here condemn the unnecessary usage of dynamic memory, but is there a time when performance requires it, or can I safely stick to the stack?

Cheers.

**Lindley** · June 2nd, 2009, 10:54 PM

Well, first off, it's pretty much impossible to avoid using dynamic memory. All of the STL containers use it internally, including std::string. But you may mean that you avoid using it manually, which is wise.

The basic problem is this: Allocating space on the stack requires incrementing a pointer. That's it. But allocating space on the heap requires:
1) Identifying the available blocks
2) Running an algorithm to choose which block best fits the allocation request (there are multiple possible meanings to "best fits" there)
3) Splitting the block into the memory to be returned and the remaining free space in that block
4) Upon release, determining whether (and when) to "coalesce" neighboring free blocks into a single larger block

You can generally expect allocation to be more or less logarithmic time, and freeing is *usually* going to be more or less constant time, but it's still longer than a pointer increment.

The "fragmentation" mentioned occurs when you get lots of non-contiguous free blocks which are *just* too small to fill your request. You could have half the memory free, but if you can't find a contiguous block of it big enough, you can't allocate.

There are three ways you can mitigate these problems:
1) Allow objects which use dynamic memory internally, such as std::vectors, to have long lifetimes when possible.

2) Cultivate allocation patterns which avoid worst-case situations. For example, I know an otherwise very bright researcher who is *always* doing the "reallocate one slot larger" pattern on his arrays----which is pretty much worst-case. The STL containers do pretty well in effecting good allocation patterns, but you can help them out---for instance, by proper use of vector::reserve().

3) Use custom allocators when appropriate. I admit I don't know many details on this. The standard allocators are a good compromise in a wide range of situations, but if you know more specifically what sort of allocation patterns you're going to have, you can sometimes find better allocator algorithms to suit them; and the STL containers allow you to substitute these custom allocators if you wish.

**Mybowlcut** · June 2nd, 2009, 11:10 PM

Originally Posted by yuenqi

Is this new of advertisement you can collect to tell the world of what you/your relatives have written ?

If it were you the writer, is there a checkpoint in the US to clearif.y if you were actually the writer of the book or someone else that "offered" you with a price ?

Can you do us all a favour and stop filling this forum with crap, please?

Originally Posted by Lindley

Well, first off, it's pretty much impossible to avoid using dynamic memory. All of the STL containers use it internally, including std::string. But you may mean that you avoid using it manually, which is wise.

The basic problem is this: Allocating space on the stack requires incrementing a pointer. That's it. But allocating space on the heap requires:
1) Identifying the available blocks
2) Running an algorithm to choose which block best fits the allocation request (there are multiple possible meanings to "best fits" there)
3) Splitting the block into the memory to be returned and the remaining free space in that block
4) Upon release, determining whether (and when) to "coalesce" neighboring free blocks into a single larger block

You can generally expect allocation to be more or less logarithmic time, and freeing is *usually* going to be more or less constant time, but it's still longer than a pointer increment.

The "fragmentation" mentioned occurs when you get lots of non-contiguous free blocks which are *just* too small to fill your request. You could have half the memory free, but if you can't find a contiguous block of it big enough, you can't allocate.

There are three ways you can mitigate these problems:
1) Allow objects which use dynamic memory internally, such as std::vectors, to have long lifetimes when possible.

2) Cultivate allocation patterns which avoid worst-case situations. For example, I know an otherwise very bright researcher who is *always* doing the "reallocate one slot larger" pattern on his arrays----which is pretty much worst-case. The STL containers do pretty well in effecting good allocation patterns, but you can help them out---for instance, by proper use of vector::reserve().

3) Use custom allocators when appropriate. I admit I don't know many details on this. The standard allocators are a good compromise in a wide range of situations, but if you know more specifically what sort of allocation patterns you're going to have, you can sometimes find better allocator algorithms to suit them; and the STL containers allow you to substitute these custom allocators if you wish.

Thanks for the detailed reply, Lindley.

I am aware that the STL uses dynamic memory for its containers, but I am curious whether doing something like pushing back (after calling reserve, of course) dynamically allocated (shared) pointers will speed up my programs when a memory-intensive operation (like loading a large level in a game) is required or just slow it down. Since vectors already allocate memory internally for elements, is that just like adding another layer of superfluous allocation?

**Lindley** · June 2nd, 2009, 11:45 PM

Originally Posted by Mybowlcut

Can you do us all a favour and stop filling this forum with crap, please?

I figure he's either a bot, a troll, or just insane. Either way best to just ignore.

**Chipmunk Baby** · June 2nd, 2009, 11:55 PM

Originally Posted by Lindley

I figure he's either a bot, a troll, or just insane. Either way best to just ignore.

Then that is ignorant he is! My family learn from those craps!

They use a boy to search for our security holes online, and introduce their old girl to him but he refused severely. They use his name to make up for their relatives and more, all about web online they have digged deeply.
It is Tran Hung and Hoang Mai Linh standing behind the scenes of mess. Oh Vietnamese in the US, modern people with modern plans of f.u.cking people young and old worldwide

**Mybowlcut** · June 3rd, 2009, 12:24 AM

Originally Posted by Lindley

I figure he's either a bot, a troll, or just insane. Either way best to just ignore.

Haha I'm too stubborn to ignore them... I'll make Brad aware of him/her or something...

Do you have any advice on my last reply?

**Lindley** · June 3rd, 2009, 06:58 AM

To the extent that copying the object itself (as opposed to the pointer to it) may be slow, going the shared pointer route could be helpful. However, I wouldn't make a practice of it without solid justification.

**JVene** · June 3rd, 2009, 07:44 AM

Lindley detailed the basics nicely, here's my 2 cents on the topic.

The thing(s) about the stack that seem pertinent to your performance concern is that it's likely associated in the CPU cache (one might say certainly under most conditions), but it's a limited resource. Windows dynamically extends the stack, and though I've not read how expensive a process that is, the over-reliance on the generous stack available under most modern operating systems does affect portability somewhat.

The heap is "all over the place". I just posted on another performance inquiry which reminded me just how this plays out (this doesn't come up very often). An algorithm separates an image in RGB into 3 color planes. Looping through pixels and copying them to 3 output buffers is a simple task, easily visualized. When the loop takes in a pixel (or a group of pixels) from the source, it then posts pixels (one at a time or in groups) to each of the 3 output buffers - all in a single loop - performance suffers.

I unrolled the loop to process the input 3 times, but send the output to each of the color separations one at a time, turning one loop of 3 outputs into 3 loops each with one output. I didn't expect this to have much of an effect, but overall the routine was over twice the speed. The original algorithm, which also incorporated array indexing of the output buffers, ran around 30 1Mpixel images per second, whereas the 'unrolled' and slightly better optimized routine ran over 350 images per second.

To the common modern CPU (x86 x64, even PowerPC) the stack is just another block of RAM. It's advantage is really that the heap memory management 'system' (CRT/OS) isn't consulted, and that it's likely been recently accessed so the cache has freshly pointed there.

Custom allocators are a significant source of performance improvement (Stroustrup indicates in at least a few texts or quotes that it can be 10 times the performance). In my own experience the result can range from moderate to stunning, and doesn't just involve the STL.

Two general categories come to mind in my own work experience. First, the simple fragmentation. Smart pointers of all types are, usually, the same size. A simple custom "new" for them improves their performance considerably. Imagine that instead of allocating smart pointers from the "standard" CRT allocation system, they're basically allocated out of an array (vector or whatever is available). I've seen performance increases of considerable import, but it's not the smart pointer (that's usually either on the stack or a member or similar) - it's the 'node' inside which represents the reference count.

The general allocation system gives you RAM which contains a block (which you don't or 'shouldn't see) that is basically a linked list pair of pointers (implementation/OS defined). This takes room. Many custom allocators that work off vectors avoid this, saving space - but that also implies a "step" in allocating and releasing which can be avoided. The other advantage is that if you group like objects or blocks by size, you can manage fragmentation through intelligent re-use.

A while ago I helped with a research project that used the STL to perform a proof of an algorithm that required a rather large in RAM processing step. The array problem Lindley mentioned struck, and though the added RAM (maxed out 32bit's 2Gbyte and then some using PAE), it wasn't enough. The 'blessing' of automatic vector expansion caused fragmentation. There was available RAM, just not contiguous. I call crashing a performance problem

So, I supplied a version of the vector (a stand-in) which worked like a vector but allocated in chunks, which served the purpose adequately and erased the fragmentation problem. The project went from bust to proof - the general topic was simply related to allocation of the underlying container.

Another issue I've run into is related to smart pointers, lots of small objects used by an algorithm and threading. The general allocator of the CRT is serialized. If several threads are busy allocating lots of small objects, which a container of smart pointers to objects is, there can be a performance hit that stalls the cores. What happens is that one thread 'locks the heap' by requesting RAM. A second thread attempts, but waits. Likely, on a 4 core that is being stuffed for work, a 3rd and 4th thread wait too.

The first thread releases, but is now going to jump right back into line because it's performing another "new" - it's now last in line.

Thread 2 can go - but it's just sitting there. Waiting on an event to be signaled. The event was signaled when the previous thread 1 released the lock on the heap, but the OS hasn't gotten around to giving thread 2 attention. That waits on the OS task scheduler. Thread 3 waits on thread 2, thread 2 gets in line. Thread 4 waits on 3, 1 waits on 4 - it goes on in this cycle. Individually, the threads could run million of these in a second, but in a threaded paradigm their stalling at the task scheduling switch speed - about 100,000 a second for all threads combined. CPU usage is nearly zero.

The key, for my own work, was to create a 'multiplexed memory allocator' - I've posted about it long ago around here. It basically uses a combination of custom "new" and thread local storage for targeted objects I know are going to be candidates of this problem (smart pointers and their internals are high candidates, but many others qualify). Instead of allocating from the heap 1 at a time, I allocated hundreds (tunable - sometimes a few thousand) at once. The subsequent several hundred are drawn from the thread local 'cache' of pre-allocated material. Threads no longer stall against each other, and even in single threaded versions of the same algorithm/application actually get a speed boost in most situations.

I apologize for not having exact references - I read about custom allocators long ago and have long forgotten the source - but the topic is treated in some dusty corner of the literature. It pays big dividends when it's really needed, but hardly matters much until the performance issue creeps into an application.

**nuzzle** · June 3rd, 2009, 08:41 AM

Originally Posted by Mybowlcut

I have found that most regulars here condemn the unnecessary usage of dynamic memory, but is there a time when performance requires it, or can I safely stick to the stack?

Shouldn't everything that's unnecessary be condemned?

In C++, stack allocation is favored mainly because the standard heap allocator delivered with most compilers is slow. Or as Alexandrescu puts it in Modern C++ Design,

"For occult reasons, the default allocator is noriously slow."

(before he goes on to describe an optimized small-object allocator for the Loki library).

What's important to realize though is that slow is relative. I'd say it's only if you're using the OO paradigm extensively heap allocation may be a performance problem. In all other situations using the standard allocator is fine. I would never recommend anyone to refrain from using the dynamic STL containers for performance reason. It's a very bad policy. If an STL container is slow is far more likely that you've picked the wrong one, rather than because of slow heap allocations.

But if you rely on OO and allocate zillions of small objects (including smart pointers) then it may pay off to replace the standard allocator with one optimized for that allocation pattern. Or maybe, as an alternative, have special allocation on a per class basis. For example using an intrusive smart pointer avoids heap allocation (the access counter is stored inside the object instead for on the heap).

Sometimes you get an optimized small-object allocator for free. I'm using the Intel TBB package (for fine-grain multithreading) which recomments you to replace the standard allocator with a special TBB allocator which also speeds up small object allocation dramatically. But there are also other replacement allocators available, both open source and commersial.

I'd say you can safely stick to the stack but you shouldn't refrain from using dynamic STL containers for performance reasons.

**ltcmelo** · June 3rd, 2009, 09:32 AM

Originally Posted by Mybowlcut

I rarely use dynamic memory in my programming - generally only when I need polymorphism or control over the lifetime of objects... Should I keep it this way?

Dynamic memory is also useful when writing exception safe code. In particular, when the strong guarantee is desired.

A good example is one provided in gotw where a function needs to compose a string. Returning the string by value isn't strongly safe because its copy constructor (or assingment) might throw. Passing it to the function by reference as an argument isn't enough either because assingment might throw.

**Mybowlcut** · June 3rd, 2009, 11:03 PM

Wow thanks for the replies, guys.

I feel that, after reading everything put forward, my concerns were probably exaggerated... although the heap fragmentation being solved by a custom allocator sounds like something I will try out one day once my programs become more memory-hungry.

Cheers!

**nuzzle** · June 3rd, 2009, 11:41 PM

Originally Posted by Mybowlcut

although the heap fragmentation being solved by a custom allocator

Note that the purpose of a custom heap allocator is not to conserve memory space. On the contrary. To make heap allocations/deallocations faster it rather uses more memory than the standard allocator.

In C++, memory on the heap cannot be freely moved around because the language requires that once an object has been allocated it sticks to that very physical position (otherwise pointers suddenly would point to something else). This prevents the allocator from performing defragmentation runs. To allow that, C++ would need relative pointers like .NET and Java. Here pointers (or references as they're called) only indirectly points at physical memory which means they can be updated when objects are moved around by the garbage collector.

**Mybowlcut** · June 4th, 2009, 12:06 AM

Originally Posted by nuzzle

Note that the purpose of a custom heap allocator is not to conserve memory space. On the contrary. To make heap allocations/deallocations faster it rather uses more memory than the standard allocator.

In C++, memory on the heap cannot be freely moved around because the language requires that once an object has been allocated it sticks to that very physical position (otherwise pointers suddenly would point to something else). This prevents the allocator from performing defragmentation runs. To allow that, C++ would need relative pointers like .NET and Java. Here pointers (or references as they're called) only indirectly points at physical memory which means they can be updated when objects are moved around by the garbage collector.

How does it use more memory? Could you use JVene's example (the allocator that allocated in chunks to prevent fragmentation)?

**nuzzle** · June 4th, 2009, 12:58 AM

Originally Posted by Mybowlcut

How does it use more memory? Could you use JVene's example (the allocator that allocated in chunks to prevent fragmentation)?

A common way is to use fixed sized block (bins). See here Doug Lea's allocator for example,

http://gee.cs.oswego.edu/dl/html/malloc.html

The standard allocators in use today should be at least as sophisticated as this one. But I cannot swear on it because frankly I don't know

. Maybe someone else knows for sure what's shipped with the common compilers?

Thread: [RESOLVED] Performance of dynamic memory

Thread Tools

Display

[RESOLVED] Performance of dynamic memory

Re: Performance of dynamic memory

Re: Performance of dynamic memory

Re: Performance of dynamic memory

Re: Performance of dynamic memory

Re: Performance of dynamic memory

Re: Performance of dynamic memory

Re: Performance of dynamic memory

Re: Performance of dynamic memory

Re: Performance of dynamic memory

Re: Performance of dynamic memory

Re: Performance of dynamic memory

Re: Performance of dynamic memory

Re: Performance of dynamic memory

Posting Permissions