Heap efficiency?

**Russco** · November 27th, 2008, 11:33 AM

Originally Posted by TheCPUWizard

1) Be careful about object lifetime. Are you calling new/delete at the most efficient times? Can you "re-use" blocks instead of returning them to the general heap?

Is this advocating liberal use of placement new or an allocator that does all this for you behind the scenes such as the small object allocator in Loki?

**TheCPUWizard** · November 27th, 2008, 10:14 PM

Originally Posted by Russco

Is this advocating liberal use of placement new or an allocator that does all this for you behind the scenes such as the small object allocator in Loki?

Actually it is advocating...

1) READ the material from the book mentioned in Reply #3

2) Run experiments with YOUR compiler on YOUR (target) hardware, and measure the performance differentials for various alternatives.

3) Measure your application performance against a defined performance specification.

4) IF Your application measurements show insufficient performance, AND if the amount of time measured for heap management is significant AND IF the differential experiements performed earlier indicate that any one of the three items (mentionedin previous posts), then those methodologies should be applied.

In general I recommend against "liberal use" of anything "throughout" an application. IT is (in general) easier to implement, test, measure, and maintain something (such as placement new, or a small object allocatior) if it is encapuslated and centrally implemented.

**_uj** · December 29th, 2008, 09:39 AM

I've investigated this matter further and come to a conclusion.

The reason I became worried about heap efficiency in the first place was a line from Modern C++ Design by Andrei Alexandrescu. It reads: "For occult reasons, the default allocator is notoriously slow". He then goes on to address the problem by designing a small object allocator to serve as a replacement.

I interpret this to mean that it's the consenus view of the C++ community that the default allocator supplied with common compilers generally is too slow to handle a programming style which makes frequent use of object allocations on the heap, such as for example OO (object oriented) programming. I also interpret this to mean that it's a very good idea to replace the default allocator with something more efficient.

What I didn't realize at the time was that I already had what I was looking for. I'm using TBB, an open source fine-grain multithreading library from Intel. I post a link here to an article describing the TBB memory allocator for two reasons. Firstly it's independent from the rest of the TBB package so people may want to consider using it although they may not be interested in multithreading for the time being. And secondly the article contains references to other allocators, both free and commersial.

ftp://download.intel.com/technology/...e_Software.pdf

My conclusion is that for serious OO programming to be feasible in C++ the standard memory allocator must be replaced with something more efficient. The standard allocator sucks and is likely to be a major bottleneck. That's the plain and simple truth, and it amazes me that this fact isn't out more in the open, especially since C++ is considered to be THE language for efficiency conscious programmers. I suggest anybody with an ambition to write fast and nimble OO programs immediately starts looking for a better memory allocator.

**TheCPUWizard** · December 29th, 2008, 10:03 AM

Originally Posted by _uj

My conclusion is that for serious OO programming to be feasible in C++ the standard memory allocator must be replaced with something more efficient. The standard allocator sucks and is likely to be a major bottleneck. That's the plain and simple truth, and it amazes me that this fact isn't out more in the open. Especially since C++ is considered to be THE language for efficiency conscious programmers. I suggest anybody with an ambition to write fast and nimble OO programs immediately start looking for a better memory allocator.

In order to support that conculsion, you would have to analyze a sample set of applications with different allocators, and measure the application level performance difference.

But there is an even eaiser way to contratict your conclusion...There are hundreds of thousands of C"Serious OO programs" written in C++ using the default allocator. If you conclusion were true, then these programs would not exist.

I HAVE done a significant amount of performance analysis on applications ranging from massive business applications to high performance real-time systems (including industrial process/machine control and military weapons systems [no good shooting where the plane was]. Only in very very rare cases was heap management even measurable as an impact on performance.

The following numbers are from a project I just completed on real-time gas dispersion:

Sample Time: 132.3 mS
Heap Allocations: 32,418
Heap ManagemenrtTime: 2.83mS

So we have (using the allocator provided with VC++ 9.0, a 2.13% time spent dealing with the heap.

Assume you have an allocator which is 100 times faster. You would get.

Sample Time: 129.5 mS
Heap Allocations: 32,,418
Heap Management Time: 0.0283mS

This represents a 2.11% performance improvement. Hardly a significant issue.

You are falling into the trap of premature optimization. The cases where an optimized allocator makes sense is when the number of (small) allocation/release cycles are extremely high AND there is little or no code in the constructor/destructor [ie you are making heavy use of the FlyWieght pattern]

**_uj** · December 29th, 2008, 11:52 AM

---

**_uj** · December 30th, 2008, 04:26 AM

Originally Posted by TheCPUWizard

Only in very very rare cases was heap management even measurable as an impact on performance.

Contrary to your personal experience C/C++ applications allocate quite a lot of dynamic memory. In the Intel link I supplied there's a reference to a study which shows that rather 20% of the execution time is spent allocating memory than 2% (as your example suggests). This study is from 1994 and with the increasing use of OO the percentage is more likely to be higher today than lower.

You are falling into the trap of premature optimization.

Well, in that case I'm in good company. I'm sure Alexandrescu didn't design a small object allocator in his book just to show off, but because he felt there's a need for one.

Introducing a more efficient memory allocator will in fact counteract premature optimizations. The reason is that if programmers feel object allocations are expensive they will avoid them and this will negatively influence the proper use of OO. Proper object allocations will be avoided for the sake of reducing heap usage, and that's premature optimization. In principle inefficient object allocation is the worst enemy of good OO design.

What says exactly the standard memory allocator has the ultimate level of efficiency? If a compiler supplier improves the standard allocator are they then guilty of premature optimization? If not, why is improving the standard allocator a virtue but replacing it a sin?

So my conclusion still is that if you're into modern OO, which relies on generous object allocation (including shared smart pointers), it's a very good idea to replace the standard memory allocator, just like Alexandrescu suggests. And if you program to utilize multiple processor cores, then using a more efficient object allocator is close to mandatory.

I feel I have heavy support for my view. I specifically draw on the Modern C++ Design book by Alexandrescu, a book you yourself in this thread recommended every programmer to read and understand. I also draw on the experience of Intel designing the TBB library.

**superbonzo** · December 30th, 2008, 08:33 AM

Hi _uj, thanks for the interesting paper...

Originally Posted by _uj

In the Intel link I supplied there's a reference to a study which shows that rather 20% of the execution time ...

I was so shocked to see this result that I felt the need to read it;

Actually both the Intel paper and the referenced paper do not speak of a <mean> relative execution time of 20%:
the intel paper speaks of un upper estimate and the Detlefs's paper reports such high ratios only for 2 over 5 allocators (one of which implements a garbage collector).
Furthermore, all mentioned non-garbage collected allocators exhibit a high variance of results (GNU: mean<17%>,min<1.6%>,max<52%>, ULTRIX: mean<6.97>,min<0.6%>,max<20%>, ...) indicating a high specificity on algorithm allocation usage.

Moreover, relative instruction counts were measured reather then relative execution times: the two concepts coincide only neglecting idle time (that is wrong, in general)

( It's also curious that the most allocation dependent code reported in that paper is CFRAC: a large integers factoring program ! ... )

Of course, the problem of memory allocation in a multithreaded environment has an entirely different perspective:

by the way, I have a question for you: as the Intel paper argues, heap access needs some synchronization mechanism; fixed a "typical" C++ compiler + standard allocator + OS, what's the relation between heap locking and object costruction ? (for example, does the heap remain locked until object costruction is fully completed (eg until the ctor returns) ? )

**TheCPUWizard** · December 30th, 2008, 09:28 AM

As you pointed out, the Intel study was done 15 years ago (published 14 years ago). Much has changed since then. As object complexity increases (especially constructor complexity with the adoption of RAII) the dynamics radically change.

Consider the following code:

Code:

Code:
{
   MyClass  instance1;
}

{
   MyClass *instance2 = new MyClass();
   delete instance2;
}The overhead of the heap management is the DIFFERENCE in the times.

When the intel study was done, it was based on "C" code which did NOT have constructor/destructor semantics. The initialization of state for the instance was done OUTSIDE the timed regions.

This radically skewed the impact when applied to complex C++ objects.

For most applications, performance tuning comes NOT from changing the allocator, but rather from the usage of Object Pools. It is typically much faster to restore an object instance to a known state then it is to initialize an object from scratch.

The result is that the percentage of time in allocation/deallocation dramatically to 2%-5% of the total time.

**TheCPUWizard** · December 30th, 2008, 09:39 AM

I specifically draw on the Modern C++ Design book by Alexandrescu, a book you yourself in this thread recommended every programmer to read and understand

Definately!...Now look at the last word that I quoted.

The default allocator is designed to give "reasonable" performance under "worst case" conditions. This means that it will "work" for nearly every application. The cost of this is a sacrifice in performance under specific conditions (most notably large numbers of small allocation/release cycles).

Nearly every replacement (including Alexandrescu's and Intel's) will have worse performance than the default allocator under specific conditions. In some cases (especially varying size large allocations on low memory systems) it will be much worse.

This is where understanding becomes critically important, and this understanding comes from actual measurements [of application level performance, NOT micro-benchmarking) performed on real applications using real data on actual systems.

There ARE cases where avoiding the default allocator is a relatively obvious choice. It can be avoided either by using a diefferent allocator, or by implementing a pool explicitly (both accomplish the same thing of having a ready supply of memory blocks of a specific size).

A good case in point is an image processing application where individual pixels are going to be represented by distinct objects using the Flyweight design pattern. In this case you may have 100 million (or more) allocations (each of size_t *n n=1,2,3). Even uS differences in allocation speeds will translate to minutes of processing time. [The PET scan processor I helped develop took this approach with 1GB+ raw data samples).

Even in this extreme case, it made sense [based on timing analysis] to implement a custom scheme ONLY for the flyweights, and let all of the other objects use the default allocator.

**Yves M** · December 30th, 2008, 11:38 AM

Well I agree more with that last analysis. For specific cases it may make sense to have a specific allocator for those kinds of objects. I've had plenty of examples where I used a custom allocator because something around 50% or more of the time was spent on heap management. However I've only used those on the objects that were posing the problem. Undo management, parsing trees and string caches are the examples that come to mind. However the advice is mostly as everyone has already said:
Measure how much it represents and if it's important then find a way to change it.

A general OO program that doesn't crunch data and has to maintain huge collections with non-standard execution behavior doesn't suffer from the standard allocators. They are actually designed so they can handle those cases well.

**cj-wijtmans** · December 30th, 2008, 12:49 PM

i dont understand how a reallocation improved allocator, talking about malloc or heapalloc and the heap expansion function, with placement new[] can be any worse in any scenario. I dont buy it. Give me the example!

Ofcourse placement new only has to be called on non-POD classes, this can even improve performance by ALOT. and this is easily done by using a typeinfo class.

i dont know anything about STL, so i have no idea if an allocator can implement the typeinfo class in that way.

also since when is it standard to expect "low memory" systems?
and since when do all systems have swapfiles?

**TheCPUWizard** · December 30th, 2008, 01:48 PM

Originally Posted by cj-wijtmans

i dont understand how a reallocation improved allocator, talking about malloc or heapalloc and the heap expansion function, with placement new[] can be any worse in any scenario. I dont buy it. Give me the example!

THREE typical examples...

1) On the very first allocation, the allocator must get memory from the system. Generally this can not be done faster than the default allocator does it. There is almost always additional overhead to set up the tracking structures that the "optimized" allocator will use.

2) Many implementations use the following (psuedo code) pattern:

Code:

void *MyAllocator(int size)
{
   if (size < threshold)
      return GetOptimizedSmallBlock();
  else
      return SystemAllocatedMemoryBlock();
}

By the very definition, this implementation will NOT be faster than a first call to "SystemAllocatedMemoryBlock".

3) [previously mentioned, but repeated for completeness]. The application requests large blocks (relative to the amount to total heap available) in varying sizes (non-repeatable) with overlapping allocations. Any scheme which retains blocks of a given size for faster access will NOT find appropriate memory blocks. It must then fall back into a searching pattern. Because of the focused optimization for repeated allocations of the same (typically small) size, it is common that this path incurs additional overhead (e.g. checking the initial lists of known block sizes)

I am willing to state, that if you give me ANY implementation, I can find at least one case where the timing is worse than the default allocator.

**superbonzo** · December 31st, 2008, 05:01 AM

Hi TheCPUWizard, your examples are very convincing but you defined "heap overhead" as

Code:

{
   MyClass  instance1;
}
/* <-- DIFFERENCE in the times of --> */
{
   MyClass *instance2 = new MyClass();
   delete instance2;
}

but _uj's concern was about the impact of heap synchronization in multithreaded environments, where such a time difference isn't,in general, well defined.

Maybe I'm missing something obvious, but how can you state that the default allocator (that I suppose makes sequential accesses to a shared heap) is optimal in heavy multithreaded C++ appications ?

**TheCPUWizard** · December 31st, 2008, 08:58 AM

Originally Posted by superbonzo

Maybe I'm missing something obvious, but how can you state that the default allocator (that I suppose makes sequential accesses to a shared heap) is optimal in heavy multithreaded C++ appications ?

Actually I never did.

My statement is and always has been that:

The default allocator, has been designed to give decent performance across the widest range of applications, while replacement allocators are targeted towards specific situations. Therefore until you have mesaured a specific situation, you do not know the impact of the allocator, and stating that it must be changed, is a premature decision.

The conslusion that can be drawn from this statement is two fold:

1) For ANY specificcondition, and allocator can be designed that will perform better than the default allocator.

2) Until you know the specific characteristics, you do not have suffient information to know if the allocator is a significant impact to application level performance. Nor do you know which replacement allocator will be best.

Much has focused on "small objects", but the situation with heavy multi-threaded applications is very similar. In the past 18 months I did two heavily threaded applications (100+ active threads running on SMP servers).

The first used a messaging system to communicate between threads. The threads did alot of dynamic data processing. The only cross thread data was the messages (and related synchronization). The chosen design was that each thread got its own heap, and memory (other than the messages) had to be allocated and freed on the same thread. NOTE: this is very similar to the way memory across DLL's can be handled.

The second did intense CPU processing on the threads along with some IO. Data was heavily shared, but 90%+ of the allocations occured on one specific thread. The percentage of time spend being blocked (due to serialization of the heap management) was negligable, and no specific action was taken.

Again, at no point did I ever saw that custom allocators were "bad", or any such thing. Simply that until you analyze a specific application and determine that the allocator is a significant issue, there are many many more important issues to be concerned with.

Remember the original question was: (emphasis added)

I wonder how efficient is the C++ heap generally?

And I stand by my statement, that when looking at typical applications that are well designed, it is generally efficient enough to NOT be a primary or even secondary concern.

**_uj** · January 1st, 2009, 06:29 AM

Originally Posted by TheCPUWizard

And I stand by my statement, that when looking at typical applications that are well designed, it is generally efficient enough to NOT be a primary or even secondary concern.

And I still think my conclusion is correct. In a typical well designed OO application (characterized by generous object allocations on the heap and ubiquitous use of non-intrusive smart pointers) the standard memory allocator will be a major bottleneck. I think there's a very good reason why Alexandrescu devotes 20 percent of his much acclaimed Modern C++ Design book to efficient small object allocation and smart pointers.

I used to avoid heap allocations for performance reasons even though I knew I shouldn't for design reasons. I'm sure many share this experience with me. Knowing I have an efficient heap allocator has put an end to this self-destructive behaviour. This psychological effect shouldn't be underestimated. So in addition to having a real effect on runtime performance, efficient object allocation also helps avoid premature optimization.

I realize my conclussions are based on hearsay and personal experience. Therefore I would wellcome an ambitious investigation from some independent party into how different allocators influence the performance of OO programs. If it turns out that standard (or maybe they should be called general) allocators are good enougth then fair enougth. But until then I prefer an allocator with the outspoken objective of performing well under the OO programming style. That's the "standard" allocator I want. I don't quite understand why there's such an opposition to such a wish. Why would it be so much better if I use the allocator that came with the compiler?

Anyway I've made up my mind so I leave this discussion. Thank you all for your input.

Thread: Heap efficiency?

Thread Tools

Display

Re: Heap efficiency?

Re: Heap efficiency?

Re: Heap efficiency?

Re: Heap efficiency?

Re: Heap efficiency?

Re: Heap efficiency?

Re: Heap efficiency?

Re: Heap efficiency?

Re: Heap efficiency?

Re: Heap efficiency?

Re: Heap efficiency?

Re: Heap efficiency?

Re: Heap efficiency?

Re: Heap efficiency?

Re: Heap efficiency?

Posting Permissions