Hash table efficiency

**primem0ver** · November 4th, 2014, 09:01 PM

I am wondering if anyone has ever experimented with a dynamic hash table class that reallocates the number of buckets based on the number of collisions encountered when storing new key value pairs. If the efficiency were to fall below a certain rating, the hash-table would be rebuilt using a larger number of buckets. The choice for that number would be picked based on the growth model the programmer expects. (For example, linear, exponential, logarithmic).

I know that if the hash table is reallocated with a new number of buckets, the data in the table must be redistributed. So there is a potential time tradeoff when the table is built. The question is, would the tradeoff be worth it in the long run?

**OReubens** · November 5th, 2014, 08:31 AM

Good hashtables (properly sized) have good hashfunctions that are tailored for the table.
if you have neither good hashtables nor good hashfunctions. "who cares".

The whole point about a hashtable is that you find the 'acceptable balance' between size, amount of collisions and colision resolution methodology.

If you have issues with growth, then a hashtable isn't the ideal container for you.

**primem0ver** · November 5th, 2014, 12:28 PM

if you have neither good hashtables nor good hashfunctions. "who cares".

I DO understand the point of a hash table. So let us assume that I DO have a good hash-table AND a good hash function and answer the question (if possible). My point isn't to resolve issues with growth. My point is to (if possible) come up with an improved hash-table that can accommodate a few dozen to tens (or in the future possibly hundreds) of thousands of key-value pairs with equal efficiency without knowing ahead of time the size of the table since both are scenarios that my solution must deal with. The only thing I do know with certainty is that the table will be built before a significant number of look-ups will occur. A red-black tree will NOT work with the potentially vast numbers of key-value pairs that will need to be dealt with in a good number of situations. (My project is module based and each module will potentially be vastly different).

**2kaud** · November 5th, 2014, 01:34 PM

The issue with hash tables is clashes and the cost of dealing with a clash - both in terms of write and also of read. In my experience I haven't done real-time dynamic hash table re-allocation but 'off-line' hash table re-allocation as required based upon collected clash statistics.

**OReubens** · November 5th, 2014, 02:44 PM

Originally Posted by primem0ver

I DO understand the point of a hash table.

it would appear you don't.

or rather, not well enough.

if you have a true hashtable, it needs to be big enough to hold all possible value plus some extra room to accomodate collisions. How much extra room is where the "art" lies in getting a good hashtable with a good hash function. a perfect hash function would cause no collisions and need no extra room. obviously this is unlikely to happen, so you find a balance between sacrificing extra memory, and sacrificing performance by getting more collisions to deal with.

so the mere fact of "not knowing ahead of time the size of the table" fundamentally violates the basic premise of a hashtable.

Now... if you are talking about a combined hashtable with a secondary container to store colissions (typically a linked list, but it could be something else too), that's a somewhat different matter.
in that case, resizing and rebuilding can technically be done, but it'll give a noticable hickup in your software, where neither the user, nor the consumer of your class has any control over. that could be anything from mildly annoying to catastropic in case the consumers expect O(x) type behaviour for an insert. And the problem only gets worse the bigger the table gets.
so technically possible: yes
good idea: no

Hash tables don't make sense unless they're at least big enough to warrant the calculation of a hashkey and collision resolution, and they're goign to be faster than something similar like a binary search tree (or a red/black tree more appropriately). This means it needs to be a few Thousand elements at least. Again, your better bet it making it big enough so that it never deteriorates so badly that a bigger hashtable would pay off. And hope that you don't waste too much memory in the process of making it big enough to do so.
hash tables are awesom when used right. more often than not, if you don't really know what you're doing, you're better off with a std::map.

**D_Drmmr** · November 5th, 2014, 03:42 PM

Originally Posted by OReubens

so the mere fact of "not knowing ahead of time the size of the table" fundamentally violates the basic premise of a hashtable.

When you say 'hashtable size' do you mean 'number of buckets'?
I think OP meant not knowing the maximum number of data elements to be stored ahead of time.

Originally Posted by OReubens

Now... if you are talking about a combined hashtable with a secondary container to store colissions (typically a linked list, but it could be something else too), that's a somewhat different matter.
in that case, resizing and rebuilding can technically be done, but it'll give a noticable hickup in your software, where neither the user, nor the consumer of your class has any control over. that could be anything from mildly annoying to catastropic in case the consumers expect O(x) type behaviour for an insert. And the problem only gets worse the bigger the table gets.
so technically possible: yes
good idea: no

What makes you say this is a bad idea?
IIRC that's pretty much what the VC++ implementation of std::unordered_map uses. Last time I had to measure lookup was about 5x faster using std::unordered_map than using std::map with somewhere in the order of 10,000 - 100,000 elements.

**razzle** · November 6th, 2014, 01:46 AM

Originally Posted by primem0ver

So let us assume that I DO have a good hash-table AND a good hash function

All hash based data structures I've come across (in Java and C++) base relocation on load factor. Relocation takes place when the load factor reaches a certain level (specified by default or by you).

But load factor and number of collisions are intimately linked. In fact if you assume a perfect (uniformly distributed) hash function they become identical. If you know the current load factor you can calculate the expected number of collisions, and the other way around. So in the world of perfect hash functions the load factor is as good as the number of collisions.

Some language standards (C++ 11) allow you to enforce rehashing specifying the new size. This means that you can accomplish your own growth model. Just monitor the load factor as items are inserted and when it reaches a certain level you rehash to the wanted new size. But although you could do this I would first and foremost rely on the default growth model. It's usually well tuned based on long experience.

Best of course would be if you knew the expected total number of items beforehand. Then you could set a big enough initial size and there will be no rehashing at all.

**OReubens** · November 7th, 2014, 08:33 AM

Originally Posted by D_Drmmr

When you say 'hashtable size' do you mean 'number of buckets'?
I think OP meant not knowing the maximum number of data elements to be stored ahead of time.

In a 'true' hashtable, the size is the maximum number of elements it can possibly hold.
if you're talking about buckets, then you have a hybrid solution where the hashtable is a first subdivision, and the "buckets" is a secondary structure that can hold 'unlimited' values.

What makes you say this is a bad idea?

if an insert can cause a reallocation and rebuilding of the whole table, then there can't be guarantees about "amount of time taken". A single insert could take long enough to make your program appear to be hung.

That's bad.
in some usage cases, it's catastrophic.

100K items is peanuts. :-)

the problem with 'map' is that programmers get lazy and just program their key (whatever it is) and value in,
then get disappointed that their 300characters long strings don't work as fast as they expect.
If you say map is slow. are you sure that's because map is slow, or because you took an easy way out to handle your keys ?
sometimes the easy approach is "good enough", sometimes it isn't, don't blame map for using it badly.

**Codeplug** · November 7th, 2014, 01:44 PM

I've implemented a couple of dynamic hash tables for different applications in the past. They needed O(x) guarantees and so they were implemented by hand and there was no rehashing. In other words, the worst case scenario is everything falling into the same bucket, degrading the structure into a linked list.

However, in both cases the data input values and ranges were from fixed domain. This allowed me to model different hash functions and table sizes before writing any code (I like to use excel).

gg

Thread: Hash table efficiency

Thread Tools

Display

Hash table efficiency

Re: Hash table efficiency

Re: Hash table efficiency

Re: Hash table efficiency

Re: Hash table efficiency

Re: Hash table efficiency

Re: Hash table efficiency

Re: Hash table efficiency

Re: Hash table efficiency

Posting Permissions