CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 9 of 9
  1. #1
    Join Date
    Sep 2014
    Posts
    7

    Hash table efficiency

    I am wondering if anyone has ever experimented with a dynamic hash table class that reallocates the number of buckets based on the number of collisions encountered when storing new key value pairs. If the efficiency were to fall below a certain rating, the hash-table would be rebuilt using a larger number of buckets. The choice for that number would be picked based on the growth model the programmer expects. (For example, linear, exponential, logarithmic).

    I know that if the hash table is reallocated with a new number of buckets, the data in the table must be redistributed. So there is a potential time tradeoff when the table is built. The question is, would the tradeoff be worth it in the long run?

  2. #2
    Join Date
    Apr 2000
    Location
    Belgium (Europe)
    Posts
    4,626

    Re: Hash table efficiency

    Good hashtables (properly sized) have good hashfunctions that are tailored for the table.
    if you have neither good hashtables nor good hashfunctions. "who cares".

    The whole point about a hashtable is that you find the 'acceptable balance' between size, amount of collisions and colision resolution methodology.

    If you have issues with growth, then a hashtable isn't the ideal container for you.

  3. #3
    Join Date
    Sep 2014
    Posts
    7

    Re: Hash table efficiency

    if you have neither good hashtables nor good hashfunctions. "who cares".
    I DO understand the point of a hash table. So let us assume that I DO have a good hash-table AND a good hash function and answer the question (if possible). My point isn't to resolve issues with growth. My point is to (if possible) come up with an improved hash-table that can accommodate a few dozen to tens (or in the future possibly hundreds) of thousands of key-value pairs with equal efficiency without knowing ahead of time the size of the table since both are scenarios that my solution must deal with. The only thing I do know with certainty is that the table will be built before a significant number of look-ups will occur. A red-black tree will NOT work with the potentially vast numbers of key-value pairs that will need to be dealt with in a good number of situations. (My project is module based and each module will potentially be vastly different).
    Last edited by primem0ver; November 5th, 2014 at 12:34 PM.

  4. #4
    2kaud's Avatar
    2kaud is offline Super Moderator Power Poster
    Join Date
    Dec 2012
    Location
    England
    Posts
    7,824

    Re: Hash table efficiency

    The issue with hash tables is clashes and the cost of dealing with a clash - both in terms of write and also of read. In my experience I haven't done real-time dynamic hash table re-allocation but 'off-line' hash table re-allocation as required based upon collected clash statistics.
    All advice is offered in good faith only. All my code is tested (unless stated explicitly otherwise) with the latest version of Microsoft Visual Studio (using the supported features of the latest standard) and is offered as examples only - not as production quality. I cannot offer advice regarding any other c/c++ compiler/IDE or incompatibilities with VS. You are ultimately responsible for the effects of your programs and the integrity of the machines they run on. Anything I post, code snippets, advice, etc is licensed as Public Domain https://creativecommons.org/publicdomain/zero/1.0/ and can be used without reference or acknowledgement. Also note that I only provide advice and guidance via the forums - and not via private messages!

    C++23 Compiler: Microsoft VS2022 (17.6.5)

  5. #5
    Join Date
    Apr 2000
    Location
    Belgium (Europe)
    Posts
    4,626

    Re: Hash table efficiency

    Quote Originally Posted by primem0ver View Post
    I DO understand the point of a hash table.
    it would appear you don't. or rather, not well enough.

    if you have a true hashtable, it needs to be big enough to hold all possible value plus some extra room to accomodate collisions. How much extra room is where the "art" lies in getting a good hashtable with a good hash function. a perfect hash function would cause no collisions and need no extra room. obviously this is unlikely to happen, so you find a balance between sacrificing extra memory, and sacrificing performance by getting more collisions to deal with.

    so the mere fact of "not knowing ahead of time the size of the table" fundamentally violates the basic premise of a hashtable.


    Now... if you are talking about a combined hashtable with a secondary container to store colissions (typically a linked list, but it could be something else too), that's a somewhat different matter.
    in that case, resizing and rebuilding can technically be done, but it'll give a noticable hickup in your software, where neither the user, nor the consumer of your class has any control over. that could be anything from mildly annoying to catastropic in case the consumers expect O(x) type behaviour for an insert. And the problem only gets worse the bigger the table gets.
    so technically possible: yes
    good idea: no


    Hash tables don't make sense unless they're at least big enough to warrant the calculation of a hashkey and collision resolution, and they're goign to be faster than something similar like a binary search tree (or a red/black tree more appropriately). This means it needs to be a few Thousand elements at least. Again, your better bet it making it big enough so that it never deteriorates so badly that a bigger hashtable would pay off. And hope that you don't waste too much memory in the process of making it big enough to do so.
    hash tables are awesom when used right. more often than not, if you don't really know what you're doing, you're better off with a std::map.

  6. #6
    Join Date
    Jul 2005
    Location
    Netherlands
    Posts
    2,042

    Re: Hash table efficiency

    Quote Originally Posted by OReubens View Post
    so the mere fact of "not knowing ahead of time the size of the table" fundamentally violates the basic premise of a hashtable.
    When you say 'hashtable size' do you mean 'number of buckets'?
    I think OP meant not knowing the maximum number of data elements to be stored ahead of time.
    Quote Originally Posted by OReubens View Post
    Now... if you are talking about a combined hashtable with a secondary container to store colissions (typically a linked list, but it could be something else too), that's a somewhat different matter.
    in that case, resizing and rebuilding can technically be done, but it'll give a noticable hickup in your software, where neither the user, nor the consumer of your class has any control over. that could be anything from mildly annoying to catastropic in case the consumers expect O(x) type behaviour for an insert. And the problem only gets worse the bigger the table gets.
    so technically possible: yes
    good idea: no
    What makes you say this is a bad idea?
    IIRC that's pretty much what the VC++ implementation of std::unordered_map uses. Last time I had to measure lookup was about 5x faster using std::unordered_map than using std::map with somewhere in the order of 10,000 - 100,000 elements.
    Cheers, D Drmmr

    Please put [code][/code] tags around your code to preserve indentation and make it more readable.

    As long as man ascribes to himself what is merely a posibility, he will not work for the attainment of it. - P. D. Ouspensky

  7. #7
    Join Date
    Jul 2013
    Posts
    576

    Re: Hash table efficiency

    Quote Originally Posted by primem0ver View Post
    So let us assume that I DO have a good hash-table AND a good hash function
    All hash based data structures I've come across (in Java and C++) base relocation on load factor. Relocation takes place when the load factor reaches a certain level (specified by default or by you).

    But load factor and number of collisions are intimately linked. In fact if you assume a perfect (uniformly distributed) hash function they become identical. If you know the current load factor you can calculate the expected number of collisions, and the other way around. So in the world of perfect hash functions the load factor is as good as the number of collisions.

    Some language standards (C++ 11) allow you to enforce rehashing specifying the new size. This means that you can accomplish your own growth model. Just monitor the load factor as items are inserted and when it reaches a certain level you rehash to the wanted new size. But although you could do this I would first and foremost rely on the default growth model. It's usually well tuned based on long experience.

    Best of course would be if you knew the expected total number of items beforehand. Then you could set a big enough initial size and there will be no rehashing at all.
    Last edited by razzle; November 6th, 2014 at 05:17 AM.

  8. #8
    Join Date
    Apr 2000
    Location
    Belgium (Europe)
    Posts
    4,626

    Re: Hash table efficiency

    Quote Originally Posted by D_Drmmr View Post
    When you say 'hashtable size' do you mean 'number of buckets'?
    I think OP meant not knowing the maximum number of data elements to be stored ahead of time.
    In a 'true' hashtable, the size is the maximum number of elements it can possibly hold.
    if you're talking about buckets, then you have a hybrid solution where the hashtable is a first subdivision, and the "buckets" is a secondary structure that can hold 'unlimited' values.


    What makes you say this is a bad idea?
    if an insert can cause a reallocation and rebuilding of the whole table, then there can't be guarantees about "amount of time taken". A single insert could take long enough to make your program appear to be hung.

    That's bad.
    in some usage cases, it's catastrophic.


    100K items is peanuts. :-)

    the problem with 'map' is that programmers get lazy and just program their key (whatever it is) and value in,
    then get disappointed that their 300characters long strings don't work as fast as they expect.
    If you say map is slow. are you sure that's because map is slow, or because you took an easy way out to handle your keys ?
    sometimes the easy approach is "good enough", sometimes it isn't, don't blame map for using it badly.

  9. #9
    Join Date
    Nov 2003
    Posts
    1,902

    Re: Hash table efficiency

    I've implemented a couple of dynamic hash tables for different applications in the past. They needed O(x) guarantees and so they were implemented by hand and there was no rehashing. In other words, the worst case scenario is everything falling into the same bucket, degrading the structure into a linked list.

    However, in both cases the data input values and ranges were from fixed domain. This allowed me to model different hash functions and table sizes before writing any code (I like to use excel).

    gg

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured