Click to See Complete Forum and Search --> : I need a good string hashing algorithm


DeepT
October 21st, 2008, 05:04 PM
Preferably fast, preferably a very even distribution. I did a forum search on this topic expecting to find pages and pages of people asking about this, but surprisingly, there were only 13 hits and no answers (14 hits now).

Ill be doing this in c++, but as long as I can understand the general algorithm the language doesn't matter that much.

ProgramThis
October 22nd, 2008, 08:04 AM
Can you be a little more specific about your needs? Are you just trying to find an algorithm that turns strings into hash keys?

You definitely need to be careful when doing this as palindromes will hash to the same value if you use a straight forward approach. As well as words like god and dog.

There are two things that it must be: accurate (not creating duplicate keys) , and repeatable. Also to note, if you are using integers to hash these to you will not be able to capture all of the words/strings possible. The number of strings/words exceeds the max value of an int.

You cal always do something like:
string[0]^2
string[1]^3
string[2]^5
...

And so on using prime numbers to raise each letter to (or multiply by, either way). You then would use % the size of the array/hash set.

You could also use bit shifting:
for(char c: str)
n += (c<<13)^(int)c;

Or something similar. Remember to use prime numbers that are different from the size of your array/hash set.

TheCPUWizard
October 22nd, 2008, 08:12 AM
<snip>There are two things that it must be: accurate (not creating duplicate keys) and repeatable. Also to note, if you are using integers to hash these to you will not be able to capture all of the words/strings possible. The number of strings/words exceeds the max value of an int.</snip>

ANY hash will create duplicate keys. What is important is the Distribution of these duplications.

As you pointed out the "available space" of a hash (typically 16,32,64 bits) is much smaller than the source domain (e.g. all possible strings).

Even the issue of repeatability is a bit questionable, depending on how the hash is utilized. There are a whole set of situations where the hash for a given input value may legitimately vary, provided that the same value in the same context, doe not vary.

DeepT
October 22nd, 2008, 08:28 AM
Bit shifting would fail after 32, and using prime numbers only, you couldn't have very long strings.

So I suppose from the first example "n" should be a double so it can handle huge numbers, however it seems to me that wouldn't buy me that much because, take for example "@" which I think is 65 and lets say I am up to prime number 101. 65^101 is pretty **** big. I do not know how large doubles can be.

You also loose resolution of numbers so the mod's will overlap.
IE: 65^101 mod 100 = 25. 65 ^ 103 mod 100 = 25. Even 65^11 mod 100 = 25.

Once you get beyond int range, it all comes up as 25, so that method will not work.

I do not see how raising to powers will work for hashing strings due to the rapid rise in values.

Any other suggestions?

TheCPUWizard
October 22nd, 2008, 08:50 AM
You do NOT want to use a double or any floating point.

What is wrong with any of the standard type hashing algorithms???

I have gone through this thread a few times, and dont see anything really "Special". :confused: :confused:

The purpose (as I am sure you already know, but other readers may not) of a hash is to provide a quick means of determining if:

1) String are different - if the hash is different the string must be different
2) Strings may be the same - if the hash is identical, you MUST compare the strings.

Therefore a good hash must an even spread of hash values - for the typical data.

It is usually a good idea if the overhead of calculating the hash is relatively small. However this is not always the case if it impeeds the even distribution. Consider a pre-build "dictionary" of a few million items. When the program is run, a single string will be input and found in the dictionary.

Since only a single string is processed at runtime, the performance of the hash calculation is insignificant compared to the effectiveness of the distribution. [The time to build the hashs for the dicionary is impaterial since it is not built at runtime]

DeepT
October 22nd, 2008, 09:36 AM
What is wrong with any of the standard type hashing algorithms???

First post, I am looking for a "standard type of hashing algorithm". That is the point of the post.

Consider a pre-build "dictionary" of a few million items.

What prebuild dictionary?
1. This is platform independent C++. I do not know of a "Dictionary" class that is platform independent.

2. This data is dynamic on both ends.

Here is the gist of the actual problem.

I am given a giant list of process names from a server. I have no idea how long this list will be until I get it.

I then need to scan the process on the local machine, which is another long list (in most cases).

I then need to report a True or False for each process name in the first list that is found in the 2nd list.

That is a lot of string compares.

How do I speed that up? Make hash codes for the first list, make hash codes for the 2nd list. Fortunately once I have the first list, it rarely changes. The 2nd list can change somewhat frequently depending on what the user is doing.

Anyway, to get this off the ground, I need a good hashing function, which is why I posted here. I do not know of any. If I was using C#, Id use a dictionary class and be done with it. It would be a cake walk. The problem is that C# doesn't run on Mac's or Linux or whatever else.

MikeAThon
October 22nd, 2008, 10:34 AM
I have no idea of whether this is implementation has a good distribution, but here is a widely circulated hashing function for use with the MFC classes of CString and CMap:
// implementation of hash function

template< > UINT AFXAPI HashKey( CString& key )
{
LPCTSTR pp = (LPCTSTR)(key);
UINT uiRet = 0;
while (*pp)
{
uiRet = (uiRet<<5) + uiRet + *pp++;
}

return uiRet;
}

DeepT
October 22nd, 2008, 11:06 AM
That is pretty simple. Does it have a name? Like one I could search for and get some statistics on?

MikeAThon
October 22nd, 2008, 01:06 PM
No, I don't have a name for it. I believe its source is another MFC class, called CMapStringToOb. The hash function for that class is identical to the code above:
inline UINT CMapStringToOb::HashKey(LPCTSTR key) const
{
UINT nHash = 0;
if (key == NULL)
{
AfxThrowInvalidArgException();
}

while (*key)
nHash = (nHash<<5) + nHash + *key++;
return nHash;
}
The last component of the key (i.e., the "+ *key" part) provides a decent guarantee that strings that are close but differ only in the last character will generate different keys, and this seems to be a desirable property.

If you need statistics on the distribution, maybe you could derive them for yourself? It would be interesting to see your results.

TheCPUWizard
October 22nd, 2008, 04:08 PM
The last component of the key (i.e., the "+ *key" part) provides a decent guarantee that strings that are close but differ only in the last character will generate different keys, and this seems to be a desirable property.

If you need statistics on the distribution, maybe you could derive them for yourself? It would be interesting to see your results.

1) "Suffix differentiation" is an important charastic in most cases

2) As I stated, the distribution will vary heavily based on the source data. What provides a even spread for "common first names in america during the 20th century" may provide a completely different spread than "names of major sidies in europe".

As an (very trivial) example. Consider the text "The Quick Brown Fox Jumped Over " (just those 6 unique words). We now want to have a hash based on this data. A 3 bit hash is sufficient; one for each of the unique words we care about (6 values), and a single value for every other word or character pattern in the universe.

This would be the ideal has for that data set, but is clearly useless on a more complex (or even slightly different) data set.

DeepT
October 23rd, 2008, 08:00 AM
I went with the one that TheCPUWizard posted in this thread (http://www.codeguru.com/forum/showthread.php?p=1774288#post1774288) in the C# forums. It is definitely faster as it works on 4 byte chunks rather then 1 byte chunks. It is also more complex and from that I would guess provides a better distribution.

It is the first one he posted, I do not know how to link specific posts within threads.

ProgramThis
October 23rd, 2008, 08:40 AM
ANY hash will create duplicate keys. What is important is the Distribution of these duplications.

As you pointed out the "available space" of a hash (typically 16,32,64 bits) is much smaller than the source domain (e.g. all possible strings).

Even the issue of repeatability is a bit questionable, depending on how the hash is utilized. There are a whole set of situations where the hash for a given input value may legitimately vary, provided that the same value in the same context, doe not vary.
Not true at all. Your "original" key, being the actual value calculated, should be different. AFTER you perform your %sizeofhash THEN you get duplicate keys. If your generator creates like keys before the % then your algorithm is not effective.

As for repeatability, you CANNOT hash without this... If the word "dog" hashes to 13 one day and 74 the next how do you ever expect to retrieve this data? Your algorithm HAS to be repeatable, no exceptions, or you cannot retrieve your data.

TheCPUWizard
October 23rd, 2008, 08:49 AM
Not true at all. Your "original" key, being the actual value calculated, should be different. AFTER you perform your %sizeofhash THEN you get duplicate keys. If your generator creates like keys before the % then your algorithm is not effective.

I do not follow your statement. If your hash is an int (32 bit), then there are 2^32 possible hash values. For a 10 character (ascii) string there are 2^80 possible values. It is impossible for them to generate unique values.


As for repeatability, you CANNOT hash without this... If the word "dog" hashes to 13 one day and 74 the next how do you ever expect to retrieve this data? Your algorithm HAS to be repeatable, no exceptions, or you cannot retrieve your data.

Sure you can....provided your program does not run for more than one day. :D

Seriously, consider an application which is going to process documents, one at a time. It can be useful to reduce each word to a single integer value (aka a Hash Key). Provided this information is used only when processing the given document, there is NO requirement that a given word produces the same keyt when processing a different document. Nor is there any requirement that if you process a document again next week, that the hash values are the same.

In fact, you can reduce the hash size to Log2(NumberOfWords) [actually Log2(NumberOfUniqueWords) but that would take more calculation overhead]. Now you have a document processing specific hashcode that is 100% unique for a given document (of reasonable size).

DeepT
October 23rd, 2008, 11:35 AM
I hate to interrupt, but here is my C++ implementation of CPU wizards hash algorithm. There is a definite bug in it somewhere. Probably a very simple stupid error, but I can't find it.
void HashElement::Hash()
{
int Length = TheString.GetLength();

// Abort on empty strings. They do not hash.
if ( 0 == Length )
{
return;
}


int BuffLen = 0;
int Mod = Length % 4;

// Is the string divisible by 4?
// If it is, but that would leave no room for null termination so we must
// increase the buffer by 4.
// If not, we must pad it.
BuffLen = Length + 4 - Mod;

char *Buffer = (char *)malloc(BuffLen);

// Make sure we got our buffer
if ( NULL == Buffer )
{
return;
}

ZeroMemory(Buffer, BuffLen);

memcpy(Buffer, TheString.c_str(), Length);

int num = 0x15051505;
int num2 = num;
int* numPtr = (int*) Buffer;
for (int i = Length; i > 0; i -= 4)
{
num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
if (i <= 2)
{
break;
}
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
numPtr += 2;
}



free(Buffer);
HashCode = (num + (num2 * 0x5d588b65));
PrimeHashIndex = HashCode % HASH_TABLE_SIZE;
return;

}The bug is, hashing the same string gives different hash values.

I read in a text file and chop it up by lines. I feed text lines into this. Now on the very first run, on the first line it will give some hash value. Stop the program, rerun it and it gives a different hash value.

In this case, I am using the standard readme.txt file DevStudio produces. The first like is 72 "=" signs. That what the line says in the string, thats what it says in Buffer (via the debugger) yet each time, the first line has a different hash code. I look at the hex value of numPtr[0] and numPtr[1] and it is 0x3d3d3d3d or "====".

I can't see any reason this is coming up with different hash values each time I run. It must be some address issue with pointers being different values since that is the only thing that is different each time, yet I can't see how that would mess up the algorithm.

DeepT
October 23rd, 2008, 01:12 PM
Figured it out this is the problem linefor (int i = Length; i > 0; i -= 4)


The i -= 4 needs to be i -= 8.

The Length is in characters. CPUWizard had mentioned C# used all Unicode. So the two numPtr[] things were covering 2 Unicode characters each, hence the character count was 4. In normal single-byte ANSI, the two ints were covering 8 characters instead of 4. I also think Ill need to change the "if (i <= 2)" to "if (i <= 4)".

TheCPUWizard
October 23rd, 2008, 01:17 PM
You are on the right track :wave: :thumb:

(You also need to make sure the length is a multipleof 8 rather than 4..... ;) )