How to design unordered_map's hash function for this case and another one?

**superbonzo** · October 1st, 2015, 02:40 AM

Originally Posted by tiliavirga

I've pointed at the relevant section in the C++ 11 standard

no you didn't

firstly, the excerpt about the probability of collisions is in a note and notes are non normative ( they are never requirements )

second, even if we take that note literally, given a a pair of uniformly picked long long's, the probability that their first 32 bits to be equal is EXACTLY 1 / (1+max of size_t).

>> But you are suggesting that std::hash<long long> could produce equal hash values with probability 1 for certain numbers in the long long range

they're called collisions and are unavoidable when you work with more than sizeof(std::size_t), and certainly the standard doesn't forbid them ...

third, not only gcc but also (AFAIR)clang and boost hash have implementations like this.

the moral of the story is monarch's comment in post #3, people should not write hash by themselves unless they really know what they're doing ( there's a reason why even hash_combine was not included in the final standard ... ).

**monarch_dodra** · October 1st, 2015, 03:07 AM

Originally Posted by laserlight

Agreed: unless it can be shown to be a mistake in the standard or is an acceptable language/library extension, where a compiler differs from the standard, it is a bug in the compiler.

I would agree too, were the compiler to differ from the standard. But truncating a long long to size_t *is* a standard compliant hash. Take any 2 64 bit numbers. Probability of collision when truncated to 32 bits:
2^^32 / 2^^64 == 1 / 2^^32 == 1.0 / numeric_limits<size_t>::max()

Originally Posted by laserlight

I tried monarch_dodra's test program from post #13 with g++ 4.8.4 with -std=c++11 and received as the program's output:
4294967297
8589934593
53021371269121
53021371269127

The issue (I think) is that you did not pass the -m32 option to compile in 32 bits?

In any case, there is something off, as 53021371269127 does not fit in a 32 bit size_t. (or any of them as a matter of fact).

Originally Posted by laserlight

It seems that there was a bug here in the standard library implementation, but it has since been fixed, though monarch_dodra is presumably still using an older version. This can hardly be included in the "experimental" disclaimer of g++ that is meant to excuse implementations of components following drafts of C++11 before publication: a bad implementation is a bad implementation.

I'm using 4.8.4. There is not a bug (again, the hash is standard compliant), and nothing has been fixed. The relevant source code for 4.8.4 is here:
https://github.com/gcc-mirror/gcc/blob/gcc-4_8-branch/libstdc++-v3/include/bits/functional_hash.h

Code:

  // Explicit specializations for integer types.
#define _Cxx_hashtable_define_trivial_hash(_Tp) 	\
  template<>						\
    struct hash<_Tp> : public __hash_base<size_t, _Tp>  \
    {                                                   \
      size_t                                            \
      operator()(_Tp __val) const noexcept              \
      { return static_cast<size_t>(__val); }            \
    };

There have been changes to the file since, but not relative to integral hashing.

**laserlight** · October 1st, 2015, 03:40 AM

Originally Posted by monarch_dodra

The issue (I think) is that you did not pass the -m32 option to compile in 32 bits?

In any case, there is something off, as 53021371269127 does not fit in a 32 bit size_t. (or any of them as a matter of fact).

Ah yes, my bad.

**tiliavirga** · October 1st, 2015, 07:30 AM

Originally Posted by superbonzo

no you didn't

Yes I did. 17.6.3.4 describes the requirements on hash function objects. No one including you have pointed at anything with higher relevance.

firstly, the excerpt about the probability of collisions is in a note and notes are non normative ( they are never requirements )

Both normative and informative sections are part of the standard. The normative sections alone constitute a pretty useless standard. Both are required for compliancy.

In this case for example all hash function objects could always return 0 and still be normatively compliant. The informative note prevents this and prescribes what kind of value should be returned for compliancy.

second, even if we take that note literally, given a a pair of uniformly picked long long's, the probability that their first 32 bits to be equal is EXACTLY 1 / (1+max of size_t).

But you have not taken the note literarly. It is stated that h(t1) and h(t2) should be equal with a very small probability only. Then h(t1) and h(t1+k) should be that too but with the suggested truncation of long long they won't. On the contrary, for a specific k they will be equal with the highest possible probability namely for certain.

So just truncating the upper part of a long long won't produce a compliant hash code.

the moral of the story is monarch's comment in post #3, people should not write hash by themselves unless they really know what they're doing

That also applies to people who think they really know what they're doing.

No one knows every corner of C++ (except maybe Stroustrup). This means you must trust your compiler to implement C++ according to the principle of least surprise. In this case it looks like VS 2015 is the winner and if you find attention to detail in one corner it's likely to be present in the next.

**superbonzo** · October 1st, 2015, 08:02 AM

... please refer to the ISO documentation for what 'normative' means
... please refer to any elementary probability book for what 'probability' means

just joking, I know there's no hope, as always ...

**tiliavirga** · October 2nd, 2015, 03:20 AM

Originally Posted by superbonzo

... please refer to the ISO documentation for what 'normative' means
... please refer to any elementary probability book for what 'probability' means

just joking, I know there's no hope, as always ...

Well, it's never fun to watch someone cave in like this. You are insinuating that the reason would be that I don't know what I'm talking about but it may also be that the impossible has happened - it may have dawned upon you that you may actually be wrong! And indeed you are. To really rub it in I'm going to summarise my points and then leave you in peace to mourn your lost besserwisser innocence.

It's true that C++ compiler compliancy means that the normative part of the C++ standard is met. But if the informative part is disregarded the normative part may be misunderstood and the compiler still won't be compliant. For every hash function object to always return 0 would be fine normatively but then there's an informative note in effect stating that doing so would go against the purpose of hashing and thus (now that you know better) would not be compliant with the C++ standard.

You may have missed that the note in question,

"For two different values t1 and t2, the probability that h(t1) and h(t2) compare equal should be very small, approaching 1.0 / numeric_limits<size_t>::max()."

is in fact making two propositions,

P1: h(t1) and h(t2) should very seldom be equal.
P2: P1 should hold for every different t1 and t2.

In our case t1 and t2 are long longs and h is std::hash<long long> returning size_t.

Now implementing h by just truncating the long long parameter to a size_t return value won't violate P1. That's why this implementation will work fine in practice with say std::unordered_set etetera. But P2 is another story. There is a known set of different h1 and h2 for which P1 doesn't hold and for which h(t1) and h(t2) are in fact always equal (it is when h1 = h2 + a specific constant). This violates P2 making the discussed implementation of h non-compliant with the C++ standard.

Does it matter? Well at least Microsoft though so and acted accordingly when VC++ 2015 was implemented.

**superbonzo** · October 2nd, 2015, 04:02 AM

Originally Posted by tiliavirga

But P2 is another story. There is a known set of different h1 and h2 for which P1 doesn't hold and for which h(t1) and h(t2) are in fact always equal (it is when h1 = h2 + a specific constant). This violates P2 and makes the discussed implementation of h non-compliant with the C++ standard.

oh dear, ignoring your imaginative interpretation of the wording,

given ANY surjective hash function h:[0,M]->[0,N] with M > N there ALWAYS exist "known" x != y such as h(x)=h(y), just take ANY y in the preimage of some h(x). So, if your funny requirement was true NO compliant hash function for long long would ever exist on 32 bit systems.

So, along with elementary probability you may also take a look at elementary set theory as well ...

regarding what 'normative' means, I could quote ISO wording but sadly I suppose it wouldn't help much.
regarding why an implementation may choose to implement more "random" looking hashes, there can be many reasons, including things like security ( simple hashes can be attacked ) or implementation internals or quality of implementation reasons ... nothing to do with compliancy anyway

**tiliavirga** · October 4th, 2015, 06:03 AM

Originally Posted by superbonzo

oh dear

Reiterating definitions, suggesting literature to read-up on, and holding lectures in besserwisser-ish make you look more like a schmuck than the genius you probably have in mind. And they are not arguments. Still I'm sure we agree on the basics. A big peg doesn't fit a small hole so hashing involves collisions. And hashing is fully deterministic so everything is known beforehand.

What we don't agree on is the general significance of notes in the standard and the significance of a specific note concerning hash function objects. Without going into details again in my view the standard requires a compliant compiler to implement "good hashing" in the consensus sense. Always returning 0 as you claim would be fine won't do.

We don't seem to be getting anywhere so I quit here, thank you.

Thread: How to design unordered_map's hash function for this case and another one?

Thread Tools

Display

Re: How to design unordered_map's hash function for this case and another one?

Re: How to design unordered_map's hash function for this case and another one?

Re: How to design unordered_map's hash function for this case and another one?

Re: How to design unordered_map's hash function for this case and another one?

Re: How to design unordered_map's hash function for this case and another one?

Re: How to design unordered_map's hash function for this case and another one?

Re: How to design unordered_map's hash function for this case and another one?

Re: How to design unordered_map's hash function for this case and another one?

Posting Permissions