Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> LLVM's libc++

Great find!

So the library solutions are still often suboptimal, and it's even more easy to hide bad decisions in the C++ sources, so whoever has the approach "just use the default library" should be aware of that once the performance is important.

Yes, even the simple multiplicative constants can significantly improve the hash if it by default doesn't do anything with the input! The libraries definitely should be fixed, and adding the multiplication step is really a simple and fast change for a great benefit.

As an inspiration, Kernighan and Ritchie in their book about C used a simple number 31, and that simple hash is still quite good compared to much more complex and more recent solutions as K&R also haven't used the (I guess misleadingly named) "open addressing" for their hash table. Their solution is amazingly minimalistic and in that context amazingly good for chain hash tables. I wouldn't be surprised if just changing

    return __c;
to

    return __c * 31;
in the functions discovered would result in great improvement. The good side of such a constant is that it can give the fast and small code even on the old architectures where the "normal" multiplication is slow (e.g. even if there's no fast multiplier the result can be obtained by one shift and one subtraction!). Also on modern architectures using this constant can't result in any performance degradation but improving the hash behavior of these formerly unprocessed inputs guarantees speedup. And there are surely use cases when using more complex functions is much better, e.g. those suggested by aappleby:

https://news.ycombinator.com/item?id=17330787

Back to the "open addressing", if you are rolling your own hash table and don't plan too much hash tables to be present in memory at once, it's often much faster to use "chains" (like in the K&R C book) than trying to store everything only in the table (which is misleadingly often called "open addressing" even if "closed hashing" is a better term) and jump through the table in the collision case. Maintaining lists per entry is typically much faster when the table is fuller, provided the allocation routines are fast.

https://www.strchr.com/hash_functions

By the way, MurmurHash2 or 3 and CityHash are definitely very good functions, the problem is when they aren't used in the library, like, it seems, in libcxx. And in the cases where the simpler code is needed, even a simple * 31 is much, much better than nothing!

And note, it seems there are even problems with these good functions, security wise: apparently the language implementations or the services accepting uncontrolled inputs also have to care about the security aspects of their hash functions:

https://131002.net/siphash/

"Jointly with Martin Boßlet, we demonstrated weaknesses in MurmurHash (used in Ruby, Java, etc.), CityHash (used in Google), and in Python's hash. Some of the technologies affected have switched to SipHash."

"SipHash was designed as a mitigation to hash-flooding DoS attacks. It is now used in the hash tables implementation of Python, Ruby, Perl 5, etc."

"SipHash was designed by Jean-Philippe Aumasson and Daniel J. Bernstein."



> "SipHash was designed as a mitigation to hash-flooding DoS attacks. It is now used in the hash tables implementation of Python, Ruby, Perl 5, etc."

It's also the default hasher in Rust.

Rust's hashing is interesting. Types that want to be hashable implement the Hash trait. What the Hash trait requires is that a type knows how to feed its fields to a Hasher, as a sequence of primitives - it doesn't require that it actually computes a hash itself. It's the Hasher which computes the hash. This is nice, because it's very easy to implement Hash; indeed, so easy that it can be done automatically using a derive macro. The downside is that it's not possible for a type to implement a custom hash that takes particular advantage of its own structure, and so to get a particularly good tradeoff of distribution against performance. The only place to make that tradeoff is in the choice of Hasher, where it has to be made generically across all types.

That said, you can choose the hasher used for individual HashMaps, so if you have a HashMap where you know the keys are integers, you can use a Hasher which just does a Fibonacci hash.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: