-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added a hash combiner for performance improvement. #250
Added a hash combiner for performance improvement. #250
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi there! Thank you for creating your first pull-request on the Graaf library :)
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #250 +/- ##
=======================================
Coverage 99.59% 99.59%
=======================================
Files 56 56
Lines 2738 2748 +10
Branches 135 135
=======================================
+ Hits 2727 2737 +10
Misses 11 11 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
template <class T> | ||
inline void hash_combine(std::size_t& seed, const T& v) { | ||
std::hash<T> hasher; | ||
seed ^= hasher(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this is hash_combine of boost v1.33.
A reference to the source as a comment would be good here. The magic number and the shifts otherwise seem a little arbitrary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it appears that is the source of the hash combine function. I was only familiar with the function based on seeing it used in multiple contexts in production. I hadn't tracked down the source.
It appears that like many magic constants, the value of 0x9e3779b9 was chosen arbitrarily as the fractional component of the golden ratio multiplied by 2^32. Perhaps on modern hardware, using the golden ratio multiplied by 2^64 would show improved performance.
Sources:
http://burtleburtle.net/bob/hash/doobs.html
https://softwareengineering.stackexchange.com/questions/63595/tea-algorithm-constant-0x9e3779b9-said-to-be-derived-from-golden-ratio-but-the/63599#63599
Perhaps it's better in that case to directly use the Boost hash combiner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bit shifts are probably chosen arbitrarily too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, hash functions are outside of my field of expertise, my goal here is to improve performance on very large graphs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll speak with my team at Atomic Industries about making a novel hash combiner function, so we don't accidentally end up plagiarizing another project. We have a few mathematicians on the team with a background in combinatorics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 I agree that referencing Boost would be helpful here. Looking at the Boost Software License under which the ContainerHash library is licensed this may also be a requirement. No need to push changes to the PR, I can add this before merging :)
The alternative would be to add Boost as a dependency to Graaf, but I would like to avoid this as I would like Graaf to be a standalone Boost::Graph alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening this PR @laurence-atomic! And wow, that is some graph with 50M edges 😅 Glad you were able to find such a great performance win (and thanks @joweich for providing insights into our benchmarks)!
If I understand correctly, Boost switched to a slightly different implementation as of Boost 1.81 which improves the distribution over the output domain. Source code here. I am wondering whether we could gain anything by using this newer implementation, but I would leave this as a potential follow up.
LGTM! I just added the Boost license and am more than happy to include this optimization in the next release of Graaf 🚀
Co-authored-by: Bob Luppes <bobluppes@gmail.com>
Glad to have this performance improvement merged, thanks again @laurence-atomic 🎉 I just wanted to take the opportunity to ask if you have any feedback regarding the library as we would love to hear from you and your team! In particular, we are currently trying to streamline the interface of the |
In our testing at Atomic Industries, we discovered that the performance of graaf suffers on graphs with >50 million edges.
We use a hash-combiner at Atomic to improve performance on hashing tuples. Implementing this in graaf resulted in substantial speed-ups in multiple test cases. Unfortunately I can not share internal performance data, but it should be possible to replicate the speed-up on large (>50 million edge) public datasets.