Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a hash combiner for performance improvement. #250

Merged

Conversation

laurence-atomic
Copy link
Contributor

In our testing at Atomic Industries, we discovered that the performance of graaf suffers on graphs with >50 million edges.

We use a hash-combiner at Atomic to improve performance on hashing tuples. Implementing this in graaf resulted in substantial speed-ups in multiple test cases. Unfortunately I can not share internal performance data, but it should be possible to replicate the speed-up on large (>50 million edge) public datasets.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi there! Thank you for creating your first pull-request on the Graaf library :)

Copy link

codecov bot commented Nov 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.59%. Comparing base (4dbde3d) to head (301664a).
Report is 8 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #250   +/-   ##
=======================================
  Coverage   99.59%   99.59%           
=======================================
  Files          56       56           
  Lines        2738     2748   +10     
  Branches      135      135           
=======================================
+ Hits         2727     2737   +10     
  Misses         11       11           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@joweich joweich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmarks do indeed indicate a significant leap in performance:

bm_results

template <class T>
inline void hash_combine(std::size_t& seed, const T& v) {
std::hash<T> hasher;
seed ^= hasher(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this is hash_combine of boost v1.33.
A reference to the source as a comment would be good here. The magic number and the shifts otherwise seem a little arbitrary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it appears that is the source of the hash combine function. I was only familiar with the function based on seeing it used in multiple contexts in production. I hadn't tracked down the source.

It appears that like many magic constants, the value of 0x9e3779b9 was chosen arbitrarily as the fractional component of the golden ratio multiplied by 2^32. Perhaps on modern hardware, using the golden ratio multiplied by 2^64 would show improved performance.

Sources:

http://burtleburtle.net/bob/hash/doobs.html
https://softwareengineering.stackexchange.com/questions/63595/tea-algorithm-constant-0x9e3779b9-said-to-be-derived-from-golden-ratio-but-the/63599#63599

Perhaps it's better in that case to directly use the Boost hash combiner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bit shifts are probably chosen arbitrarily too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, hash functions are outside of my field of expertise, my goal here is to improve performance on very large graphs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll speak with my team at Atomic Industries about making a novel hash combiner function, so we don't accidentally end up plagiarizing another project. We have a few mathematicians on the team with a background in combinatorics.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I agree that referencing Boost would be helpful here. Looking at the Boost Software License under which the ContainerHash library is licensed this may also be a requirement. No need to push changes to the PR, I can add this before merging :)

The alternative would be to add Boost as a dependency to Graaf, but I would like to avoid this as I would like Graaf to be a standalone Boost::Graph alternative.

Copy link
Owner

@bobluppes bobluppes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this PR @laurence-atomic! And wow, that is some graph with 50M edges 😅 Glad you were able to find such a great performance win (and thanks @joweich for providing insights into our benchmarks)!

If I understand correctly, Boost switched to a slightly different implementation as of Boost 1.81 which improves the distribution over the output domain. Source code here. I am wondering whether we could gain anything by using this newer implementation, but I would leave this as a potential follow up.

LGTM! I just added the Boost license and am more than happy to include this optimization in the next release of Graaf 🚀

include/graaflib/types.h Outdated Show resolved Hide resolved
Co-authored-by: Bob Luppes <bobluppes@gmail.com>
@bobluppes bobluppes added the performance Benchmarks or performance improvements label Nov 26, 2024
@bobluppes bobluppes merged commit ec9e0b0 into bobluppes:main Nov 26, 2024
8 checks passed
@bobluppes
Copy link
Owner

Glad to have this performance improvement merged, thanks again @laurence-atomic 🎉

I just wanted to take the opportunity to ask if you have any feedback regarding the library as we would love to hear from you and your team! In particular, we are currently trying to streamline the interface of the graph class and corresponding algorithms. So if there is anything in particular that does or doesn't work well for you then you are more than welcome to hop in our discord and we can discuss 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Benchmarks or performance improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants