-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use better maps to store visitor state #4459
Conversation
For the record, I also tried https://github.com/Tessil/robin-map – it was slower than phmap for visitors. And it is not a drop-in replacement, so code changed are required. |
8c17523
to
e1b2302
Compare
Just two notes:
|
Is there an ETA for when Protobuf 22.x upgrade is likely to occur? Wondering whether it makes sense for this change to wait so that it can avoid adding and later removing the Abseil files? Edit: Rereading the PR comment again ("Essentially it's a hash map extracted from Abseil plus some additional stuff on top of that"), perhaps waiting for Abseil as part of a Protobuf upgrade is insufficient. |
It might be as simple as bumping the version here: https://github.com/p4lang/p4c/blob/main/cmake/Protobuf.cmake#L35 Main reason I have not done that is because, well, we have a lot of boost code flying around and adding another dependency to a utility framework before cleaning the other one up seemed excessive. I am tracking this in an issue (#3898), the hard part is replacing boost::format with std::format. |
Another thing to do is to break down our boost dependencies into the independent modules which are now available. For example: That would simplify things instead of pulling in |
Note that it started as a library from there. But since that time it had some changes here and there and also there are some notable differences. Certainly it would worth checking and either keeping it or replacing with abseil maps.
Nothing wrt multiple dispatch here :) We are reducing malloc / GC traffic here and speeding up basic operations (insert / find). |
ir/visitor.cpp
Outdated
visited_t visited; | ||
|
||
public: | ||
Tracker() : visited(16) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a very brief comment explaining the number 16 🙂 E.g., "pre-allocating 16 slots yields a performance improvement and is sufficient for most cases".
I want to avoid a "huh?" response when people see this code later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah! Likely we will change with inlined pre-allocated storage in the future, but one step at a time.
All right. I went ahead and checked with abseil maps. Looks like they added some improvements there since phmap was forked. I would definitely say we'd need to upgrade ASAP:
I have some other PRs that replace maps on hot codepaths, so I'd prefer us to choose map before submitting it. |
If abseil is a dependency, then you could use |
Doesn't this change speed up traversal for Transform and Inspector, too? I'd assume it would help with interpreters like P4Testgen.
Well one thing we could try is to bump the Protobuf version and see what happens. If nothing breaks we could add a PR for it, then rebase this one top and link against the absl map. In parallel, we can try to clean up the other dependencies. |
Yes, it speeds-up the traverse a lot. But not the multiple-dispatch itself.
Let me give it a try and see what will happen :) |
Iirc the problem is the difference in format specifiers. Boost uses |
Yes. I do not recall how well |
I'd be interested in seeing why hvec_map is so much slower than this, given it is using essentially the same algorithms. You could try giving it an initial capacity of 16 as well, and maybe a different hash function (what is the default has function for pointers -- just identity? That would seem to be pretty good.) |
The benchmarking above already do this (pre-allocation of capacity plus better hash function, we do compare apples-to-apples). I tried |
6ae6c0d
to
b0f076a
Compare
Rebased on top of abseil. Here are the updated results (the differences with tables above is the non-unity build here):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but please please give the other reviewers a chance to provide feedback before merging.
(The speedups sound great!)
Absolutely! |
56dd486
to
4a152b3
Compare
4a152b3
to
e910044
Compare
One follow-up: it would be good to include a small section in the developer docs (e.g., docs/README.md -- maybe a "performance considerations" section?) as to why some collections are now Abseil. You could also consider adding guidance on when code might benefit from Abseil collections and how to switch. Feel free to do this as a separate PR or part of #4473. |
Good idea. I think I'd also mention |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The speedup does look very interesting, thank you.
We could probably get even more by replacing standard maps sprinkled around the code base, but that would be significantly more work (and we would also need a better ordered_map
/ordered_set
).
See #4473 just in case :) Some care should be taken here as these maps have a bi different semantics about iterator invalidation and pointers stability, so we cannot just automatically replaces ones with others. However, it is definitely beneficial in few hot code paths |
ir/CMakeLists.txt
Outdated
@@ -64,6 +64,7 @@ set (BASE_IR_DEF_FILES | |||
set (IR_DEF_FILES ${IR_DEF_FILES} ${BASE_IR_DEF_FILES} PARENT_SCOPE) | |||
|
|||
add_library (ir STATIC ${IR_SRCS}) | |||
target_link_libraries(ir absl::flat_hash_map) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how we want to manage linking... Should each dependent target link absl::flat_hash_map, should ir
export it? Also unclear to me what the best approach is to keep binary size small.
I am guess we should make PUBLIC
explicit here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good question. Abseil is quite fine grained. So, all small pieces are exposed as small library.
What I can see from the present cmakefiles:
- Order dependencies (
add_dependency
) are used instead of proper use dependencies (target_link_libraries
). Therefore interface dependencies (include paths, etc.) are not properly exposed. In some cases this is alleviated via explicit adding of include paths - Some dependencies are clearly missed. E.g. backends / midend definitely pull pieces from frontend. However, this is only addressed during the link time, not during the build time. Likely this required explicit order dependencies mentioned above to fix highly-parallel builds.
I'd not bother about "binary size" here as these are libraries, so all unused code would be removed by linker. I am seeing few possibilities:
- Make explicit global abseil dependencies. Via e.g.
P4_LIB_DEPS
. Still, it is not used consistently and usually only for final binaries, not for libraries - Go for fine grained approach
I'd probably try to use the second one:
- If abseil is exposed via headers, then do
PUBLIC
dependency - If abseil is only used in cpp as implementation, fine – it's
PRIVATE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll revise this PR wrt dependencies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add_dependency
was used because historically all the libs were built independently and linked together by each back end. However, to ensure proper build order in parallel builds, add_dependency
was needed.
The fine-grained approach works for me. #4474 implements some of this as I ran into some issues downstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add_dependency
was used because historically all the libs were built independently and linked together by each back end. However, to ensure proper build order in parallel builds,add_dependency
was needed.
Yes. But this is not how the dependencies should be done. Things are pulling headers, so this is a build-time dependency, not a link-time... #4473 also has some... we'd need to unify all somehow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With binary size I meant the final compiler binaries, which can be huge. I believe this is primarily caused by all the templating and ir-generated header, but also by the chaotic linking, which also seems to affect linking time. It can be excruciatingly slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. But this is not how the dependencies should be done. Things are pulling headers, so this is a build-time dependency, not a link-time... #4473 also has some... we'd need to unify all somehow
Yeah I agree with you, the compiler infrastructure has been going through various phases of structuring with each back end doing their own thing.
Co-authored-by: Glen Gibb <glen.gibb@alumni.stanford.edu>
e910044
to
16f06a3
Compare
16f06a3
to
def2fed
Compare
This PR imports modern hash map implementation from https://github.com/greg7mdp/parallel-hashmap Essentially it's a hash map extracted from Abseil plus some additional stuff on top of that. Note that while it is a drop-in replacement of
std::unordered_map
it has different guarantees wrt iterator invalidation as it uses open-addressing scheme.Advantages of this map is:
On our downstream codebase just this switch alone yields ~2.1x compile time reduction on some large P4C apps.
The PR contains several commits to measure effects:
gtestp4c-phmap
below)gtestp4c-phmap-16
below)std::hash
toUtils::Hash
(gtestp4c-phmap-16-hash
below)To compare with
hvec_map
I also benchmarked wrt it (gtestp4c-hmap
). The benchmarking results are:gtestp4c-baseline --gtest_filter=P4CParserUnroll.switch_20160512
gtestp4c-phmap --gtest_filter=P4CParserUnroll.switch_20160512
gtestp4c-phmap-16 --gtest_filter=P4CParserUnroll.switch_20160512
gtestp4c-phmap-16-hash --gtest_filter=P4CParserUnroll.switch_20160512
gtestp4c-hmap --gtest_filter=P4CParserUnroll.switch_20160512
To summarize: it yields 25% improvements against the baseline
std::unordered_map
and 13% improvements towardshvec_map
. As I already said, for some of our downstream apps the difference is more than two-fold, the compilation time reduced from 5m 14s down to 2m 29s (hvec_map
was not that attractive there sadly). I also saw some marginal reduction of peak RSS from 15.3 GiB down to 14.87 GiB.To give more benchmarking results, here are the whole gtest times (here the proportion of
Visitor
code is much less):gtestp4c-baseline
gtestp4c-hmap
gtestp4c-phmap-16-hash