Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use better maps to store visitor state #4459

Merged
merged 9 commits into from
Feb 27, 2024
Merged

Conversation

asl
Copy link
Contributor

@asl asl commented Feb 23, 2024

This PR imports modern hash map implementation from https://github.com/greg7mdp/parallel-hashmap Essentially it's a hash map extracted from Abseil plus some additional stuff on top of that. Note that while it is a drop-in replacement of std::unordered_map it has different guarantees wrt iterator invalidation as it uses open-addressing scheme.

Advantages of this map is:

  • Much faster lookups
  • Much less malloc traffic
  • Better memory usage

On our downstream codebase just this switch alone yields ~2.1x compile time reduction on some large P4C apps.

The PR contains several commits to measure effects:

  • Use of map itself (gtestp4c-phmap below)
  • Preallocation of 16 slots (we know that in majority of cases this would be enough) (gtestp4c-phmap-16 below)
  • Switch from std::hash to Utils::Hash (gtestp4c-phmap-16-hash below)

To compare with hvec_map I also benchmarked wrt it (gtestp4c-hmap). The benchmarking results are:

Command Mean [s] Min [s] Max [s] Relative
gtestp4c-baseline --gtest_filter=P4CParserUnroll.switch_20160512 8.375 ± 0.177 8.118 8.648 1.25 ± 0.04
gtestp4c-phmap --gtest_filter=P4CParserUnroll.switch_20160512 6.886 ± 0.173 6.563 7.061 1.03 ± 0.03
gtestp4c-phmap-16 --gtest_filter=P4CParserUnroll.switch_20160512 6.722 ± 0.118 6.539 6.900 1.00 ± 0.03
gtestp4c-phmap-16-hash --gtest_filter=P4CParserUnroll.switch_20160512 6.705 ± 0.147 6.546 7.038 1.00
gtestp4c-hmap --gtest_filter=P4CParserUnroll.switch_20160512 7.547 ± 0.099 7.356 7.689 1.13 ± 0.03

To summarize: it yields 25% improvements against the baseline std::unordered_map and 13% improvements towards hvec_map. As I already said, for some of our downstream apps the difference is more than two-fold, the compilation time reduced from 5m 14s down to 2m 29s (hvec_map was not that attractive there sadly). I also saw some marginal reduction of peak RSS from 15.3 GiB down to 14.87 GiB.

To give more benchmarking results, here are the whole gtest times (here the proportion of Visitor code is much less):

Command Mean [s] Min [s] Max [s] Relative
gtestp4c-baseline 14.398 ± 0.310 13.920 14.923 1.17 ± 0.03
gtestp4c-hmap 13.083 ± 0.280 12.683 13.496 1.06 ± 0.03
gtestp4c-phmap-16-hash 12.338 ± 0.256 12.003 12.877 1.00

@asl asl requested review from fruffy, grg, vlstill and ChrisDodd February 23, 2024 06:21
@asl
Copy link
Contributor Author

asl commented Feb 23, 2024

For the record, I also tried https://github.com/Tessil/robin-map – it was slower than phmap for visitors. And it is not a drop-in replacement, so code changed are required.

@asl asl force-pushed the visitor-improve-maps branch from 8c17523 to e1b2302 Compare February 23, 2024 06:35
@fruffy
Copy link
Collaborator

fruffy commented Feb 23, 2024

Just two notes:

  • We may soon get Abseil "for free" because the compiler's Protobuf dependency will pull it in with Protobuf 22.x: https://protobuf.dev/news/2022-08-03/#abseil-dep. We simply haven't pulled the trigger on an upgrade.

  • P4Testgen heavily depends on the visitors' multiple dispatch so it should be affected by these changes. I will try to add a benchmark. We could also write up independent visitor microbenchmarks.

@fruffy fruffy added the core Topics concerning the core segments of the compiler (frontend, midend, parser) label Feb 23, 2024
@grg
Copy link
Contributor

grg commented Feb 23, 2024

Is there an ETA for when Protobuf 22.x upgrade is likely to occur? Wondering whether it makes sense for this change to wait so that it can avoid adding and later removing the Abseil files?

Edit: Rereading the PR comment again ("Essentially it's a hash map extracted from Abseil plus some additional stuff on top of that"), perhaps waiting for Abseil as part of a Protobuf upgrade is insufficient.

@fruffy
Copy link
Collaborator

fruffy commented Feb 23, 2024

Is there an ETA for when Protobuf 22.x upgrade is likely to occur? Wondering whether it makes sense for this change to wait so that it can avoid adding and later removing the Abseil files?

Edit: Rereading the PR comment again ("Essentially it's a hash map extracted from Abseil plus some additional stuff on top of that"), perhaps waiting for Abseil as part of a Protobuf upgrade is insufficient.

It might be as simple as bumping the version here: https://github.com/p4lang/p4c/blob/main/cmake/Protobuf.cmake#L35

Main reason I have not done that is because, well, we have a lot of boost code flying around and adding another dependency to a utility framework before cleaning the other one up seemed excessive.

I am tracking this in an issue (#3898), the hard part is replacing boost::format with std::format.

@fruffy
Copy link
Collaborator

fruffy commented Feb 23, 2024

Another thing to do is to break down our boost dependencies into the independent modules which are now available. For example:
https://github.com/boostorg/format
https://github.com/boostorg/multiprecision

That would simplify things instead of pulling in libboost-all-dev always.

@asl
Copy link
Contributor Author

asl commented Feb 23, 2024

Note that it started as a library from there. But since that time it had some changes here and there and also there are some notable differences. Certainly it would worth checking and either keeping it or replacing with abseil maps.

P4Testgen heavily depends on the visitors' multiple dispatch so it should be affected by these changes. I will try to add a benchmark. We could also write up independent visitor microbenchmarks.

Nothing wrt multiple dispatch here :) We are reducing malloc / GC traffic here and speeding up basic operations (insert / find).

ir/visitor.cpp Outdated
visited_t visited;

public:
Tracker() : visited(16) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a very brief comment explaining the number 16 🙂 E.g., "pre-allocating 16 slots yields a performance improvement and is sufficient for most cases".

I want to avoid a "huh?" response when people see this code later on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah! Likely we will change with inlined pre-allocated storage in the future, but one step at a time.

@asl
Copy link
Contributor Author

asl commented Feb 23, 2024

All right. I went ahead and checked with abseil maps. Looks like they added some improvements there since phmap was forked. I would definitely say we'd need to upgrade ASAP:

Command Mean [s] Min [s] Max [s] Relative
gtestp4c-baseline --gtest_filter=P4CParserUnroll.switch_20160512 8.381 ± 0.187 8.155 8.652 1.41 ± 0.05
gtestp4c-phmap-16-hash --gtest_filter=P4CParserUnroll.switch_20160512 6.611 ± 0.107 6.490 6.840 1.11 ± 0.04
gtestp4c-abseil --gtest_filter=P4CParserUnroll.switch_20160512 5.935 ± 0.160 5.712 6.278 1.00
test/gtestp4c-hmap --gtest_filter=P4CParserUnroll.switch_20160512 7.541 ± 0.195 7.233 7.907 1.27 ± 0.05

I have some other PRs that replace maps on hot codepaths, so I'd prefer us to choose map before submitting it.

@asl
Copy link
Contributor Author

asl commented Feb 23, 2024

I am tracking this in an issue (#3898), the hard part is replacing boost::format with std::format.

If abseil is a dependency, then you could use absl::format as well :)

@fruffy
Copy link
Collaborator

fruffy commented Feb 23, 2024

Nothing wrt multiple dispatch here :) We are reducing malloc / GC traffic here and speeding up basic operations (insert / find).

Doesn't this change speed up traversal for Transform and Inspector, too? I'd assume it would help with interpreters like P4Testgen.

All right. I went ahead and checked with abseil maps. Looks like they added some improvements there since phmap was forked. I would definitely say we'd need to upgrade ASAP:

Well one thing we could try is to bump the Protobuf version and see what happens. If nothing breaks we could add a PR for it, then rebase this one top and link against the absl map. In parallel, we can try to clean up the other dependencies.

@asl
Copy link
Contributor Author

asl commented Feb 23, 2024

Doesn't this change speed up traversal for Transform and Inspector, too? I'd assume it would help with interpreters like P4Testgen.

Yes, it speeds-up the traverse a lot. But not the multiple-dispatch itself.

Well one thing we could try is to bump the Protobuf version and see what happens. If nothing breaks we could add a PR for it, then rebase this one top and link against the absl map. In parallel, we can try to clean up the other dependencies.

Let me give it a try and see what will happen :)

@fruffy
Copy link
Collaborator

fruffy commented Feb 23, 2024

If abseil is a dependency, then you could use absl::format as well :)

Iirc the problem is the difference in format specifiers. Boost uses %1%, %2%,... which may not be supported in the other frameworks. The migration is finicky.

@asl
Copy link
Contributor Author

asl commented Feb 23, 2024

Iirc the problem is the difference in format specifiers. Boost uses %1%, %2%,... which may not be supported in the other frameworks. The migration is finicky.

Yes. I do not recall how well std::format is supported across the compilers off-hand. Migration of specifiers is a pain, I confirm (I once migrated large codebase to https://github.com/fmtlib/fmt)

@ChrisDodd
Copy link
Contributor

I'd be interested in seeing why hvec_map is so much slower than this, given it is using essentially the same algorithms. You could try giving it an initial capacity of 16 as well, and maybe a different hash function (what is the default has function for pointers -- just identity? That would seem to be pretty good.)

@asl
Copy link
Contributor Author

asl commented Feb 26, 2024

You could try giving it an initial capacity of 16 as well, and maybe a different hash function (what is the default has function for pointers -- just identity? That would seem to be pretty good.)

The benchmarking above already do this (pre-allocation of capacity plus better hash function, we do compare apples-to-apples). I tried hvec_map in other cases as well (PRs pending) and the overall outcome is the same: hvec_map might be faster than unordered_map, but always slower than these "swiss hash maps" (from abseil or separate phmap).

@asl
Copy link
Contributor Author

asl commented Feb 27, 2024

Rebased on top of abseil. Here are the updated results (the differences with tables above is the non-unity build here):

Command Mean [s] Min [s] Max [s] Relative
gtestp4c-baseline --gtest_filter=P4CParserUnroll.switch_20160512 8.467 ± 0.164 8.150 8.682 1.38 ± 0.06
gtestp4c --gtest_filter=P4CParserUnroll.switch_20160512 6.134 ± 0.214 5.861 6.446 1.00

@asl asl requested a review from grg February 27, 2024 00:20
@asl asl mentioned this pull request Feb 27, 2024
Copy link
Contributor

@grg grg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please please give the other reviewers a chance to provide feedback before merging.

(The speedups sound great!)

ir/visitor.cpp Show resolved Hide resolved
ir/visitor.cpp Outdated Show resolved Hide resolved
@asl
Copy link
Contributor Author

asl commented Feb 27, 2024

LGTM, but please please give the other reviewers a chance to provide feedback before merging.

(The speedups sound great!)

Absolutely!

@asl asl force-pushed the visitor-improve-maps branch from 56dd486 to 4a152b3 Compare February 27, 2024 01:55
@asl asl added the run-validation Use this tag to trigger a Validation CI run. label Feb 27, 2024
@asl asl force-pushed the visitor-improve-maps branch from 4a152b3 to e910044 Compare February 27, 2024 01:55
@grg
Copy link
Contributor

grg commented Feb 27, 2024

One follow-up: it would be good to include a small section in the developer docs (e.g., docs/README.md -- maybe a "performance considerations" section?) as to why some collections are now Abseil. You could also consider adding guidance on when code might benefit from Abseil collections and how to switch.

Feel free to do this as a separate PR or part of #4473.

@asl
Copy link
Contributor Author

asl commented Feb 27, 2024

One follow-up: it would be good to include a small section in the developer docs (e.g., docs/README.md -- maybe a "performance considerations" section?) as to why some collections are now Abseil. You could also consider adding guidance on when code might benefit from Abseil collections and how to switch.

Good idea. I think I'd also mention InlinedVector

Copy link
Contributor

@vlstill vlstill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The speedup does look very interesting, thank you.

We could probably get even more by replacing standard maps sprinkled around the code base, but that would be significantly more work (and we would also need a better ordered_map/ordered_set).

ir/visitor.cpp Show resolved Hide resolved
ir/visitor.cpp Show resolved Hide resolved
@asl
Copy link
Contributor Author

asl commented Feb 27, 2024

We could probably get even more by replacing standard maps sprinkled around the code base, but that would be significantly more work (and we would also need a better ordered_map/ordered_set

See #4473 just in case :) Some care should be taken here as these maps have a bi different semantics about iterator invalidation and pointers stability, so we cannot just automatically replaces ones with others. However, it is definitely beneficial in few hot code paths

@@ -64,6 +64,7 @@ set (BASE_IR_DEF_FILES
set (IR_DEF_FILES ${IR_DEF_FILES} ${BASE_IR_DEF_FILES} PARENT_SCOPE)

add_library (ir STATIC ${IR_SRCS})
target_link_libraries(ir absl::flat_hash_map)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how we want to manage linking... Should each dependent target link absl::flat_hash_map, should ir export it? Also unclear to me what the best approach is to keep binary size small.
I am guess we should make PUBLIC explicit here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good question. Abseil is quite fine grained. So, all small pieces are exposed as small library.

What I can see from the present cmakefiles:

  • Order dependencies (add_dependency) are used instead of proper use dependencies (target_link_libraries). Therefore interface dependencies (include paths, etc.) are not properly exposed. In some cases this is alleviated via explicit adding of include paths
  • Some dependencies are clearly missed. E.g. backends / midend definitely pull pieces from frontend. However, this is only addressed during the link time, not during the build time. Likely this required explicit order dependencies mentioned above to fix highly-parallel builds.

I'd not bother about "binary size" here as these are libraries, so all unused code would be removed by linker. I am seeing few possibilities:

  • Make explicit global abseil dependencies. Via e.g. P4_LIB_DEPS. Still, it is not used consistently and usually only for final binaries, not for libraries
  • Go for fine grained approach

I'd probably try to use the second one:

  • If abseil is exposed via headers, then do PUBLIC dependency
  • If abseil is only used in cpp as implementation, fine – it's PRIVATE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revise this PR wrt dependencies

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_dependency was used because historically all the libs were built independently and linked together by each back end. However, to ensure proper build order in parallel builds, add_dependency was needed.

The fine-grained approach works for me. #4474 implements some of this as I ran into some issues downstream.

Copy link
Contributor Author

@asl asl Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_dependency was used because historically all the libs were built independently and linked together by each back end. However, to ensure proper build order in parallel builds, add_dependency was needed.

Yes. But this is not how the dependencies should be done. Things are pulling headers, so this is a build-time dependency, not a link-time... #4473 also has some... we'd need to unify all somehow

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With binary size I meant the final compiler binaries, which can be huge. I believe this is primarily caused by all the templating and ir-generated header, but also by the chaotic linking, which also seems to affect linking time. It can be excruciatingly slow.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But this is not how the dependencies should be done. Things are pulling headers, so this is a build-time dependency, not a link-time... #4473 also has some... we'd need to unify all somehow

Yeah I agree with you, the compiler infrastructure has been going through various phases of structuring with each back end doing their own thing.

@asl asl force-pushed the visitor-improve-maps branch from e910044 to 16f06a3 Compare February 27, 2024 16:33
@asl asl requested review from vlstill and fruffy February 27, 2024 16:33
@asl asl force-pushed the visitor-improve-maps branch from 16f06a3 to def2fed Compare February 27, 2024 16:34
@asl asl enabled auto-merge February 27, 2024 16:55
@asl asl added this pull request to the merge queue Feb 27, 2024
@fruffy fruffy removed this pull request from the merge queue due to a manual request Feb 27, 2024
@fruffy
Copy link
Collaborator

fruffy commented Feb 27, 2024

@asl I removed this from the merge queue because @vlstill and @grg should still give their final OK. I believe @grg already gave the go.

@asl
Copy link
Contributor Author

asl commented Feb 27, 2024

@asl I removed this from the merge queue because @vlstill and @grg should still give their final OK.

Ok, no problem

@fruffy fruffy added this pull request to the merge queue Feb 27, 2024
Merged via the queue into p4lang:main with commit bfd65e1 Feb 27, 2024
16 checks passed
@asl asl deleted the visitor-improve-maps branch February 27, 2024 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Topics concerning the core segments of the compiler (frontend, midend, parser) run-validation Use this tag to trigger a Validation CI run.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants