Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce p4c compile time #4674

Open
asl opened this issue May 23, 2024 · 15 comments
Open

Reduce p4c compile time #4674

asl opened this issue May 23, 2024 · 15 comments
Labels
compiler-performance Topics on improving the performance of the compiler core. core Topics concerning the core segments of the compiler (frontend, midend, parser) enhancement This topic discusses an improvement to existing compiler code.

Comments

@asl
Copy link
Contributor

asl commented May 23, 2024

Currently p4c compile time are quite large compared to other compilers / project given the codebase size. Likely this question / issue was already raised before (otherwise, why there are unity builds here), but I was unable to find the corresponding issue.

I tried to check, if there are any low-hanging fruits here. Looks like not, still few cases are interested. Attaches is the time race report as generated by clang. It could be visualized via Chrome tracing backend, or better, via Speedoscope or Perfetto. The file in question is helpers.cpp from gtest testsuite, though similar patterns are everywhere. The file in question has the longest frontend parse time across the codebase (13 seconds for me).

Interesting observations:

  • Overall C++ frontend is 98%, nothing heavy from the midend / backend perspective
  • 5% of compile time is big_int.h that is includes via big_int_utils.h that is included via stringify.h almost everywhere (as it is subsequently included via error_handler.h used by exceptions.h, the latter provides BUG_CHECK, etc). Maybe this could be refactored a bit better. Though, big ints are everywhere, so it will still be here.
  • 30% is ir-generated.h (actually more due to later template instantiations)
  • 6% is frontends/common/parseInput.h with the majority of time spent into v1.0 converters. This likely could be refactored

There are some expensive template instantiations here

Template sets that took longest to instantiate:

  1277 ms: IR::Vector<$>::visit_children (64 times, avg 19 ms)
   866 ms: IR::Vector<$>::insert<$> (32 times, avg 27 ms)
   848 ms: std::vector<$>::insert<$> (33 times, avg 25 ms)
   749 ms: std::__dispatch_copy_or_move<$> (154 times, avg 4 ms)
   662 ms: std::__unwrap_and_dispatch<$> (215 times, avg 3 ms)
   589 ms: std::map<$> (70 times, avg 8 ms)
   549 ms: P4V1::ProgramStructure::NamedObjectInfo<$>::emplace (18 times, avg 30 ms)
   536 ms: std::map<$>::emplace<$> (54 times, avg 9 ms)
   430 ms: std::copy<$> (48 times, avg 8 ms)
   425 ms: std::__copy<$> (48 times, avg 8 ms)
   408 ms: std::__tree<$>::__emplace_unique<$> (56 times, avg 7 ms)
   400 ms: std::move<$> (72 times, avg 5 ms)
   386 ms: std::__move<$> (72 times, avg 5 ms)
   377 ms: std::pair<$> (150 times, avg 2 ms)
   371 ms: std::__tree<$> (75 times, avg 4 ms)
   294 ms: P4V1::ProgramStructure::NamedObjectInfo<$> (18 times, avg 16 ms)
   286 ms: std::vector<$>::__swap_out_circular_buffer (42 times, avg 6 ms)
   283 ms: RTTI::TypeInfo<$>::dyn_cast<$> (189 times, avg 1 ms)
   272 ms: std::allocator_traits<$> (123 times, avg 2 ms)
   251 ms: std::__tree<$>::__emplace_unique_key_args<$> (35 times, avg 7 ms)
   250 ms: std::__uninitialized_allocator_move_if_noexcept<$> (42 times, avg 5 ms)
   237 ms: std::vector<$>::vector (58 times, avg 4 ms)
   231 ms: IR::Vector<$>::erase (32 times, avg 7 ms)

Here are the stats across the whole codebase:

Templates that took longest to instantiate:

 16023 ms: boost::basic_format<char>::basic_format (455 times, avg 35 ms)
 15929 ms: Util::CompilerBug::CompilerBug<> (440 times, avg 36 ms)
 15903 ms: IR::Vector<IR::Node>::visit_children (866 times, avg 18 ms)
 15844 ms: Util::P4CExceptionBase::P4CExceptionBase<> (440 times, avg 36 ms)
 14351 ms: boost::basic_format<char>::parse (455 times, avg 31 ms)
 13624 ms: IR::Vector<IR::DpdkSelector>::visit_children (866 times, avg 15 ms)
 13551 ms: IR::Vector<IR::DpdkAction>::visit_children (866 times, avg 15 ms)
 13485 ms: IR::Vector<IR::DpdkHeaderType>::visit_children (866 times, avg 15 ms)
 13393 ms: IR::Vector<IR::Declaration_ID>::visit_children (866 times, avg 15 ms)
 13386 ms: IR::Vector<IR::DpdkExternDeclaration>::visit_children (866 times, avg 15 ms)
 13360 ms: IR::Vector<IR::Type_Var>::visit_children (866 times, avg 15 ms)
 13329 ms: IR::Vector<IR::DpdkAsmStatement>::visit_children (866 times, avg 15 ms)
 13308 ms: IR::Vector<IR::Property>::visit_children (866 times, avg 15 ms)
 13214 ms: IR::Vector<IR::DpdkTable>::visit_children (866 times, avg 15 ms)
 13187 ms: IR::Vector<IR::DpdkStructType>::visit_children (866 times, avg 15 ms)
 13063 ms: IR::Vector<IR::Entry>::visit_children (866 times, avg 15 ms)
 13061 ms: IR::Vector<IR::StatOrDecl>::visit_children (866 times, avg 15 ms)
 13013 ms: IR::Vector<IR::DpdkLearner>::visit_children (866 times, avg 15 ms)
 12911 ms: IR::Vector<IR::ParserState>::visit_children (866 times, avg 14 ms)
 12878 ms: IR::Vector<IR::Declaration>::visit_children (866 times, avg 14 ms)
 12838 ms: IR::Vector<IR::Parameter>::visit_children (866 times, avg 14 ms)
 12771 ms: IR::Vector<IR::KeyElement>::visit_children (866 times, avg 14 ms)
 12757 ms: IR::Vector<IR::ActionListElement>::visit_children (866 times, avg 14 ms)
 12731 ms: IR::Vector<IR::NamedExpression>::visit_children (866 times, avg 14 ms)
 12719 ms: IR::Vector<IR::DpdkDeclaration>::visit_children (866 times, avg 14 ms)
 12569 ms: IR::Vector<IR::Method>::visit_children (866 times, avg 14 ms)
 12548 ms: IR::Vector<IR::StructField>::visit_children (866 times, avg 14 ms)
 12354 ms: IR::Vector<IR::SerEnumMember>::visit_children (866 times, avg 14 ms)
 12329 ms: IR::Vector<IR::SelectCase>::visit_children (866 times, avg 14 ms)
 12231 ms: IR::Vector<IR::AnnotationToken>::visit_children (866 times, avg 14 ms)

Template sets that took longest to instantiate:

411170 ms: IR::Vector<$>::visit_children (27718 times, avg 14 ms)
287032 ms: IR::Vector<$>::insert<$> (13866 times, avg 20 ms)
279278 ms: std::vector<$>::insert<$> (14158 times, avg 19 ms)
219502 ms: std::__dispatch_copy_or_move<$> (64652 times, avg 3 ms)
184992 ms: std::__unwrap_and_dispatch<$> (83282 times, avg 2 ms)
127186 ms: std::move<$> (30874 times, avg 4 ms)
121353 ms: std::__move<$> (30850 times, avg 3 ms)
119899 ms: std::copy<$> (19052 times, avg 6 ms)
119576 ms: RTTI::TypeInfo<$>::dyn_cast<$> (84718 times, avg 1 ms)
117637 ms: std::__copy<$> (19047 times, avg 6 ms)
109498 ms: std::vector<$>::__swap_out_circular_buffer (17921 times, avg 6 ms)
 92251 ms: std::__uninitialized_allocator_move_if_noexcept<$> (18086 times, avg 5 ms)
 80784 ms: std::vector<$>::vector (22750 times, avg 3 ms)
 65467 ms: IR::Vector<$> (14722 times, avg 4 ms)
 64696 ms: std::vector<$>::__construct_at_end<$> (24919 times, avg 2 ms)
 64212 ms: ordered_map<$>::ordered_map (4627 times, avg 13 ms)
 62657 ms: IR::Vector<$>::erase (13853 times, avg 4 ms)
 62412 ms: std::pair<$> (31028 times, avg 2 ms)
 62006 ms: std::map<$> (11351 times, avg 5 ms)
 61258 ms: std::vector<$>::erase (13902 times, avg 4 ms)
 59134 ms: std::vector<$> (28239 times, avg 2 ms)
 53943 ms: std::__uninitialized_allocator_copy<$> (23446 times, avg 2 ms)
 52458 ms: Util::CompilerBug::CompilerBug<$> (12253 times, avg 4 ms)
 51391 ms: Util::P4CExceptionBase::P4CExceptionBase<$> (12397 times, avg 4 ms)
 49646 ms: IR::Vector<$>::parallel_visit_children (14010 times, avg 3 ms)
 48877 ms: std::__unwrap_range<$> (37549 times, avg 1 ms)
 48375 ms: std::map<$>::emplace<$> (6672 times, avg 7 ms)
 47952 ms: std::map<$>::map (8962 times, avg 5 ms)
 47945 ms: IR::IndexedVector<$>::visit_children (18185 times, avg 2 ms)
 44757 ms: safe_vector<$> (20185 times, avg 2 ms)

The file with time profile:
helpers.cpp.json.gz

@asl asl added the core Topics concerning the core segments of the compiler (frontend, midend, parser) label May 23, 2024
@vlstill
Copy link
Contributor

vlstill commented May 23, 2024

Thanks for this! I haven't looked into this for some time, but from my time at Tofino I remember some investigations (the deeper once mostly lead nowhere because what happened to Tofino) and some conclusions.

30% is ir-generated.h (actually more due to later template instantiations)

Yes, that matches our observations. We were discussing if precompiled headers would be something that could help with this, but I don't think we got to a serious try. Might be worth investigating.

On top of that, I was concerned with build process and linking as that was low-hanging fruit for us, this will likely work for others (but a lot probably use it already):

  • ninja builds the compiler significantly faster then make.
  • lld (LLVM linked) is much faster then both GNU gold and GNU ld (with gold being faster then ld). mold was even faster, but was not stable enough to my liking at that point, this might be different now and for others.
  • If reasonable, replacing static libraries with CMakes object libraries (which are just semantic collection of object files) often speeds up compilation. The disadvantage is that it will then link all, not just needed objects from the archive (probably not much concerning for P4C though). Another disadvantage is that it is harder to work with them in cmake.

6% is frontends/common/parseInput.h with the majority of time spent into v1.0 converters. This likely could be refactored

Maybe we should isolate the P4-14 input support and make it optional. Not all backends need it.

Templates that took longest to instantiate:

16023 ms: boost::basic_format::basic_format (455 times, avg 35 ms)
15929 ms: Util::CompilerBug::CompilerBug<> (440 times, avg 36 ms)
15903 ms: IR::VectorIR::Node::visit_children (866 times, avg 18 ms)
[...]

Template sets that took longest to instantiate:

411170 ms: IR::Vector<$>::visit_children (27718 times, avg 14 ms)
287032 ms: IR::Vector<$>::insert<$> (13866 times, avg 20 ms)
[...]

Now we have actually a lot of control over many of these. We can add explicit instantiation declarations for selected instances (e.g. Vector and IndexedVector combos used in IR) and instantiate those explicitly in a .cpp file. Similarly for CompilerBug<>. Given the numbers, this could speed up compilation a lot. A lot of the rest are standard library things, so there I don't expect we could do much better (until we can make used of modules in far future).

@asl
Copy link
Contributor Author

asl commented May 23, 2024

Yes, that matches our observations. We were discussing if precompiled headers would be something that could help with this, but I don't think we got to a serious try. Might be worth investigating.

I think it's mostly because we're having 200+ IR::Node descendants with lots of virtual methods and wide inheritance. Lots of small things that add together...

On top of that, I was concerned with build process and linking as that was low-hanging fruit for us, this will likely work for others (but a lot probably use it already):

Yeah. Though this was ninja build. And I'm using Apple'd ld64 which is reasonable fast :)

A lot of the rest are standard library things, so there I don't expect we could do much better (until we can make used of modules in far future).

Actually, they could be as both Vector and IndexedVector use lots of STL stuff under the hood. But worth investigating more, yes. The whole compiler is just a pile of visit calls :)

@asl
Copy link
Contributor Author

asl commented May 23, 2024

And this one is interesting:

 15929 ms: Util::CompilerBug::CompilerBug<> (440 times, avg 36 ms)

Actually, every BUG_CHECK leads to such instantiation, depending on the types of arguments. Maybe there should be some other solution here.

@vlstill
Copy link
Contributor

vlstill commented May 23, 2024

A lot of the error handling code is from C++11 times, we now have more ways of handling variadics, it might be worth refactoring it. I can even see a way when we just take the argument pack and project it either directly to string, or to a type-erased object (more likely, as we still need to handle source info) and then instead of doing the recursion (if we can't get rid of it) based on all the type combinations, we just instantiate it based on all the lengths of error parametrization. Then we could again apply explicit instatiations if needed.

@fruffy
Copy link
Collaborator

fruffy commented May 23, 2024

Thanks for the analysis. Yes, compile time has been a constant pain point. The Tofino compiler is much worse because it adds a lot more IR nodes which blows up compile time and linking even further.

The only other open-source issue I remember is #3980, which is concerned with splitting ir-generated.h. We only need few IR nodes most of the time but have to pull this header in every time. The problem is that the visitor infrastructure coupled with the ir-inline.h and ir-tree-macros.h macros seems to depend on everything being in one place. It is really difficult to untangle or change this.

Other than that I have periodically been running IWYU to get some better sense what is being included/used in some classes: #3767

5% of compile time is big_int.h that is includes via big_int_utils.h that is included via stringify.h almost everywhere (as it is subsequently included via error_handler.h used by exceptions.h, the latter provides BUG_CHECK, etc). Maybe this could be refactored a bit better. Though, big ints are everywhere, so it will still be here.

We could move the toString to big_int_utils here? big_int is primarily (only?) used for constants. Maybe we can do something there...

A lot of the error handling code is from C++11 times, we now have more ways of handling variadics, it might be worth refactoring it. I can even see a way when we just take the argument pack and project it either directly to string, or to a type-erased object (more likely, as we still need to handle source info) and then instead of doing the recursion (if we can't get rid of it) based on all the type combinations, we just instantiate it based on all the lengths of error parametrization. Then we could again apply explicit instatiations if needed.

I tried some refactoring in #3774 once. The challenge is to preserve the source location accurately, which may require some C++20 features. If you wanted to get rid of the macro setup..

@asl
Copy link
Contributor Author

asl commented May 23, 2024

We could move the toString to big_int_utils here? big_int is primarily (only?) used for constants. Maybe we can do something there...

The problem is header boost dependency. So, moving out will not help a lot :)

@vlstill
Copy link
Contributor

vlstill commented May 23, 2024

I tried some refactoring in #3774 once. The challenge is to preserve the source location accurately, which may require some C++20 features. If you wanted to get rid of the macro setup..

That would require not only C++20, but more importantly GCC 11... Maybe once Ubuntu 20 goes out of support we can consider updating? Other old popular systems (RHEL 8) already have to depend on non-default GCC now.

EDIT: to get source_location that is... There could be probably some workaround with keeping macros but making the compilation faster.

@fruffy
Copy link
Collaborator

fruffy commented May 23, 2024

The problem is header boost dependency. So, moving out will not help a lot :)

Not sure I follow. If we move out toString for big_int we can at least get rid of the big_int include there. We could also try to forward declare big_int in ir.h. But that may break things.

@asl
Copy link
Contributor Author

asl commented May 23, 2024

Not sure I follow. If we move out toString for big_int we can at least get rid of the big_int include there. We could also try to forward declare big_int in ir.h. But that may break things.

Right. But big_int is everywhere in the compiler. So it will simply be included in other place – we'd need to parse it in any case somewhere.

@fruffy
Copy link
Collaborator

fruffy commented May 23, 2024

Right. But big_int is everywhere in the compiler. So it will simply be included in other place – we'd need to parse it in any case somewhere.

I am thinking it is not that widely used. Usually only in places where we work with constants. It might be possible to factor it out.

Second, we used to have GMP support actually: #3485 But it was removed because of licensing concerns/CI waste. It could be brought back, if there is really a need.

@asl
Copy link
Contributor Author

asl commented May 23, 2024

I am thinking it is not that widely used. Usually only in places where we work with constants. It might be possible to factor it out.

Ok, maybe good to go then, yes.

@fruffy fruffy added enhancement This topic discusses an improvement to existing compiler code. compiler-performance Topics on improving the performance of the compiler core. labels May 23, 2024
@vlstill
Copy link
Contributor

vlstill commented May 24, 2024

Building a debug version of a downstream compiler with hyperfine -p 'make clean' -m 5 -- make. Using GCC 11 & LLD linker, ccache disabled, on a dedicated machine with enough CPUs and RAM and storage on local SSD.

original:

Benchmark 1: make
  Time (mean ± σ):     414.572 s ±  5.070 s    [User: 9543.740 s, System: 825.331 s]
  Range (min … max):   411.245 s … 423.311 s    5 runs

With explicit intantiations of Vector and IndexedVector for types used with them in ir-generated.h:

Benchmark 1: make
  Time (mean ± σ):     386.579 s ±  5.057 s    [User: 9083.205 s, System: 815.379 s]
  Range (min … max):   382.127 s … 393.961 s    5 runs

... so something like almost 7 % in this case. Not a huge difference, but noticeable for such an easy change I'd say. I'll try a few more template-instantiation tricks and then open a PR with this.

@asl
Copy link
Contributor Author

asl commented May 24, 2024

@vlstill One step at a time. There is no single place that contributes to the compile time, but 7% improvement is already a lot.

@vlstill
Copy link
Contributor

vlstill commented May 24, 2024

For release the story is similar. Also I forgot to mention this is a non-unity build. A release build this time, otherwise the same comparison:

# original
  Time (mean ± σ):     588.054 s ±  5.932 s    [User: 12029.151 s, System: 615.580 s]
  Range (min … max):   580.889 s … 593.137 s    5 runs

# Vector + IndexedVector:
  Time (mean ± σ):     533.384 s ±  8.694 s    [User: 11267.462 s, System: 597.603 s]
  Range (min … max):   523.313 s … 544.670 s    5 runs

So little over 9 % for release.

@asl
Copy link
Contributor Author

asl commented May 24, 2024

I am thinking it is not that widely used. Usually only in places where we work with constants. It might be possible to factor it out.

Well, the thing it: it is a part of IR::Constant. So, essentially it is everywhere we're include IR headers. We may factor it out using pimpl or similar approach though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler-performance Topics on improving the performance of the compiler core. core Topics concerning the core segments of the compiler (frontend, midend, parser) enhancement This topic discusses an improvement to existing compiler code.
Projects
None yet
Development

No branches or pull requests

3 participants