-
Notifications
You must be signed in to change notification settings - Fork 12.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CMake][PGO] Build Sema.cpp to generate profdata for PGO builds #77347
base: main
Are you sure you want to change the base?
Conversation
When doing a multi-stage PGO build of clang, run the check-clang and check-llvm targets using the instrumented clang and use that profile data for building the final stage2 clang. This is what is recommended by our official documentation: https://llvm.org/docs/HowToBuildWithPGO.html#building-clang-with-pgo I benchmarked this change by compiling the SemaChecking.cpp file from clang. Using check-clang/check-llvm to generate the profile data gives a 25% speedup in the PGO+LTO stage2 clang when compared to the stage1 clang (no-LTO). Prior to this change, I was only seeing ~5% speedup when comparing the stage2 and stage1 builds.
@kwk fyi |
@llvm/pr-subscribers-clang Author: Tom Stellard (tstellar) ChangesWhen doing a multi-stage PGO build of clang, run the check-clang and check-llvm targets using the instrumented clang and use that profile data for building the final stage2 clang. This is what is recommended by our official documentation: https://llvm.org/docs/HowToBuildWithPGO.html#building-clang-with-pgo I benchmarked this change by compiling the SemaChecking.cpp file from clang. Using check-clang/check-llvm to generate the profile data gives a 25% speedup in the PGO+LTO stage2 clang when compared to the stage1 clang (no-LTO). Prior to this change, I was only seeing ~5% speedup when comparing the stage2 and stage1 builds. Full diff: https://github.com/llvm/llvm-project/pull/77347.diff 2 Files Affected:
diff --git a/clang/utils/perf-training/CMakeLists.txt b/clang/utils/perf-training/CMakeLists.txt
index c6d51863fb1b5c..95ff8115aa538b 100644
--- a/clang/utils/perf-training/CMakeLists.txt
+++ b/clang/utils/perf-training/CMakeLists.txt
@@ -15,7 +15,7 @@ if(LLVM_BUILD_INSTRUMENTED)
)
add_custom_target(clear-profraw
- COMMAND "${Python3_EXECUTABLE}" ${CMAKE_CURRENT_SOURCE_DIR}/perf-helper.py clean ${CMAKE_CURRENT_BINARY_DIR} profraw
+ COMMAND "${Python3_EXECUTABLE}" ${CMAKE_CURRENT_SOURCE_DIR}/perf-helper.py clean ${CMAKE_CURRENT_BINARY_DIR} ${CMAKE_BINARY_DIR}/profiles/ profraw
COMMENT "Clearing old profraw data")
if(NOT LLVM_PROFDATA)
@@ -26,9 +26,9 @@ if(LLVM_BUILD_INSTRUMENTED)
message(STATUS "To enable merging PGO data LLVM_PROFDATA has to point to llvm-profdata")
else()
add_custom_target(generate-profdata
- COMMAND "${Python3_EXECUTABLE}" ${CMAKE_CURRENT_SOURCE_DIR}/perf-helper.py merge ${LLVM_PROFDATA} ${CMAKE_CURRENT_BINARY_DIR}/clang.profdata ${CMAKE_CURRENT_BINARY_DIR}
+ COMMAND "${Python3_EXECUTABLE}" ${CMAKE_CURRENT_SOURCE_DIR}/perf-helper.py merge ${LLVM_PROFDATA} ${CMAKE_CURRENT_BINARY_DIR}/clang.profdata ${CMAKE_CURRENT_BINARY_DIR} ${CMAKE_BINARY_DIR}/profiles/
COMMENT "Merging profdata"
- DEPENDS generate-profraw)
+ DEPENDS generate-profraw check-clang check-llvm)
endif()
endif()
diff --git a/clang/utils/perf-training/perf-helper.py b/clang/utils/perf-training/perf-helper.py
index 99d6a3333b6ef0..bd8f74c9c2e129 100644
--- a/clang/utils/perf-training/perf-helper.py
+++ b/clang/utils/perf-training/perf-helper.py
@@ -30,26 +30,28 @@ def findFilesWithExtension(path, extension):
def clean(args):
- if len(args) != 2:
+ if len(args) < 2:
print(
- "Usage: %s clean <path> <extension>\n" % __file__
+ "Usage: %s clean <paths> <extension>\n" % __file__
+ "\tRemoves all files with extension from <path>."
)
return 1
- for filename in findFilesWithExtension(args[0], args[1]):
- os.remove(filename)
+ for path in args[1:-1]:
+ for filename in findFilesWithExtension(path, args[-1]):
+ os.remove(filename)
return 0
def merge(args):
- if len(args) != 3:
+ if len(args) < 3:
print(
- "Usage: %s merge <llvm-profdata> <output> <path>\n" % __file__
+ "Usage: %s merge <llvm-profdata> <output> <paths>\n" % __file__
+ "\tMerges all profraw files from path into output."
)
return 1
cmd = [args[0], "merge", "-o", args[1]]
- cmd.extend(findFilesWithExtension(args[2], "profraw"))
+ for i in range(2, len(args)):
+ cmd.extend(findFilesWithExtension(args[i], "profraw"))
subprocess.check_call(cmd)
return 0
|
✅ With the latest revision this PR passed the Python code formatter. |
Looping in some Apple people. I'm unsure if Apple is still using this infrastructure for generating PGO data. If so this change will impact them. When I wrote this, Apple had some additional test cases that got layered on top of the publicly available tests. It might make sense to make this configurable by having a user-definable (advanced) variable that specifies which targets to run to generate PGO data. cc: @ributzka, @Bigcheese |
I'd also prefer to make this configurable, we're using our own corpus which in my experiments both produces better results and takes less time than We should also consider updating the documentation since I don't think that I think that LLVM test-suite is actually a better fit since more representative of real code. In our training corpus, I for example use CppPerformanceBenchmarks (an older version is also included in LLVM test suite). |
My goal right now is to make this simple and easy to use. I'm not opposed to making something configurable, but it would be nice to have a default that actually helped improve performance, so that a new user or a distribution maintainer could just use the cache file as is. What about if we tried to improve the existing in tree corpus (which only has hello-world). @petrhosek Is there anything in your corpus that you can contribute upstream? Any suggestions for what to put in there? Maybe a few of the bigger c++ files from in the tree? |
check-llvm Also add an option to supply a cmake project to use to generate profile data.
I was able to figure out how to generate profile data by compiling Sema.cpp from the clang source tree, so I switched that to the default and added an option so that users can override this with an external CMake project, like the LLVM Test Suite. |
@tstellar thank you for looping me in. When I did my experiments I've used the llvm-test-suite and |
set(generate_profraw_clang_sema tools/clang/lib/Sema/CMakeFiles/obj.clangSema.dir/Sema.cpp.o) | ||
llvm_ExternalProject_Add(generate-profraw-clang ${CMAKE_CURRENT_SOURCE_DIR}/../../../llvm | ||
USE_TOOLCHAIN EXLUDE_FROM_ALL NO_INSTALL DEPENDS generate-profraw | ||
EXTRA_TARGETS generate_profraw_clang_sema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be really fragile. The paths of object file outputs are an implementation detail of CMake that has changed in the past and could change in the future, also I think only the makefile and ninja generator actually export those as buildable targets.
Beyond that I also have a philosophical problem with the approach of using living clang sources as training data. I think this can result in unexplainable or difficult to diagnose differences in the efficacy of PGO because as you change the compiler you're also innately changing the training data. A potential better solution would be to have a preprocessed Sema.cpp from a recent LLVM release that we check in as a stable copy. This would be similar to how the GCC compile-time benchmark works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be really fragile. The paths of object file outputs are an implementation detail of CMake that has changed in the past and could change in the future, also I think only the makefile and ninja generator actually export those as buildable targets.
What about using the library target clangSema
instead of the individual file?
Beyond that I also have a philosophical problem with the approach of using living clang sources as training data. I think this can result in unexplainable or difficult to diagnose differences in the efficacy of PGO because as you change the compiler you're also innately changing the training data. A potential better solution would be to have a preprocessed Sema.cpp from a recent LLVM release that we check in as a stable copy. This would be similar to how the GCC compile-time benchmark works.
The first thing I tried to do was add a preprocessed version of CommentSema.cpp, but the file was 10MB, and I thought that might be too big. Sema.cpp preprocessed is 5MB, which is smaller but still pretty big. I would be OK with this approach if we think it's OK to add a big file like this to the repo.
I'm also not concerned about efficacy of PGO, because building Sema.cpp is just the default. I really want there to be a good default experience for users, so that they can use the PGO cache files and things will 'just work'. This patch also adds a configuration option for advanced users that need something more consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, the PGO use case is already an advanced-user feature. It's even documented in the "Advanced Builds" documentation.
If we want some more complex C++ sources there are some samples we could look at, like Kaleidoscope. The Kaleidoscope chapter 2 source builds without dependencies, and the other ones are also pretty simple to just compile.
If you're not concerned about the efficacy of the PGO build, I'm not sure why the existing single hello_world is a bad default. It isn't great but it is stable and something. If you want something that gives better performance then you must care about the efficacy, and I think you need to be somewhat concerned about the stability of those results, otherwise what's the point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're not concerned about the efficacy of the PGO build, I'm not sure why the existing single hello_world is a bad default. It isn't great but it is stable and something. If you want something that gives better performance then you must care about the efficacy, and I think you need to be somewhat concerned about the stability of those results, otherwise what's the point?
I'm concerned about the efficacy if it is 0%, but not really concerned if it bounces around between say 10 and 25 % based on updates to the file we are using to train.
I tested with the existing default (hello_world) and I didn't see any performance improvements from that. We also did some manual builds (not using the 2-stage PGO cache) and could not get any performance gains for using a hello world source. How certain are we that hello_world is enough to get performance improvements from PGO? Has anyone tested this recently? I will re-run my tests just to double check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw small gains when I added it, but that was years ago. My confidence in it having any meaningful gain today is close to 0. I still think using training sources that have no stability is a bad idea. I'd feel much more comfortable using the LLVM examples or some other coding samples as testing sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea I'd like explore is using libc++ (and perhaps also libc) test suite for training. It consists of self-contained test programs which should sufficiently exercise various aspects of C/C++ frontend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea I'd like explore is using libc++ (and perhaps also libc) test suite for training. It consists of self-contained test programs which should sufficiently exercise various aspects of C/C++ frontend.
I've updated the patch to use the check-cxx target to generate the profile data. The gains (~20%) are less than with Sema.cpp (~34%) but still pretty good. The only issue is that the profile data comes from building libcxx as well, so it's not just from the test suite.
It may be possible to get test data from just running the test suite, but that might be a lot more complicated to set up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea I'd like explore is using libc++ (and perhaps also libc) test suite for training. It consists of self-contained test programs which should sufficiently exercise various aspects of C/C++ frontend.
@petrhosek are you suggesting to use libc++ from in-tree or from a fixed and already released version as @llvm-beanz has suggested here? He hasn't suggested to use libc++ but a recent LLVM release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tstellar For the numbers you reported (~34% and ~20%), what did you use to collect those numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was compiling just a single file (SemaChecking.cpp).
I split the new configuration option out into a separate PR: #78879, because it seems useful regardless of what the default is. We can continue to discuss the best default on this PR. |
I want to add something as a side note because it is PGO related. Last year I've experiment with a PGO enabled LLVM toolchain on Fedora (1, 2). Summary of my experimentFor training data I've used a set of packages that we build on Fedora and I've smuggled in my PGO toolchain. The RPM packaging was also tweaked to pick up the profile data and create a sub packaging automatically just like we do for debug information. This part was rather tricky and RPM related but it worked and we ended up having a profiledata-sub-package for any given project in Fedora that uses LLVM. Like I've said I've only used it for a handful of project as demonstration. Encountered problemBut back to the point I wanted to make: disk space for profile data can quickly increase. I've compiled chromium in my tests and our builders exploded because of diskspace just for profile data:
QuestionIs it possible to have CMake collect and merge the profile data in the background? I've used tools like I understand this is a special use case but one should not underestimate the size of profile data, that's all I'm saying. |
Also, add a new option CLANG_PGO_TRAINING_DATA_SOURCE_DIR which allows users to specify a CMake project to use for generating the profile data.
I benchmarked this change by compiling the SemaChecking.cpp file from clang. Using Sema.cpp to generate the profile data gives a 34% speedup in the PGO+LTO stage2 clang when compared to the stage1 clang (no-LTO).
Prior to this change, the profile data was being generated only by compiling a hello world program, and I was seeing ~5% speedup when comparing the stage2 and stage1 builds.
I choose the Sema.cpp file for generating the profile data, because this is the same thing that is done for the Windows release builds. See 060af26.