Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CMake][PGO] Build Sema.cpp to generate profdata for PGO builds #77347

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

tstellar
Copy link
Collaborator

@tstellar tstellar commented Jan 8, 2024

Also, add a new option CLANG_PGO_TRAINING_DATA_SOURCE_DIR which allows users to specify a CMake project to use for generating the profile data.

I benchmarked this change by compiling the SemaChecking.cpp file from clang. Using Sema.cpp to generate the profile data gives a 34% speedup in the PGO+LTO stage2 clang when compared to the stage1 clang (no-LTO).

Prior to this change, the profile data was being generated only by compiling a hello world program, and I was seeing ~5% speedup when comparing the stage2 and stage1 builds.

I choose the Sema.cpp file for generating the profile data, because this is the same thing that is done for the Windows release builds. See 060af26.

When doing a multi-stage PGO build of clang, run the check-clang and
check-llvm targets using the instrumented clang and use that profile
data for building the final stage2 clang.  This is what is recommended
by our official documentation: https://llvm.org/docs/HowToBuildWithPGO.html#building-clang-with-pgo

I benchmarked this change by compiling the SemaChecking.cpp file from
clang.  Using check-clang/check-llvm to generate the profile data gives a 25% speedup
in the PGO+LTO stage2 clang when compared to the stage1 clang (no-LTO).

Prior to this change, I was only seeing ~5% speedup when comparing the
stage2 and stage1 builds.
@tstellar
Copy link
Collaborator Author

tstellar commented Jan 8, 2024

@kwk fyi

@llvmbot llvmbot added the clang Clang issues not falling into any other category label Jan 8, 2024
@llvmbot
Copy link
Member

llvmbot commented Jan 8, 2024

@llvm/pr-subscribers-clang

Author: Tom Stellard (tstellar)

Changes

When doing a multi-stage PGO build of clang, run the check-clang and check-llvm targets using the instrumented clang and use that profile data for building the final stage2 clang. This is what is recommended by our official documentation: https://llvm.org/docs/HowToBuildWithPGO.html#building-clang-with-pgo

I benchmarked this change by compiling the SemaChecking.cpp file from clang. Using check-clang/check-llvm to generate the profile data gives a 25% speedup in the PGO+LTO stage2 clang when compared to the stage1 clang (no-LTO).

Prior to this change, I was only seeing ~5% speedup when comparing the stage2 and stage1 builds.


Full diff: https://github.com/llvm/llvm-project/pull/77347.diff

2 Files Affected:

  • (modified) clang/utils/perf-training/CMakeLists.txt (+3-3)
  • (modified) clang/utils/perf-training/perf-helper.py (+9-7)
diff --git a/clang/utils/perf-training/CMakeLists.txt b/clang/utils/perf-training/CMakeLists.txt
index c6d51863fb1b5c..95ff8115aa538b 100644
--- a/clang/utils/perf-training/CMakeLists.txt
+++ b/clang/utils/perf-training/CMakeLists.txt
@@ -15,7 +15,7 @@ if(LLVM_BUILD_INSTRUMENTED)
     )
 
   add_custom_target(clear-profraw
-    COMMAND "${Python3_EXECUTABLE}" ${CMAKE_CURRENT_SOURCE_DIR}/perf-helper.py clean ${CMAKE_CURRENT_BINARY_DIR} profraw
+    COMMAND "${Python3_EXECUTABLE}" ${CMAKE_CURRENT_SOURCE_DIR}/perf-helper.py clean ${CMAKE_CURRENT_BINARY_DIR} ${CMAKE_BINARY_DIR}/profiles/ profraw
     COMMENT "Clearing old profraw data")
 
   if(NOT LLVM_PROFDATA)
@@ -26,9 +26,9 @@ if(LLVM_BUILD_INSTRUMENTED)
     message(STATUS "To enable merging PGO data LLVM_PROFDATA has to point to llvm-profdata")
   else()
     add_custom_target(generate-profdata
-      COMMAND "${Python3_EXECUTABLE}" ${CMAKE_CURRENT_SOURCE_DIR}/perf-helper.py merge ${LLVM_PROFDATA} ${CMAKE_CURRENT_BINARY_DIR}/clang.profdata ${CMAKE_CURRENT_BINARY_DIR}
+      COMMAND "${Python3_EXECUTABLE}" ${CMAKE_CURRENT_SOURCE_DIR}/perf-helper.py merge ${LLVM_PROFDATA} ${CMAKE_CURRENT_BINARY_DIR}/clang.profdata ${CMAKE_CURRENT_BINARY_DIR} ${CMAKE_BINARY_DIR}/profiles/
       COMMENT "Merging profdata"
-      DEPENDS generate-profraw)
+      DEPENDS generate-profraw check-clang check-llvm)
   endif()
 endif()
 
diff --git a/clang/utils/perf-training/perf-helper.py b/clang/utils/perf-training/perf-helper.py
index 99d6a3333b6ef0..bd8f74c9c2e129 100644
--- a/clang/utils/perf-training/perf-helper.py
+++ b/clang/utils/perf-training/perf-helper.py
@@ -30,26 +30,28 @@ def findFilesWithExtension(path, extension):
 
 
 def clean(args):
-    if len(args) != 2:
+    if len(args) < 2:
         print(
-            "Usage: %s clean <path> <extension>\n" % __file__
+            "Usage: %s clean <paths> <extension>\n" % __file__
             + "\tRemoves all files with extension from <path>."
         )
         return 1
-    for filename in findFilesWithExtension(args[0], args[1]):
-        os.remove(filename)
+    for path in args[1:-1]:
+       for filename in findFilesWithExtension(path, args[-1]):
+           os.remove(filename)
     return 0
 
 
 def merge(args):
-    if len(args) != 3:
+    if len(args) < 3:
         print(
-            "Usage: %s merge <llvm-profdata> <output> <path>\n" % __file__
+            "Usage: %s merge <llvm-profdata> <output> <paths>\n" % __file__
             + "\tMerges all profraw files from path into output."
         )
         return 1
     cmd = [args[0], "merge", "-o", args[1]]
-    cmd.extend(findFilesWithExtension(args[2], "profraw"))
+    for i in range(2, len(args)):
+      cmd.extend(findFilesWithExtension(args[i], "profraw"))
     subprocess.check_call(cmd)
     return 0
 

Copy link

github-actions bot commented Jan 8, 2024

✅ With the latest revision this PR passed the Python code formatter.

@tstellar tstellar changed the title [CMake][PGO] Use check-clang target to generate profdata for PGO builds [CMake][PGO] Use check targets to generate profdata for PGO builds Jan 8, 2024
@llvm-beanz
Copy link
Collaborator

Looping in some Apple people. I'm unsure if Apple is still using this infrastructure for generating PGO data. If so this change will impact them. When I wrote this, Apple had some additional test cases that got layered on top of the publicly available tests.

It might make sense to make this configurable by having a user-definable (advanced) variable that specifies which targets to run to generate PGO data.

cc: @ributzka, @Bigcheese

@petrhosek
Copy link
Member

I'd also prefer to make this configurable, we're using our own corpus which in my experiments both produces better results and takes less time than check-llvm and check-clang.

We should also consider updating the documentation since I don't think that check-llvm and check-clang is what we should be recommending; LLVM and Clang tests have different goal, that is to provide maximum coverage which often means exercising various corner cases and error conditions, but these are not helpful when collecting profiles, in fact they can be harmful.

I think that LLVM test-suite is actually a better fit since more representative of real code. In our training corpus, I for example use CppPerformanceBenchmarks (an older version is also included in LLVM test suite).

@tstellar
Copy link
Collaborator Author

tstellar commented Jan 8, 2024

My goal right now is to make this simple and easy to use. I'm not opposed to making something configurable, but it would be nice to have a default that actually helped improve performance, so that a new user or a distribution maintainer could just use the cache file as is.

What about if we tried to improve the existing in tree corpus (which only has hello-world). @petrhosek Is there anything in your corpus that you can contribute upstream? Any suggestions for what to put in there? Maybe a few of the bigger c++ files from in the tree?

check-llvm

Also add an option to supply a cmake project to use to generate profile
data.
@tstellar tstellar changed the title [CMake][PGO] Use check targets to generate profdata for PGO builds [CMake][PGO] Build Sema.cpp to generate profdata for PGO builds Jan 12, 2024
@tstellar
Copy link
Collaborator Author

I was able to figure out how to generate profile data by compiling Sema.cpp from the clang source tree, so I switched that to the default and added an option so that users can override this with an external CMake project, like the LLVM Test Suite.

@tstellar
Copy link
Collaborator Author

cc @zmodem Since I borrowed your idea from #71067.

@kwk
Copy link
Contributor

kwk commented Jan 12, 2024

@tstellar thank you for looping me in. When I did my experiments I've used the llvm-test-suite and test-suite/utils/compare.py --metric exec_time --metric compile_time --metric link_time --lhs-name 16.0.3 --rhs-name 16.0.2-pgo /root/rawhide/results.json vs /root/pgo/results.json to get the geomean difference. How do you measure performance? Maybe the answer is in your code but I haven't checked your code yet because I am reading this on my phone.

set(generate_profraw_clang_sema tools/clang/lib/Sema/CMakeFiles/obj.clangSema.dir/Sema.cpp.o)
llvm_ExternalProject_Add(generate-profraw-clang ${CMAKE_CURRENT_SOURCE_DIR}/../../../llvm
USE_TOOLCHAIN EXLUDE_FROM_ALL NO_INSTALL DEPENDS generate-profraw
EXTRA_TARGETS generate_profraw_clang_sema
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be really fragile. The paths of object file outputs are an implementation detail of CMake that has changed in the past and could change in the future, also I think only the makefile and ninja generator actually export those as buildable targets.

Beyond that I also have a philosophical problem with the approach of using living clang sources as training data. I think this can result in unexplainable or difficult to diagnose differences in the efficacy of PGO because as you change the compiler you're also innately changing the training data. A potential better solution would be to have a preprocessed Sema.cpp from a recent LLVM release that we check in as a stable copy. This would be similar to how the GCC compile-time benchmark works.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be really fragile. The paths of object file outputs are an implementation detail of CMake that has changed in the past and could change in the future, also I think only the makefile and ninja generator actually export those as buildable targets.

What about using the library target clangSema instead of the individual file?

Beyond that I also have a philosophical problem with the approach of using living clang sources as training data. I think this can result in unexplainable or difficult to diagnose differences in the efficacy of PGO because as you change the compiler you're also innately changing the training data. A potential better solution would be to have a preprocessed Sema.cpp from a recent LLVM release that we check in as a stable copy. This would be similar to how the GCC compile-time benchmark works.

The first thing I tried to do was add a preprocessed version of CommentSema.cpp, but the file was 10MB, and I thought that might be too big. Sema.cpp preprocessed is 5MB, which is smaller but still pretty big. I would be OK with this approach if we think it's OK to add a big file like this to the repo.

I'm also not concerned about efficacy of PGO, because building Sema.cpp is just the default. I really want there to be a good default experience for users, so that they can use the PGO cache files and things will 'just work'. This patch also adds a configuration option for advanced users that need something more consistent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, the PGO use case is already an advanced-user feature. It's even documented in the "Advanced Builds" documentation.

If we want some more complex C++ sources there are some samples we could look at, like Kaleidoscope. The Kaleidoscope chapter 2 source builds without dependencies, and the other ones are also pretty simple to just compile.

If you're not concerned about the efficacy of the PGO build, I'm not sure why the existing single hello_world is a bad default. It isn't great but it is stable and something. If you want something that gives better performance then you must care about the efficacy, and I think you need to be somewhat concerned about the stability of those results, otherwise what's the point?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're not concerned about the efficacy of the PGO build, I'm not sure why the existing single hello_world is a bad default. It isn't great but it is stable and something. If you want something that gives better performance then you must care about the efficacy, and I think you need to be somewhat concerned about the stability of those results, otherwise what's the point?

I'm concerned about the efficacy if it is 0%, but not really concerned if it bounces around between say 10 and 25 % based on updates to the file we are using to train.

I tested with the existing default (hello_world) and I didn't see any performance improvements from that. We also did some manual builds (not using the 2-stage PGO cache) and could not get any performance gains for using a hello world source. How certain are we that hello_world is enough to get performance improvements from PGO? Has anyone tested this recently? I will re-run my tests just to double check.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw small gains when I added it, but that was years ago. My confidence in it having any meaningful gain today is close to 0. I still think using training sources that have no stability is a bad idea. I'd feel much more comfortable using the LLVM examples or some other coding samples as testing sources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea I'd like explore is using libc++ (and perhaps also libc) test suite for training. It consists of self-contained test programs which should sufficiently exercise various aspects of C/C++ frontend.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea I'd like explore is using libc++ (and perhaps also libc) test suite for training. It consists of self-contained test programs which should sufficiently exercise various aspects of C/C++ frontend.

I've updated the patch to use the check-cxx target to generate the profile data. The gains (~20%) are less than with Sema.cpp (~34%) but still pretty good. The only issue is that the profile data comes from building libcxx as well, so it's not just from the test suite.

It may be possible to get test data from just running the test suite, but that might be a lot more complicated to set up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea I'd like explore is using libc++ (and perhaps also libc) test suite for training. It consists of self-contained test programs which should sufficiently exercise various aspects of C/C++ frontend.

@petrhosek are you suggesting to use libc++ from in-tree or from a fixed and already released version as @llvm-beanz has suggested here? He hasn't suggested to use libc++ but a recent LLVM release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tstellar For the numbers you reported (~34% and ~20%), what did you use to collect those numbers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was compiling just a single file (SemaChecking.cpp).

@tstellar
Copy link
Collaborator Author

I split the new configuration option out into a separate PR: #78879, because it seems useful regardless of what the default is. We can continue to discuss the best default on this PR.

@kwk
Copy link
Contributor

kwk commented Jan 23, 2024

I want to add something as a side note because it is PGO related. Last year I've experiment with a PGO enabled LLVM toolchain on Fedora (1, 2).

Summary of my experiment

For training data I've used a set of packages that we build on Fedora and I've smuggled in my PGO toolchain. The RPM packaging was also tweaked to pick up the profile data and create a sub packaging automatically just like we do for debug information. This part was rather tricky and RPM related but it worked and we ended up having a profiledata-sub-package for any given project in Fedora that uses LLVM.

Like I've said I've only used it for a handful of project as demonstration.

Encountered problem

But back to the point I wanted to make: disk space for profile data can quickly increase. I've compiled chromium in my tests and our builders exploded because of diskspace just for profile data:

LLVM Profile Error: Failed to write file "/builddir/build/BUILD/raw-pgo-profdata
//chromium.llvm.1970228969820616430_0.24617.profraw": No space left on device

Question

Is it possible to have CMake collect and merge the profile data in the background? I've used tools like inotifywait in my PoC to look for close_write events to any file under a build directory that matches .*\.profraw$. When an entry was found, I've stored it into a file to be consumed and all profiles in it merged, once there are 10 files. Afterwards plain profiles were also deleted to free up disk space. This continuous merge helped me reduce the size of total profile data from ~1,4GB (unmerged) to 1,6MB (merged). This was for the retsnoop project. Chromium simply took to long to build for my experiment.

I understand this is a special use case but one should not underestimate the size of profile data, that's all I'm saying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants