[CMake][PGO] Build Sema.cpp to generate profdata for PGO builds #77347

tstellar · 2024-01-08T17:39:44Z

Also, add a new option CLANG_PGO_TRAINING_DATA_SOURCE_DIR which allows users to specify a CMake project to use for generating the profile data.

I benchmarked this change by compiling the SemaChecking.cpp file from clang. Using Sema.cpp to generate the profile data gives a 34% speedup in the PGO+LTO stage2 clang when compared to the stage1 clang (no-LTO).

Prior to this change, the profile data was being generated only by compiling a hello world program, and I was seeing ~5% speedup when comparing the stage2 and stage1 builds.

I choose the Sema.cpp file for generating the profile data, because this is the same thing that is done for the Windows release builds. See 060af26.

When doing a multi-stage PGO build of clang, run the check-clang and check-llvm targets using the instrumented clang and use that profile data for building the final stage2 clang. This is what is recommended by our official documentation: https://llvm.org/docs/HowToBuildWithPGO.html#building-clang-with-pgo I benchmarked this change by compiling the SemaChecking.cpp file from clang. Using check-clang/check-llvm to generate the profile data gives a 25% speedup in the PGO+LTO stage2 clang when compared to the stage1 clang (no-LTO). Prior to this change, I was only seeing ~5% speedup when comparing the stage2 and stage1 builds.

tstellar · 2024-01-08T17:39:53Z

@kwk fyi

llvmbot · 2024-01-08T17:40:18Z

@llvm/pr-subscribers-clang

Author: Tom Stellard (tstellar)

Changes

When doing a multi-stage PGO build of clang, run the check-clang and check-llvm targets using the instrumented clang and use that profile data for building the final stage2 clang. This is what is recommended by our official documentation: https://llvm.org/docs/HowToBuildWithPGO.html#building-clang-with-pgo

I benchmarked this change by compiling the SemaChecking.cpp file from clang. Using check-clang/check-llvm to generate the profile data gives a 25% speedup in the PGO+LTO stage2 clang when compared to the stage1 clang (no-LTO).

Prior to this change, I was only seeing ~5% speedup when comparing the stage2 and stage1 builds.

Full diff: https://github.com/llvm/llvm-project/pull/77347.diff

2 Files Affected:

(modified) clang/utils/perf-training/CMakeLists.txt (+3-3)
(modified) clang/utils/perf-training/perf-helper.py (+9-7)

diff --git a/clang/utils/perf-training/CMakeLists.txt b/clang/utils/perf-training/CMakeLists.txt
index c6d51863fb1b5c..95ff8115aa538b 100644
--- a/clang/utils/perf-training/CMakeLists.txt
+++ b/clang/utils/perf-training/CMakeLists.txt
@@ -15,7 +15,7 @@ if(LLVM_BUILD_INSTRUMENTED)
     )
 
   add_custom_target(clear-profraw
-    COMMAND "${Python3_EXECUTABLE}" ${CMAKE_CURRENT_SOURCE_DIR}/perf-helper.py clean ${CMAKE_CURRENT_BINARY_DIR} profraw
+    COMMAND "${Python3_EXECUTABLE}" ${CMAKE_CURRENT_SOURCE_DIR}/perf-helper.py clean ${CMAKE_CURRENT_BINARY_DIR} ${CMAKE_BINARY_DIR}/profiles/ profraw
     COMMENT "Clearing old profraw data")
 
   if(NOT LLVM_PROFDATA)
@@ -26,9 +26,9 @@ if(LLVM_BUILD_INSTRUMENTED)
     message(STATUS "To enable merging PGO data LLVM_PROFDATA has to point to llvm-profdata")
   else()
     add_custom_target(generate-profdata
-      COMMAND "${Python3_EXECUTABLE}" ${CMAKE_CURRENT_SOURCE_DIR}/perf-helper.py merge ${LLVM_PROFDATA} ${CMAKE_CURRENT_BINARY_DIR}/clang.profdata ${CMAKE_CURRENT_BINARY_DIR}
+      COMMAND "${Python3_EXECUTABLE}" ${CMAKE_CURRENT_SOURCE_DIR}/perf-helper.py merge ${LLVM_PROFDATA} ${CMAKE_CURRENT_BINARY_DIR}/clang.profdata ${CMAKE_CURRENT_BINARY_DIR} ${CMAKE_BINARY_DIR}/profiles/
       COMMENT "Merging profdata"
-      DEPENDS generate-profraw)
+      DEPENDS generate-profraw check-clang check-llvm)
   endif()
 endif()
 
diff --git a/clang/utils/perf-training/perf-helper.py b/clang/utils/perf-training/perf-helper.py
index 99d6a3333b6ef0..bd8f74c9c2e129 100644
--- a/clang/utils/perf-training/perf-helper.py
+++ b/clang/utils/perf-training/perf-helper.py
@@ -30,26 +30,28 @@ def findFilesWithExtension(path, extension):
 
 
 def clean(args):
-    if len(args) != 2:
+    if len(args) < 2:
         print(
-            "Usage: %s clean <path> <extension>\n" % __file__
+            "Usage: %s clean <paths> <extension>\n" % __file__
             + "\tRemoves all files with extension from <path>."
         )
         return 1
-    for filename in findFilesWithExtension(args[0], args[1]):
-        os.remove(filename)
+    for path in args[1:-1]:
+       for filename in findFilesWithExtension(path, args[-1]):
+           os.remove(filename)
     return 0
 
 
 def merge(args):
-    if len(args) != 3:
+    if len(args) < 3:
         print(
-            "Usage: %s merge <llvm-profdata> <output> <path>\n" % __file__
+            "Usage: %s merge <llvm-profdata> <output> <paths>\n" % __file__
             + "\tMerges all profraw files from path into output."
         )
         return 1
     cmd = [args[0], "merge", "-o", args[1]]
-    cmd.extend(findFilesWithExtension(args[2], "profraw"))
+    for i in range(2, len(args)):
+      cmd.extend(findFilesWithExtension(args[i], "profraw"))
     subprocess.check_call(cmd)
     return 0

github-actions · 2024-01-08T17:42:05Z

✅ With the latest revision this PR passed the Python code formatter.

llvm-beanz · 2024-01-08T18:15:24Z

Looping in some Apple people. I'm unsure if Apple is still using this infrastructure for generating PGO data. If so this change will impact them. When I wrote this, Apple had some additional test cases that got layered on top of the publicly available tests.

It might make sense to make this configurable by having a user-definable (advanced) variable that specifies which targets to run to generate PGO data.

cc: @ributzka, @Bigcheese

petrhosek · 2024-01-08T18:40:31Z

I'd also prefer to make this configurable, we're using our own corpus which in my experiments both produces better results and takes less time than check-llvm and check-clang.

We should also consider updating the documentation since I don't think that check-llvm and check-clang is what we should be recommending; LLVM and Clang tests have different goal, that is to provide maximum coverage which often means exercising various corner cases and error conditions, but these are not helpful when collecting profiles, in fact they can be harmful.

I think that LLVM test-suite is actually a better fit since more representative of real code. In our training corpus, I for example use CppPerformanceBenchmarks (an older version is also included in LLVM test suite).

tstellar · 2024-01-08T19:46:52Z

My goal right now is to make this simple and easy to use. I'm not opposed to making something configurable, but it would be nice to have a default that actually helped improve performance, so that a new user or a distribution maintainer could just use the cache file as is.

What about if we tried to improve the existing in tree corpus (which only has hello-world). @petrhosek Is there anything in your corpus that you can contribute upstream? Any suggestions for what to put in there? Maybe a few of the bigger c++ files from in the tree?

check-llvm Also add an option to supply a cmake project to use to generate profile data.

tstellar · 2024-01-12T00:08:16Z

I was able to figure out how to generate profile data by compiling Sema.cpp from the clang source tree, so I switched that to the default and added an option so that users can override this with an external CMake project, like the LLVM Test Suite.

tstellar · 2024-01-12T00:39:54Z

cc @zmodem Since I borrowed your idea from #71067.

kwk · 2024-01-12T07:23:18Z

@tstellar thank you for looping me in. When I did my experiments I've used the llvm-test-suite and test-suite/utils/compare.py --metric exec_time --metric compile_time --metric link_time --lhs-name 16.0.3 --rhs-name 16.0.2-pgo /root/rawhide/results.json vs /root/pgo/results.json to get the geomean difference. How do you measure performance? Maybe the answer is in your code but I haven't checked your code yet because I am reading this on my phone.

llvm-beanz · 2024-01-12T16:22:58Z

clang/utils/perf-training/CMakeLists.txt

+      set(generate_profraw_clang_sema tools/clang/lib/Sema/CMakeFiles/obj.clangSema.dir/Sema.cpp.o)
+      llvm_ExternalProject_Add(generate-profraw-clang ${CMAKE_CURRENT_SOURCE_DIR}/../../../llvm
+              USE_TOOLCHAIN EXLUDE_FROM_ALL NO_INSTALL DEPENDS generate-profraw
+              EXTRA_TARGETS generate_profraw_clang_sema


This is going to be really fragile. The paths of object file outputs are an implementation detail of CMake that has changed in the past and could change in the future, also I think only the makefile and ninja generator actually export those as buildable targets.

Beyond that I also have a philosophical problem with the approach of using living clang sources as training data. I think this can result in unexplainable or difficult to diagnose differences in the efficacy of PGO because as you change the compiler you're also innately changing the training data. A potential better solution would be to have a preprocessed Sema.cpp from a recent LLVM release that we check in as a stable copy. This would be similar to how the GCC compile-time benchmark works.

This is going to be really fragile. The paths of object file outputs are an implementation detail of CMake that has changed in the past and could change in the future, also I think only the makefile and ninja generator actually export those as buildable targets.

What about using the library target clangSema instead of the individual file?

Beyond that I also have a philosophical problem with the approach of using living clang sources as training data. I think this can result in unexplainable or difficult to diagnose differences in the efficacy of PGO because as you change the compiler you're also innately changing the training data. A potential better solution would be to have a preprocessed Sema.cpp from a recent LLVM release that we check in as a stable copy. This would be similar to how the GCC compile-time benchmark works.

The first thing I tried to do was add a preprocessed version of CommentSema.cpp, but the file was 10MB, and I thought that might be too big. Sema.cpp preprocessed is 5MB, which is smaller but still pretty big. I would be OK with this approach if we think it's OK to add a big file like this to the repo.

I'm also not concerned about efficacy of PGO, because building Sema.cpp is just the default. I really want there to be a good default experience for users, so that they can use the PGO cache files and things will 'just work'. This patch also adds a configuration option for advanced users that need something more consistent.

IMO, the PGO use case is already an advanced-user feature. It's even documented in the "Advanced Builds" documentation.

If we want some more complex C++ sources there are some samples we could look at, like Kaleidoscope. The Kaleidoscope chapter 2 source builds without dependencies, and the other ones are also pretty simple to just compile.

If you're not concerned about the efficacy of the PGO build, I'm not sure why the existing single hello_world is a bad default. It isn't great but it is stable and something. If you want something that gives better performance then you must care about the efficacy, and I think you need to be somewhat concerned about the stability of those results, otherwise what's the point?

If you're not concerned about the efficacy of the PGO build, I'm not sure why the existing single hello_world is a bad default. It isn't great but it is stable and something. If you want something that gives better performance then you must care about the efficacy, and I think you need to be somewhat concerned about the stability of those results, otherwise what's the point?

I'm concerned about the efficacy if it is 0%, but not really concerned if it bounces around between say 10 and 25 % based on updates to the file we are using to train.

I tested with the existing default (hello_world) and I didn't see any performance improvements from that. We also did some manual builds (not using the 2-stage PGO cache) and could not get any performance gains for using a hello world source. How certain are we that hello_world is enough to get performance improvements from PGO? Has anyone tested this recently? I will re-run my tests just to double check.

I saw small gains when I added it, but that was years ago. My confidence in it having any meaningful gain today is close to 0. I still think using training sources that have no stability is a bad idea. I'd feel much more comfortable using the LLVM examples or some other coding samples as testing sources.

The idea I'd like explore is using libc++ (and perhaps also libc) test suite for training. It consists of self-contained test programs which should sufficiently exercise various aspects of C/C++ frontend.

The idea I'd like explore is using libc++ (and perhaps also libc) test suite for training. It consists of self-contained test programs which should sufficiently exercise various aspects of C/C++ frontend.

I've updated the patch to use the check-cxx target to generate the profile data. The gains (~20%) are less than with Sema.cpp (~34%) but still pretty good. The only issue is that the profile data comes from building libcxx as well, so it's not just from the test suite.

It may be possible to get test data from just running the test suite, but that might be a lot more complicated to set up.

The idea I'd like explore is using libc++ (and perhaps also libc) test suite for training. It consists of self-contained test programs which should sufficiently exercise various aspects of C/C++ frontend.

@petrhosek are you suggesting to use libc++ from in-tree or from a fixed and already released version as @llvm-beanz has suggested here? He hasn't suggested to use libc++ but a recent LLVM release.

@tstellar For the numbers you reported (~34% and ~20%), what did you use to collect those numbers?

I was compiling just a single file (SemaChecking.cpp).

tstellar · 2024-01-21T05:53:28Z

I split the new configuration option out into a separate PR: #78879, because it seems useful regardless of what the default is. We can continue to discuss the best default on this PR.

kwk · 2024-01-23T14:30:50Z

I want to add something as a side note because it is PGO related. Last year I've experiment with a PGO enabled LLVM toolchain on Fedora (1, 2).

Summary of my experiment

For training data I've used a set of packages that we build on Fedora and I've smuggled in my PGO toolchain. The RPM packaging was also tweaked to pick up the profile data and create a sub packaging automatically just like we do for debug information. This part was rather tricky and RPM related but it worked and we ended up having a profiledata-sub-package for any given project in Fedora that uses LLVM.

Like I've said I've only used it for a handful of project as demonstration.

Encountered problem

But back to the point I wanted to make: disk space for profile data can quickly increase. I've compiled chromium in my tests and our builders exploded because of diskspace just for profile data:

LLVM Profile Error: Failed to write file "/builddir/build/BUILD/raw-pgo-profdata
//chromium.llvm.1970228969820616430_0.24617.profraw": No space left on device

Question

Is it possible to have CMake collect and merge the profile data in the background? I've used tools like inotifywait in my PoC to look for close_write events to any file under a build directory that matches .*\.profraw$. When an entry was found, I've stored it into a file to be consumed and all profiles in it merged, once there are 10 files. Afterwards plain profiles were also deleted to free up disk space. This continuous merge helped me reduce the size of total profile data from ~1,4GB (unmerged) to 1,6MB (merged). This was for the retsnoop project. Chromium simply took to long to build for my experiment.

I understand this is a special use case but one should not underestimate the size of profile data, that's all I'm saying.

tstellar requested review from petrhosek and llvm-beanz January 8, 2024 17:39

llvmbot added the clang Clang issues not falling into any other category label Jan 8, 2024

Fix python formatting

4f77345

tstellar changed the title ~~[CMake][PGO] Use check-clang target to generate profdata for PGO builds~~ [CMake][PGO] Use check targets to generate profdata for PGO builds Jan 8, 2024

tstellar added 2 commits January 11, 2024 23:39

Build Sema.cpp to generate profile data instead of using check-clang and

d330bff

check-llvm Also add an option to supply a cmake project to use to generate profile data.

Update documentatin

9b6543d

tstellar changed the title ~~[CMake][PGO] Use check targets to generate profdata for PGO builds~~ [CMake][PGO] Build Sema.cpp to generate profdata for PGO builds Jan 12, 2024

tstellar added 2 commits January 12, 2024 00:42

Fix documentation build

adcbfc0

Fix documentation

8b8837a

llvm-beanz reviewed Jan 12, 2024

View reviewed changes

Use check-cxx to generate profile data instead of Sema.cpp

44aca59

boomanaiden154 mentioned this pull request Jan 30, 2024

[Clang][CMake] Add CSSPGO support to LLVM_BUILD_INSTRUMENTED #79942

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CMake][PGO] Build Sema.cpp to generate profdata for PGO builds #77347

[CMake][PGO] Build Sema.cpp to generate profdata for PGO builds #77347

tstellar commented Jan 8, 2024 •

edited

Loading

tstellar commented Jan 8, 2024

llvmbot commented Jan 8, 2024

github-actions bot commented Jan 8, 2024 •

edited

Loading

llvm-beanz commented Jan 8, 2024

petrhosek commented Jan 8, 2024

tstellar commented Jan 8, 2024

tstellar commented Jan 12, 2024

tstellar commented Jan 12, 2024

kwk commented Jan 12, 2024

llvm-beanz Jan 12, 2024

tstellar Jan 12, 2024

llvm-beanz Jan 12, 2024

tstellar Jan 13, 2024

llvm-beanz Jan 13, 2024

petrhosek Jan 16, 2024

tstellar Jan 17, 2024

kwk Jan 23, 2024

petrhosek Feb 13, 2024

tstellar Feb 14, 2024

tstellar commented Jan 21, 2024

kwk commented Jan 23, 2024 •

edited

Loading

[CMake][PGO] Build Sema.cpp to generate profdata for PGO builds #77347

Are you sure you want to change the base?

[CMake][PGO] Build Sema.cpp to generate profdata for PGO builds #77347

Conversation

tstellar commented Jan 8, 2024 • edited Loading

tstellar commented Jan 8, 2024

llvmbot commented Jan 8, 2024

github-actions bot commented Jan 8, 2024 • edited Loading

llvm-beanz commented Jan 8, 2024

petrhosek commented Jan 8, 2024

tstellar commented Jan 8, 2024

tstellar commented Jan 12, 2024

tstellar commented Jan 12, 2024

kwk commented Jan 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tstellar commented Jan 21, 2024

kwk commented Jan 23, 2024 • edited Loading

Summary of my experiment

Encountered problem

Question

tstellar commented Jan 8, 2024 •

edited

Loading

github-actions bot commented Jan 8, 2024 •

edited

Loading

kwk commented Jan 23, 2024 •

edited

Loading