-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update gcc version in CI #5297
Update gcc version in CI #5297
Conversation
Hit error |
Hi @stgpetrovic, do you know how to clean the remote bazel cache? I added The context is I'm trying to update the CI base image to the cuda develop container. The motivation of this is we need to patch OpenXLA to compile in the current CI container, which is fairly old. And the patches grow over the time and make each pin update more difficult than the last one. PR for updating CI container is #5290, but I'm hitting GPU test failure in the new container there. To take a step back I'm updating the gcc in the current CI container, in this way we can also get rid of the patches. |
Hey there, you can change the cache key, so.if itnis cache-key-ci you can append -1 or the like. Generally there should be 1 per machine env, I've even seen docker hashes being used as keys, there is no issue in bumping the key now (it has ttl). Hope that helps |
74a9a48
to
cd22869
Compare
Hi @stgpetrovic, I found the remote cache key is set at here. I appended it with '-1' to see if works. Also I have a follow up question for education purpose, is there a ttl for all build caches? Does that mean there will be one CI runs slower periodically, because of the cache TTL? Another question is if we know the cache key, is there a way to manually remove the remote cache via a cmd? Thanks! |
Good news, CI passed with upgraded GCC and without statusor patch. Let me try to remove more patches before merging. We can get rid of cuda patch and Triton patch which doesn't compile earlier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
CI failed again due to remote cache key, we need to use new cache key for different gcc version. |
Even the CI passes earlier, it hits an error
It should be the same issue in pytorch/pytorch#105248. However, only removing The root cause is when installing |
After fixing the build failure, the GPU test |
Trigger TPU CI to see if upgrading GCC/G++ affects TPU, since in #5290 TPU CI failed. Run TPU CI to rule out the cause is using newer gcc/g++ |
@lsy323 Is this pr ready for review? |
@JackCaoG Yes, please review. |
Mostly LGTM, can you open a pytorch pr like https://github.com/pytorch/pytorch/commits/main/.github/ci_commit_pins/xla.txt and update the xla pin to this commit's latest commit? I want to make sure this change does not break upstream CI |
Created upstream PR pytorch/pytorch#105360 for testing |
Upstream CI passed with the pin, we can merge. |
* update gcc version in ci image * use gcc 9 * run bazel clean before compiling xla * avoid remove nvcc * do not remove old gcc which also removes nvcc * change bazel cache key * remove abslor patch * update remote cache key for test script * remove cuda graph patch and triton patch * use gcc-8 to align with dev container * update remote cache key for gcc-8 * Revert "update remote cache key for gcc-8" This reverts commit ecd2964. * Revert "use gcc-8 to align with dev container" This reverts commit a93402f. * try new cache key * hack libstdc++ version * hack libstdc++ version in test.sh * Revert "hack libstdc++ version in test.sh" This reverts commit dae89e8. * rm all libstdc++ reference in conda * Revert "remove cuda graph patch and triton patch" This reverts commit f5015ff.
Before #5290 (Update CI image to be the same as dev container) is landing, we can upgrade the GCC version in current CI image to get rid of some patches. The reason for upgrading GCC not Clang is currently we force GCC in bazel at
xla/.bazelrc
Line 27 in 6fe5cb9