Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update gcc version in CI #5297

Merged
merged 19 commits into from
Jul 18, 2023
Merged

Update gcc version in CI #5297

merged 19 commits into from
Jul 18, 2023

Conversation

lsy323
Copy link
Collaborator

@lsy323 lsy323 commented Jul 12, 2023

Before #5290 (Update CI image to be the same as dev container) is landing, we can upgrade the GCC version in current CI image to get rid of some patches. The reason for upgrading GCC not Clang is currently we force GCC in bazel at

xla/.bazelrc

Line 27 in 6fe5cb9

# Force GCC because clang/bazel has issues.

@lsy323
Copy link
Collaborator Author

lsy323 commented Jul 12, 2023

Hit error this rule is missing dependency declarations, according to https://stackoverflow.com/questions/48155976/bazel-undeclared-inclusions-errors-after-updating-gcc. It caused by not cleaning up bazel cache after upgrading gcc. Adding bazel clean --expunge still doesn't work, I'm starting #5298 from a fork to avoiding using remote cache to see if it works.

@lsy323
Copy link
Collaborator Author

lsy323 commented Jul 12, 2023

Hit error this rule is missing dependency declarations, according to https://stackoverflow.com/questions/48155976/bazel-undeclared-inclusions-errors-after-updating-gcc. It caused by not cleaning up bazel cache after upgrading gcc.

Hi @stgpetrovic, do you know how to clean the remote bazel cache? I added bazel clean --expunge but it doesn't work. I started another PR #5298 from a fork branch so that remote cache is disabled. The build can pass without remote cache.

The context is I'm trying to update the CI base image to the cuda develop container. The motivation of this is we need to patch OpenXLA to compile in the current CI container, which is fairly old. And the patches grow over the time and make each pin update more difficult than the last one.

PR for updating CI container is #5290, but I'm hitting GPU test failure in the new container there. To take a step back I'm updating the gcc in the current CI container, in this way we can also get rid of the patches.

cc @JackCaoG @will-cromar

@lsy323 lsy323 added DO_NOT_MERGE_YET For PRs which cannot be merged, despite tests passing DO_NOT_REVIEW_YET labels Jul 12, 2023
@stgpetrovic
Copy link
Collaborator

Hey there,

you can change the cache key, so.if itnis cache-key-ci you can append -1 or the like. Generally there should be 1 per machine env, I've even seen docker hashes being used as keys, there is no issue in bumping the key now (it has ttl).

Hope that helps

@lsy323
Copy link
Collaborator Author

lsy323 commented Jul 13, 2023

Hey there,

you can change the cache key, so.if itnis cache-key-ci you can append -1 or the like. Generally there should be 1 per machine env, I've even seen docker hashes being used as keys, there is no issue in bumping the key now (it has ttl).

Hope that helps

Hi @stgpetrovic, I found the remote cache key is set at here. I appended it with '-1' to see if works.

Also I have a follow up question for education purpose, is there a ttl for all build caches? Does that mean there will be one CI runs slower periodically, because of the cache TTL?

Another question is if we know the cache key, is there a way to manually remove the remote cache via a cmd?

Thanks!

@lsy323 lsy323 marked this pull request as ready for review July 13, 2023 15:02
@lsy323
Copy link
Collaborator Author

lsy323 commented Jul 13, 2023

Good news, CI passed with upgraded GCC and without statusor patch. Let me try to remove more patches before merging. We can get rid of cuda patch and Triton patch which doesn't compile earlier.

Copy link
Collaborator

@JackCaoG JackCaoG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@lsy323
Copy link
Collaborator Author

lsy323 commented Jul 13, 2023

CI failed again due to remote cache key, we need to use new cache key for different gcc version.

@lsy323 lsy323 added CI CI related change and removed DO_NOT_MERGE_YET For PRs which cannot be merged, despite tests passing DO_NOT_REVIEW_YET labels Jul 13, 2023
@lsy323
Copy link
Collaborator Author

lsy323 commented Jul 17, 2023

Even the CI passes earlier, it hits an error

2023-07-16T16:26:35.0012661Z Traceback (most recent call last):
2023-07-16T16:26:35.0013104Z   File "setup.py", line 53, in <module>
2023-07-16T16:26:35.0013524Z     from torch.utils.cpp_extension import BuildExtension
2023-07-16T16:26:35.0014179Z   File "/opt/conda/lib/python3.8/site-packages/torch/__init__.py", line 236, in <module>
2023-07-16T16:26:35.0014597Z     from torch._C import *  # noqa: F403
2023-07-16T16:26:35.0015271Z ImportError: /opt/conda/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)

It should be the same issue in pytorch/pytorch#105248. However, only removing /opt/conda/lib/libstdc++.so.6 doesn't work in our case, need to remove all libstdc++ shared lib reference in conda. Then the build passes.

The root cause is when installing gcc-11, gcc-13-base is also installed, which introduces a mismatch between the libstdc++ installed with gcc-11 and the one in the conda env. Not sure if this is expected, maybe a culprit in ubuntu package graph.

@lsy323
Copy link
Collaborator Author

lsy323 commented Jul 17, 2023

After fixing the build failure, the GPU test TestPjRtDistributedDataParallel.test_ddp_correctness_env_init hangs. Tried to revert the removal of cuda graph patch and triton patch, then the GPU tests passed.

@lsy323
Copy link
Collaborator Author

lsy323 commented Jul 17, 2023

Trigger TPU CI to see if upgrading GCC/G++ affects TPU, since in #5290 TPU CI failed. Run TPU CI to rule out the cause is using newer gcc/g++

@JackCaoG
Copy link
Collaborator

@lsy323 Is this pr ready for review?

@lsy323 lsy323 requested a review from JackCaoG July 17, 2023 17:17
@lsy323
Copy link
Collaborator Author

lsy323 commented Jul 17, 2023

@JackCaoG Yes, please review.

@JackCaoG
Copy link
Collaborator

Mostly LGTM, can you open a pytorch pr like https://github.com/pytorch/pytorch/commits/main/.github/ci_commit_pins/xla.txt and update the xla pin to this commit's latest commit? I want to make sure this change does not break upstream CI

@lsy323
Copy link
Collaborator Author

lsy323 commented Jul 17, 2023

Mostly LGTM, can you open a pytorch pr like https://github.com/pytorch/pytorch/commits/main/.github/ci_commit_pins/xla.txt and update the xla pin to this commit's latest commit? I want to make sure this change does not break upstream CI

Created upstream PR pytorch/pytorch#105360 for testing

@JackCaoG
Copy link
Collaborator

@lsy323
Copy link
Collaborator Author

lsy323 commented Jul 18, 2023

waiting https://github.com/pytorch/pytorch/actions/runs/5580888287/jobs/10200378755?pr=105360 to finish

Upstream CI passed with the pin, we can merge.

@JackCaoG JackCaoG merged commit e7e1899 into master Jul 18, 2023
8 checks passed
khatwanimohit pushed a commit that referenced this pull request Jul 20, 2023
* update gcc version in ci image

* use gcc 9

* run bazel clean before compiling xla

* avoid remove nvcc

* do not remove old gcc which also removes nvcc

* change bazel cache key

* remove abslor patch

* update remote cache key for test script

* remove cuda graph patch and triton patch

* use gcc-8 to align with dev container

* update remote cache key for gcc-8

* Revert "update remote cache key for gcc-8"

This reverts commit ecd2964.

* Revert "use gcc-8 to align with dev container"

This reverts commit a93402f.

* try new cache key

* hack libstdc++ version

* hack libstdc++ version in test.sh

* Revert "hack libstdc++ version in test.sh"

This reverts commit dae89e8.

* rm all libstdc++ reference in conda

* Revert "remove cuda graph patch and triton patch"

This reverts commit f5015ff.
@lsy323 lsy323 self-assigned this Jul 25, 2023
@lsy323 lsy323 deleted the lsiyuan/update-ci-gcc branch July 28, 2023 00:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI CI related change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants