Update gcc version in CI #5297

lsy323 · 2023-07-12T05:40:59Z

Before #5290 (Update CI image to be the same as dev container) is landing, we can upgrade the GCC version in current CI image to get rid of some patches. The reason for upgrading GCC not Clang is currently we force GCC in bazel at

xla/.bazelrc

Line 27 in 6fe5cb9

# Force GCC because clang/bazel has issues.

lsy323 · 2023-07-12T20:17:53Z

Hit error this rule is missing dependency declarations, according to https://stackoverflow.com/questions/48155976/bazel-undeclared-inclusions-errors-after-updating-gcc. It caused by not cleaning up bazel cache after upgrading gcc. Adding bazel clean --expunge still doesn't work, I'm starting #5298 from a fork to avoiding using remote cache to see if it works.

lsy323 · 2023-07-12T22:21:06Z

Hit error this rule is missing dependency declarations, according to https://stackoverflow.com/questions/48155976/bazel-undeclared-inclusions-errors-after-updating-gcc. It caused by not cleaning up bazel cache after upgrading gcc.

Hi @stgpetrovic, do you know how to clean the remote bazel cache? I added bazel clean --expunge but it doesn't work. I started another PR #5298 from a fork branch so that remote cache is disabled. The build can pass without remote cache.

The context is I'm trying to update the CI base image to the cuda develop container. The motivation of this is we need to patch OpenXLA to compile in the current CI container, which is fairly old. And the patches grow over the time and make each pin update more difficult than the last one.

PR for updating CI container is #5290, but I'm hitting GPU test failure in the new container there. To take a step back I'm updating the gcc in the current CI container, in this way we can also get rid of the patches.

cc @JackCaoG @will-cromar

stgpetrovic · 2023-07-12T22:35:05Z

Hey there,

you can change the cache key, so.if itnis cache-key-ci you can append -1 or the like. Generally there should be 1 per machine env, I've even seen docker hashes being used as keys, there is no issue in bumping the key now (it has ttl).

Hope that helps

lsy323 · 2023-07-13T00:13:35Z

Hey there,

you can change the cache key, so.if itnis cache-key-ci you can append -1 or the like. Generally there should be 1 per machine env, I've even seen docker hashes being used as keys, there is no issue in bumping the key now (it has ttl).

Hope that helps

Hi @stgpetrovic, I found the remote cache key is set at here. I appended it with '-1' to see if works.

Also I have a follow up question for education purpose, is there a ttl for all build caches? Does that mean there will be one CI runs slower periodically, because of the cache TTL?

Another question is if we know the cache key, is there a way to manually remove the remote cache via a cmd?

Thanks!

lsy323 · 2023-07-13T15:05:16Z

Good news, CI passed with upgraded GCC and without statusor patch. Let me try to remove more patches before merging. We can get rid of cuda patch and Triton patch which doesn't compile earlier.

JackCaoG

nice!

lsy323 · 2023-07-13T16:09:40Z

CI failed again due to remote cache key, we need to use new cache key for different gcc version.

This reverts commit ecd2964.

This reverts commit a93402f.

This reverts commit dae89e8.

This reverts commit f5015ff.

lsy323 · 2023-07-17T16:45:42Z

Even the CI passes earlier, it hits an error

2023-07-16T16:26:35.0012661Z Traceback (most recent call last):
2023-07-16T16:26:35.0013104Z   File "setup.py", line 53, in <module>
2023-07-16T16:26:35.0013524Z     from torch.utils.cpp_extension import BuildExtension
2023-07-16T16:26:35.0014179Z   File "/opt/conda/lib/python3.8/site-packages/torch/__init__.py", line 236, in <module>
2023-07-16T16:26:35.0014597Z     from torch._C import *  # noqa: F403
2023-07-16T16:26:35.0015271Z ImportError: /opt/conda/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)

It should be the same issue in pytorch/pytorch#105248. However, only removing /opt/conda/lib/libstdc++.so.6 doesn't work in our case, need to remove all libstdc++ shared lib reference in conda. Then the build passes.

The root cause is when installing gcc-11, gcc-13-base is also installed, which introduces a mismatch between the libstdc++ installed with gcc-11 and the one in the conda env. Not sure if this is expected, maybe a culprit in ubuntu package graph.

lsy323 · 2023-07-17T16:48:01Z

After fixing the build failure, the GPU test TestPjRtDistributedDataParallel.test_ddp_correctness_env_init hangs. Tried to revert the removal of cuda graph patch and triton patch, then the GPU tests passed.

lsy323 · 2023-07-17T16:50:17Z

Trigger TPU CI to see if upgrading GCC/G++ affects TPU, since in #5290 TPU CI failed. Run TPU CI to rule out the cause is using newer gcc/g++

JackCaoG · 2023-07-17T17:17:18Z

@lsy323 Is this pr ready for review?

lsy323 · 2023-07-17T17:18:45Z

@JackCaoG Yes, please review.

JackCaoG · 2023-07-17T19:05:10Z

Mostly LGTM, can you open a pytorch pr like https://github.com/pytorch/pytorch/commits/main/.github/ci_commit_pins/xla.txt and update the xla pin to this commit's latest commit? I want to make sure this change does not break upstream CI

lsy323 · 2023-07-17T21:42:41Z

Mostly LGTM, can you open a pytorch pr like https://github.com/pytorch/pytorch/commits/main/.github/ci_commit_pins/xla.txt and update the xla pin to this commit's latest commit? I want to make sure this change does not break upstream CI

Created upstream PR pytorch/pytorch#105360 for testing

JackCaoG · 2023-07-18T00:05:39Z

waiting https://github.com/pytorch/pytorch/actions/runs/5580888287/jobs/10200378755?pr=105360 to finish

lsy323 · 2023-07-18T16:34:31Z

waiting https://github.com/pytorch/pytorch/actions/runs/5580888287/jobs/10200378755?pr=105360 to finish

Upstream CI passed with the pin, we can merge.

* update gcc version in ci image * use gcc 9 * run bazel clean before compiling xla * avoid remove nvcc * do not remove old gcc which also removes nvcc * change bazel cache key * remove abslor patch * update remote cache key for test script * remove cuda graph patch and triton patch * use gcc-8 to align with dev container * update remote cache key for gcc-8 * Revert "update remote cache key for gcc-8" This reverts commit ecd2964. * Revert "use gcc-8 to align with dev container" This reverts commit a93402f. * try new cache key * hack libstdc++ version * hack libstdc++ version in test.sh * Revert "hack libstdc++ version in test.sh" This reverts commit dae89e8. * rm all libstdc++ reference in conda * Revert "remove cuda graph patch and triton patch" This reverts commit f5015ff.

lsy323 added 5 commits July 12, 2023 05:40

update gcc version in ci image

c78a479

use gcc 9

b385efd

run bazel clean before compiling xla

a011371

avoid remove nvcc

f5dca93

do not remove old gcc which also removes nvcc

93ae1ae

lsy323 added DO_NOT_MERGE_YET For PRs which cannot be merged, despite tests passing DO_NOT_REVIEW_YET labels Jul 12, 2023

lsy323 added 2 commits July 13, 2023 00:07

change bazel cache key

855b464

remove abslor patch

cd22869

lsy323 force-pushed the lsiyuan/update-ci-gcc branch from 74a9a48 to cd22869 Compare July 13, 2023 00:08

update remote cache key for test script

4f53722

lsy323 marked this pull request as ready for review July 13, 2023 15:02

lsy323 requested review from stgpetrovic, will-cromar and JackCaoG July 13, 2023 15:02

lsy323 added 2 commits July 13, 2023 15:05

remove cuda graph patch and triton patch

f5015ff

use gcc-8 to align with dev container

a93402f

JackCaoG approved these changes Jul 13, 2023

View reviewed changes

update remote cache key for gcc-8

ecd2964

lsy323 added CI CI related change and removed DO_NOT_MERGE_YET For PRs which cannot be merged, despite tests passing DO_NOT_REVIEW_YET labels Jul 13, 2023

lsy323 added 2 commits July 15, 2023 05:42

Revert "update remote cache key for gcc-8"

bdc1ad7

This reverts commit ecd2964.

Revert "use gcc-8 to align with dev container"

5697853

This reverts commit a93402f.

lsy323 added 6 commits July 15, 2023 06:53

try new cache key

9893fcc

hack libstdc++ version

b6d49fc

hack libstdc++ version in test.sh

dae89e8

Revert "hack libstdc++ version in test.sh"

ac8a00c

This reverts commit dae89e8.

rm all libstdc++ reference in conda

c1610a4

Revert "remove cuda graph patch and triton patch"

5e1892c

This reverts commit f5015ff.

lsy323 requested a review from JackCaoG July 17, 2023 17:17

lsy323 mentioned this pull request Jul 17, 2023

[Test] Update PyTorch/XLA pin pytorch/pytorch#105360

Closed

JackCaoG approved these changes Jul 18, 2023

View reviewed changes

JackCaoG merged commit e7e1899 into master Jul 18, 2023
8 checks passed

lsy323 self-assigned this Jul 25, 2023

lsy323 mentioned this pull request Jul 25, 2023

OpenXLA pin update 07052023 #5304

Merged

lsy323 deleted the lsiyuan/update-ci-gcc branch July 28, 2023 00:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update gcc version in CI #5297

Update gcc version in CI #5297

lsy323 commented Jul 12, 2023 •

edited

Loading

lsy323 commented Jul 12, 2023

lsy323 commented Jul 12, 2023 •

edited

Loading

stgpetrovic commented Jul 12, 2023

lsy323 commented Jul 13, 2023

lsy323 commented Jul 13, 2023 •

edited

Loading

JackCaoG left a comment

lsy323 commented Jul 13, 2023

lsy323 commented Jul 17, 2023 •

edited

Loading

lsy323 commented Jul 17, 2023

lsy323 commented Jul 17, 2023

JackCaoG commented Jul 17, 2023

lsy323 commented Jul 17, 2023

JackCaoG commented Jul 17, 2023

lsy323 commented Jul 17, 2023

JackCaoG commented Jul 18, 2023

lsy323 commented Jul 18, 2023

Update gcc version in CI #5297

Update gcc version in CI #5297

Conversation

lsy323 commented Jul 12, 2023 • edited Loading

lsy323 commented Jul 12, 2023

lsy323 commented Jul 12, 2023 • edited Loading

stgpetrovic commented Jul 12, 2023

lsy323 commented Jul 13, 2023

lsy323 commented Jul 13, 2023 • edited Loading

JackCaoG left a comment

Choose a reason for hiding this comment

lsy323 commented Jul 13, 2023

lsy323 commented Jul 17, 2023 • edited Loading

lsy323 commented Jul 17, 2023

lsy323 commented Jul 17, 2023

JackCaoG commented Jul 17, 2023

lsy323 commented Jul 17, 2023

JackCaoG commented Jul 17, 2023

lsy323 commented Jul 17, 2023

JackCaoG commented Jul 18, 2023

lsy323 commented Jul 18, 2023

lsy323 commented Jul 12, 2023 •

edited

Loading

lsy323 commented Jul 12, 2023 •

edited

Loading

lsy323 commented Jul 13, 2023 •

edited

Loading

lsy323 commented Jul 17, 2023 •

edited

Loading