Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tpu ci module refactor #7

Merged
merged 286 commits into from
Nov 16, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
286 commits
Select commit Hold shift + click to select a range
65a43e4
Test commit
mbzomowski Sep 26, 2023
47025b3
Test commit
mbzomowski Sep 26, 2023
45ddc66
Test commit
mbzomowski Sep 26, 2023
a7cef9a
Test commit
mbzomowski Sep 26, 2023
8c07126
Add terraform config files to manage tpu-ci infrastructure
mbzomowski Sep 27, 2023
75441f6
Moved PAT to GH secrets
mbzomowski Sep 27, 2023
b352ecf
Testing flux
mbzomowski Sep 27, 2023
c6ded80
Merge pull request #1 from mbzomowski/test_branch
mbzomowski Sep 27, 2023
eb64c4d
Testing tf-controller
mbzomowski Sep 27, 2023
9da183e
Test tf-controller
mbzomowski Sep 27, 2023
957999a
Merge pull request #3 from mbzomowski/test_branch
mbzomowski Sep 27, 2023
cc868af
Test new config
mbzomowski Sep 27, 2023
e8820bd
Test reduction in nodes
mbzomowski Sep 27, 2023
bd992eb
Merge pull request #4 from mbzomowski/test_branch
mbzomowski Sep 27, 2023
bd5621a
Update tf-controller
mbzomowski Sep 28, 2023
a39cf6e
Remove branch planner from tf-controller
mbzomowski Sep 29, 2023
0ba0850
Merge pull request #5 from mbzomowski/test_branch
mbzomowski Sep 29, 2023
7b7fdaa
Fix kubernetes provider
mbzomowski Oct 3, 2023
a28a9d9
Merge pull request #6 from mbzomowski/test_branch
mbzomowski Oct 3, 2023
0498545
Fix kubernetes provider
mbzomowski Oct 3, 2023
921ba77
Add rolebinding and role for tf-runner user
mbzomowski Oct 3, 2023
fa8a7d6
Fix kubernetes provider
mbzomowski Oct 3, 2023
7183c4e
Fix kubernetes provider
mbzomowski Oct 3, 2023
f8f13f4
Fix kubernetes provider
mbzomowski Oct 3, 2023
80ba824
Add debug logging to tf-controller pods
mbzomowski Oct 3, 2023
4cdc8f2
Remove env var from tf-controller
mbzomowski Oct 3, 2023
d56cdcc
Fix kubernetes provider
mbzomowski Oct 3, 2023
b9029bb
Fix kubernetes provider
mbzomowski Oct 3, 2023
be51922
Fix kubernetes provider
mbzomowski Oct 3, 2023
067ac68
Fix kubernetes provider
mbzomowski Oct 3, 2023
a260990
Fix kubernetes provider
mbzomowski Oct 3, 2023
34226ea
Add workload identity for cluster
mbzomowski Oct 4, 2023
0491610
Remove rolebinding
mbzomowski Oct 4, 2023
2b2e666
Testing to see if gitops finally works
mbzomowski Oct 4, 2023
d5a553d
Fix service account name in runnerpod
mbzomowski Oct 4, 2023
4ecff99
Fix ksa in runnerpod
mbzomowski Oct 4, 2023
1a859ef
Fixed ksa name again
mbzomowski Oct 4, 2023
04bea37
Added role and rolebinding for ksa secret access
mbzomowski Oct 4, 2023
5e81fab
Please let this work
mbzomowski Oct 4, 2023
57e4271
Added namespace to role
mbzomowski Oct 4, 2023
d866794
Testing gitops
mbzomowski Oct 4, 2023
de5e3be
Add leases permissions for ksa
mbzomowski Oct 4, 2023
be42716
Add delete secrets permission for ksa
mbzomowski Oct 4, 2023
61f458f
Add refresh before apply for tf object
mbzomowski Oct 4, 2023
41308fa
Migrate tf state to gcs
mbzomowski Oct 4, 2023
8f30c08
Add backend config for tf object
mbzomowski Oct 4, 2023
1022fc2
Fix typo in backendconfig
mbzomowski Oct 4, 2023
3fd28b1
Remove ignore for secrets
mbzomowski Oct 4, 2023
b1dafe7
Add back secrets to ignore and set destroy to false in tf-object
mbzomowski Oct 5, 2023
fb2c826
Removed secret.tf file
mbzomowski Oct 5, 2023
8553f58
Set tf-controller pods to 2
mbzomowski Oct 5, 2023
9085244
Change workflow to setup pytorch/xla from most recent commit
mbzomowski Oct 5, 2023
e7be023
Change workflow
mbzomowski Oct 5, 2023
9218c49
Testing workflow
mbzomowski Oct 5, 2023
c16f075
Testing workflow
mbzomowski Oct 5, 2023
7f7cf4b
Testing workflow
mbzomowski Oct 5, 2023
c9b1221
Test workflow
mbzomowski Oct 5, 2023
184ec17
Test workflow
mbzomowski Oct 5, 2023
bf75e0d
Test workflow
mbzomowski Oct 5, 2023
721827c
Test workflow
mbzomowski Oct 5, 2023
d245af0
Test workflow
mbzomowski Oct 5, 2023
85e3969
Test workflow
mbzomowski Oct 5, 2023
33e379b
Test workflow
mbzomowski Oct 5, 2023
c2d8707
Test workflow
mbzomowski Oct 5, 2023
95a3d27
Test workflow
mbzomowski Oct 5, 2023
38998db
Test workflow
mbzomowski Oct 5, 2023
802975e
Test workflow
mbzomowski Oct 5, 2023
5ee1a55
Test workflow
mbzomowski Oct 5, 2023
190d311
Test workflow
mbzomowski Oct 5, 2023
0ab30ee
Test workflow
mbzomowski Oct 5, 2023
e53136d
Test workflow
mbzomowski Oct 5, 2023
4f35ede
Test workflow
mbzomowski Oct 5, 2023
21b2777
Test workflow; sleeping runner
mbzomowski Oct 6, 2023
a873caf
Test workflow
mbzomowski Oct 6, 2023
505755e
Test workflow
mbzomowski Oct 6, 2023
8ff9f72
Test workflow
mbzomowski Oct 6, 2023
c51ccbc
Test workflow
mbzomowski Oct 6, 2023
1370e7e
Test workflow
mbzomowski Oct 6, 2023
99a022a
Test workflow
mbzomowski Oct 6, 2023
3e7c78c
Test workflow
mbzomowski Oct 6, 2023
d92a1ae
Test workflow
mbzomowski Oct 6, 2023
5d606aa
Test workflow
mbzomowski Oct 6, 2023
62ebef3
Test worflow
mbzomowski Oct 6, 2023
7b4dad5
Test workflow
mbzomowski Oct 6, 2023
c279670
Test workflow
mbzomowski Oct 6, 2023
c87a8c1
Test workflow
mbzomowski Oct 6, 2023
35b34f5
Test workflow
mbzomowski Oct 6, 2023
c254705
Test workflow
mbzomowski Oct 6, 2023
fac0a7c
Test workflow
mbzomowski Oct 6, 2023
8c96311
Test workflow
mbzomowski Oct 6, 2023
470d600
Test workflow
mbzomowski Oct 6, 2023
1e93431
Test workflow
mbzomowski Oct 6, 2023
2f8d26f
Test workflow
mbzomowski Oct 6, 2023
97da938
Test workflow
mbzomowski Oct 6, 2023
b52d0b6
Test workflow
mbzomowski Oct 6, 2023
a7bffcb
Test workflow
mbzomowski Oct 6, 2023
dab54b3
Test workflow
mbzomowski Oct 6, 2023
e8828f4
Changed runner image to custom image
mbzomowski Oct 6, 2023
85b0c7b
Test workflow
mbzomowski Oct 6, 2023
eeb3f9f
Test workflow
mbzomowski Oct 6, 2023
8b12b18
Test workflow
mbzomowski Oct 6, 2023
c541585
Test workflow
mbzomowski Oct 6, 2023
07209a6
Test workflow
mbzomowski Oct 6, 2023
75c7959
Test workflow
mbzomowski Oct 6, 2023
dabef74
Test workflow
mbzomowski Oct 7, 2023
5daf390
Test workflow
mbzomowski Oct 7, 2023
0a52f7b
Test workflow
mbzomowski Oct 7, 2023
b2c85f1
Test workflow
mbzomowski Oct 7, 2023
ca08638
Test workflow
mbzomowski Oct 7, 2023
f50f9c6
Test workflow
mbzomowski Oct 7, 2023
e31ae23
Test workflow
mbzomowski Oct 9, 2023
deecc31
Test workflow
mbzomowski Oct 9, 2023
833be5c
Test workflow
mbzomowski Oct 9, 2023
ed7400c
Test workflow
mbzomowski Oct 9, 2023
85f5365
Merge remote-tracking branch 'upstream/master'
mbzomowski Oct 9, 2023
08c3ce5
Test workflow
mbzomowski Oct 9, 2023
60e489b
Test workflow
mbzomowski Oct 9, 2023
d09b702
Test workflow
mbzomowski Oct 9, 2023
7a9e901
Test workflow
mbzomowski Oct 9, 2023
b4b723d
Test workflow
mbzomowski Oct 9, 2023
ba20f0a
Test workflow
mbzomowski Oct 9, 2023
44dedbf
Test workflow
mbzomowski Oct 9, 2023
388d0cb
Test workflow
mbzomowski Oct 9, 2023
5b3e916
Test workflow
mbzomowski Oct 10, 2023
84adb2e
Test workflow
mbzomowski Oct 10, 2023
8681a4e
Test workflow
mbzomowski Oct 10, 2023
ae3ecfd
Test workflow
mbzomowski Oct 10, 2023
9fb105d
Test workflow
mbzomowski Oct 10, 2023
f9d41d8
Test workflow
mbzomowski Oct 10, 2023
4c43385
Test workflow
mbzomowski Oct 10, 2023
3623a13
Test workflow
mbzomowski Oct 10, 2023
4cd6e1e
Test workflow
mbzomowski Oct 10, 2023
fa8516b
Test workflow
mbzomowski Oct 10, 2023
0eb2da3
Test workflow
mbzomowski Oct 10, 2023
c707fb1
Test workflow
mbzomowski Oct 10, 2023
648f50b
Test workflow
mbzomowski Oct 11, 2023
16f2e91
Test workflow
mbzomowski Oct 11, 2023
962bfad
Test workflow
mbzomowski Oct 11, 2023
509bb8c
Test workflow
mbzomowski Oct 11, 2023
59beccf
Test workflow
mbzomowski Oct 11, 2023
df47486
Test workflow
mbzomowski Oct 11, 2023
79f221f
Test workflow
mbzomowski Oct 11, 2023
388d856
Test workflow
mbzomowski Oct 11, 2023
ec958dc
Test workflow
mbzomowski Oct 11, 2023
61daabb
Test workflow
mbzomowski Oct 11, 2023
6b848fb
Test workflow
mbzomowski Oct 11, 2023
6f38d32
Test workflow
mbzomowski Oct 11, 2023
3156fb2
Test workflow
mbzomowski Oct 11, 2023
0c0e4c8
Test workflow
mbzomowski Oct 11, 2023
d0a8b2f
Test workflow
mbzomowski Oct 11, 2023
73c8715
Test workflow
mbzomowski Oct 11, 2023
cc08c80
Test workflow
mbzomowski Oct 11, 2023
6c8aa54
Test workflow
mbzomowski Oct 11, 2023
87a6f8d
Test workflow
mbzomowski Oct 11, 2023
4a02373
Test workflow
mbzomowski Oct 11, 2023
800f515
Test workflow
mbzomowski Oct 11, 2023
b8a71df
Test workflow
mbzomowski Oct 11, 2023
4920b4d
Test workflow
mbzomowski Oct 11, 2023
2f4e096
Test workflow
mbzomowski Oct 11, 2023
fa19f00
Test workflow
mbzomowski Oct 11, 2023
db4aaea
Test workflow
mbzomowski Oct 11, 2023
9c37b4d
Test workflow
mbzomowski Oct 11, 2023
652c1ba
Cleanup files from test runs
mbzomowski Oct 12, 2023
127b311
Test
mbzomowski Oct 13, 2023
ae8e2de
Test
mbzomowski Oct 13, 2023
342d184
Testing workflow
mbzomowski Oct 16, 2023
0e457ec
Clean up unused TPU CI files
mbzomowski Oct 20, 2023
e03ae3a
Scale tf-controller down to 1 pod
mbzomowski Oct 20, 2023
693b6cb
Refactored TPU CI into an ARC module
mbzomowski Nov 14, 2023
cfec375
Add second workflow job and fix repo url error
mbzomowski Nov 14, 2023
338b1ce
Small change to TF config formatting
mbzomowski Nov 15, 2023
571b180
Testing the speed of TPU node autoscaling
mbzomowski Nov 15, 2023
8419234
Update artifacts.auto.tfvars for CUDA 12.1 (#5683)
ManfeiBai Oct 9, 2023
4d8850d
Add API to assemble CPU shards to a sharded tensor (#5681)
jonb377 Oct 9, 2023
fe337ad
Initial commit for CheckpointManager (#5678)
jonb377 Oct 10, 2023
5cfe6f1
Fix `masked_fill` broadcasting. (#5688)
ysiraichi Oct 10, 2023
9f4235a
Conditionally set default TPU settings in `__init__.py` (#5696)
will-cromar Oct 10, 2023
572daad
Disable xla backend for SPMD (#5690)
jonb377 Oct 11, 2023
734315c
Add support for unused params (#5694)
qihqi Oct 11, 2023
642e026
Open XLA pin update (#5675)
qihqi Oct 11, 2023
6ca267a
Update CI image with dev container image (#5290)
lsy323 Oct 12, 2023
80cda6c
Support synchronous saving and loading in CheckpointManager (#5693)
jonb377 Oct 13, 2023
6209e06
Reduce unnecessary tensor allocation in Adam and AdamW (#5700)
baoleai Oct 13, 2023
fb53d22
Support async checkpointing through CheckpointManager (#5697)
jonb377 Oct 13, 2023
2930e46
Add --net=host and --shm-size=16g flag to the docker run command in G…
vanbasten23 Oct 13, 2023
919b348
Don't set $TPU_LIBRARY_PATH during import (#5698)
will-cromar Oct 16, 2023
8ffd7bd
pass bundle object to make_tf_function (#5708)
haozha111 Oct 16, 2023
67d00ad
Filter tensor arguments from traced model. (#5689)
ysiraichi Oct 16, 2023
7c01f2a
Update Troubleshooting with some sanity check example (#5705)
JackCaoG Oct 17, 2023
fab3072
Add multi-host GPU support (#5657)
vanbasten23 Oct 19, 2023
ce55d10
Add support for `_unsafe_index`. (#5707)
ysiraichi Oct 19, 2023
1a8ba58
Run decomp before processing (#5713)
qihqi Oct 20, 2023
2ea21d8
fix typo in TROUBLESHOOTING (#5716)
JackCaoG Oct 20, 2023
108e6fc
Remove GetTensorsFused since we don;t have opbyop anymore (#5718)
JackCaoG Oct 21, 2023
ec92c8c
When exporting StableHLO to SavedModel, also include the original var…
haozha111 Oct 22, 2023
e49176e
Update OpenXLA pin to 20231022 (#5720)
alanwaketan Oct 23, 2023
014e73f
Make sure printing XLA tensor only execute the HLO once (#5721)
JackCaoG Oct 24, 2023
445e85d
add op lowering for _prelu_kernel_backward (#5724)
zpcore Oct 24, 2023
16418fd
Patch RNG for better memory utilization in dropout layers (#5710)
yeounoh Oct 24, 2023
f418091
Ensure dist runtime is in sync before shutting down. (#5714)
vanbasten23 Oct 24, 2023
9c8108c
Make the git version appears in pip (#5728)
alanwaketan Oct 25, 2023
e9ccdc2
Fix the missing parameter error when running mp_imagenet with torchru…
vanbasten23 Oct 25, 2023
c7c2001
Promote in convolution (#5727)
qihqi Oct 25, 2023
92932d6
Revert "Don't set $TPU_LIBRARY_PATH during import (#5698)" (#5731)
alanwaketan Oct 25, 2023
b752c66
Set --xla_latency_hiding_scheduler_rerun to 1 (#5736)
alanwaketan Oct 26, 2023
b254f8d
Merge `--pjrt_distributed` flag with `--ddp` flag. (#5732)
will-cromar Oct 26, 2023
51f2b58
Mangle root scope TF variable name during tf.saved_model export (#5738)
lsy323 Oct 26, 2023
34ed6d4
Add Python 3.9 build for 2.1 release (#5744)
will-cromar Oct 26, 2023
fcb6323
Add doc for multinode GPU training. (#5704)
vanbasten23 Oct 27, 2023
63b2a47
Add information about on-going DTensor API in spmd.md (#5735)
yeounoh Oct 27, 2023
fedacf9
Update set_default_tensor_type to set_default_dtype (#5734)
JackCaoG Oct 30, 2023
722c06a
Support SPMD through the xla:// init_method (#5706)
jonb377 Oct 30, 2023
0955634
Remove master IP discovery test for MP (#5748)
jonb377 Oct 30, 2023
1a215e5
Add tooling to explain why a graph execution happens (#5723)
JackCaoG Oct 31, 2023
80e7d87
Fix XLA tensor storage device by using `XlaDeviceToAtenDevice`. (#5743)
ysiraichi Oct 31, 2023
50fd007
Destroy the ComputationClient when the program exits (#5750)
will-cromar Oct 31, 2023
6a08f88
Correct the multinode training doc (#5747)
vanbasten23 Oct 31, 2023
1386459
Support PreemptionSyncManager in XlaCoordinator (#5733)
jonb377 Nov 1, 2023
6460448
Transfer data directly to the device (#5752)
will-cromar Nov 2, 2023
898b36e
Revert "Transfer data directly to the device (#5752)" (#5765)
will-cromar Nov 2, 2023
51fca39
update doc to use PJRT_DEVICE=CUDA instead of PJRT_DEVICE=GPU (#5754)
vanbasten23 Nov 4, 2023
abb34a3
Remove `_unsafe_index` implementation. (#5769)
ysiraichi Nov 6, 2023
7ccc70f
Include Dynamo executation in the executation cause analysis (#5758)
JackCaoG Nov 7, 2023
c17f1f6
fix squeeze op lowering issue when dim is not in sorted order (#5751)
zpcore Nov 7, 2023
254c283
Support autocheckpointing (#5753)
jonb377 Nov 8, 2023
93f849a
fixing neuron import (#5775)
aws-kingrj Nov 8, 2023
83aef4b
Disable flaky cpp test (#5779)
JackCaoG Nov 8, 2023
03d902a
Refactor type conversion functions out of `//torch_xla/csrc:tensor` (…
will-cromar Nov 9, 2023
74f263a
Dsiable C++ test correctly (#5783)
JackCaoG Nov 9, 2023
25535ce
Use default method in stablehlo txt/bytecode getter of StableHLOGraph…
lsy323 Nov 10, 2023
452665b
Implement mark_sharding as a custom op to support dynamo spmd activat…
wonjoolee95 Nov 11, 2023
8285dfc
Remove some unused code from `csrc/runtime` (#5785)
will-cromar Nov 13, 2023
897e3b5
Make the pjrt gpu allocator configurable (#5759)
anw90 Nov 13, 2023
b9a045f
disable test_set temporarily (#5792)
yeounoh Nov 13, 2023
1d3c5b9
[SPMD] suppor DTensor API integration (#5776)
yeounoh Nov 13, 2023
6a39d05
Transfer data directly to the device (#5772)
will-cromar Nov 13, 2023
61bf7d4
Lower AtenFull op (#5781)
danielvegamyhre Nov 14, 2023
0845e97
Add GKE support and various usability improvements in CheckpointManag…
jonb377 Nov 14, 2023
cbdb18f
Use TPU profiler plugin (#5793)
will-cromar Nov 14, 2023
e795565
Record the lazy tracing time(C++) in metrics (#5757)
JackCaoG Nov 14, 2023
d16f8c1
Don't set $TPU_LIBRARY_PATH during import
will-cromar Nov 14, 2023
7fe9f76
Enable passing down dynamic dimensions from torch to XLA (#5790)
lsy323 Nov 15, 2023
6c5abef
Add python 3.11 trigger for 2.1 release build (#5810)
JackCaoG Nov 15, 2023
6e107d6
Use upstream XLA concurrency utilities (#5799)
will-cromar Nov 15, 2023
aabeb25
[SPMD] fix bug with XLAShardedTensor.__repr__ (#5807)
yeounoh Nov 16, 2023
559c0c8
Clean up tpu ci files
mbzomowski Nov 16, 2023
0aa0f11
Move tpu ci files to infra/
mbzomowski Nov 16, 2023
a324f07
Change workflow trigger to workflow_dispatch and schedule
mbzomowski Nov 16, 2023
45ed387
Fix workflow
mbzomowski Nov 16, 2023
ec05372
Testing workflow manual trigger
mbzomowski Nov 16, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .circleci/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -41,14 +41,17 @@ apply_patches

python -c "import fcntl; fcntl.fcntl(1, fcntl.F_SETFL, 0)"

# We always build PyTorch without CUDA support.
export USE_CUDA=0
python setup.py install

sccache --show-stats

source $XLA_DIR/xla_env
export GCLOUD_SERVICE_KEY_FILE="$XLA_DIR/default_credentials.json"
export SILO_NAME='cache-silo-ci-gcc-11' # cache bucket for CI
export SILO_NAME='cache-silo-ci-dev-3.8_cuda_12.1' # cache bucket for CI
export BUILD_CPP_TESTS='1'
export TF_CUDA_COMPUTE_CAPABILITIES="sm_50,sm_70,sm_75,compute_80,$TF_CUDA_COMPUTE_CAPABILITIES"
build_torch_xla $XLA_DIR

popd
41 changes: 11 additions & 30 deletions .circleci/common.sh
Original file line number Diff line number Diff line change
Expand Up @@ -92,27 +92,6 @@ function install_deps_pytorch_xla() {

sudo ln -s "$(command -v bazelisk)" /usr/bin/bazel

# Install gcc-11
sudo apt-get update
# Update ppa for GCC
sudo apt-get install -y software-properties-common
sudo add-apt-repository -y ppa:ubuntu-toolchain-r/test
sudo apt update -y
sudo apt install -y gcc-11
sudo apt install -y g++-11
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-11 100
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-11 100

export NVCC_PREPEND_FLAGS='-ccbin /usr/bin/g++-11'

# Hack similar to https://github.com/pytorch/pytorch/pull/105227/files#diff-9e59213240d3b55d2ddc53c8c096db9eece0665d64f46473454f9dc0c10fd804
sudo rm /opt/conda/lib/libstdc++.so*

# Update gcov for test coverage
sudo update-alternatives --install /usr/bin/gcov gcov /usr/bin/gcov-11 100
sudo update-alternatives --install /usr/bin/gcov-dump gcov-dump /usr/bin/gcov-dump-11 100
sudo update-alternatives --install /usr/bin/gcov-tool gcov-tool /usr/bin/gcov-tool-11 100

# Symnlink the missing cuda headers if exists
CUBLAS_PATTERN="/usr/include/cublas*"
if ls $CUBLAS_PATTERN 1> /dev/null 2>&1; then
Expand Down Expand Up @@ -148,16 +127,18 @@ function run_torch_xla_python_tests() {
else
./test/run_tests.sh

# GPU tests
# CUDA tests
if [ -x "$(command -v nvidia-smi)" ]; then
# These tests fail on GPU with 03/30 TF-pin update (https://github.com/pytorch/xla/pull/4840)
# These tests fail on CUDA with 03/30 TF-pin update (https://github.com/pytorch/xla/pull/4840)
PJRT_DEVICE=CUDA python test/test_train_mp_imagenet_fsdp.py --fake_data --use_nested_fsdp --use_small_fake_sample --num_epochs=1
# TODO(xiowei replace gpu with cuda): remove the test below with PJRT_DEVICE=GPU because PJRT_DEVICE=GPU is being deprecated.
PJRT_DEVICE=GPU python test/test_train_mp_imagenet_fsdp.py --fake_data --use_nested_fsdp --use_small_fake_sample --num_epochs=1
PJRT_DEVICE=GPU python test/test_train_mp_imagenet_fsdp.py --fake_data --auto_wrap_policy type_based --use_small_fake_sample --num_epochs=1
XLA_DISABLE_FUNCTIONALIZATION=1 PJRT_DEVICE=GPU python test/test_train_mp_imagenet_fsdp.py --fake_data --use_nested_fsdp --use_small_fake_sample --num_epochs=1
PJRT_DEVICE=CUDA python test/test_train_mp_imagenet_fsdp.py --fake_data --auto_wrap_policy type_based --use_small_fake_sample --num_epochs=1
XLA_DISABLE_FUNCTIONALIZATION=1 PJRT_DEVICE=CUDA python test/test_train_mp_imagenet_fsdp.py --fake_data --use_nested_fsdp --use_small_fake_sample --num_epochs=1
# Syncfree SGD optimizer tests
if [ -d ./torch_xla/amp/syncfree ]; then
echo "Running Syncfree Optimizer Test"
PJRT_DEVICE=GPU python test/test_syncfree_optimizers.py
PJRT_DEVICE=CUDA python test/test_syncfree_optimizers.py

# Following test scripts are mainly useful for
# performance evaluation & comparison among different
Expand Down Expand Up @@ -192,9 +173,9 @@ function run_torch_xla_cpp_tests() {
if [ "$USE_COVERAGE" != "0" ]; then
# TODO(yeounoh) shard the coverage testing
if [ -x "$(command -v nvidia-smi)" ]; then
PJRT_DEVICE=GPU test/cpp/run_tests.sh $EXTRA_ARGS -L""
PJRT_DEVICE=CUDA test/cpp/run_tests.sh $EXTRA_ARGS -L""
cp $XLA_DIR/bazel-out/_coverage/_coverage_report.dat /tmp/cov1.dat
PJRT_DEVICE=GPU test/cpp/run_tests.sh -X early_sync -F AtenXlaTensorTest.TestEarlySyncLiveTensors -L"" $EXTRA_ARGS
PJRT_DEVICE=CUDA test/cpp/run_tests.sh -X early_sync -F AtenXlaTensorTest.TestEarlySyncLiveTensors -L"" $EXTRA_ARGS
cp $XLA_DIR/bazel-out/_coverage/_coverage_report.dat /tmp/cov2.dat
lcov --add-tracefile /tmp/cov1.dat -a /tmp/cov2.dat -o /tmp/merged.dat
else
Expand All @@ -206,8 +187,8 @@ function run_torch_xla_cpp_tests() {
else
# Shard GPU testing
if [ -x "$(command -v nvidia-smi)" ]; then
PJRT_DEVICE=GPU test/cpp/run_tests.sh $EXTRA_ARGS -L""
PJRT_DEVICE=GPU test/cpp/run_tests.sh -X early_sync -F AtenXlaTensorTest.TestEarlySyncLiveTensors -L"" $EXTRA_ARGS
PJRT_DEVICE=CUDA test/cpp/run_tests.sh $EXTRA_ARGS -L""
PJRT_DEVICE=CUDA test/cpp/run_tests.sh -X early_sync -F AtenXlaTensorTest.TestEarlySyncLiveTensors -L"" $EXTRA_ARGS
else
PJRT_DEVICE=CPU test/cpp/run_tests.sh $EXTRA_ARGS -L""
fi
Expand Down
59 changes: 28 additions & 31 deletions .circleci/docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# This requires cuda & cudnn packages pre-installed in the base image.
# Other available cuda images are listed at https://hub.docker.com/r/nvidia/cuda
ARG base_image="nvidia/cuda:11.7.0-cudnn8-devel-ubuntu18.04"
ARG base_image="us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/development:3.8_cuda_12.1"
FROM "${base_image}"

ARG python_version="3.8"
ARG cuda="1"
ARG cuda_compute="5.2,7.5"
ARG cc="clang-8"
ARG cxx="clang++-8"
ARG cc="clang"
ARG cxx="clang++"
ARG cxx_abi="1"
ARG tpuvm=""

Expand Down Expand Up @@ -37,38 +37,15 @@ ENV CXX "${cxx}"
# Whether to build for TPUVM mode
ENV TPUVM_MODE "${tpuvm}"

# Rotate nvidia repo public key (last updated: 04/27/2022)
# Unfortunately, nvidia/cuda image is shipped with invalid public key
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub

# Install base system packages
RUN apt-get clean && apt-get update
RUN apt-get upgrade -y
RUN apt-get install --fix-missing -y python-pip python3-pip git curl libopenblas-dev vim jq \
apt-transport-https ca-certificates procps openssl sudo wget libssl-dev libc6-dbg

# Install clang & llvm
ADD ./install_llvm_clang.sh install_llvm_clang.sh
RUN bash ./install_llvm_clang.sh

# Install clang as upstream CI forces clang
RUN apt-get install -y clang
# Install valgrind
ADD ./install_valgrind.sh install_valgrind.sh
COPY ./install_valgrind.sh install_valgrind.sh
RUN bash ./install_valgrind.sh

# Sets up jenkins user.
RUN useradd jenkins && \
mkdir /home/jenkins && \
chown jenkins /home/jenkins
RUN echo 'jenkins ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers

RUN mkdir -p /opt/conda /opt/cargo /opt/rustup /workspace /var/lib/jenkins && \
chown jenkins /opt/conda /opt/cargo /opt/rustup /workspace /var/lib/jenkins
USER jenkins
WORKDIR /workspace

# Install openmpi for CUDA
run sudo apt-get install -y ssh
run sudo apt-get install -y --allow-downgrades --allow-change-held-packages openmpi-bin libopenmpi-dev
run apt-get install -y ssh
run apt-get install -y --allow-downgrades --allow-change-held-packages openmpi-bin libopenmpi-dev

# Builds and configure sccache
ENV OPENSSL_INCLUDE_DIR /usr/include/openssl
Expand All @@ -87,6 +64,25 @@ RUN . $CARGO_HOME/env && \

ENV PATH $CARGO_HOME/bin:$PATH

# Upstream CI requires jq
RUN apt-get install -y jq

# TODO: Add exec permisson for all users in base image.
RUN chmod a+x /usr/local/bin/bazel
# TODO: move sudo installation in base image.
RUN apt-get install -y sudo

RUN useradd jenkins && \
mkdir /home/jenkins && \
chown jenkins /home/jenkins
RUN echo 'jenkins ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers

RUN mkdir -p /opt/conda /opt/cargo /opt/rustup /workspace /var/lib/jenkins && \
chown jenkins /opt/conda /opt/cargo /opt/rustup /workspace /var/lib/jenkins
ENV PATH /home/jenkins/.local/bin:$PATH
USER jenkins
WORKDIR /workspace

# Installs and configures Conda.
ADD ./install_conda.sh install_conda.sh
RUN sudo chown jenkins ./install_conda.sh
Expand All @@ -95,6 +91,7 @@ RUN bash ./install_conda.sh "${python_version}" /opt/conda
RUN echo "conda activate base" >> ~/.bashrc
RUN echo "export TF_CPP_LOG_THREAD_ID=1" >> ~/.bashrc
ENV PATH /opt/conda/bin:$PATH
ENV LD_LIBRARY_PATH /lib/x86_64-linux-gnu/:/usr/lib/x86_64-linux-gnu/:/opt/conda/lib/:$LD_LIBRARY_PATH

RUN bash -c "source ~/.bashrc"
CMD ["bash"]
7 changes: 2 additions & 5 deletions .circleci/docker/install_conda.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ set -ex

PYTHON_VERSION=$1
CONDA_PREFIX=$2
DEFAULT_PYTHON_VERSION=3.7
DEFAULT_PYTHON_VERSION=3.8


function install_and_setup_conda() {
Expand All @@ -30,7 +30,7 @@ function install_and_setup_conda() {
conda update -y -n base conda
conda install -y python=$PYTHON_VERSION

conda install -y nomkl numpy=1.18.5 pyyaml setuptools cmake \
conda install -y nomkl numpy=1.18.5 pyyaml setuptools \
cffi typing tqdm coverage hypothesis dataclasses cython

/usr/bin/yes | pip install mkl==2022.2.1
Expand All @@ -41,9 +41,6 @@ function install_and_setup_conda() {
/usr/bin/yes | pip install --upgrade numba
/usr/bin/yes | pip install cloud-tpu-client
/usr/bin/yes | pip install expecttest==0.1.3
/usr/bin/yes | pip install ninja # Install ninja to speedup the build
# Using Ninja requires CMake>=3.13, PyTorch requires CMake>=3.18
/usr/bin/yes | pip install "cmake>=3.18" --upgrade
/usr/bin/yes | pip install absl-py
# Additional PyTorch requirements
/usr/bin/yes | pip install scikit-image scipy==1.6.3
Expand Down
2 changes: 1 addition & 1 deletion .circleci/docker/install_valgrind.sh
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ tar -xjf valgrind-${VALGRIND_VERSION}.tar.bz2
cd valgrind-${VALGRIND_VERSION}
./configure --prefix=/usr/local
make -j6
sudo make install
make install
cd ../../
rm -rf valgrind_build
alias valgrind="/usr/local/bin/valgrind"
2 changes: 1 addition & 1 deletion .circleci/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -26,5 +26,5 @@ function install_torchvision() {
install_torchvision

export GCLOUD_SERVICE_KEY_FILE="$XLA_DIR/default_credentials.json"
export SILO_NAME='cache-silo-ci-gcc-11' # cache bucket for CI
export SILO_NAME='cache-silo-ci-dev-3.8_cuda_12.1' # cache bucket for CI
run_torch_xla_tests $PYTORCH_DIR $XLA_DIR $USE_COVERAGE
10 changes: 4 additions & 6 deletions .github/workflows/_build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -73,12 +73,12 @@ jobs:
# if image layers are not present in the repo.
# Note: disable the following 2 lines while testing a new image, so we do not
# push to the upstream.
docker tag "${GCR_DOCKER_IMAGE}" "${ECR_DOCKER_IMAGE_BASE}:v1.0" >/dev/null
docker push "${ECR_DOCKER_IMAGE_BASE}:v1.0" >/dev/null
docker tag "${GCR_DOCKER_IMAGE}" "${ECR_DOCKER_IMAGE_BASE}:v1.1-lite" >/dev/null
docker push "${ECR_DOCKER_IMAGE_BASE}:v1.1-lite" >/dev/null
- name: Start the container
shell: bash
run: |
pid=$(docker run -t -d -w "$WORKDIR" "${GCR_DOCKER_IMAGE}")
pid=$(docker run --privileged -t -d -w "$WORKDIR" "${GCR_DOCKER_IMAGE}")
docker exec -u jenkins "${pid}" sudo chown -R jenkins "${WORKDIR}"
docker cp "${GITHUB_WORKSPACE}/." "$pid:$WORKDIR"
echo "pid=${pid}" >> "${GITHUB_ENV}"
Expand All @@ -87,7 +87,6 @@ jobs:
shell: bash
run: |
echo "declare -x SCCACHE_BUCKET=${SCCACHE_BUCKET}" | docker exec -i "${pid}" sh -c "cat >> env"
echo "declare -x CC=clang-8 CXX=clang++-8" | docker exec -i "${pid}" sh -c "cat >> xla_env"
echo "declare -x DISABLE_XRT=${DISABLE_XRT}" | docker exec -i "${pid}" sh -c "cat >> xla_env"
echo "declare -x XLA_CUDA=${XLA_CUDA}" | docker exec -i "${pid}" sh -c "cat >> xla_env"
echo "declare -x BAZEL_REMOTE_CACHE=1" | docker exec -i "${pid}" sh -c "cat >> xla_env"
Expand All @@ -96,8 +95,7 @@ jobs:
- name: Build
shell: bash
run: |
docker exec -u jenkins "${pid}" bash -c ". ~/.bashrc && .circleci/build.sh"

docker exec --privileged -u jenkins "${pid}" bash -c ".circleci/build.sh"
- name: Cleanup build env
shell: bash
run: |
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/_coverage.yml
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ jobs:
- name: Test
shell: bash
run: |
docker exec -u jenkins "${pid}" bash -c '. ~/.bashrc && .circleci/${{ inputs.test-script }}'
docker exec -u jenkins "${pid}" bash -c '.circleci/${{ inputs.test-script }}'
- name: Upload coverage results
if: ${{ inputs.collect-coverage }}
shell: bash
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/_docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ jobs:
echo "pid=${pid}" >> "${GITHUB_ENV}"
- name: Build & publish docs
shell: bash
run: docker exec -u jenkins "${pid}" bash -c '. ~/.bashrc && .circleci/doc_push.sh'
run: docker exec -u jenkins "${pid}" bash -c '.circleci/doc_push.sh'
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always()
2 changes: 1 addition & 1 deletion .github/workflows/_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ jobs:
- name: Test
shell: bash
run: |
docker exec -u jenkins "${pid}" bash -c '. ~/.bashrc && .circleci/${{ inputs.test-script }}'
docker exec --privileged -u jenkins "${pid}" bash -c '.circleci/${{ inputs.test-script }}'
- name: Upload coverage results
if: ${{ inputs.collect-coverage }}
shell: bash
Expand Down
5 changes: 2 additions & 3 deletions .github/workflows/build_and_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,7 @@ jobs:
uses: ./.github/workflows/_build.yml
with:
ecr-docker-image-base: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/xla_base
gcr-docker-image: gcr.io/tpu-pytorch/xla_base:latest
disable_xrt: 1
gcr-docker-image: gcr.io/tpu-pytorch/xla_base:dev-3.8_cuda_12.1
cuda: 1
secrets:
gcloud-service-key: ${{ secrets.GCLOUD_SERVICE_KEY }}
Expand All @@ -43,7 +42,7 @@ jobs:
with:
docker-image: ${{ needs.build.outputs.docker-image }}
runner: linux.8xlarge.nvidia.gpu
timeout-minutes: 300
timeout-minutes: 180
disable-xrt: 1
secrets:
gcloud-service-key: ${{ secrets.GCLOUD_SERVICE_KEY }}
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/build_and_test_xrt.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ jobs:
uses: ./.github/workflows/_build.yml
with:
ecr-docker-image-base: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/xla_base
gcr-docker-image: gcr.io/tpu-pytorch/xla_base:latest
gcr-docker-image: gcr.io/tpu-pytorch/xla_base:dev-3.8_cuda_12.1
disable_xrt: 0
cuda: 1
secrets:
Expand All @@ -42,7 +42,7 @@ jobs:
with:
docker-image: ${{ needs.build.outputs.docker-image }}
runner: linux.8xlarge.nvidia.gpu
timeout-minutes: 300
timeout-minutes: 180
disable-xrt: 0
secrets:
gcloud-service-key: ${{ secrets.GCLOUD_SERVICE_KEY }}
27 changes: 27 additions & 0 deletions .github/workflows/tpu-ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: TPU Test
run-name: CI Testing
on:
workflow_dispatch:
schedule:
- cron: '0 16,20,0 * * 1-5'
jobs:
tpu-test:
runs-on: v4-runner-set
steps:
- run: |
git clone --recursive https://github.com/pytorch/pytorch
cd pytorch/
python3 setup.py install --user
git clone --recursive https://github.com/mbzomowski/xla.git
- env:
BAZEL_VERBOSE: 1
BUNDLE_LIBTPU: 1
TPUVM_MODE: 1
run: |
cd pytorch/xla
python3 setup.py install --user
- env:
PJRT_DEVICE: TPU
run: |
cd pytorch/xla
python3 -u test/test_operations.py -v
2 changes: 1 addition & 1 deletion .kokoro/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ ARG SCCACHE="$(which sccache)"

WORKDIR /pytorch/xla
ARG GCLOUD_SERVICE_KEY_FILE="/pytorch/xla/default_credentials.json"
ARG SILO_NAME='cache-silo-ci-gcc-11' # cache bucket for CI
ARG SILO_NAME='cache-silo-ci-dev-3.8_cuda_12.1' # cache bucket for CI
RUN time pip install -e .

# Run tests
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ To run the tests, follow __one__ of the options below:
* Run on GPU:

```Shell
export PJRT_DEVICE=GPU GPU_NUM_DEVICES=${NUM_GPU}
export PJRT_DEVICE=CUDA GPU_NUM_DEVICES=${NUM_GPU}
```

For more detail on configuring the runtime, please refer to [this doc](https://github.com/pytorch/xla/blob/master/docs/pjrt.md#quickstart)
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ If you're using `DistributedDataParallel`, make the following changes:
Additional information on PyTorch/XLA, including a description of its semantics
and functions, is available at [PyTorch.org](http://pytorch.org/xla/). See the
[API Guide](API_GUIDE.md) for best practices when writing networks that run on
XLA devices (TPU, GPU, CPU and...).
XLA devices (TPU, CUDA, CPU and...).

Our comprehensive user guides are available at:

Expand Down
Loading
Loading