tpu ci module refactor #7

mbzomowski · 2023-11-16T22:19:46Z

Test GH actions
Test TPU workflow
Test TPU workflow
Test TPU workflow
Test TPU workflow
Test TPU workflow
Test TPU workflow
Test TPU workflow
Test TPU workflow
Test TPU workflow
Test TPU workflow
Test TPU workflow
Test TPU workflow
Test TPU workflow
Test TPU workflow
Test TPU workflow
test commit
test commit
test commit
test commit
test commit
test commit
test commit
test commit
test commit
test commit
test commit
test commit
test commit
test commit
test commit
Testing runner persistence
Test commit
Test commit
Test commit
Test commit
Test commit
Test commit
Test commit
Add terraform config files to manage tpu-ci infrastructure
Moved PAT to GH secrets
Testing flux
Testing tf-controller
Test tf-controller
Test new config
Test reduction in nodes
Update tf-controller
Remove branch planner from tf-controller
Fix kubernetes provider
Fix kubernetes provider
Add rolebinding and role for tf-runner user
Fix kubernetes provider
Fix kubernetes provider
Fix kubernetes provider
Add debug logging to tf-controller pods
Remove env var from tf-controller
Fix kubernetes provider
Fix kubernetes provider
Fix kubernetes provider
Fix kubernetes provider
Fix kubernetes provider
Add workload identity for cluster
Remove rolebinding
Testing to see if gitops finally works
Fix service account name in runnerpod
Fix ksa in runnerpod
Fixed ksa name again
Added role and rolebinding for ksa secret access
Please let this work
Added namespace to role
Testing gitops
Add leases permissions for ksa
Add delete secrets permission for ksa
Add refresh before apply for tf object
Migrate tf state to gcs
Add backend config for tf object
Fix typo in backendconfig
Remove ignore for secrets
Add back secrets to ignore and set destroy to false in tf-object
Removed secret.tf file
Set tf-controller pods to 2
Change workflow to setup pytorch/xla from most recent commit
Change workflow
Testing workflow
Testing workflow
Testing workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow; sleeping runner
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test worflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Changed runner image to custom image
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Test workflow
Cleanup files from test runs
Test
Test
Testing workflow
Clean up unused TPU CI files
Scale tf-controller down to 1 pod
Refactored TPU CI into an ARC module
Add second workflow job and fix repo url error
Small change to TF config formatting
Testing the speed of TPU node autoscaling
Update artifacts.auto.tfvars for CUDA 12.1 (Update artifacts.auto.tfvars for cuda 12.1 wheel pytorch/xla#5683)
Add API to assemble CPU shards to a sharded tensor (Add API to assemble CPU shards to a sharded tensor pytorch/xla#5681)
Initial commit for CheckpointManager (Initial commit for CheckpointManager pytorch/xla#5678)
Fix masked_fill broadcasting. (Fix masked_fill broadcasting. pytorch/xla#5688)
Conditionally set default TPU settings in __init__.py (Conditionally set default TPU settings in __init__.py pytorch/xla#5696)
Disable xla backend for SPMD (Disable xla backend for SPMD pytorch/xla#5690)
Add support for unused params (Support export to tf saved_model for those models with unused params pytorch/xla#5694)
Open XLA pin update (Open XLA pin update pytorch/xla#5675)
Update CI image with dev container image (Update CI image with develop docker image pytorch/xla#5290)
Support synchronous saving and loading in CheckpointManager (Support synchronous saving and loading in CheckpointManager pytorch/xla#5693)
Reduce unnecessary tensor allocation in Adam and AdamW (Reduce unnecessary tensor allocation in Adam and AdamW pytorch/xla#5700)
Support async checkpointing through CheckpointManager (Support async checkpointing through CheckpointManager pytorch/xla#5697)
Add --net=host and --shm-size=16g flag to the docker run command in GPU doc. (Add --net=host and --shm-size=16g flag to docker run cmd in GPU doc. pytorch/xla#5702)
Don't set $TPU_LIBRARY_PATH during import (Don't set $TPU_LIBRARY_PATH during import pytorch/xla#5698)
pass bundle object to make_tf_function (pass bundle object to make_tf_function pytorch/xla#5708)
Filter tensor arguments from traced model. (Filter tensor arguments from traced model. pytorch/xla#5689)
Update Troubleshooting with some sanity check example (Update Troubleshooting with some sanity check example pytorch/xla#5705)
Add multi-host GPU support (Add multi-host GPU support pytorch/xla#5657)
Add support for _unsafe_index. (Add support for _unsafe_index. pytorch/xla#5707)
Run decomp before processing (Run decomp before processing pytorch/xla#5713)
fix typo in TROUBLESHOOTING (fix typo in TROUBLESHOOTING pytorch/xla#5716)
Remove GetTensorsFused since we don;t have opbyop anymore (Remove GetTensorsFused since we don;t have opbyop anymore pytorch/xla#5718)
When exporting StableHLO to SavedModel, also include the original var (When exporting StableHLO to SavedModel, also include the original var pytorch/xla#5711)
Update OpenXLA pin to 20231022 (Update OpenXLA pin to 20231022 pytorch/xla#5720)
Make sure printing XLA tensor only execute the HLO once (Make sure printing XLA tensor only execute the HLO once pytorch/xla#5721)
add op lowering for _prelu_kernel_backward (add op lowering for _prelu_kernel_backward pytorch/xla#5724)
Patch RNG for better memory utilization in dropout layers (Patch RNG for better memory utilization in dropout layers pytorch/xla#5710)
Ensure dist runtime is in sync before shutting down. (Ensure dist runtime is in sync before shutting down. pytorch/xla#5714)
Make the git version appears in pip (Make the git version appears in pip pytorch/xla#5728)
Fix the missing parameter error when running mp_imagenet with torchrun (Fix the missing parameter error when running mp_imagenet with torchrun pytorch/xla#5729)
Promote in convolution (Promote in convolution pytorch/xla#5727)
Revert "Don't set $TPU_LIBRARY_PATH during import (Don't set $TPU_LIBRARY_PATH during import pytorch/xla#5698)" (Revert "Don't set $TPU_LIBRARY_PATH during import (#5698)" pytorch/xla#5731)
Set --xla_latency_hiding_scheduler_rerun to 1 (Set --xla_latency_hiding_scheduler_rerun to 1 pytorch/xla#5736)
Merge --pjrt_distributed flag with --ddp flag. (Merge --pjrt_distributed flag with --ddp flag. pytorch/xla#5732)
Mangle root scope TF variable name during tf.saved_model export (Mangle root scope TF variable name during tf.saved_model export pytorch/xla#5738)
Add Python 3.9 build for 2.1 release (Add Python 3.9 build for 2.1 release pytorch/xla#5744)
Add doc for multinode GPU training. (Add doc for multinode GPU training. pytorch/xla#5704)
Add information about on-going DTensor API in spmd.md (Add DTensor API integration section in spmd.md pytorch/xla#5735)
Update set_default_tensor_type to set_default_dtype (Update set_default_tensor_type to set_default_dtype pytorch/xla#5734)
Support SPMD through the xla:// init_method (Support SPMD through the xla:// init_method pytorch/xla#5706)
Remove master IP discovery test for MP (Remove master IP discovery test for MP pytorch/xla#5748)
Add tooling to explain why a graph execution happens (Add tooling to explain why a graph execution happens pytorch/xla#5723)
Fix XLA tensor storage device by using XlaDeviceToAtenDevice. (Fix XLA tensor storage device by using XlaDeviceToAtenDevice. pytorch/xla#5743)
Destroy the ComputationClient when the program exits (Destroy the ComputationClient when the program exits pytorch/xla#5750)
Correct the multinode training doc (Correct the multinode training doc pytorch/xla#5747)
Support PreemptionSyncManager in XlaCoordinator (Support PreemptionSyncManager in XlaCoordinator pytorch/xla#5733)
Transfer data directly to the device (Transfer data directly to the device pytorch/xla#5752)
Revert "Transfer data directly to the device (Transfer data directly to the device pytorch/xla#5752)" (Revert "Transfer data directly to the device (#5752)" pytorch/xla#5765)
update doc to use PJRT_DEVICE=CUDA instead of PJRT_DEVICE=GPU (update doc to use PJRT_DEVICE=CUDA instead of PJRT_DEVICE=GPU pytorch/xla#5754)
Remove _unsafe_index implementation. (Remove _unsafe_index implementation. pytorch/xla#5769)
Include Dynamo executation in the executation cause analysis (Include Dynamo executation in the executation cause analysis pytorch/xla#5758)
fix squeeze op lowering issue when dim is not in sorted order (fix squeeze op lowering issue when dim is not in sorted order pytorch/xla#5751)
Support autocheckpointing (Support autocheckpointing in CheckpointManager pytorch/xla#5753)
fixing neuron import (Fixing neuron import pytorch/xla#5775)
Disable flaky cpp test (Disable flaky cpp test pytorch/xla#5779)
Refactor type conversion functions out of //torch_xla/csrc:tensor (Refactor type conversion functions out of //torch_xla/csrc:tensor pytorch/xla#5777)
Dsiable C++ test correctly (Dsiable C++ test correctly pytorch/xla#5783)
Use default method in stablehlo txt/bytecode getter of StableHLOGraphModule (Use default method in stablehlo txt/bytecode getter pytorch/xla#5745)
Implement mark_sharding as a custom op to support dynamo spmd activation sharding (Implement mark_sharding as a custom op to support dynamo spmd activation sharding pytorch/xla#5712)
Remove some unused code from csrc/runtime (Remove some unused code from csrc/runtime pytorch/xla#5785)
Make the pjrt gpu allocator configurable (Make the pjrt gpu allocator configurable pytorch/xla#5759)
disable test_set temporarily ([HotFix] disable test_set temporarily pytorch/xla#5792)
[SPMD] suppor DTensor API integration ([SPMD] suppor DTensor API integration pytorch/xla#5776)
Transfer data directly to the device (Transfer data directly to the device pytorch/xla#5772)
Lower AtenFull op (Lower AtenFull op pytorch/xla#5781)
Add GKE support and various usability improvements in CheckpointManager (Add GKE support and various usability improvements in CheckpointManager pytorch/xla#5770)
Use TPU profiler plugin ([PJRT] Use TPU profiler plugin pytorch/xla#5793)
Record the lazy tracing time(C++) in metrics (Record the lazy tracing time(C++) in metrics pytorch/xla#5757)
Don't set $TPU_LIBRARY_PATH during import
Enable passing down dynamic dimensions from torch to XLA (Enable passing down dynamic dimensions from torch to XLA pytorch/xla#5790)
Add python 3.11 trigger for 2.1 release build (Add python 3.11 trigger for 2.1 release build pytorch/xla#5810)
Use upstream XLA concurrency utilities (Use upstream XLA concurrency utilities pytorch/xla#5799)
[SPMD] fix bug with XLAShardedTensor.repr ([SPMD] fix bug with XLAShardedTensor.__repr__ pytorch/xla#5807)
Clean up tpu ci files
Move tpu ci files to infra/
Change workflow trigger to workflow_dispatch and schedule
Fix workflow
Testing workflow manual trigger

Testing flux

test branch

Test branch

Fix kubernetes provider

…h#5751) * fix squeeze op lowering issue when dim is not in sorted order * remove debug info * remove debug info * refactor BuildSqueezedDimensions

…ytorch#5777) * Move pure dtype conversion functions to `dtype.cpp` * remove comments * better names * fix includes * formatting * consolidate * fix test build * more explicit names * remove extra line

…Module (pytorch#5745) Co-authored-by: Siyuan Liu <lsiyuan@google.coim>

…ion sharding (pytorch#5712)

* delete nccl_distributed * remove async_task * remove unique * Remove hashing * more random cleanup * formatting * remove util.cc * Revert "remove unique" This reverts commit ebe4567. * Use upstream Unique

* Make the pjrt gpu allocator configurable * the default value changed from 0.9 to 0.75 * return default GpuAllocatorConfig --------- Co-authored-by: wangang.wa <wangang.wa@alibaba-inc.com>

* [SPMD] move SPMD package to torch_xla/experimental/spmd, introduce shadow xla DTensor API. * support backward compatibility of the old imports * Move spmd out of experimental * Update spmd.md for distributed/spmd

* Transfer data directly to the device (pytorch#5752) * Remove `populate_fn` from `TensorSource` * Make TensorSource an interface * Re-enable pjrt_computation_client_test * server -> device * add comment * fix outbound data metric * formatting * implement byte_strides in TensorSource * more formatting * remove extra deps * add missing deps * Revert "server -> device" This reverts commit 6384516. * Use `at::Tensor`'s layout for byte strides * Downcast at::Tensor if required * formatting * Simplify AtenSource * fix build * formatting * fix typo that makes us ignore input type * Revert "Simplify AtenSource" This reverts commit 4225deb. * Skip hanging test * fix gil deadlock * formatting

* lower full * update test for full op * formatting

…er (pytorch#5770) * Add GKE support and various usability improvements in CheckpointManager * Bug fix for async checkpointing fully sharded state dicts

* Record the lazy tracing time(C++) in metrics * Delete torch_patches/.torch_pin

This reverts commit 4baef3c.

* port sandeep unbounded dynamism change * Enable unbounded dynamism using env var, add more guards for unbounded dynamism code path --------- Co-authored-by: Siyuan Liu <lsiyuan@google.coim>

* Use TSL threadpool * remove multiwait * fix test build * Move threadpool namespace * formatting * fix test build * Use BlockingCounter

mbzomowski added 30 commits September 26, 2023 01:00

Test commit

65a43e4

Test commit

47025b3

Test commit

45ddc66

Test commit

a7cef9a

Add terraform config files to manage tpu-ci infrastructure

8c07126

Moved PAT to GH secrets

75441f6

Testing flux

b352ecf

Merge pull request #1 from mbzomowski/test_branch

c6ded80

Testing flux

Testing tf-controller

eb64c4d

Test tf-controller

9da183e

Merge pull request #3 from mbzomowski/test_branch

957999a

test branch

Test new config

cc868af

Test reduction in nodes

e8820bd

Merge pull request #4 from mbzomowski/test_branch

bd992eb

test branch

Update tf-controller

bd5621a

Remove branch planner from tf-controller

a39cf6e

Merge pull request #5 from mbzomowski/test_branch

0ba0850

Test branch

Fix kubernetes provider

7b7fdaa

Merge pull request #6 from mbzomowski/test_branch

a28a9d9

Fix kubernetes provider

Fix kubernetes provider

0498545

Add rolebinding and role for tf-runner user

921ba77

Fix kubernetes provider

fa8a7d6

Fix kubernetes provider

7183c4e

Fix kubernetes provider

f8f13f4

Add debug logging to tf-controller pods

80ba824

Remove env var from tf-controller

4cdc8f2

Fix kubernetes provider

d56cdcc

Fix kubernetes provider

b9029bb

Fix kubernetes provider

be51922

Fix kubernetes provider

067ac68

ysiraichi and others added 29 commits November 16, 2023 21:47

Remove _unsafe_index implementation. (pytorch#5769)

abb34a3

Include Dynamo executation in the executation cause analysis (pytorch…

7ccc70f

…#5758)

fix squeeze op lowering issue when dim is not in sorted order (pytorc…

c17f1f6

…h#5751) * fix squeeze op lowering issue when dim is not in sorted order * remove debug info * remove debug info * refactor BuildSqueezedDimensions

Support autocheckpointing (pytorch#5753)

254c283

fixing neuron import (pytorch#5775)

93f849a

Disable flaky cpp test (pytorch#5779)

83aef4b

Refactor type conversion functions out of //torch_xla/csrc:tensor (p…

03d902a

…ytorch#5777) * Move pure dtype conversion functions to `dtype.cpp` * remove comments * better names * fix includes * formatting * consolidate * fix test build * more explicit names * remove extra line

Dsiable C++ test correctly (pytorch#5783)

74f263a

Use default method in stablehlo txt/bytecode getter of StableHLOGraph…

25535ce

…Module (pytorch#5745) Co-authored-by: Siyuan Liu <lsiyuan@google.coim>

Implement mark_sharding as a custom op to support dynamo spmd activat…

452665b

…ion sharding (pytorch#5712)

Remove some unused code from csrc/runtime (pytorch#5785)

8285dfc

* delete nccl_distributed * remove async_task * remove unique * Remove hashing * more random cleanup * formatting * remove util.cc * Revert "remove unique" This reverts commit ebe4567. * Use upstream Unique

Make the pjrt gpu allocator configurable (pytorch#5759)

897e3b5

* Make the pjrt gpu allocator configurable * the default value changed from 0.9 to 0.75 * return default GpuAllocatorConfig --------- Co-authored-by: wangang.wa <wangang.wa@alibaba-inc.com>

disable test_set temporarily (pytorch#5792)

b9a045f

[SPMD] suppor DTensor API integration (pytorch#5776)

1d3c5b9

* [SPMD] move SPMD package to torch_xla/experimental/spmd, introduce shadow xla DTensor API. * support backward compatibility of the old imports * Move spmd out of experimental * Update spmd.md for distributed/spmd

Lower AtenFull op (pytorch#5781)

61bf7d4

* lower full * update test for full op * formatting

Add GKE support and various usability improvements in CheckpointManag…

0845e97

…er (pytorch#5770) * Add GKE support and various usability improvements in CheckpointManager * Bug fix for async checkpointing fully sharded state dicts

Use TPU profiler plugin (pytorch#5793)

cbdb18f

Record the lazy tracing time(C++) in metrics (pytorch#5757)

e795565

* Record the lazy tracing time(C++) in metrics * Delete torch_patches/.torch_pin

Don't set $TPU_LIBRARY_PATH during import

d16f8c1

This reverts commit 4baef3c.

Enable passing down dynamic dimensions from torch to XLA (pytorch#5790)

7fe9f76

* port sandeep unbounded dynamism change * Enable unbounded dynamism using env var, add more guards for unbounded dynamism code path --------- Co-authored-by: Siyuan Liu <lsiyuan@google.coim>

Add python 3.11 trigger for 2.1 release build (pytorch#5810)

6c5abef

Use upstream XLA concurrency utilities (pytorch#5799)

6e107d6

* Use TSL threadpool * remove multiwait * fix test build * Move threadpool namespace * formatting * fix test build * Use BlockingCounter

[SPMD] fix bug with XLAShardedTensor.__repr__ (pytorch#5807)

aabeb25

Clean up tpu ci files

559c0c8

Move tpu ci files to infra/

0aa0f11

Change workflow trigger to workflow_dispatch and schedule

a324f07

Fix workflow

45ed387

Testing workflow manual trigger

ec05372

mbzomowski merged commit 26b52c3 into master Nov 16, 2023
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tpu ci module refactor #7

tpu ci module refactor #7

mbzomowski commented Nov 16, 2023

tpu ci module refactor #7

tpu ci module refactor #7

Conversation

mbzomowski commented Nov 16, 2023