Open XLA pin update #5675

qihqi · 2023-10-04T22:46:50Z

No description provided.

will-cromar · 2023-10-04T22:58:45Z

openxla_patches/f16_abi_clang.diff

@@ -1,19 +0,0 @@
-upstream CI will fail without this


Do you know why we were able to remove this patch? Is it because we updated the compiler in the CI?

I think we need to kick off upstream CI build targetting this branch and see whether CI will pass

yeah turns out I do still need those patches... otherwise the training job hangs.

will-cromar · 2023-10-04T22:59:19Z

openxla_patches/gpu_race_condition.diff

@@ -1,14 +0,0 @@
-diff --git a/xla/service/gpu/gpu_executable.cc b/xla/service/gpu/gpu_executable.cc


Same question as above

will-cromar · 2023-10-04T23:00:39Z

WORKSPACE

    ],
-    strip_prefix = "xla-97a5f819faf9ff793b7ba68ff1f31f74f9459c18",
+    strip_prefix = "xla-7a19856d74569fd1f765cd03bdee84e3b1fdc579",


Can you also update the libtpu dependency in setup.py to the same date as this commit?

qihqi · 2023-10-05T18:49:25Z

tested on v4-8:

with command

LD_LIBRARY_PATH=/home/hanq/miniconda3/envs/py310/lib python3 test/test_train_mp_imagenet.py --model=resnet50          --fake_data --num_epochs=10 --log_steps=300          --profile   --use_optimized_kwargs=tpuv4  --drop_last

result:

Old:
| Training Device=xla:0/3 Epoch=1 Step=1800 Loss=0.00135 Rate=1833.71 GlobalRate=918.89 Time=17:20:14
| Training Device=xla:0/1 Epoch=1 Step=2100 Loss=0.00135 Rate=1843.82 GlobalRate=986.79 Time=17:20:35
| Training Device=xla:0/3 Epoch=1 Step=2100 Loss=0.00135 Rate=1843.82 GlobalRate=990.06 Time=17:20:35
| Training Device=xla:0/0 Epoch=1 Step=2100 Loss=0.00135 Rate=1843.81 GlobalRate=982.20 Time=17:20:35
| Training Device=xla:0/2 Epoch=1 Step=2100 Loss=0.00135 Rate=1843.79 GlobalRate=989.61 Time=17:20:35

===
New:
| Training Device=xla:0/3 Epoch=1 Step=1500 Loss=0.00138 Rate=1803.73 GlobalRate=822.80 Time=18:09:52
| Training Device=xla:0/2 Epoch=1 Step=1500 Loss=0.00138 Rate=1803.72 GlobalRate=821.27 Time=18:09:52
| Training Device=xla:0/1 Epoch=1 Step=1800 Loss=0.00135 Rate=1828.62 GlobalRate=911.50 Time=18:10:12
| Training Device=xla:0/3 Epoch=1 Step=1800 Loss=0.00135 Rate=1828.62 GlobalRate=906.47 Time=18:10:12
| Training Device=xla:0/0 Epoch=1 Step=1800 Loss=0.00135 Rate=1828.62 GlobalRate=910.19 Time=18:10:12
| Training Device=xla:0/2 Epoch=1 Step=1800 Loss=0.00135 Rate=1828.63 GlobalRate=904.92 Time=18:10:12
| Training Device=xla:0/3 Epoch=1 Step=2100 Loss=0.00135 Rate=1837.96 GlobalRate=977.43 Time=18:10:33
| Training Device=xla:0/0 Epoch=1 Step=2100 Loss=0.00135 Rate=1837.97 GlobalRate=981.14 Time=18:10:33
| Training Device=xla:0/2 Epoch=1 Step=2100 Loss=0.00135 Rate=1837.97 GlobalRate=975.89 Time=18:10:33
| Training Device=xla:0/1 Epoch=1 Step=2100 Loss=0.00135 Rate=1837.96 GlobalRate=982.45 Time=18:10:33

ManfeiBai · 2023-10-05T20:35:24Z

openxla_patches/gpu_build_file.diff

+         "@tsl//tsl/platform:casts",
+         "@tsl//tsl/platform:errors",
+-    ] + if_cuda([
+    ] + if_cuda_or_rocm([


Thanks!

this patch looks like for openxla/xla@9938bdb, so curious about the reason to skip the modify of load("//xla/stream_executor:build_defs.bzl", "if_cuda_or_rocm", "if_gpu_is_configured")?

since GPU CI failed with the same issue: RuntimeError: torch_xla/csrc/device.cpp:72 : Invalid device specification: CUDA:0, are they related too?

No particular reason.

I started importing on Oct 3 and this change is Oct 4.

alanwaketan · 2023-10-11T00:57:39Z

WORKSPACE

    ],
-    strip_prefix = "xla-97a5f819faf9ff793b7ba68ff1f31f74f9459c18",
+    strip_prefix = "xla-51b59cfb1999c6f1b3ec59851675044b2c502aae",


Thanks for moving the head to this commit!

alanwaketan · 2023-10-11T01:00:28Z

setup.py

@@ -72,7 +72,7 @@

 base_dir = os.path.dirname(os.path.abspath(__file__))

-_libtpu_version = '0.1.dev20230825'
+_libtpu_version = '0.1.dev20231009'


I suspect this should be 0.1.dev20231010 in order to include the open xla commit you specified.

alanwaketan

LGTM. Let me enable TPU CI and wait until it finishes.

Open XLA pin update - updated to 20231010

qihqi requested review from will-cromar and ManfeiBai October 4, 2023 22:46

qihqi force-pushed the hanq/pin_update branch from 2f43dfd to 63c5a13 Compare October 4, 2023 22:52

will-cromar reviewed Oct 4, 2023

View reviewed changes

qihqi force-pushed the hanq/pin_update branch from 7de99ad to ef9c1cc Compare October 5, 2023 19:43

ManfeiBai reviewed Oct 5, 2023

View reviewed changes

qihqi force-pushed the hanq/pin_update branch 5 times, most recently from 6c59c2c to 3f57cd1 Compare October 6, 2023 02:46

alanwaketan self-requested a review October 9, 2023 17:52

qihqi added 5 commits October 10, 2023 19:00

Open XLA pin update

3e7a229

Add patches back

96ec8bb

CUDA is same as GPU

6d3a97a

Adding more CUDA instead of GPU

a40767a

Also update env vars

6bb9f3e

qihqi force-pushed the hanq/pin_update branch 2 times, most recently from b97aa10 to 2dc72ab Compare October 10, 2023 20:24

Revert cl/561479066

af8bb2f

qihqi force-pushed the hanq/pin_update branch from 2dc72ab to af8bb2f Compare October 10, 2023 21:28

qihqi requested review from will-cromar, ManfeiBai and JackCaoG October 11, 2023 00:55

alanwaketan reviewed Oct 11, 2023

View reviewed changes

move libtpu to oct 10

346c037

alanwaketan approved these changes Oct 11, 2023

View reviewed changes

qihqi merged commit 418c751 into master Oct 11, 2023
19 checks passed

zpcore pushed a commit that referenced this pull request Oct 19, 2023

Open XLA pin update (#5675)

fcdb922

Open XLA pin update - updated to 20231010

ghpvnist pushed a commit to ghpvnist/xla that referenced this pull request Oct 31, 2023

Open XLA pin update (pytorch#5675)

3992343

Open XLA pin update - updated to 20231010

vanbasten23 mentioned this pull request Nov 1, 2023

update doc to use PJRT_DEVICE=CUDA instead of PJRT_DEVICE=GPU #5754

Merged

mbzomowski pushed a commit to mbzomowski-test-org/xla that referenced this pull request Nov 16, 2023

Open XLA pin update (pytorch#5675)

642e026

Open XLA pin update - updated to 20231010

mbzomowski mentioned this pull request Nov 16, 2023

tpu ci module refactor mbzomowski-test-org/xla#7

Merged

chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023

Open XLA pin update (pytorch#5675)

10022f7

Open XLA pin update - updated to 20231010

golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024

Open XLA pin update (#5675)

9edb4f1

Open XLA pin update - updated to 20231010

bhavya01 pushed a commit that referenced this pull request Apr 22, 2024

Open XLA pin update (#5675)

384d968

Open XLA pin update - updated to 20231010

qihqi deleted the hanq/pin_update branch April 29, 2024 21:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open XLA pin update #5675

Open XLA pin update #5675

qihqi commented Oct 4, 2023

will-cromar Oct 4, 2023

JackCaoG Oct 4, 2023

qihqi Oct 5, 2023

will-cromar Oct 4, 2023

will-cromar Oct 4, 2023

qihqi Oct 5, 2023

qihqi commented Oct 5, 2023

ManfeiBai Oct 5, 2023 •

edited

Loading

qihqi Oct 5, 2023

alanwaketan Oct 11, 2023

alanwaketan Oct 11, 2023

qihqi Oct 11, 2023

alanwaketan left a comment

		@@ -1,14 +0,0 @@
		diff --git a/xla/service/gpu/gpu_executable.cc b/xla/service/gpu/gpu_executable.cc

Open XLA pin update #5675

Open XLA pin update #5675

Conversation

qihqi commented Oct 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qihqi commented Oct 5, 2023

ManfeiBai Oct 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanwaketan left a comment

Choose a reason for hiding this comment

ManfeiBai Oct 5, 2023 •

edited

Loading