Pin update March 2024 #6677

lsy323 · 2024-03-06T20:21:42Z

Update xla pin to HEAD

Summary:

Update bazel to 6.5.0
Rename PJRT_Structure_Base to PJRT_Extension_Base to accommodate change in XLA.
Generate custom_call to mhlo.uniform_dequantize/quantize to accommodate the change in HLO->MHLO converter patch.
Reduce the BS of several GPU tests to unblock the pin update from the OOM issue in GPU tests. (Details in comment below).

lsy323 · 2024-03-06T20:25:03Z

Hit the following error

 File "/home/lsiyuan/.cache/bazel/_bazel_lsiyuan/9d8c0c9d904275861907f86bf4a21dbc/external/llvm-project/mlir/BUILD.bazel", line 40, column 7, in <toplevel>
                } | if_cuda_available(
Error: unsupported binary operation: dict | select

Need to upgrade bazel version to above 6.0.0

lsy323 · 2024-03-06T23:22:34Z

Testing performance with the following cmd, on v4-8 TPU

python test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1 --metrics_debug

After pin update

| Training Device=xla:0/1 Epoch=1 Step=2280 Loss=0.00135 Rate=425.92 GlobalRate=370.64 Time=23:07:35
| Training Device=xla:0/2 Epoch=1 Step=2280 Loss=0.00135 Rate=425.92 GlobalRate=371.41 Time=23:07:35
| Training Device=xla:0/3 Epoch=1 Step=2280 Loss=0.00135 Rate=425.92 GlobalRate=371.07 Time=23:07:35
| Training Device=xla:0/1 Epoch=1 Step=2300 Loss=0.00135 Rate=425.11 GlobalRate=371.04 Time=23:07:41
| Training Device=xla:0/2 Epoch=1 Step=2300 Loss=0.00135 Rate=425.12 GlobalRate=371.81 Time=23:07:41
| Training Device=xla:0/0 Epoch=1 Step=2300 Loss=0.00135 Rate=425.12 GlobalRate=371.80 Time=23:07:41
| Training Device=xla:0/3 Epoch=1 Step=2300 Loss=0.00135 Rate=425.12 GlobalRate=371.48 Time=23:07:41

Before pin update

| Training Device=xla:0/1 Epoch=1 Step=2260 Loss=0.00135 Rate=453.15 GlobalRate=401.45 Time=00:43:45
| Training Device=xla:0/0 Epoch=1 Step=2260 Loss=0.00135 Rate=453.15 GlobalRate=400.79 Time=00:43:45
| Training Device=xla:0/2 Epoch=1 Step=2260 Loss=0.00135 Rate=453.14 GlobalRate=400.77 Time=00:43:45
| Training Device=xla:0/1 Epoch=1 Step=2280 Loss=0.00135 Rate=456.66 GlobalRate=401.89 Time=00:43:50
| Training Device=xla:0/0 Epoch=1 Step=2280 Loss=0.00135 Rate=456.66 GlobalRate=401.23 Time=00:43:50
| Training Device=xla:0/2 Epoch=1 Step=2280 Loss=0.00135 Rate=456.67 GlobalRate=401.21 Time=00:43:50
| Training Device=xla:0/3 Epoch=1 Step=2280 Loss=0.00135 Rate=456.65 GlobalRate=401.81 Time=00:43:50
| Training Device=xla:0/3 Epoch=1 Step=2300 Loss=0.00135 Rate=458.70 GlobalRate=402.25 Time=00:43:56
| Training Device=xla:0/2 Epoch=1 Step=2300 Loss=0.00135 Rate=458.70 GlobalRate=401.66 Time=00:43:56
| Training Device=xla:0/1 Epoch=1 Step=2300 Loss=0.00135 Rate=458.69 GlobalRate=402.34 Time=00:43:56
| Training Device=xla:0/0 Epoch=1 Step=2300 Loss=0.00135 Rate=458.68 GlobalRate=401.68 Time=00:43:56

There is a perf regression after the pin update.

Update: The perf result above is using debugging build, redo with release build.
after pin update

| Training Device=xla:0/1 Epoch=1 Step=2300 Loss=0.00135 Rate=1792.04 GlobalRate=1229.30 Time=02:04:04
| Training Device=xla:0/1 Epoch=1 Step=2320 Loss=0.00135 Rate=1794.55 GlobalRate=1232.65 Time=02:04:06
| Training Device=xla:0/2 Epoch=1 Step=2320 Loss=0.00135 Rate=1794.40 GlobalRate=1229.71 Time=02:04:06
| Training Device=xla:0/3 Epoch=1 Step=2320 Loss=0.00135 Rate=1794.37 GlobalRate=1239.40 Time=02:04:06
| Training Device=xla:0/0 Epoch=1 Step=2320 Loss=0.00135 Rate=1794.41 GlobalRate=1237.04 Time=02:04:06
| Training Device=xla:0/1 Epoch=1 Step=2340 Loss=0.00135 Rate=1795.36 GlobalRate=1235.96 Time=02:04:07
| Training Device=xla:0/2 Epoch=1 Step=2340 Loss=0.00135 Rate=1795.27 GlobalRate=1233.04 Time=02:04:07
| Training Device=xla:0/3 Epoch=1 Step=2340 Loss=0.00135 Rate=1795.29 GlobalRate=1242.69 Time=02:04:07
| Training Device=xla:0/0 Epoch=1 Step=2340 Loss=0.00135 Rate=1795.25 GlobalRate=1240.33 Time=02:04:07

Before pin upate

| Training Device=xla:0/0 Epoch=1 Step=2300 Loss=0.00135 Rate=1792.56 GlobalRate=1229.77 Time=04:31:01
| Training Device=xla:0/1 Epoch=1 Step=2300 Loss=0.00135 Rate=1792.61 GlobalRate=1225.71 Time=04:31:01
| Training Device=xla:0/2 Epoch=1 Step=2300 Loss=0.00135 Rate=1792.51 GlobalRate=1235.31 Time=04:31:01
| Training Device=xla:0/3 Epoch=1 Step=2300 Loss=0.00135 Rate=1792.52 GlobalRate=1234.37 Time=04:31:01
| Training Device=xla:0/0 Epoch=1 Step=2320 Loss=0.00135 Rate=1794.64 GlobalRate=1233.12 Time=04:31:02
| Training Device=xla:0/2 Epoch=1 Step=2320 Loss=0.00135 Rate=1794.66 GlobalRate=1238.64 Time=04:31:02
| Training Device=xla:0/1 Epoch=1 Step=2320 Loss=0.00135 Rate=1794.68 GlobalRate=1229.07 Time=04:31:02
| Training Device=xla:0/3 Epoch=1 Step=2320 Loss=0.00135 Rate=1794.65 GlobalRate=1237.70 Time=04:31:02
| Training Device=xla:0/3 Epoch=1 Step=2340 Loss=0.00135 Rate=1794.95 GlobalRate=1240.99 Time=04:31:04
| Training Device=xla:0/2 Epoch=1 Step=2340 Loss=0.00135 Rate=1794.83 GlobalRate=1241.93 Time=04:31:04
| Training Device=xla:0/0 Epoch=1 Step=2340 Loss=0.00135 Rate=1794.70 GlobalRate=1236.43 Time=04:31:04
| Training Device=xla:0/1 Epoch=1 Step=2340 Loss=0.00135 Rate=1794.77 GlobalRate=1232.39 Time=04:31:04

lsy323 · 2024-03-07T06:46:40Z

Test failed with PT2E test, because the converter patch is commented out now. Move xla pin again after https://github.com/pytorch/xla/blob/master/openxla_patches/quant_dequant_converter.diff is upstreamed

lsy323 · 2024-03-07T18:17:36Z

The following GPU tests hit OOM in CI after pin update

PJRT_DEVICE=CUDA torchrun --nnodes=1 --node_rank=0 --nproc_per_node=2 test/test_train_mp_imagenet.py --fake_data --pjrt_distributed --batch_size=16 --num_epochs=1 --num_steps=25 --model=resnet18

PJRT_DEVICE=CUDA python test/test_train_mp_imagenet_fsdp.py --fake_data --auto_wrap_policy type_based --use_small_fake_sample --num_epochs=1

PJRT_DEVICE=CUDA python test/test_train_mp_imagenet_fsdp.py --fake_data --use_nested_fsdp --use_small_fake_sample --num_epochs=1

Example error message:

E0000 00:00:1709794596.611335   87900 pjrt_stream_executor_client.cc:2804] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 5571021088 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
             parameter allocation:  134.46MiB
              constant allocation:         4B
        maybe_live_out allocation:  183.47MiB
     preallocated temp allocation:    5.19GiB
  preallocated temp fragmentation:   31.49MiB (0.59%)
                 total allocation:    5.42GiB
              total fragmentation:  119.12MiB (2.15%)
Peak buffers:
	Buffer 1:
		Size: 196.00MiB
		XLA Label: custom-call
		Shape: f32[64,256,56,56]
		==========================
...
	Buffer 15:
		Size: 98.00MiB
		XLA Label: fusion
		Shape: f32[64,128,56,56]
		==========================

WORKSPACE

This reverts commit e857c3e.

…he change in converter

lsy323 · 2024-03-13T00:41:05Z

cc @will-cromar for some PJRT changes to accommodate the change of PJRT interface in upstream XLA .

lsy323 · 2024-03-13T00:43:35Z

Thanks @sdasgup3 for pointing out we need to generate custom call to mhlo.uniform_de/quantize, to accommodate the incoming stablehlo.uniform_quantize/dequantize in HLO->MHLO converter. Put the note here for reference.

yeounoh

LGTM

Co-authored-by: Siyuan Liu <lsiyuan@google.coim>

lsy323 marked this pull request as draft March 6, 2024 20:22

lsy323 self-assigned this Mar 6, 2024

lsy323 force-pushed the lsiyuan/pin-update branch from b70f43f to 069f708 Compare March 6, 2024 20:29

lsy323 mentioned this pull request Mar 6, 2024

Add fx passes to support unbounded dynamism in torch op arg #6653

Merged

lsy323 force-pushed the lsiyuan/pin-update branch from b8b4aaa to 1e48f58 Compare March 8, 2024 06:40

frgossen reviewed Mar 8, 2024

View reviewed changes

WORKSPACE Outdated Show resolved Hide resolved

yeounoh mentioned this pull request Mar 12, 2024

[SPMD] auto-sharding PoC #6719

Merged

lsy323 force-pushed the lsiyuan/pin-update branch from 1e48f58 to e857c3e Compare March 12, 2024 01:15

Siyuan Liu and others added 10 commits March 13, 2024 00:26

update pin

d0319aa

fix build

47312de

clang-format

e0eb99f

disable pt2e test until the stablehlo qdq patch is rebased

43a7660

disable GPU tests hitting OOM

c1c7410

Revert "disable GPU tests hitting OOM"

572d9fd

This reverts commit e857c3e.

reduce bs for GPU tests hitting OOM

43427d0

rebase quant patch

402ebdf

change qdq custom call from stablhlo.qdq to mhlo.qdq to accommodate t…

a51fc76

…he change in converter

re-enable pt2e tests

e39587c

lsy323 force-pushed the lsiyuan/pin-update branch from 689e2dc to e39587c Compare March 13, 2024 00:26

recover local xla repo path

9003ef7

lsy323 marked this pull request as ready for review March 13, 2024 00:40

lsy323 requested review from yeounoh, JackCaoG and will-cromar March 13, 2024 00:40

JackCaoG approved these changes Mar 13, 2024

View reviewed changes

JackCaoG added the backport_2.3 label Mar 13, 2024

yeounoh approved these changes Mar 13, 2024

View reviewed changes

lsy323 merged commit 82b5ed3 into master Mar 13, 2024
19 checks passed

lsy323 deleted the lsiyuan/pin-update branch March 13, 2024 17:08

lsy323 added a commit that referenced this pull request Mar 13, 2024

Pin update March 2024 (#6677)

1d2e21f

Co-authored-by: Siyuan Liu <lsiyuan@google.coim>

lsy323 added a commit that referenced this pull request Mar 13, 2024

Backport #6677 to r2.3 (#6737)

db30eb3

Co-authored-by: Siyuan Liu <lsiyuan@google.coim>

lsy323 mentioned this pull request Mar 13, 2024

2.3 backport PR request list #6676

Closed

yeounoh mentioned this pull request Mar 13, 2024

OpenXLA pin update to 670dd9472b #6702

Closed

will-cromar mentioned this pull request Mar 13, 2024

Bump JAX version in setup.py #6740

Merged

This was referenced Mar 21, 2024

Failing Torchbench Models: tracking issue #5932

Open

Regression caused by recent open XLA pin update #6816

Closed

vanbasten23 mentioned this pull request Apr 3, 2024

Check pointer when we cast PjRtLayout to PjRtXlaLayout #6882

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pin update March 2024 #6677

Pin update March 2024 #6677

lsy323 commented Mar 6, 2024 •

edited

Loading

lsy323 commented Mar 6, 2024

lsy323 commented Mar 6, 2024 •

edited

Loading

lsy323 commented Mar 7, 2024 •

edited

Loading

lsy323 commented Mar 7, 2024 •

edited

Loading

lsy323 commented Mar 13, 2024 •

edited

Loading

lsy323 commented Mar 13, 2024 •

edited

Loading

yeounoh left a comment

Pin update March 2024 #6677

Pin update March 2024 #6677

Conversation

lsy323 commented Mar 6, 2024 • edited Loading

lsy323 commented Mar 6, 2024

lsy323 commented Mar 6, 2024 • edited Loading

lsy323 commented Mar 7, 2024 • edited Loading

lsy323 commented Mar 7, 2024 • edited Loading

lsy323 commented Mar 13, 2024 • edited Loading

lsy323 commented Mar 13, 2024 • edited Loading

yeounoh left a comment

Choose a reason for hiding this comment

lsy323 commented Mar 6, 2024 •

edited

Loading

lsy323 commented Mar 6, 2024 •

edited

Loading

lsy323 commented Mar 7, 2024 •

edited

Loading

lsy323 commented Mar 7, 2024 •

edited

Loading

lsy323 commented Mar 13, 2024 •

edited

Loading

lsy323 commented Mar 13, 2024 •

edited

Loading