Set --xla_latency_hiding_scheduler_rerun to 1 #5736

alanwaketan · 2023-10-25T22:21:54Z

Summary:
This flag will rerun the latency hidding scheduler if the default
shared memory limit 95% leads to OOM. Each rerun will choose a value
0.9x of the previous run, and the number of rerun is set to 1 now.
Shared memory limit refers to --xla_tpu_scheduler_percent_shared_memory_limit.
Lower shared memory limit means less communiation and computation overlapping,
and thus worse performance.

Test Plan:
Tested on Llama 2 7B on V4-32.

Summary: This flag will rerun the latency hidding scheduler if the default shared memory limit 95% leads to OOM. Each rerun will choose a value 0.9x of the previous run, and the number of rerun is set to 1 now. Shared memory limit refers to --xla_tpu_scheduler_percent_shared_memory_limit. Lower shared memory limit means less communiation and computation overlapping, and thus worse performance. Test Plan: Tested on Llama 2 7B on V4-32.

JackCaoG · 2023-10-25T22:24:00Z

lgtm, let's wait for the manual run passed then merge this pr

jonb377 · 2023-10-25T22:24:35Z

torch_xla/__init__.py

+  # Lower shared memory limit means less communiation and computation overlapping,
+  # and thus worse performance.
+  flags = _set_missing_flags(flags,
+                             (('xla_latency_hiding_scheduler_rerun', '1'),))


Nice! I was actually wondering about this, we set XLA_FLAGS here, but in other cases we pass the flags through LIBTPU_INIT_ARGS. Do you know if there's a difference? If you saw the appropriate log output from the test, it seems both work...

oh right.. sorry I missed this. so for everything in compiler/xla/xla.proto we used XLA_FLAGS. xla_latency_hiding_scheduler_rerun is one of those TPU specified flags that needs to be passed in with LIBTPU_INIT_ARGS.

Good catch. I miss this lol.

Ah, makes sense that LIBTPU_INIT_ARGS would be tpu-specific lol, thanks @JackCaoG. Is there a rule of thumb to tell which flag goes where? I'm thinking in terms of hashing the compilation environment, I suppose we'll just need to ensure both env vars are included.

jonb377

LGTM, thanks Jiewen!

alanwaketan · 2023-10-26T16:22:31Z

Thanks for the review, Jon.

Summary: This flag will rerun the latency hidding scheduler if the default shared memory limit 95% leads to OOM. Each rerun will choose a value 0.9x of the previous run, and the number of rerun is set to 1 now. Shared memory limit refers to --xla_tpu_scheduler_percent_shared_memory_limit. Lower shared memory limit means less communiation and computation overlapping, and thus worse performance. Test Plan: Tested on Llama 2 7B on V4-32.

alanwaketan added 2 commits October 25, 2023 22:17

Adds more comment

a0e0d3d

alanwaketan requested review from jonb377 and JackCaoG October 25, 2023 22:21

alanwaketan self-assigned this Oct 25, 2023

jonb377 reviewed Oct 25, 2023

View reviewed changes

alanwaketan added 2 commits October 26, 2023 00:37

Introduce LIBTPU_INIT_ARGS

c7f477b

Fix linters

4afa5d5

jonb377 approved these changes Oct 26, 2023

View reviewed changes

alanwaketan merged commit 47a33d0 into master Oct 26, 2023
18 checks passed

mbzomowski mentioned this pull request Nov 16, 2023

tpu ci module refactor mbzomowski-test-org/xla#7

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set --xla_latency_hiding_scheduler_rerun to 1 #5736

Set --xla_latency_hiding_scheduler_rerun to 1 #5736

alanwaketan commented Oct 25, 2023

JackCaoG commented Oct 25, 2023

jonb377 Oct 25, 2023

JackCaoG Oct 25, 2023

alanwaketan Oct 25, 2023

jonb377 Oct 25, 2023

jonb377 left a comment

alanwaketan commented Oct 26, 2023

Set --xla_latency_hiding_scheduler_rerun to 1 #5736

Set --xla_latency_hiding_scheduler_rerun to 1 #5736

Conversation

alanwaketan commented Oct 25, 2023

JackCaoG commented Oct 25, 2023

jonb377 Oct 25, 2023

Choose a reason for hiding this comment

JackCaoG Oct 25, 2023

Choose a reason for hiding this comment

alanwaketan Oct 25, 2023

Choose a reason for hiding this comment

jonb377 Oct 25, 2023

Choose a reason for hiding this comment

jonb377 left a comment

Choose a reason for hiding this comment

alanwaketan commented Oct 26, 2023