Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use upstream XLA concurrency utilities #5799

Merged
merged 7 commits into from
Nov 15, 2023
Merged

Conversation

will-cromar
Copy link
Collaborator

@will-cromar will-cromar commented Nov 14, 2023

@JackCaoG and I both have run into cases where our existing concurrency utilities add significant overhead (e.g. waiting >1ms for lock when completing a MultiWait task), which functionally limits the number of threads we can spawn. This PR replaces two custom implementations of common utilities (thread pool and latch/MultiWait) with more optimized upstream equivalents.

  • Use tsl::thread::ThreadPool. The underlying implementation more carefully reuses threads and handles NUMA affinity to reduce context switching costs.
    • Simplified interface for starting threads. There used to be two thread pools, one for low-latency ops and one for IO ops. However, there was no difference in implementation between them, so I replaced it with one thread pool with the maximum number of threads. I expect the IO thread pool was more important when we had extremely high latency calls through XRT.
    • A nice side benefit is that by reusing threads more often, our profiles become much more readable:

Before:

image

After:

image

  • Mostly replace MultiWait with absl::BlockingCounter. Completing a task requires acquiring a lock, which can cost >1ms in practice with enough threads. BlockingCounter instead uses a lockless atomic counter, significantly reducing the latency to decrement the remaining task count.
    • MultiWait still exists upstream PyTorch if we need to use it for anything. I left usage of the upstream MultiWait alone.
    • Remove cases of MultiWait where we launch one operation and do not wait for it

We already depend on absl and TSL through OpenXLA, so this PR adds no new dependencies.

Tested with SPMD llama inference to confirm no regression in performance:

Baseline:

Totally decoded 1007 tokens in 7.11816 seconds

With this PR:

Totally decoded 1007 tokens in 7.06291 seconds

I found these performance benefits while working on #5737, which spawns significantly more threads and potentially reduces the synchronous time spent in ExecuteReplicated. See my comment about performance. But, I don't expect this PR to provide significant benefits on its own. Benefits of this PR alone:

  • Clearer profile traces, by grouping operations in fewer threads and actually labeling which ones are PyTorch/XLA.
  • Deleting code

Future work:

  • Check if TSL's low_latency_hint makes a difference in performance. Operations through the thread pool are assumed to be low latency by default.
  • See if upstream TSL can remove the tf_ prefix from thread names.
  • Use different labeled thread pools for different purposes to make profiles more readable. Example of separating "framework" ops from "runtime" ops in a WIP PR:

image

@will-cromar will-cromar changed the title Use upstream concurrency utilities Use upstream XLA concurrency utilities Nov 14, 2023
@will-cromar will-cromar marked this pull request as ready for review November 14, 2023 21:09
Copy link
Collaborator

@alanwaketan alanwaketan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@@ -57,7 +57,7 @@ void TestSingleReplication(

std::vector<std::vector<torch_xla::runtime::ComputationClient::DataPtr>>
results(device_strings.size());
torch_xla::runtime::util::MultiWait mwait(device_strings.size());
absl::BlockingCounter mwait(device_strings.size());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's rename it to bc or something, mwait would be confusing if the type is not MultiWait.

Copy link
Collaborator

@JackCaoG JackCaoG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, minor nits

@will-cromar will-cromar merged commit 6e66130 into master Nov 15, 2023
18 checks passed
mbzomowski pushed a commit to mbzomowski-test-org/xla that referenced this pull request Nov 16, 2023
* Use TSL threadpool

* remove multiwait

* fix test build

* Move threadpool namespace

* formatting

* fix test build

* Use BlockingCounter
jonb377 added a commit that referenced this pull request Nov 20, 2023
zpcore pushed a commit that referenced this pull request Nov 21, 2023
* Use TSL threadpool

* remove multiwait

* fix test build

* Move threadpool namespace

* formatting

* fix test build

* Use BlockingCounter
lsy323 pushed a commit to lsy323/xla that referenced this pull request Nov 28, 2023
* Use TSL threadpool

* remove multiwait

* fix test build

* Move threadpool namespace

* formatting

* fix test build

* Use BlockingCounter
jonb377 added a commit that referenced this pull request Nov 29, 2023
jonb377 added a commit that referenced this pull request Nov 30, 2023
jonb377 added a commit that referenced this pull request Dec 1, 2023
* Distribute Literal->Tensor copies across thread pool

* Update for #5799
ManfeiBai pushed a commit to ManfeiBai/PyTorchXLA that referenced this pull request Dec 1, 2023
* Distribute Literal->Tensor copies across thread pool

* Update for pytorch#5799
ManfeiBai pushed a commit to ManfeiBai/PyTorchXLA that referenced this pull request Dec 1, 2023
* Distribute Literal->Tensor copies across thread pool

* Update for pytorch#5799
chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023
* Use TSL threadpool

* remove multiwait

* fix test build

* Move threadpool namespace

* formatting

* fix test build

* Use BlockingCounter
chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023
* Distribute Literal->Tensor copies across thread pool

* Update for pytorch#5799
golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024
* Use TSL threadpool

* remove multiwait

* fix test build

* Move threadpool namespace

* formatting

* fix test build

* Use BlockingCounter
golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024
* Distribute Literal->Tensor copies across thread pool

* Update for #5799
bhavya01 pushed a commit that referenced this pull request Apr 22, 2024
* Use TSL threadpool

* remove multiwait

* fix test build

* Move threadpool namespace

* formatting

* fix test build

* Use BlockingCounter
bhavya01 pushed a commit that referenced this pull request Apr 22, 2024
* Distribute Literal->Tensor copies across thread pool

* Update for #5799
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants