Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make CUDA OpenXLA fallback the default. #7630

Merged
merged 7 commits into from
Jul 24, 2024

Conversation

ysiraichi
Copy link
Collaborator

Partially fix: #7342

This PR changes the default device for running OpenXLA fallback operation from CPU to CUDA. So, instead of specifying XLA_FALLBACK_CUDA=1, in order to run fallback operations on CUDA the user must make sure that the following is true:

  1. XLA_FALLBACK_CPU (newly introduced) is not set
  2. The DeviceType of the current ComputationClient is CUDA
  3. PyTorch was compiled with CUDA support

I have also changed test_fallback function so that it won't (un)set XLA_FALLBACK_CPU flag. Instead, it just runs the fallback operation. But, from this PR onwards, PyTorch/XLA should be able to detect when it can't do CUDA OpenXLA fallback automatically.

That said, one can force CPU fallback execution with the newly introduced XLA_FALLBACK_CPU environment variable.

cc @miladm @vanbasten23 @JackCaoG

Copy link
Collaborator

@miladm miladm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ysiraichi

approving to unblock - let's confirm the above items please.

@@ -11,6 +11,8 @@ static void fail(const char* name) {

namespace c10::cuda {

DeviceIndex device_count() noexcept { return 0; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, we unit test this function in python layer - do we need a unit test for the cpp layer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we? Can you point me to the source location? These functions are supposed to be implemented by PyTorch when it is compiled with CUDA support. Otherwise, this is an implementation for cases when PyTorch is compiled without CUDA support. This needs to be supplied since we would get an undefined reference, otherwise.

@@ -55,6 +55,10 @@ std::string DeviceType::toString() const {
return absl::StrCat(type_name_, ":");
}

XlaDeviceType DeviceType::getType() const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Collaborator Author

@ysiraichi ysiraichi Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a unit test for this function? It only casts an integer into an enum.

@miladm
Copy link
Collaborator

miladm commented Jul 8, 2024

cc @zpcore

@@ -51,8 +52,33 @@ std::vector<std::string> GetFallbackOperations() {
// Before each modified function below, we shall specify what has changed,
// if there was any.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, we can fix in a follow up. We should rename this file to aten_fallback.cpp

@zpcore
Copy link
Collaborator

zpcore commented Jul 8, 2024

cc @zpcore

Thanks, looks like the fallback is enabled by default. We should see the performance update tomorrow.

@ysiraichi
Copy link
Collaborator Author

As @miladm pointed out, I will still run benchmarks with this PR. So, this might get landed only tomorrow.

@ysiraichi
Copy link
Collaborator Author

I confirmed this PR doesn't introduce any new regressions.

@miladm I will merge this PR, and add the XLA_FALLBACK_CPU description with the file renaming (suggested by @JackCaoG) in a follow-up PR.

@ysiraichi
Copy link
Collaborator Author

Actually, I think I will wait for #7647. It looks like a relevant issue.

bool UseCUDAFallback() {
return runtime::sys_util::GetEnvBool("XLA_FALLBACK_CUDA", false);
// Decide whether to run OpenXLA fallback operations on CUDA.
bool UseCUDAFallback(const c10::OperatorHandle& op) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have another type of cuda fallback, how about the name "OpenXLAFallbackOnCUDA"

@@ -11,6 +11,8 @@ static void fail(const char* name) {

namespace c10::cuda {

DeviceIndex device_count() noexcept { return 0; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns 0 because it's a phony implementation?
Also, could you add a comment in this file describing it's phony implementation and what is the purpose of this file?

Copy link
Collaborator

@vanbasten23 vanbasten23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor comments

@ysiraichi ysiraichi force-pushed the ysiraichi/make-cuda-fallback-default branch from 462a877 to 34a3ac1 Compare July 23, 2024 13:27
Comment on lines +24 to +34
// List of operations that should be fallbacked to CPU instead of GPU.
static std::unordered_set<std::string> _force_fallback_on_cpu{
// This operation is a simple memory access that transforms the given
// 1-element tensor into a Scalar.
//
// Although it makes sense to run this operation on CPU (since the
// output will get copied back to CPU anyway), this also fixes a
// particular issue with moco benchmark.
// More details: https://github.com/pytorch/xla/issues/7647
"aten::_local_scalar_dense",
};
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be completely transparent: aten::_local_scalar_dense is here for 2 reasons:

  1. It just makes sense to run it on CPU, since the output (a Scalar) also lives on CPU
  2. As a temporary fix to [torchbench] moco fails to run with CUDA OpenXLA fallback. #7647

@ysiraichi
Copy link
Collaborator Author

@miladm @JackCaoG @vanbasten23 @zpcore In case you want to take another look at this PR, I added the following changes:

I will merge this one tomorrow.

@ysiraichi ysiraichi merged commit 806de83 into master Jul 24, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OpenXLA CUDA fallback: Tracking Issue
5 participants