-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce CUDA OpenXLA fallback. #7318
Conversation
I'm still running torchbench. Will report back when it is over. |
test/test_ops.py
Outdated
@dataclass | ||
class AllowedFallbackOpInfoEntry(AllowedOpInfoEntry): | ||
fallback_ops: List[str] = field(default_factory=list) | ||
allow_sample: Optional[Callable[[SampleInput], bool]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does allow_sample
mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It filters the sample list, looking for a specific one. I will leave a comment there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SampleInput is the sample list that you are referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SampleInput
is the class that represents one sample.
@@ -211,18 +213,6 @@ cc_library( | |||
], | |||
) | |||
|
|||
ptxla_cc_library( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this change for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, aten_cpu_fallback.cpp needs functions from:
- aten_xla_bridge.cpp
- xla_graph_executor.cpp
- dl_convertor.cpp
So, I thought it would be easier to merge it into the main library.
return runtime::sys_util::GetEnvBool("XLA_FALLBACK_CUDA", false); | ||
} | ||
|
||
// Change: use of std::any_of instead of iterating with a for-loop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the comment is stale now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. This is the change that I applied to that function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe rephrase Change:
to Change made:
? Change:
sounds it is something we want to change next lol
torch_xla/csrc/aten_cpu_fallback.cpp
Outdated
} | ||
|
||
// Synchronizes the CUDA device being used by PyTorch. | ||
static void torch_cuda_synchronize(at::DeviceIndex common_device) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the param common_device
mean? The device to be synchronized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
// 1. Track the device index being used. Rationale: we synchronize the device | ||
// before crossing device borders for correctness. | ||
// | ||
void cuda_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found similar implementations in PyTorch, one in pytorch/aten/src/ATen/native/CPUFallback.cpp and the other one in pytorch/torch/csrc/lazy/ts_backend/ts_eager_fallback.cpp. How is this implementation different from those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main difference is that we (a) are falling back to CUDA. Here are more details regarding these 3 implementations:
- CPUFallback.cpp (b) looks more updated than ts_eager_fallback.cpp (c), e.g. handles mutation in tensor lists
- Device support: even though (c) does support other devices, its conversion uses
tensor.to(device)
, which copies the tensor. In contrast, (a) we simply share the storage of the tensors - Device synchronization: (a) needs further device synchronization, since we are not calling
tensor.to
method
With all that said, I do agree the functions are all similar to each other. A better approach would be to generalize (b), adding support for some customization.
The CI error is a bit tricky to solve. Problem: I'm using some CUDA functions defined inside PyTorch, which requires linking libc10_cuda.so to the test binaries. However, since (in CI) PyTorch isn't being compiled with CUDA support, that won't work. While I could condition compilation of that code with C++ macros (e.g. using Possible Solution: create a phony implementation for the CUDA functions I'm using, and compile it to another library. Then, if we don't find the library libc10_cuda.so, we link this other library. Notice that this is only needed for the test binaries. @JackCaoG @vanbasten23 @lezcano |
We could also always compile PyTorch with CUDA support in CI. |
If it's only the test binary that requires pytorch built with CUDA, there is a way to achieve it. In our CI, there is a workflow that build pytorch with CUDA, build torch_xla with CUDA, and run only those tests that requires pytorch with CUDA:
|
@@ -31,3 +31,4 @@ | |||
WORLD_SIZE = 'WORLD_SIZE' | |||
LOCAL_WORLD_SIZE = 'LOCAL_WORLD_SIZE' | |||
ZERO_COPY_ENABLED = 'ZERO_COPY_ENABLED' | |||
XLA_FALLBACK_CUDA = 'XLA_FALLBACK_CUDA' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we define a plan to make this feature default (e.g. for 2.5 release) and remove the env variable. Wdyt @ysiraichi?
Environment variables need a description here: https://github.com/pytorch/xla/blob/master/configuration.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to defaul XLA:CUDA executions on CUDA fallback, while XLA:CPU and XLA:TPU remain on CPU fallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, more specifically, after this PR, it would be good to:
Step1: Reversing the function of XLA_FALLBACK_CUDA
so the default would be CUDA fallback enabled (e.g. define DISABLE_XLA_FALLBACK_CUDA
).
Step2: define a mechanism to remove the env variable and have a native solution that just works, and avoids user experience hiccups.
For the problem 1 "Problem1: C++ test binaries need all references to be resolved", you mentioned the "Solution: Create a fallback implementation of the CUDA functions". Could you point to me where is the fallback implementation of the CUDA functions? |
@zpcore to upgrade the XLA:GPU benchmarking to adopt CUDA fallback setting after this PR lands. cc @will-cromar for viz re: comment #7318 (comment) |
#include "torch_xla/csrc/aten_cuda_functions.h" | ||
|
||
#include <c10/util/Exception.h> | ||
|
||
static void fail(const char* name) { | ||
TORCH_CHECK(false, "Could not call the CUDA function: ", name, | ||
". PyTorch was compiled without CUDA support."); | ||
} | ||
|
||
namespace c10::cuda { | ||
|
||
c10::DeviceIndex current_device() noexcept { return -1; } | ||
|
||
void set_device(c10::DeviceIndex) { fail("c10::cuda::set_device()"); } | ||
|
||
void device_synchronize() { fail("c10::cuda::device_synchronize()"); } | ||
|
||
} // namespace c10::cuda |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vanbasten23 this is the fallback implementation.
@@ -137,6 +142,7 @@ ptxla_cc_test( | |||
":torch_xla_test", | |||
"//torch_xla/csrc/runtime:metrics", | |||
"//torch_xla/csrc:tensor", | |||
"//torch_xla/csrc:aten_cuda_functions", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using the fallback implementation for solving the undefined references in C++ tests. I think this should be reasonable, since we don't test fallback on C++ tests.
@will-cromar I'm having a hard time figuring out how to make this PR work with CI. Specifically: compile + run fallback operations test (at test_ops.py). Context: I'm calling a few PyTorch CUDA functions inside a function in aten_cpu_fallback.cpp. The implementation of these functions live in Problem: In the CI action we compile PyTorch/XLA, we actually compile PyTorch and PyTorch/XLA without CUDA support. In other words,
Proposed Solution: have 2 libraries:
I know this is not a pretty solution, so do you have any suggestions? |
Hey @ysiraichi, I'll spend some more time going over this PR tomorrow to try to understand it better. We were just preparing to remove the separate GPU variant of the main Most of the team that is building from source is doing so on TPUs realistically, so it is a nice convenience to not have to build the CUDA version of PyTorch first. Obviously adding the CUDA I don't fully understand after skimming the PR why we need |
It can be loaded at runtime. However, it can't be loaded conditionally. At least, not like this. Loading conditionally ("as needed") was, in fact, the solution that I was proposing. We could have a separate library with a phony implementation of these CUDA functions. Then, import it only if we are in an environment where PyTorch has no CUDA support. Let me have a first implementation. We can remove it, if that's not what we want. |
I have worked on this for a while, now, trying a bunch of things. Unfortunately, none of them worked. Here's the current state of things: What I tried:
What is happening:
I'm not sure why this is not working given that: $ nm -CD _XLAC.cpython-310-x86_64-linux-gnu.so | grep c10::cuda
U c10::cuda::set_device(signed char)
U c10::cuda::current_device()
U c10::cuda::device_synchronize()
$ nm -CD _XLAC_cuda_functions.cpython-310-x86_64-linux-gnu.so | grep c10::cuda
000000000002c0b1 T c10::cuda::set_device(signed char)
000000000002c09e T c10::cuda::current_device()
000000000002c0cd T c10::cuda::device_synchronize() # This works!
$ LD_PRELOAD=./_XLAC_cuda_functions.cpython-310-x86_64-linux-gnu.so python -c "import torch_xla"
# This doesn't work...
$ python -c "import _XLAC_cuda_functions; import torch_xla"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "xla/torch_xla/__init__.py", line 11, in <module>
import _XLAC
ImportError: xla/_XLAC.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda14current_deviceEv @JackCaoG @vanbasten23 @lezcano @will-cromar |
python imports the libraries with import sys, os
prev = sys.getdlopenflags()
sys.setdlopenflags(prev | os.RTLD_GLOBAL)
import _XLAC_cuda_functions
sys.setdlopenflags(prev)
import torch_xla |
70b2146
to
2c2b165
Compare
|
||
namespace c10::cuda { | ||
|
||
c10::DeviceIndex current_device() { fail("c10::cuda::current_device()"); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mean, if pytorch is built without CUDA, then these phony definition will be used. Without it, doing import torch_xla
will fail with something like ImportError: xla/_XLAC.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda14current_deviceEv
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly.
cc_binary( | ||
name = "_XLAC_cuda_functions.so", | ||
copts = [ | ||
"-fopenmp", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how the copts and linkopts are determined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I just copied them from _XLAC
. I guess I could get rid of them, though. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if it works now, feel free to keep it :p
if not torch.cuda.is_available(): | ||
# Load _XLAC_cuda_functions to RTLD_GLOBAL, so that it can be used by _XLAC. | ||
flags = sys.getdlopenflags() | ||
sys.setdlopenflags(flags | os.RTLD_NOW | os.RTLD_GLOBAL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use os.RTLD_NOW
to perform all necessary relocations when dlopen is called, as I don't see in isuruf's example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Internally we discussed that it would be better, just to be safe. That's because dlopen
needs one of RTLD_NOW
or RTLD_LAZY
.
import tempfile | ||
import warnings | ||
|
||
import torch | ||
|
||
if not torch.cuda.is_available(): | ||
# Load _XLAC_cuda_functions to RTLD_GLOBAL, so that it can be used by _XLAC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you point me to the place where _XLAC_cuda_functions is used by _XLAC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I guess you mean the phone definition will be made available when we do import _XLAC
below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Exactly.
import _XLAC_cuda_functions | ||
|
||
# Then, restore the original flags. | ||
sys.setdlopenflags(flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why we need to restore the original flags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd guess that, in general, we don't really want to load things and make them available globally (i.e. some encapsulation for loaded functions).
torch_xla/csrc/aten_cpu_fallback.cpp
Outdated
// device. | ||
// | ||
// This variable is updated over the course of 'to_cuda' calls. | ||
c10::DeviceIndex common_device = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about the common_device
. Do we ever change the value after this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the original tgt_device
be easier to understand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do change inside to_cuda
function. There, we check the device of every XLA tensor, so that we are able to synchronize the computation later.
Would the original
tgt_device
be easier to understand?
Since they have different types, and are used in different ways, I thought that it would be a bit confusing to name it tgt_device
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see your point. Maybe add some comment such as: common_device refers to the device that all tensors should be on; Ideally, all the tensors should be on the same device. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just did that.
// Common device for all XLA tensors.
//
// CUDA OpenXLA fallback is supported only when all XLA tensors live in
// the same XLA device. This field should be updated and checked every
// time we convert an XLA tensor argument into a CUDA tensor.
c10::Device common_device;
opt_tensors[idx] = cuda_tensors[i]; | ||
} | ||
(*stack)[arguments_begin + idx] = c10::IValue(opt_tensors); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the cpu implementation, there is a
else if (ivalue.isDevice()) {
tgt_device = ivalue.toDevice();
(*stack)[arguments_begin + idx] = c10::IValue(c10::Device(kCPU));
}
why don't we need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I thought we didn't need it, since we would always be on XLA. But, I guess it's important to have it just to be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Let me know what you think.
// If any input tensors are mutable aliases, we need to | ||
// directly copy the updated data on the CUDA tensors back to the original | ||
// inputs. | ||
for (const auto i : c10::irange(tensor_args_indices.size())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does torch_xla has a concept of mutable aliases?
Also, do you know if there is a specific test cases for step3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't. The functional layer abstracts it for PyTorch/XLA. That's why we have to propagate the results back to the input arguments that were mutated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do you know if there is a specific test cases for step3?
Not sure we have those in PyTorch/XLA. For that, we would need an operation that mutates at least one of the inputs, e.g. operations that have the out
parameter.
Mostly LGTM with minor comments. Amazing work! |
torch_xla/csrc/aten_cpu_fallback.cpp
Outdated
struct DeviceInfo { | ||
DeviceInfo(c10::Device device, c10::DeviceIndex i = -1) | ||
: common_device(device), index(i) {} | ||
|
||
// Synchronizes the CUDA device being used by PyTorch. | ||
void synchronize() { | ||
TORCH_CHECK(index != -1, "No defined XLA tensors found for CUDA fallback: ", | ||
op.operator_name()); | ||
|
||
// Save the current PyTorch device, in case it's not the same as the | ||
// recorded tensor device. | ||
c10::DeviceIndex current = c10::cuda::current_device(); | ||
c10::cuda::set_device(index); | ||
c10::cuda::device_synchronize(); | ||
c10::cuda::set_device(current); | ||
} | ||
|
||
// Common device for all XLA tensors. | ||
// | ||
// CUDA OpenXLA fallback is supported only when all XLA tensors live in | ||
// the same XLA device. This field should be updated and checked every | ||
// time we convert an XLA tensor argument into a CUDA tensor. | ||
c10::Device common_device; | ||
|
||
// CUDA device index where the tensors live in. | ||
// | ||
// This is used for synchronizing the device where the fallback operation | ||
// was called. This should ensure completion of the CUDA computation, in | ||
// order to be used by another XLA computation. | ||
c10::DeviceIndex index; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This struct helps making sure only one device is used in the CUDA OpenXLA fallback.
- XLA tensor devices are checked against
common_device
- CUDA device
index
is also checked just to be safe
Running TorchBench with |
This PR introduces OpenXLA fallback on PyTorch GPU eager. Instead of running fallback operations (i.e. whenever a operation has no lowering implemented) on CPU, we now make it possible to run them on GPU. This makes sense specially when using XLA:CUDA devices.
In summary, this PR introduces the following changes:
xla_cpu_fallback
intoxla_fallback
cuda_fallback
functionat::native::cpu_fallback
, but with a few changes (called out before each function)at::native::cpu_fallback
implementation inside PyTorch, thoughXLA_FALLBACK_CUDA
flag for using this featurecc @miladm @JackCaoG @vanbasten23