-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PJRT] Release the GIL during TransferFromServer
#4504
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Will for the investigation. @ronghanghu FYI
This looks like a real error from the CPU test: The linter is just getting confused. I'll expand those |
I'm not sure where this test actually comes from: I get this error:
Also nothing if I search this test name: https://github.com/pytorch/pytorch/search?q=TestViewOpsXLA |
I can reproduce the error locally like this:
|
202c765
to
1b00a10
Compare
The next step is to rebase onto #4503 and confirm performance on v4 and to double-check this on XRT. I don't expect either to be impacted since both cases have replicas in separate processes. |
Tested this with PJRT and XRT on both v3 and v4. I don't see any noticeable impact on performance in any case.* * For some reason, the performance is lower with XRT than it is in our automated test, but that's true with or without this change. I'll keep an eye on the benchmarks after this is merged. |
The CI test failure looks real. I can reproduce it with |
I have the CPP tests passing locally again. 🤞 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do NoGilSection in the python binding layer? Just doesn't feel right to have code below the python binding layer to deal with any python concepts. wdyt?
I agree, and this was my first thought as well. But I traced the offending call back all the way, and it didn't pass through our python binding layer ( Attaching the full gdb output for reference. Full `gdb` trace
|
Yea, after eyeballing the trace. It looks like the GIL is held by the PyTorch layer. I guess it's okay then. Did you consider that maybe we can move the NoGilSection section out from pybind layer to a shared header, and reuse that code here? |
I thought about bringing in NoGilSection, but in this case we have to explicitly check if Python is initialized and that the GIL is not already released before we save the thread state. What do you think of modifying |
It looks like the upstream is using |
Good find! Skimming the code, I'm not sure how it will handle our case where this function is called from a non-python context (C++ tests mainly) or the GIL has already been released. I found that |
Haha, not even pybind people has thought about our use cases... Let's leave it as it is now. But probably add some notes suggesting that this code is more like an exception. I just try to prevent people overusing them in the cpp layer. |
1ea78b7
to
798dca5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Confirmed performance after rebasing and added a warning in the comments. We have other risky changes going into the next nightly build (especially the TF pin update), so I'm going to hold off on merging this until tests run again. |
798dca5
to
fbbc6c6
Compare
Implemented offline request from @yeounoh to add an easy switch to disable this change |
* Release the GIL during `TransferFromServer` * Check for GIL before releasing it * Add comment * formatting * Check for Python intitialization before GIL status * Add explanation and warning * Add temporary switch to hold onto GIL * Formatting
Because execution in PJRT is asynchronous, if we call
TransferFromServer
on a buffer produced by a computation that is in progress,TransferToServer
will block until the underlying buffer is ready. However, in this case, the thread callingTransferFromServer
is also holding the GIL. That means that in a multithreaded context, we can end up with a deadlock where one thread is holding the GIL indefinitely while it waits for another replica to participate in a computation, and the thread that must run that computation is blocked because the first thread is holding the GIL.This started happening consistently with PJRT on TPU v3 some time in mid-December, triggered by an unrelated commit. I confirmed the GIL deadlock by inspecting hanging threads with
gdb
:`gdb` output
This problem was irrelevant to XRT for two reasons:
TransferFromServer
will never block and wait for a result.We have to do one of the following to prevent this deadlock:
TransferFromServer
to allow other threads to move forward.This PR implements (2) to preserve performance.