-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use spawn as the fork method for the profiler test. #6302
Conversation
@JackCaoG this is a friendly reminder. Do you have some time to review this PR? |
lol totally forgot, let me take a look. |
# Create a new context for forking processes with the spawn method. | ||
# This is necessary so as to avoid CUDA initialization issues when | ||
# both PyTorch and PyTorch/XLA were compiled with CUDA support. | ||
context = multiprocessing.get_context("spawn") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ysiraichi IIUC, the failure happens when we initialize CUDA in the parent process and use CUDA in the child process. I wonder where we initialize CUDA in the parent process before your change in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some investigation, I believe it comes from importing torch_xla
. Specifically, the following chain:
torch_xla
stablehlo
dynamo_bridge
torch._inductor.fx_passes.post_grad
I guess, one way to solve this issue is to move ConstructorMoverPass
out of inductor tree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reply.
Curious how do you know torch._inductor.fx_passes.post_grad
initializes a CUDA context.
Also, what do you mean by move ConstructorMoverPass out of inductor tree
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you know torch._inductor.fx_passes.post_grad initializes a CUDA context.
Just by commenting it out, the problem goes away.
what do you mean by move ConstructorMoverPass out of inductor tree?
This class is declared under inductor module. Importing it means that we have to load the inductor module itself, which initializes a CUDA context. If the class is declared somewhere else (which is possible, since it doesn't really depend on anything of inductor), that initialization goes away
Fix: #6292
Summary of the changes:
train_worker
function was made top-level: spawn method requires it to be pickle-ablecc @miladm @JackCaoG