Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use spawn as the fork method for the profiler test. #6302

Merged
merged 1 commit into from
Jan 18, 2024

Conversation

ysiraichi
Copy link
Collaborator

Fix: #6292

Summary of the changes:

  • train_worker function was made top-level: spawn method requires it to be pickle-able
  • Creating a new context where the fork method is spawn: needed for avoiding CUDA initialization issues

cc @miladm @JackCaoG

@ysiraichi
Copy link
Collaborator Author

@JackCaoG this is a friendly reminder. Do you have some time to review this PR?

@JackCaoG
Copy link
Collaborator

lol totally forgot, let me take a look.

@ysiraichi ysiraichi merged commit 52ef8ef into master Jan 18, 2024
19 checks passed
# Create a new context for forking processes with the spawn method.
# This is necessary so as to avoid CUDA initialization issues when
# both PyTorch and PyTorch/XLA were compiled with CUDA support.
context = multiprocessing.get_context("spawn")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ysiraichi IIUC, the failure happens when we initialize CUDA in the parent process and use CUDA in the child process. I wonder where we initialize CUDA in the parent process before your change in this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some investigation, I believe it comes from importing torch_xla. Specifically, the following chain:

  • torch_xla
  • stablehlo
  • dynamo_bridge
  • torch._inductor.fx_passes.post_grad

I guess, one way to solve this issue is to move ConstructorMoverPass out of inductor tree.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reply.

Curious how do you know torch._inductor.fx_passes.post_grad initializes a CUDA context.
Also, what do you mean by move ConstructorMoverPass out of inductor tree?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you know torch._inductor.fx_passes.post_grad initializes a CUDA context.

Just by commenting it out, the problem goes away.

what do you mean by move ConstructorMoverPass out of inductor tree?

This class is declared under inductor module. Importing it means that we have to load the inductor module itself, which initializes a CUDA context. If the class is declared somewhere else (which is possible, since it doesn't really depend on anything of inductor), that initialization goes away

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_trace_and_metrics fail if PyTorch has CUDA support.
3 participants