Use spawn as the fork method for the profiler test. #6302

ysiraichi · 2024-01-12T22:56:51Z

Fix: #6292

Summary of the changes:

train_worker function was made top-level: spawn method requires it to be pickle-able
Creating a new context where the fork method is spawn: needed for avoiding CUDA initialization issues

cc @miladm @JackCaoG

ysiraichi · 2024-01-18T14:04:41Z

@JackCaoG this is a friendly reminder. Do you have some time to review this PR?

JackCaoG · 2024-01-18T18:25:19Z

lol totally forgot, let me take a look.

vanbasten23 · 2024-01-19T01:27:12Z

test/test_profiler.py

+    # Create a new context for forking processes with the spawn method.
+    # This is necessary so as to avoid CUDA initialization issues when
+    # both PyTorch and PyTorch/XLA were compiled with CUDA support.
+    context = multiprocessing.get_context("spawn")


@ysiraichi IIUC, the failure happens when we initialize CUDA in the parent process and use CUDA in the child process. I wonder where we initialize CUDA in the parent process before your change in this PR.

After some investigation, I believe it comes from importing torch_xla. Specifically, the following chain:

torch_xla

stablehlo

dynamo_bridge

torch._inductor.fx_passes.post_grad

I guess, one way to solve this issue is to move ConstructorMoverPass out of inductor tree.

Thanks for the reply.

Curious how do you know torch._inductor.fx_passes.post_grad initializes a CUDA context.
Also, what do you mean by move ConstructorMoverPass out of inductor tree?

how do you know torch._inductor.fx_passes.post_grad initializes a CUDA context.

Just by commenting it out, the problem goes away.

what do you mean by move ConstructorMoverPass out of inductor tree?

This class is declared under inductor module. Importing it means that we have to load the inductor module itself, which initializes a CUDA context. If the class is declared somewhere else (which is possible, since it doesn't really depend on anything of inductor), that initialization goes away

Use spawn method for forking worker.

b661649

ysiraichi requested a review from JackCaoG January 12, 2024 22:57

ysiraichi mentioned this pull request Jan 16, 2024

Failing Torchbench Models: tracking issue #5932

Open

JackCaoG approved these changes Jan 18, 2024

View reviewed changes

ysiraichi merged commit 52ef8ef into master Jan 18, 2024
19 checks passed

vanbasten23 reviewed Jan 19, 2024

View reviewed changes

bhavya01 pushed a commit that referenced this pull request Apr 22, 2024

Use spawn as the fork method for the profiler test. (#6302)

184522e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use spawn as the fork method for the profiler test. #6302

Use spawn as the fork method for the profiler test. #6302

ysiraichi commented Jan 12, 2024

ysiraichi commented Jan 18, 2024

JackCaoG commented Jan 18, 2024

vanbasten23 Jan 19, 2024

ysiraichi Jan 24, 2024

vanbasten23 Jan 24, 2024

ysiraichi Jan 24, 2024

Use spawn as the fork method for the profiler test. #6302

Use spawn as the fork method for the profiler test. #6302

Conversation

ysiraichi commented Jan 12, 2024

ysiraichi commented Jan 18, 2024

JackCaoG commented Jan 18, 2024

vanbasten23 Jan 19, 2024

Choose a reason for hiding this comment

ysiraichi Jan 24, 2024

Choose a reason for hiding this comment

vanbasten23 Jan 24, 2024

Choose a reason for hiding this comment

ysiraichi Jan 24, 2024

Choose a reason for hiding this comment