Optimize execution for ops that have multiple output in eager mode #7680

JackCaoG · 2024-07-12T23:28:34Z

In eager mode the execution happens when we create an XLATensor with IR, we will use the IR as the root to build/execute the graph.

This is mostly fine but for ops that has multiple outputs(like native_batch_norm), most of the outputs share a good amounts of common HLOs. It will be much faster to execute all of them in a single graph. The eager mode in PyTorch/XLA can't really execute HLO one by one, so the goal is to execute once(ideally) for each pytorch op.

The change in this pr will

delay the eager execution for some ops when they creating new XLAtensor with IRs
execute the HLO for all XLAtensors after they are created.

I will take another round to check I didn't mess up anything but would appreciate if someone can look closely at my change inside tensor_method.cpp.

torch_xla/csrc/tensor_methods.cpp

JackCaoG · 2024-07-12T23:49:33Z

I also intentionally didn't handle the collectives. Collective will return a all_reduce token which we actually don't want to execute in eager case. I will handle that in a separate pr.

aws-rhsoln · 2024-07-13T02:49:01Z

Curious how much perf boost do we expect when we fuse them into a single graph?

JackCaoG · 2024-07-15T18:37:00Z

Curious how much perf boost do we expect when we fuse them into a single graph?

for a test code

torch_xla.experimental.eager_mode(True)

device = torch_xla.device()
m = nn.BatchNorm2d(16).to(device)
m.train()
input = torch.randn(16, 16, 1024, 1024, device=device)

start = time.time()
for _ in range(20):
  input = m(input)
xm.wait_device_ops()
end = time.time()
duration = end - start
print(f"total time = {duration}")

with my change total time = 0.46190381050109863, without this change total time = 14.28174352645874. I actually don;t know why it is 28x faster, but I did verified that in HLO without my change BatchNorm2d will compute the result one by one.

JackCaoG · 2024-07-15T22:59:17Z

@alanwaketan @wonjoolee95 This one is ready for review.

Optimize execution for ops that have multiple output in eager mode

e36f0da

JackCaoG added the eager label Jul 12, 2024

JackCaoG commented Jul 12, 2024

View reviewed changes

torch_xla/csrc/tensor_methods.cpp Outdated Show resolved Hide resolved

fix comment

eecda39

JackCaoG mentioned this pull request Jul 13, 2024

[RFC] PyTorch/XLA eager mode as default #7253

Open

JackCaoG marked this pull request as ready for review July 15, 2024 18:37

JackCaoG requested review from alanwaketan and wonjoolee95 July 15, 2024 18:37

wonjoolee95 approved these changes Jul 16, 2024

View reviewed changes

JackCaoG merged commit b2c7f65 into master Jul 16, 2024
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize execution for ops that have multiple output in eager mode #7680

Optimize execution for ops that have multiple output in eager mode #7680

JackCaoG commented Jul 12, 2024

JackCaoG commented Jul 12, 2024

aws-rhsoln commented Jul 13, 2024

JackCaoG commented Jul 15, 2024 •

edited

Loading

JackCaoG commented Jul 15, 2024

Optimize execution for ops that have multiple output in eager mode #7680

Optimize execution for ops that have multiple output in eager mode #7680

Conversation

JackCaoG commented Jul 12, 2024

JackCaoG commented Jul 12, 2024

aws-rhsoln commented Jul 13, 2024

JackCaoG commented Jul 15, 2024 • edited Loading

JackCaoG commented Jul 15, 2024

JackCaoG commented Jul 15, 2024 •

edited

Loading