Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PoC] TorchDisc: accelerating PyTorch training via LTC + BladeDISC #156

Closed
4 tasks done
Yancey1989 opened this issue Mar 10, 2022 · 4 comments
Closed
4 tasks done

Comments

@Yancey1989
Copy link
Collaborator

Yancey1989 commented Mar 10, 2022

Background

BladeDISC is an end-to-end compiler that supports dynamic shape features, and dynamic shape is widely used on the training scene, this issue descript how to improve PyTorch training performance with DISC based on the LazyTensorCore(LTC) mechanism.

feature branch: https://github.com/alibaba/BladeDISC/tree/features/torch_disc_devel

Design Overview

image

  1. According to LTC, a MARK API should be called manually at the end of each iteration to sync and execute a Graph on a physical device.
  2. Lowering To TorchScript, LTC uses TorchScript as the backend engine, ref TSBackendImpl, we can use it lower Lazy IR to TorchScript IR.
  3. Cluster DISC SubGraph,
  4. DISC Compilation Stage
    a. mhlo conversation, DISC uses MLIR::mhloas the front-end, we should convert TorchScript IR to mhlo before compilation.
    b. compiling to an executable program, call DISC entry function to compile mhlo IR to an executable file (a dynamic library file).
    c. disc execution, call DISC RAL to execute the executable program with input Tensors.
  5. TorchScript Execution, finally call torch::jit::GraphExecutorto execute the TorchScript IR and return the result Tensors.

Implement and TODO Actions

To implement the above features, we should build a Pybind library _torch_disc.so to expose step_mark API with some important C++ functions, the TODO actions as the following:

Reference

  1. PyTorch LazyTensor branch: https://github.com/pytorch/pytorch/tree/lazy_tensor_staging/lazy_tensor_core
  2. PyTorch/XLA backend example: https://github.com/pytorch/xla/tree/asuhan/xla_ltc_plugin
@tanyokwok
Copy link
Collaborator

tanyokwok commented Mar 10, 2022

library _torch_disc.so to expose step_mark API

What difference between our disc.step_mark() and PyTorch LTC's lazy_tensor_core.step_mark(). I think eventually we can reuse the API from PyTorch LTC?

@tanyokwok
Copy link
Collaborator

tanyokwok commented Mar 10, 2022

Perhaps we could add mnist and bert as the driven cases into our tasks?

@Yancey1989
Copy link
Collaborator Author

What difference between our disc.step_mark() and PyTorch LTC's lazy_tensor_core.step_mark(). I think eventually we can reuse the API from PyTorch LTC?

Not exactly the same, disc.step_mark() should include the additional MHLO conversion and clustering stage.

Perhaps we could add mnist and bert as the driven cases into our tasks?

That's a great idea, will add a task in the TODO actions.

@Yancey1989 Yancey1989 changed the title [PoC] TorchDISC to improve training workload [PoC] TorchDISC to improve PyTorch training workload Mar 10, 2022
@Yancey1989 Yancey1989 changed the title [PoC] TorchDISC to improve PyTorch training workload [PoC] TorchDisc: accelerating PyTorch training via LTC + BladeDISC Apr 6, 2022
@Yancey1989
Copy link
Collaborator Author

Yancey1989 commented Apr 27, 2022

I will close this PR, all the PoC tasks have been done and we merged the feature branch to master, will keep tracing this feature with project .

Please feel free to check the design doc about this PoC: #236 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants