Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torch-MLIR LTC Backend #1001

Merged
merged 32 commits into from
Jul 30, 2022
Merged

Torch-MLIR LTC Backend #1001

merged 32 commits into from
Jul 30, 2022

Conversation

antoniojkim
Copy link
Collaborator

The LTC staging branch is finally at a point where it is stable enough against latest PyTorch master that we can merged the staging branch into main and continue working off of main instead of off the staging branch.

All of the commits on this branch came from PRs that were individually reviewed internally at Cerebras and then got the stamp of approval to merge in by someone else on the Torch-MLIR project (mainly @silvasean).

Open to any feedback as this is a fairly major change. Having said that, the LTC backend is fairly well decoupled from the rest of Torch-MLIR, so it shouldn't break anything.

CC: @ke1337 @henrytwo

@powderluv
Copy link
Collaborator

@antoniojkim once the CI is green and please do try to run the nightly build on the ltc branch too. Just choose your branch and edit the oneshotSnapshotPackage.yml to point to your branch instead of 'main' and run it. This way we can test the nightly build too.

Copy link
Contributor

@silvasean silvasean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few initial comments. I would like to do another round of review after the docs I mentioned in the other PR have been added so I can orient myself better.

examples/ltc_backend/CMakeLists.txt Outdated Show resolved Hide resolved
test/lit.cfg.py Outdated Show resolved Hide resolved
@rdadolf
Copy link
Collaborator

rdadolf commented Jul 4, 2022

Not to be "that guy", but is there some way we can rename/disambiguate "backend"? I recognize that you're coming at this from a PyTorch angle where what you've built is called a "backend", but torch-mlir also already has a concept of "backend" that's used quite a bit, and it's unrelated.

It seems like maybe there's room for a new name here. @silvasean maybe you can weigh in?

@antoniojkim
Copy link
Collaborator Author

Not to be "that guy", but is there some way we can rename/disambiguate "backend"? I recognize that you're coming at this from a PyTorch angle where what you've built is called a "backend", but torch-mlir also already has a concept of "backend" that's used quite a bit, and it's unrelated.

It seems like maybe there's room for a new name here. @silvasean maybe you can weigh in?

We've had this discussion before. This is why in the directory structure, we've named it lazy_base_backend from its original ltc_backend and its intended to be the base for a more concrete backend.

@rdadolf
Copy link
Collaborator

rdadolf commented Jul 5, 2022

We've had this discussion before.

Apologies if I'm re-hashing things. If there's a record of the conversation you could point me to, I'll try to catch up from there.

This is why in the directory structure, we've named it lazy_base_backend from its original ltc_backend and its intended to be the base for a more concrete backend.

So I think we may be talking about different things (or maybe I'm just misinterpreting you), because the lazy_base_ vs. ltc_ distinction doesn't really relate. I'm talking about the use of the term backend. In your case, you're talking about backend in the sense that PyTorch uses it, but torch-mlir also uses backend to describe the conversation paths from torch to various other MLIR dialects. E.g., everything in lib/Dialect/TorchConversion/Transforms. There doesn't seem to be much semantic overlap between torch->tosa, torch->linalg, and the LTC work.

To be clear, this isn't a criticism of LTC---the way you're using "backend" makes total sense. This is more of an issue that stems from two independent projects (PyTorch and torch-mlir) having a name collision. To avoid ambiguity and chaos, though, it's probably up to us to resolve that name collision in a sane way. And maybe that even means renaming all the existing things in torch-mlir instead of LTC. I'm not sure.

@antoniojkim
Copy link
Collaborator Author

We've had this discussion before.

Apologies if I'm re-hashing things. If there's a record of the conversation you could point me to, I'll try to catch up from there.

There is some discussion here. But mostly to do with making it more clear.

This is why in the directory structure, we've named it lazy_base_backend from its original ltc_backend and its intended to be the base for a more concrete backend.

So I think we may be talking about different things (or maybe I'm just misinterpreting you), because the lazy_base_ vs. ltc_ distinction doesn't really relate. I'm talking about the use of the term backend. In your case, you're talking about backend in the sense that PyTorch uses it, but torch-mlir also uses backend to describe the conversation paths from torch to various other MLIR dialects. E.g., everything in lib/Dialect/TorchConversion/Transforms. There doesn't seem to be much semantic overlap between torch->tosa, torch->linalg, and the LTC work.

There isn't meant to be overlap per say. This is why its a "base" lazy backend. I can definitely see us adding a TOSA lazy backend and/or a LinAlg lazy backend that are both derived from the base lazy backend.

To that end, I can kind of see where you're coming from. It isn't an actual backend, only a "base". So, including "backend" in the name might be a bit confusing. Having said that, I'm still in favour of keeping it as is. @henrytwo and I are planning on adding some documentation to make things more clear.

@silvasean
Copy link
Contributor

I think that the name "backend" here is fine. Yes, it's a little confusing, but "backend" and "frontend" are always terms fraught with peril and absent a super specific meaning without a very long description (at a minimum of confusion, one person's backend is usually another's frontend). For example, the thing we informally call "backend" in Torch-MLIR that Bob was referring to is really something like "The interface that Torch-MLIR provides downstream consumers who want to receive the torch dialect with shape inference, dtype inference, value semantic conversions, and select decompositions applied in a flattened module graph using standard MLIR func/module abstractions"

Copy link
Contributor

@silvasean silvasean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me.

python/torch_mlir/csrc/README.md Outdated Show resolved Hide resolved
examples/ltc_backend/CMakeLists.txt Outdated Show resolved Hide resolved
@henrytwo
Copy link
Member

henrytwo commented Jul 7, 2022

For anyone following along, we have some docs here to help understand the architecture: https://github.com/llvm/torch-mlir/blob/torch_mlir_ltc_backend/docs/ltc_backend.md

@silvasean
Copy link
Contributor

Hey folks, is there anything blocking us from merging this?

@henrytwo
Copy link
Member

henrytwo commented Jul 14, 2022

Hey folks, is there anything blocking us from merging this?

I think we're trying to fix CI first. Once we confirm that works I think we're good

@antoniojkim
Copy link
Collaborator Author

Hey folks, is there anything blocking us from merging this?

We may need to wait until tomorrow at least for the latest PyTorch changes to make it into the nightly build in order for CI to pass

henrytwo and others added 21 commits July 29, 2022 14:32
* Assume zero rank tensors are scalar

* Run RefineTypes pass on JIT Graph

* Rollback assumption that zero rank tensors are scalar

* Set numSizes to -1 for non-ranked tensors

* Rename RefineTypes to RefineTupleTypes
* Update native function definitions

* Add ops to support bert lowering

- Add empty_strided and as_strided

- Restore zeros_like to op blacklist (Without this, tensors will be unintentionally created with a CPU device rather than lazy)

- Check for composite implicit ops and add device data IR

- Also fix codegen for functionalization

* Add autogen to CMakeList

* Remove PyTorch submodule

* Reduced BERT model size

* Print Mark Step status in Torch MLIR LTC debug string

* Apply fixes to work with latest upstream/main

- Pass importOptions into getMlirTypeFromTorchType during NodeImporter::importNode

  Without this, the tensor type created may have a mismatched type as ImportOptions may cause vtensor to be used instead of tensor

* Update shape inference functions

- Fixed compute_shape_native_batch_norm when mean and var are uninitialized

  Previously, the number of shapes returned would be <3 if either mean or val was didn't exist. Instead, we now initialize them with a vector matching the number of channels.

- Implemented compute_shape_mul

- Fixed bug in reshape shape inference error message

* Get MLIR backend more consistent with TS backend

- Remove LazyNativeFunctions::_unsafe_view from autogen

- Blacklist ops to make JIT graph more like output of TS backend

- Print graph when SSA value has mismatch of types and results

- Remove normalize_index from LazyShapeInference

- Fix seeds for LTC example models

* Update and clean up shape inference functions

- Prune shape inference functions

- Add shape inference function for GenerateSlice

- Add shape inference function for GenerateCopy

Co-authored-by: Henry Tu <henry.tu@cerebras.net>
* Added e2e LTC Torch MLIR tests

* Fix seed for reproducability

* Check if computation is None before getting debug string

* Updated unit tests, and added numeric tests

* Print name of the model layer that fails numeric validation

* Run LTC e2e test with CI/CD

* Set seed in main function, instead of beginning of execution

* Add comment to specify number of digits of precision

* Fixed typo

* Remove tests for LTC example models

* Added LTC option to torchscript e2e

* Implement compile and run for LTC e2e test

* xfail all tests that use ops that aren't currently supported
…ence functions (#978)

* Removed compute_shape_native_batch_norm

* Removed compute_shape_native_batch_norm_backward
* Fix autogen build dir issue

* Got functionalization pass to compile

* Add slice/diagonal backwards functionalization

* Fix codegen invocation in CMakeLists.txt

* Add functionalization view ops

* Fix logsumexp out functionalization

* Fix ComputationPtr

* Blacklist new_empty op

* Add op comparison

* Remove unnecessary ops

Co-authored-by: Henry Tu <henry.tu@cerebras.net>
* Update README.md

* Update README.md

Co-authored-by: Jae Hoon (Antonio) Kim <17433012+antoniojkim@users.noreply.github.com>

* Update README.md

Co-authored-by: Jae Hoon (Antonio) Kim <17433012+antoniojkim@users.noreply.github.com>

* Update README.md

Co-authored-by: Jae Hoon (Antonio) Kim <17433012+antoniojkim@users.noreply.github.com>
* Create ltc_backend.md

* Added introduction and examples

* Added descriptions for files from autogen

* Added remaining file descriptions

* architecture overview

* Added subheadings to architecture section

* Description for implementing custom backend

* Add graphics

* Future expansion

* Remove "location" in architecture section

* Updated LTC in readme

* Remove extra space in example

* Fix typo

* Reworded vendor compilation process

* Address PR review comments

* Updated diagrams

* Add kernel registration to diagram

* Address PR comments

* Create separate ltc_examples file

* PyBind -> pybind
- Pruned number of xfailed e2e LTC tests from 305 to 134
  - Reviewed every failure to ensure the error genuinely warrants an xfail
- Fixed bug where non-tensor outputs of LTC computation had `.to('cpu')` called, which caused a failure and inflated the xfail count
- Fixed bug with `HBC_basic` test where a constant tensor was created in its constructor without being declared as a buffer, which prevented the device from being updated when the parent `torch.nn.Module` got moved to the `lazy` device
  - Note that this test is still xfail'd due to some unsupported ops. Left a comment about some potential issues that may arise if it gets reenabled in the future
- Updated autogen `GeneratedTorchOps.td` to reflect the latest set of supported ops
- Renamed `aten.zero.functionalization` to `aten.zero` to reflect upstream PyTorch changes
* Changed Example MLIR backend to Reference MLIR backend

* Moved reference_ltc_backend into csrc

* Merged sys_utils.h

* Renamed reference_ltc_backend to reference_lazy_backend

* Addressed review comments

* Update docs with new library name

* Removed _REFERENCE_LAZY_BACKEND from .gitignore

* Added reference_lazy_backend to the TorchMLIRPythonModules dependency list

Fixed typo in `ltc_examples.md`

Missed instance where `ltc_backend` was used instead of `lazy_backend`.
* Blacklist _convolution op in LTC

* Removed duplicate Torch_AtenSelectScatterOp instance from autogen .td

* Removed duplicate Torch_AtenSliceScatterOp instance from autogen .td
- Update llvm-project pin to match main
* Add default device ordinal API

* Fix reference backend
* Xfail unsupported ops

* Register FuncDialect

* Include dynamic_ir in build

* Code reformat

* Enable LTC tests for macOS and Source Build
* Remove unnecessary sed in autogen

* Remove .pyc files frrom VCS
@henrytwo
Copy link
Member

@powderluv @silvasean Are there any additional changes we need in this PR before landing in main? If so, could we please get some approval stamps?

As soon as #1110 goes in we expect CI to be all green here too, since the changes are identical.

Copy link
Contributor

@silvasean silvasean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once CI is green, let's land this!

Copy link
Collaborator

@powderluv powderluv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's land once CI is green. Thanks for pushing through with this.

Copy link
Collaborator

@powderluv powderluv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's land once CI is green. Thanks for pushing through with this.

* Replace CHECK_EQ with TORCH_CHECK_EQ

* Check value of TORCH_MLIR_USE_INSTALLED_PYTORCH during LTC build

* Update LTC XFAIL with NewZerosModule ops

* Explicitly blacklist _like ops

* Automatically blacklist new_/_like ops

* Prune away unused Python dependencies from LTC

* Add flag to disable LTC

* Autogen dummy _REFERENCE_LAZY_BACKEND library when LTC is disabled

* Implement compute_shape_var

* Removed Var tests from XFAIL Set

* XFAIL tests using _local_scalar_dense or index.Tensor

* Add StdDim tests to XFAIL set

* Autogen aten::cat
@henrytwo henrytwo merged commit 2c3b360 into main Jul 30, 2022
@henrytwo henrytwo deleted the torch_mlir_ltc_backend branch July 30, 2022 13:40
@silvasean
Copy link
Contributor

Great follow-through folks! I'm so excited to see this land!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants