Torch-MLIR LTC Backend #1001

antoniojkim · 2022-06-30T19:33:23Z

The LTC staging branch is finally at a point where it is stable enough against latest PyTorch master that we can merged the staging branch into main and continue working off of main instead of off the staging branch.

All of the commits on this branch came from PRs that were individually reviewed internally at Cerebras and then got the stamp of approval to merge in by someone else on the Torch-MLIR project (mainly @silvasean).

Open to any feedback as this is a fairly major change. Having said that, the LTC backend is fairly well decoupled from the rest of Torch-MLIR, so it shouldn't break anything.

CC: @ke1337 @henrytwo

powderluv · 2022-06-30T21:07:28Z

@antoniojkim once the CI is green and please do try to run the nightly build on the ltc branch too. Just choose your branch and edit the oneshotSnapshotPackage.yml to point to your branch instead of 'main' and run it. This way we can test the nightly build too.

silvasean

A few initial comments. I would like to do another round of review after the docs I mentioned in the other PR have been added so I can orient myself better.

examples/ltc_backend/CMakeLists.txt

python/torch_mlir/dialects/torch/importer/jit_ir/CMakeLists.txt

test/lit.cfg.py

rdadolf · 2022-07-04T16:55:31Z

Not to be "that guy", but is there some way we can rename/disambiguate "backend"? I recognize that you're coming at this from a PyTorch angle where what you've built is called a "backend", but torch-mlir also already has a concept of "backend" that's used quite a bit, and it's unrelated.

It seems like maybe there's room for a new name here. @silvasean maybe you can weigh in?

antoniojkim · 2022-07-05T14:28:04Z

Not to be "that guy", but is there some way we can rename/disambiguate "backend"? I recognize that you're coming at this from a PyTorch angle where what you've built is called a "backend", but torch-mlir also already has a concept of "backend" that's used quite a bit, and it's unrelated.

It seems like maybe there's room for a new name here. @silvasean maybe you can weigh in?

We've had this discussion before. This is why in the directory structure, we've named it lazy_base_backend from its original ltc_backend and its intended to be the base for a more concrete backend.

rdadolf · 2022-07-05T16:11:30Z

We've had this discussion before.

Apologies if I'm re-hashing things. If there's a record of the conversation you could point me to, I'll try to catch up from there.

This is why in the directory structure, we've named it lazy_base_backend from its original ltc_backend and its intended to be the base for a more concrete backend.

So I think we may be talking about different things (or maybe I'm just misinterpreting you), because the lazy_base_ vs. ltc_ distinction doesn't really relate. I'm talking about the use of the term backend. In your case, you're talking about backend in the sense that PyTorch uses it, but torch-mlir also uses backend to describe the conversation paths from torch to various other MLIR dialects. E.g., everything in lib/Dialect/TorchConversion/Transforms. There doesn't seem to be much semantic overlap between torch->tosa, torch->linalg, and the LTC work.

To be clear, this isn't a criticism of LTC---the way you're using "backend" makes total sense. This is more of an issue that stems from two independent projects (PyTorch and torch-mlir) having a name collision. To avoid ambiguity and chaos, though, it's probably up to us to resolve that name collision in a sane way. And maybe that even means renaming all the existing things in torch-mlir instead of LTC. I'm not sure.

antoniojkim · 2022-07-05T16:35:52Z

We've had this discussion before.

Apologies if I'm re-hashing things. If there's a record of the conversation you could point me to, I'll try to catch up from there.

There is some discussion here. But mostly to do with making it more clear.

This is why in the directory structure, we've named it lazy_base_backend from its original ltc_backend and its intended to be the base for a more concrete backend.

So I think we may be talking about different things (or maybe I'm just misinterpreting you), because the lazy_base_ vs. ltc_ distinction doesn't really relate. I'm talking about the use of the term backend. In your case, you're talking about backend in the sense that PyTorch uses it, but torch-mlir also uses backend to describe the conversation paths from torch to various other MLIR dialects. E.g., everything in lib/Dialect/TorchConversion/Transforms. There doesn't seem to be much semantic overlap between torch->tosa, torch->linalg, and the LTC work.

There isn't meant to be overlap per say. This is why its a "base" lazy backend. I can definitely see us adding a TOSA lazy backend and/or a LinAlg lazy backend that are both derived from the base lazy backend.

To that end, I can kind of see where you're coming from. It isn't an actual backend, only a "base". So, including "backend" in the name might be a bit confusing. Having said that, I'm still in favour of keeping it as is. @henrytwo and I are planning on adding some documentation to make things more clear.

.github/workflows/buildAndTest.yml

silvasean · 2022-07-07T18:06:51Z

I think that the name "backend" here is fine. Yes, it's a little confusing, but "backend" and "frontend" are always terms fraught with peril and absent a super specific meaning without a very long description (at a minimum of confusion, one person's backend is usually another's frontend). For example, the thing we informally call "backend" in Torch-MLIR that Bob was referring to is really something like "The interface that Torch-MLIR provides downstream consumers who want to receive the torch dialect with shape inference, dtype inference, value semantic conversions, and select decompositions applied in a flattened module graph using standard MLIR func/module abstractions"

silvasean

Overall this looks good to me.

python/torch_mlir/csrc/README.md

python/torch_mlir/dialects/torch/importer/jit_ir/CMakeLists.txt

examples/ltc_backend/CMakeLists.txt

henrytwo · 2022-07-07T19:20:49Z

For anyone following along, we have some docs here to help understand the architecture: https://github.com/llvm/torch-mlir/blob/torch_mlir_ltc_backend/docs/ltc_backend.md

silvasean · 2022-07-14T17:17:21Z

Hey folks, is there anything blocking us from merging this?

henrytwo · 2022-07-14T17:21:31Z

Hey folks, is there anything blocking us from merging this?

I think we're trying to fix CI first. Once we confirm that works I think we're good

antoniojkim · 2022-07-14T17:29:31Z

Hey folks, is there anything blocking us from merging this?

We may need to wait until tomorrow at least for the latest PyTorch changes to make it into the nightly build in order for CI to pass

* Assume zero rank tensors are scalar * Run RefineTypes pass on JIT Graph * Rollback assumption that zero rank tensors are scalar * Set numSizes to -1 for non-ranked tensors * Rename RefineTypes to RefineTupleTypes

* Update native function definitions * Add ops to support bert lowering - Add empty_strided and as_strided - Restore zeros_like to op blacklist (Without this, tensors will be unintentionally created with a CPU device rather than lazy) - Check for composite implicit ops and add device data IR - Also fix codegen for functionalization * Add autogen to CMakeList * Remove PyTorch submodule * Reduced BERT model size * Print Mark Step status in Torch MLIR LTC debug string * Apply fixes to work with latest upstream/main - Pass importOptions into getMlirTypeFromTorchType during NodeImporter::importNode Without this, the tensor type created may have a mismatched type as ImportOptions may cause vtensor to be used instead of tensor * Update shape inference functions - Fixed compute_shape_native_batch_norm when mean and var are uninitialized Previously, the number of shapes returned would be <3 if either mean or val was didn't exist. Instead, we now initialize them with a vector matching the number of channels. - Implemented compute_shape_mul - Fixed bug in reshape shape inference error message * Get MLIR backend more consistent with TS backend - Remove LazyNativeFunctions::_unsafe_view from autogen - Blacklist ops to make JIT graph more like output of TS backend - Print graph when SSA value has mismatch of types and results - Remove normalize_index from LazyShapeInference - Fix seeds for LTC example models * Update and clean up shape inference functions - Prune shape inference functions - Add shape inference function for GenerateSlice - Add shape inference function for GenerateCopy Co-authored-by: Henry Tu <henry.tu@cerebras.net>

* Added e2e LTC Torch MLIR tests * Fix seed for reproducability * Check if computation is None before getting debug string * Updated unit tests, and added numeric tests * Print name of the model layer that fails numeric validation * Run LTC e2e test with CI/CD * Set seed in main function, instead of beginning of execution * Add comment to specify number of digits of precision * Fixed typo * Remove tests for LTC example models * Added LTC option to torchscript e2e * Implement compile and run for LTC e2e test * xfail all tests that use ops that aren't currently supported

…ence functions (#978) * Removed compute_shape_native_batch_norm * Removed compute_shape_native_batch_norm_backward

* Fix autogen build dir issue * Got functionalization pass to compile * Add slice/diagonal backwards functionalization * Fix codegen invocation in CMakeLists.txt * Add functionalization view ops * Fix logsumexp out functionalization * Fix ComputationPtr * Blacklist new_empty op * Add op comparison * Remove unnecessary ops Co-authored-by: Henry Tu <henry.tu@cerebras.net>

* Update README.md * Update README.md Co-authored-by: Jae Hoon (Antonio) Kim <17433012+antoniojkim@users.noreply.github.com> * Update README.md Co-authored-by: Jae Hoon (Antonio) Kim <17433012+antoniojkim@users.noreply.github.com> * Update README.md Co-authored-by: Jae Hoon (Antonio) Kim <17433012+antoniojkim@users.noreply.github.com>

* Create ltc_backend.md * Added introduction and examples * Added descriptions for files from autogen * Added remaining file descriptions * architecture overview * Added subheadings to architecture section * Description for implementing custom backend * Add graphics * Future expansion * Remove "location" in architecture section * Updated LTC in readme * Remove extra space in example * Fix typo * Reworded vendor compilation process * Address PR review comments * Updated diagrams * Add kernel registration to diagram * Address PR comments * Create separate ltc_examples file * PyBind -> pybind

- Pruned number of xfailed e2e LTC tests from 305 to 134 - Reviewed every failure to ensure the error genuinely warrants an xfail - Fixed bug where non-tensor outputs of LTC computation had `.to('cpu')` called, which caused a failure and inflated the xfail count - Fixed bug with `HBC_basic` test where a constant tensor was created in its constructor without being declared as a buffer, which prevented the device from being updated when the parent `torch.nn.Module` got moved to the `lazy` device - Note that this test is still xfail'd due to some unsupported ops. Left a comment about some potential issues that may arise if it gets reenabled in the future - Updated autogen `GeneratedTorchOps.td` to reflect the latest set of supported ops - Renamed `aten.zero.functionalization` to `aten.zero` to reflect upstream PyTorch changes

* Changed Example MLIR backend to Reference MLIR backend * Moved reference_ltc_backend into csrc * Merged sys_utils.h * Renamed reference_ltc_backend to reference_lazy_backend * Addressed review comments * Update docs with new library name * Removed _REFERENCE_LAZY_BACKEND from .gitignore * Added reference_lazy_backend to the TorchMLIRPythonModules dependency list Fixed typo in `ltc_examples.md` Missed instance where `ltc_backend` was used instead of `lazy_backend`.

* Blacklist _convolution op in LTC * Removed duplicate Torch_AtenSelectScatterOp instance from autogen .td * Removed duplicate Torch_AtenSliceScatterOp instance from autogen .td

- Update llvm-project pin to match main

* Add default device ordinal API * Fix reference backend

* Xfail unsupported ops * Register FuncDialect * Include dynamic_ir in build * Code reformat * Enable LTC tests for macOS and Source Build

* Remove unnecessary sed in autogen * Remove .pyc files frrom VCS

henrytwo · 2022-07-29T21:39:50Z

@powderluv @silvasean Are there any additional changes we need in this PR before landing in main? If so, could we please get some approval stamps?

As soon as #1110 goes in we expect CI to be all green here too, since the changes are identical.

silvasean

Once CI is green, let's land this!

powderluv

Let's land once CI is green. Thanks for pushing through with this.

powderluv

Let's land once CI is green. Thanks for pushing through with this.

* Replace CHECK_EQ with TORCH_CHECK_EQ * Check value of TORCH_MLIR_USE_INSTALLED_PYTORCH during LTC build * Update LTC XFAIL with NewZerosModule ops * Explicitly blacklist _like ops * Automatically blacklist new_/_like ops * Prune away unused Python dependencies from LTC * Add flag to disable LTC * Autogen dummy _REFERENCE_LAZY_BACKEND library when LTC is disabled * Implement compute_shape_var * Removed Var tests from XFAIL Set * XFAIL tests using _local_scalar_dense or index.Tensor * Add StdDim tests to XFAIL set * Autogen aten::cat

silvasean · 2022-08-01T20:48:03Z

Great follow-through folks! I'm so excited to see this land!

antoniojkim requested review from powderluv, makslevental, cathyzhyi, ramiro050 and silvasean June 30, 2022 19:33

antoniojkim force-pushed the torch_mlir_ltc_backend branch 2 times, most recently from 14f5d50 to 26c32ad Compare June 30, 2022 19:41

silvasean reviewed Jun 30, 2022

View reviewed changes

examples/ltc_backend/CMakeLists.txt Outdated Show resolved Hide resolved

python/torch_mlir/dialects/torch/importer/jit_ir/CMakeLists.txt Show resolved Hide resolved

test/lit.cfg.py Outdated Show resolved Hide resolved

antoniojkim force-pushed the torch_mlir_ltc_backend branch from 3099b0e to fa1615b Compare July 5, 2022 16:37

powderluv reviewed Jul 6, 2022

View reviewed changes

.github/workflows/buildAndTest.yml Outdated Show resolved Hide resolved

antoniojkim force-pushed the torch_mlir_ltc_backend branch from fa1615b to e2c7dfd Compare July 7, 2022 14:53

silvasean reviewed Jul 7, 2022

View reviewed changes

python/torch_mlir/csrc/README.md Outdated Show resolved Hide resolved

python/torch_mlir/dialects/torch/importer/jit_ir/CMakeLists.txt Show resolved Hide resolved

examples/ltc_backend/CMakeLists.txt Outdated Show resolved Hide resolved

henrytwo force-pushed the torch_mlir_ltc_backend branch 2 times, most recently from 0939a1b to a65857f Compare July 7, 2022 19:40

silvasean mentioned this pull request Jul 7, 2022

torch_mlir.compile error with jit.trace when using TensorPlaceholder for example_args (related to dynamic shape support) #997

Closed

henrytwo force-pushed the torch_mlir_ltc_backend branch from a65857f to 080777e Compare July 8, 2022 15:31

ke1337 requested a review from gprateek93 July 8, 2022 16:18

henrytwo force-pushed the torch_mlir_ltc_backend branch 2 times, most recently from 9921d41 to 787924b Compare July 12, 2022 20:04

antoniojkim force-pushed the torch_mlir_ltc_backend branch from 775bbf2 to 761469a Compare July 14, 2022 17:56

henrytwo and others added 21 commits July 29, 2022 14:32

Bert example and relevant shape inference functions (#831)

3905562

Add static shape for scalar tensors (#833)

19ea644

* Assume zero rank tensors are scalar * Run RefineTypes pass on JIT Graph * Rollback assumption that zero rank tensors are scalar * Set numSizes to -1 for non-ranked tensors * Rename RefineTypes to RefineTupleTypes

Refactor Node Lowering (#914)

a6a398a

Refactor autogen (#925)

0830574

Only import the LTC backend that's used (#939)

e8021b7

Upstream native_batch_norm and native_batch_norm_backward shape infer…

74825e1

…ence functions (#978) * Removed compute_shape_native_batch_norm * Removed compute_shape_native_batch_norm_backward

Update Torch MLIR readme

b28e7d3

Blacklist _convolution op (#1048)

f574a4e

* Blacklist _convolution op in LTC * Removed duplicate Torch_AtenSelectScatterOp instance from autogen .td * Removed duplicate Torch_AtenSliceScatterOp instance from autogen .td

Fix LTC autogen for CI with nightly PyTorch

4ade84c

- Update llvm-project pin to match main

Default Device Ordinal API (#1079)

e0c972a

* Add default device ordinal API * Fix reference backend

Add support for lift_fresh op (#1101)

2038516

Resolve CI testing failure for Lazy Tensor Core (#1088)

9a82a69

* Xfail unsupported ops * Register FuncDialect * Include dynamic_ir in build * Code reformat * Enable LTC tests for macOS and Source Build

Export LTC Headers (#1108)

d3b8d97

Clean up Autogen (#1112)

89e14e8

* Remove unnecessary sed in autogen * Remove .pyc files frrom VCS

henrytwo force-pushed the torch_mlir_ltc_backend branch from b149d13 to 89e14e8 Compare July 29, 2022 18:43

silvasean approved these changes Jul 29, 2022

View reviewed changes

powderluv approved these changes Jul 29, 2022

View reviewed changes

henrytwo merged commit 2c3b360 into main Jul 30, 2022

henrytwo deleted the torch_mlir_ltc_backend branch July 30, 2022 13:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torch-MLIR LTC Backend #1001

Torch-MLIR LTC Backend #1001

antoniojkim commented Jun 30, 2022

powderluv commented Jun 30, 2022

silvasean left a comment

rdadolf commented Jul 4, 2022

antoniojkim commented Jul 5, 2022

rdadolf commented Jul 5, 2022

antoniojkim commented Jul 5, 2022

silvasean commented Jul 7, 2022

silvasean left a comment

henrytwo commented Jul 7, 2022

silvasean commented Jul 14, 2022

henrytwo commented Jul 14, 2022 •

edited

Loading

antoniojkim commented Jul 14, 2022

henrytwo commented Jul 29, 2022

silvasean left a comment

powderluv left a comment

powderluv left a comment

silvasean commented Aug 1, 2022

Torch-MLIR LTC Backend #1001

Torch-MLIR LTC Backend #1001

Conversation

antoniojkim commented Jun 30, 2022

powderluv commented Jun 30, 2022

silvasean left a comment

Choose a reason for hiding this comment

rdadolf commented Jul 4, 2022

antoniojkim commented Jul 5, 2022

rdadolf commented Jul 5, 2022

antoniojkim commented Jul 5, 2022

silvasean commented Jul 7, 2022

silvasean left a comment

Choose a reason for hiding this comment

henrytwo commented Jul 7, 2022

silvasean commented Jul 14, 2022

henrytwo commented Jul 14, 2022 • edited Loading

antoniojkim commented Jul 14, 2022

henrytwo commented Jul 29, 2022

silvasean left a comment

Choose a reason for hiding this comment

powderluv left a comment

Choose a reason for hiding this comment

powderluv left a comment

Choose a reason for hiding this comment

silvasean commented Aug 1, 2022

henrytwo commented Jul 14, 2022 •

edited

Loading