-
Notifications
You must be signed in to change notification settings - Fork 507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Torch-MLIR LTC Backend #1001
Torch-MLIR LTC Backend #1001
Conversation
14f5d50
to
26c32ad
Compare
@antoniojkim once the CI is green and please do try to run the nightly build on the ltc branch too. Just choose your branch and edit the oneshotSnapshotPackage.yml to point to your branch instead of 'main' and run it. This way we can test the nightly build too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few initial comments. I would like to do another round of review after the docs I mentioned in the other PR have been added so I can orient myself better.
Not to be "that guy", but is there some way we can rename/disambiguate "backend"? I recognize that you're coming at this from a PyTorch angle where what you've built is called a "backend", but torch-mlir also already has a concept of "backend" that's used quite a bit, and it's unrelated. It seems like maybe there's room for a new name here. @silvasean maybe you can weigh in? |
We've had this discussion before. This is why in the directory structure, we've named it |
Apologies if I'm re-hashing things. If there's a record of the conversation you could point me to, I'll try to catch up from there.
So I think we may be talking about different things (or maybe I'm just misinterpreting you), because the To be clear, this isn't a criticism of LTC---the way you're using "backend" makes total sense. This is more of an issue that stems from two independent projects (PyTorch and torch-mlir) having a name collision. To avoid ambiguity and chaos, though, it's probably up to us to resolve that name collision in a sane way. And maybe that even means renaming all the existing things in torch-mlir instead of LTC. I'm not sure. |
There is some discussion here. But mostly to do with making it more clear.
There isn't meant to be overlap per say. This is why its a "base" lazy backend. I can definitely see us adding a TOSA lazy backend and/or a LinAlg lazy backend that are both derived from the base lazy backend. To that end, I can kind of see where you're coming from. It isn't an actual backend, only a "base". So, including "backend" in the name might be a bit confusing. Having said that, I'm still in favour of keeping it as is. @henrytwo and I are planning on adding some documentation to make things more clear. |
3099b0e
to
fa1615b
Compare
fa1615b
to
e2c7dfd
Compare
I think that the name "backend" here is fine. Yes, it's a little confusing, but "backend" and "frontend" are always terms fraught with peril and absent a super specific meaning without a very long description (at a minimum of confusion, one person's backend is usually another's frontend). For example, the thing we informally call "backend" in Torch-MLIR that Bob was referring to is really something like "The interface that Torch-MLIR provides downstream consumers who want to receive the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good to me.
For anyone following along, we have some docs here to help understand the architecture: https://github.com/llvm/torch-mlir/blob/torch_mlir_ltc_backend/docs/ltc_backend.md |
0939a1b
to
a65857f
Compare
a65857f
to
080777e
Compare
9921d41
to
787924b
Compare
Hey folks, is there anything blocking us from merging this? |
I think we're trying to fix CI first. Once we confirm that works I think we're good |
We may need to wait until tomorrow at least for the latest PyTorch changes to make it into the nightly build in order for CI to pass |
775bbf2
to
761469a
Compare
* Assume zero rank tensors are scalar * Run RefineTypes pass on JIT Graph * Rollback assumption that zero rank tensors are scalar * Set numSizes to -1 for non-ranked tensors * Rename RefineTypes to RefineTupleTypes
* Update native function definitions * Add ops to support bert lowering - Add empty_strided and as_strided - Restore zeros_like to op blacklist (Without this, tensors will be unintentionally created with a CPU device rather than lazy) - Check for composite implicit ops and add device data IR - Also fix codegen for functionalization * Add autogen to CMakeList * Remove PyTorch submodule * Reduced BERT model size * Print Mark Step status in Torch MLIR LTC debug string * Apply fixes to work with latest upstream/main - Pass importOptions into getMlirTypeFromTorchType during NodeImporter::importNode Without this, the tensor type created may have a mismatched type as ImportOptions may cause vtensor to be used instead of tensor * Update shape inference functions - Fixed compute_shape_native_batch_norm when mean and var are uninitialized Previously, the number of shapes returned would be <3 if either mean or val was didn't exist. Instead, we now initialize them with a vector matching the number of channels. - Implemented compute_shape_mul - Fixed bug in reshape shape inference error message * Get MLIR backend more consistent with TS backend - Remove LazyNativeFunctions::_unsafe_view from autogen - Blacklist ops to make JIT graph more like output of TS backend - Print graph when SSA value has mismatch of types and results - Remove normalize_index from LazyShapeInference - Fix seeds for LTC example models * Update and clean up shape inference functions - Prune shape inference functions - Add shape inference function for GenerateSlice - Add shape inference function for GenerateCopy Co-authored-by: Henry Tu <henry.tu@cerebras.net>
* Added e2e LTC Torch MLIR tests * Fix seed for reproducability * Check if computation is None before getting debug string * Updated unit tests, and added numeric tests * Print name of the model layer that fails numeric validation * Run LTC e2e test with CI/CD * Set seed in main function, instead of beginning of execution * Add comment to specify number of digits of precision * Fixed typo * Remove tests for LTC example models * Added LTC option to torchscript e2e * Implement compile and run for LTC e2e test * xfail all tests that use ops that aren't currently supported
…ence functions (#978) * Removed compute_shape_native_batch_norm * Removed compute_shape_native_batch_norm_backward
* Fix autogen build dir issue * Got functionalization pass to compile * Add slice/diagonal backwards functionalization * Fix codegen invocation in CMakeLists.txt * Add functionalization view ops * Fix logsumexp out functionalization * Fix ComputationPtr * Blacklist new_empty op * Add op comparison * Remove unnecessary ops Co-authored-by: Henry Tu <henry.tu@cerebras.net>
* Update README.md * Update README.md Co-authored-by: Jae Hoon (Antonio) Kim <17433012+antoniojkim@users.noreply.github.com> * Update README.md Co-authored-by: Jae Hoon (Antonio) Kim <17433012+antoniojkim@users.noreply.github.com> * Update README.md Co-authored-by: Jae Hoon (Antonio) Kim <17433012+antoniojkim@users.noreply.github.com>
* Create ltc_backend.md * Added introduction and examples * Added descriptions for files from autogen * Added remaining file descriptions * architecture overview * Added subheadings to architecture section * Description for implementing custom backend * Add graphics * Future expansion * Remove "location" in architecture section * Updated LTC in readme * Remove extra space in example * Fix typo * Reworded vendor compilation process * Address PR review comments * Updated diagrams * Add kernel registration to diagram * Address PR comments * Create separate ltc_examples file * PyBind -> pybind
- Pruned number of xfailed e2e LTC tests from 305 to 134 - Reviewed every failure to ensure the error genuinely warrants an xfail - Fixed bug where non-tensor outputs of LTC computation had `.to('cpu')` called, which caused a failure and inflated the xfail count - Fixed bug with `HBC_basic` test where a constant tensor was created in its constructor without being declared as a buffer, which prevented the device from being updated when the parent `torch.nn.Module` got moved to the `lazy` device - Note that this test is still xfail'd due to some unsupported ops. Left a comment about some potential issues that may arise if it gets reenabled in the future - Updated autogen `GeneratedTorchOps.td` to reflect the latest set of supported ops - Renamed `aten.zero.functionalization` to `aten.zero` to reflect upstream PyTorch changes
* Changed Example MLIR backend to Reference MLIR backend * Moved reference_ltc_backend into csrc * Merged sys_utils.h * Renamed reference_ltc_backend to reference_lazy_backend * Addressed review comments * Update docs with new library name * Removed _REFERENCE_LAZY_BACKEND from .gitignore * Added reference_lazy_backend to the TorchMLIRPythonModules dependency list Fixed typo in `ltc_examples.md` Missed instance where `ltc_backend` was used instead of `lazy_backend`.
* Blacklist _convolution op in LTC * Removed duplicate Torch_AtenSelectScatterOp instance from autogen .td * Removed duplicate Torch_AtenSliceScatterOp instance from autogen .td
- Update llvm-project pin to match main
* Add default device ordinal API * Fix reference backend
* Xfail unsupported ops * Register FuncDialect * Include dynamic_ir in build * Code reformat * Enable LTC tests for macOS and Source Build
* Remove unnecessary sed in autogen * Remove .pyc files frrom VCS
b149d13
to
89e14e8
Compare
@powderluv @silvasean Are there any additional changes we need in this PR before landing in main? If so, could we please get some approval stamps? As soon as #1110 goes in we expect CI to be all green here too, since the changes are identical. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once CI is green, let's land this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's land once CI is green. Thanks for pushing through with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's land once CI is green. Thanks for pushing through with this.
* Replace CHECK_EQ with TORCH_CHECK_EQ * Check value of TORCH_MLIR_USE_INSTALLED_PYTORCH during LTC build * Update LTC XFAIL with NewZerosModule ops * Explicitly blacklist _like ops * Automatically blacklist new_/_like ops * Prune away unused Python dependencies from LTC * Add flag to disable LTC * Autogen dummy _REFERENCE_LAZY_BACKEND library when LTC is disabled * Implement compute_shape_var * Removed Var tests from XFAIL Set * XFAIL tests using _local_scalar_dense or index.Tensor * Add StdDim tests to XFAIL set * Autogen aten::cat
Great follow-through folks! I'm so excited to see this land! |
The LTC staging branch is finally at a point where it is stable enough against latest PyTorch
master
that we can merged the staging branch intomain
and continue working off ofmain
instead of off the staging branch.All of the commits on this branch came from PRs that were individually reviewed internally at Cerebras and then got the stamp of approval to merge in by someone else on the Torch-MLIR project (mainly @silvasean).
Open to any feedback as this is a fairly major change. Having said that, the LTC backend is fairly well decoupled from the rest of Torch-MLIR, so it shouldn't break anything.
CC: @ke1337 @henrytwo