Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
On Llama3 8B model, no AC `compiled_rmsnorm` is ~9% faster than `rmsnorm`, but ~2% slower than `fused_rmsnorm`. Please see below for details. rmsnorm <img width="757" alt="image" src="https://github.com/pytorch/torchtitan/assets/150487191/79645518-e38b-4ddb-b01d-b0c93ec27dd4"> compiled_rmsnorm <img width="754" alt="image" src="https://github.com/pytorch/torchtitan/assets/150487191/c457b388-793f-452b-9bce-17bc1823df66"> fused_rmsnorm <img width="753" alt="image" src="https://github.com/pytorch/torchtitan/assets/150487191/ea1db7ad-5887-4efa-9788-e708e4b40428"> [ghstack-poisoned]
- Loading branch information