Remove unnecessary unsqueeze - squeeze in rotary positional embedding #26162

fxmarty · 2023-09-14T13:15:50Z

As per title, removes unnecessary operations in the model initialization and forward.

HuggingFaceDocBuilderDev · 2023-09-14T13:40:42Z

The documentation is not available anymore as the PR was closed or merged.

amyeroberts

LGTM - thanks for fixing!

ArthurZucker

Thanks, wondering why we have this in the first place now.
Make sure to also apply this to GPTNeoX and anywhere that has the same logic but no copied from 😉

ArthurZucker · 2023-09-14T14:33:03Z

FYI @gante if there is anything we are missing here?

fxmarty · 2023-09-14T14:34:48Z

There's the same unsqueeze - squeeze in falcon. Maybe I am misunderstanding something in rotary positional embedding.

Edit:

This test fails on this branch:

FAILED tests/models/llama/test_modeling_llama.py::CodeLlamaIntegrationTest::test_model_7b_logits - AssertionError: Lists differ: ['<s>▁<PRE> def remove_non_ascii(s: str) -> st[893 chars]ID>'] != ['<s> <PRE> def remove_non_ascii(s: str) -> st[893 chars...

however it seems unrelated (also fails on 866df66 & tokenizers==0.13.3)

ArthurZucker · 2023-09-14T15:27:20Z

Yep it's unrelated, seen this failing, the fast tokenizer is not properly splitting

fxmarty · 2023-09-16T08:08:27Z

Updated other affected archs.

This PR #22785 was great, but incomplete. cc @fpgaminer

Running slow tests (RUN_SLOW=1 pytest tests/models/gpt_neox/test_modeling_gpt_neox.py -s -vvvvv, RUN_SLOW=1 pytest tests/models/idefics/ -s -vvvvv, RUN_SLOW=1 pytest tests/models/falcon/ -s -vvvvv), no more tests fail than on main.

Related PR: #25830, maybe @ArthurZucker you want to merge that first?

Note: some slow tests do not pass, but they don't pass on main 0a55d9f either:

FAILED tests/models/falcon/test_modeling_falcon.py::FalconModelTest::test_cpu_offload - AssertionError: False is not true
FAILED tests/models/falcon/test_modeling_falcon.py::FalconModelTest::test_disk_offload - AssertionError: False is not true
FAILED tests/models/falcon/test_modeling_falcon.py::FalconModelTest::test_feed_forward_chunking - AssertionError: False is not true
FAILED tests/models/falcon/test_modeling_falcon.py::FalconModelTest::test_left_padding_compatibility - AssertionError: False is not true

FAILED tests/models/idefics/test_modeling_idefics.py::IdeficsModelTest::test_cpu_offload - AssertionError: False is not true
FAILED tests/models/idefics/test_modeling_idefics.py::IdeficsModelTest::test_determinism - ValueError: zero-size array to reduction operation maximum which has no identity
FAILED tests/models/idefics/test_modeling_idefics.py::IdeficsModelTest::test_disk_offload - AssertionError: False is not true
FAILED tests/models/idefics/test_modeling_idefics.py::IdeficsModelTest::test_feed_forward_chunking - AssertionError: False is not true
FAILED tests/models/idefics/test_modeling_idefics.py::IdeficsModelTest::test_model_parallelism - AssertionError: False is not true
FAILED tests/models/idefics/test_modeling_idefics.py::IdeficsForVisionText2TextTest::test_cpu_offload - AssertionError: False is not true
FAILED tests/models/idefics/test_modeling_idefics.py::IdeficsForVisionText2TextTest::test_determinism - ValueError: zero-size array to reduction operation maximum which has no identity
FAILED tests/models/idefics/test_modeling_idefics.py::IdeficsForVisionText2TextTest::test_disk_offload - AssertionError: False is not true
FAILED tests/models/idefics/test_modeling_idefics.py::IdeficsForVisionText2TextTest::test_feed_forward_chunking - AssertionError: False is not true
FAILED tests/models/idefics/test_modeling_idefics.py::IdeficsForVisionText2TextTest::test_model_parallelism - AssertionError: False is not true

ArthurZucker

Good for me but let's wait for #25830 to be merged first! Also make fix-copies and make fixup 😉

fxmarty · 2023-09-28T16:45:34Z

Do you think this can make it in the release?

ArthurZucker · 2023-09-29T06:27:05Z

Actually let's wait a bit in case this breaks things!

ArthurZucker

Make sure to rebase for the changes applied to GPTNeox and Idefics

fxmarty · 2023-10-05T08:33:34Z

sure

fxmarty · 2023-10-05T09:50:48Z

Should be in a good state now.

Summary of slow tests:

RUN_SLOW=1 CUDA_VISIBLE_DEVICES=0 pytest tests/models/llama/ -s -vvvvv errors on (same as main)

FAILED tests/models/llama/test_modeling_llama.py::CodeLlamaIntegrationTest::test_model_7b_logits
E       AssertionError: Lists differ: ['<s>▁<PRE> def remove_non_ascii(s: str) -> st[893 chars]ID>'] != ['<s> <PRE> def remove_non_ascii(s: str) -> st[893 chars]ID>']

RUN_SLOW=1 CUDA_VISIBLE_DEVICES=0 pytest tests/models/idefics/ -s -vvvvv errors on (same as main)

FAILED tests/models/idefics/test_modeling_idefics.py::IdeficsModelTest::test_cpu_offload - AssertionError: False is not true
FAILED tests/models/idefics/test_modeling_idefics.py::IdeficsModelTest::test_determinism - ValueError: zero-size array to reduction operation maximum which has no identity
FAILED tests/models/idefics/test_modeling_idefics.py::IdeficsModelTest::test_disk_offload - AssertionError: False is not true
FAILED tests/models/idefics/test_modeling_idefics.py::IdeficsModelTest::test_feed_forward_chunking - AssertionError: False is not true
FAILED tests/models/idefics/test_modeling_idefics.py::IdeficsForVisionText2TextTest::test_cpu_offload - AssertionError: False is not true
FAILED tests/models/idefics/test_modeling_idefics.py::IdeficsForVisionText2TextTest::test_determinism - ValueError: zero-size array to reduction operation maximum which has no identity
FAILED tests/models/idefics/test_modeling_idefics.py::IdeficsForVisionText2TextTest::test_disk_offload - AssertionError: False is not true
FAILED tests/models/idefics/test_modeling_idefics.py::IdeficsForVisionText2TextTest::test_feed_forward_chunking - AssertionError: False is not true

RUN_SLOW=1 CUDA_VISIBLE_DEVICES=0 pytest tests/models/mistral/ -s -vvvvv errors on (same as main)

FAILED tests/models/mistral/test_modeling_mistral.py::MistralIntegrationTest::test_model_7b_generation - AssertionError: 'My f[17 chars]t is mayonnaise. I love it on sandwiches, in s[13 chars]gers' != 'My f[17 chars]t is 100% ketchup. I love it on everythin...
FAILED tests/models/mistral/test_modeling_mistral.py::MistralIntegrationTest::test_model_7b_logits - RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index ...

RUN_SLOW=1 CUDA_VISIBLE_DEVICES=0 pytest tests/models/falcon/ -s -vvvvv errors on (same as main)

FAILED tests/models/falcon/test_modeling_falcon.py::FalconModelTest::test_left_padding_compatibility - AssertionError: False is not true

RUN_SLOW=1 CUDA_VISIBLE_DEVICES=0 pytest tests/models/gpt_neox/ -s -vvvvv pass.

fxmarty · 2023-10-05T12:13:32Z

This PR fixes tests/models/mistral/test_modeling_mistral.py::MistralIntegrationTest::test_model_7b_logits as well.

The test tests/models/mistral/test_modeling_mistral.py::MistralIntegrationTest::test_model_7b_generation (that uses using slow tokenizers) seem to have never worked (on 72958fc) cc @Bam4d, so I will ignore it for now.

ArthurZucker

Thanks a lot! Good to go 😉

…huggingface#26162) * remove unnecessary unsqueeze-squeeze in llama * correct other models * fix * revert gpt_neox_japanese * fix copie * fix test

gante · 2023-10-17T11:42:59Z

@fxmarty thank you for the fix!

I suppose this redundant pattern got in gpt_neox, and we were copying it over to other new models with RoPE :)

### Description This PR contains fusion-level and kernel-level optimizations for [Meta's LLaMA-2](https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-meta-expand-their-ai-partnership-with-llama-2-on-azure-and-windows/). Some of the added optimizations include: - SimplifiedLayerNorm changes - Fusions for multiple variants - SkipSimplifiedLayerNorm changes - Kernel support for CPU - Rotary embeddings (previously did not exist) - Fusions for multiple variants - CPU and CUDA kernels - Supports interleaving and non-interleaving in the same kernels - Optimized cache that requires half of its originally exported sizes - Reduced from `(max_sequence_length, head_size)` to `(max_sequence_length, head_size / 2)` - Multi-head attention - Support for 2D and 3D attention masks - Group query attention (for FP16 CUDA and INT4 CUDA) - Integration with flash attention v2 and past-present buffer sharing - Removes need for `attention_mask` input as it is supported in the kernel - 4 bit quantization - `block_size` parameter is available for customizing - Support the new changes for [Microsoft version](https://github.com/microsoft/Llama-2-Onnx) - Support combinations of the below variants (ex: export ORT version and run with Optimum) Supported variants of LLaMA-2 include: - [ORT version](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models/llama) - Produces one ONNX file that is already optimized (and quantized if requested) - Integrates with Optimum - [Another Microsoft version](https://github.com/microsoft/Llama-2-Onnx) - Already exported and available off-the-shelf - Faster versions of those models will be uploaded there soon - [Hugging Face version](https://huggingface.co/meta-llama) - Models that end with `-hf` - Some older and current versions of [`transformers`](https://github.com/huggingface/transformers) and [`optimum`](https://github.com/huggingface/optimum) that export the model to ONNX differently - Note that while some older versions are supported, it is recommended to use the latest package versions. ### Usage To use the optimizations, please see `README.md` for details. Please note the various `requirements.txt` files for the package versions recommended in order to use these changes. To run the ORT transformer optimizer separately, run the script as follows: ``` $ cd onnxruntime/onnxruntime/python/tools/transformers/ $ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type gpt2 --num_heads <number of attention heads> --hidden_size <attention hidden size> --use_external_data_format --opt_level 0 ``` ### Motivation and Context This PR helps the following issues: - #14997 - #16254 - #17681 - #17925 - microsoft/onnxruntime-inference-examples#320 This PR uses changes from the following PRs: - pytorch/pytorch#104468 - pytorch/pytorch#109759 - #17020 - #17674 - #17890 - #17920 - huggingface/transformers#26162 - huggingface/optimum#1257 - huggingface/optimum#1289 - huggingface/optimum#1462 ### New TorchDynamo Exporter (experimental stage) This PR uses changes from the following issues and PRs to begin supporting the [new TorchDynamo exporter](https://pytorch.org/docs/stable/onnx.html#torchdynamo-based-onnx-exporter): - huggingface/transformers#26307 - pytorch/pytorch#104903 - pytorch/pytorch#105040 - microsoft/onnxscript#847 - microsoft/onnxscript#862 - microsoft/onnxscript#493

@skottmckay

commit 538e97c Author: Patrice Vignola <vignola.patrice@gmail.com> Date: Wed Oct 25 19:56:16 2023 -0700 [DML EP] Add dynamic graph compilation (#17876) Historically, DML was only able to fuse partitions when all sizes are known in advance or when we were overriding them at session creation time. But in practice, it should be possible to compile partitions at compute time if the caller knows that the dimensions won't be changed for every inference (e.g. resizing a webcam window, or padding the input to powers of 2). This graph will be cached and reused until the sizes change. This is an opt-in option gated under the `enable_dynamic_graph_fusion` option, which means that it will only be enabled when the caller requests it since they have more context on how their model will be called between inferences. This PR also adds the option to disable metacommands from the python API, which is an option for the C API but was lacking for python. commit d30d4d3 Author: Jambay Kinley <jambaykinley@microsoft.com> Date: Wed Oct 25 15:34:58 2023 -0700 Add MatMul FP4 and NF4 Support (#18066) Add a contrib op MatMulBnb4 (FP4 and NF4) and related toolchain to support quantization on weight. This PR adds: - schema for contrib op MatMulBnb4 which can support FP4 (4-bit floating point) and NF4 (4-bit NormalFloat) quantization on weight. - a naive implementation for MatMulBnb4 on CPU and GPU, i.e., implemented like MatMul(A, Dequantize(B)). - a special implementation for GemV for MatMulBnb4 and related benchmark tool. - tool to quantize model to FP4 or NF4. commit d88d52e Author: snadampal <87143774+snadampal@users.noreply.github.com> Date: Wed Oct 25 13:34:57 2023 -0500 [aarch64] Remove mmla kernel support from apple (#18082)  The mmla kernels require additional ISA flags and are currently supported only on Linux  more context is in #15270 cc: @skottmckay , @chenfucn , @snnn commit 706e13e Author: liqun Fu <liqfu@microsoft.com> Date: Wed Oct 25 10:46:04 2023 -0700 implement affinegrid cpu kernel (#17777) commit 2c6b31c Author: pengwa <pengwa@microsoft.com> Date: Wed Oct 25 15:11:02 2023 +0800 FP16 optimizer automatically detect DeepSpeed compatibility (#18084) Optimum/Transformers are using accelerate lib to prepare models, so our FP16 optimizer wrapper does not work for long time. Because the namespace is `accelerate.utils.deepspeed.DeepSpeedOptimizerWrapper`, which underlying is still calling into DeepSpeed stage1and2 optimizer. This PR includes following changes: 1. Add `accelerate.utils.deepspeed.DeepSpeedOptimizerWrapper` in the modifier registry, plus a check on its contained `optimizer` property MUST be DeepSpeed stage 1 and 2 optimizer. (let's cover Stage 3 optimizer later) 2. For DeepSpeed version > 0.9.1, we will store the source code in a version list. As long as the related function in DeepSpeed remains unchanged during its new release, we won't need manually upgrade the version check any more. If some day, the source code did not match, a warning will be raised to users, to add a new version of source code in the list. With the above change, we will have our FP16 Optimizer working again in Optimum. ![image](https://github.com/microsoft/onnxruntime/assets/10530022/d35b4aa9-b371-46f1-98ae-73114f91179b) commit ae85619 Author: Sumit Agarwal <sumitagarwal330@gmail.com> Date: Tue Oct 24 19:41:10 2023 -0700 Introduce new optimizer MatMul + BatchNormalization (#17915) Introduce new ORT L1 optimizer under RewriteRule category to fuse MatMul + BatchNormalization node. This optimizer look for a specific pattern observed in one of the impacting customer models and fuse the Matmul and Batchnormalization node into a Gemm node. For details on the pattern matching and fusion please refer to the comment section of `matmul_bn_fusion.cc`. To visualize, this optimizer will replace following subgraph to a Gemm node. <pre> MatMul GEMM | | Reshape ^ ---> Reshape ^ | | Transpose ^ Transpose ^ | BatchNormalization Note: ^ means there can be >=0 occurrence(s) of that node. Few example fusable pattern: * - MatMul -> Reshape -> Transpose -> BatchNormalization ---> GEMM -> Reshape -> Transpose * - MatMul -> Reshape -> BatchNormalization ---> GEMM -> Reshape * - MatMul -> Transpose -> BatchNormalization ---> GEMM -> Transpose * - MatMul -> Reshape -> Reshape -> BatchNormalization ---> GEMM -> Reshape -> Reshape * - MatMul -> Reshape -> Transpose -> Reshape -> BatchNormalization ---> GEMM -> Reshape -> Transpose -> Reshape * - MatMul -> BatchNormalization ---> GEMM </pre> Note: This optimizer may evolve in the future to be more generic in terms of the pattern matching. - Why is this change required? What problem does it solve? One of the user of ORT+DML ep needs this to better target the model to DML. But this transformation applies more broadly, so added L1 optimizer.  commit 76e275b Author: Jian Chen <cjian@microsoft.com> Date: Tue Oct 24 15:17:36 2023 -0700 Merge Cuda docker files into a single one (#18020)   commit 6ec45f2 Author: Changming Sun <chasun@microsoft.com> Date: Tue Oct 24 13:04:08 2023 -0700 Merge aiinfra-linux-ARM64-CPU-2019 and onnxruntime-linux-ARM64-CPU-2019 (#18069) Merge aiinfra-linux-ARM64-CPU-2019 and onnxruntime-linux-ARM64-CPU-2019 machines to a single one to ease management. commit efa0cc2 Author: liqun Fu <liqfu@microsoft.com> Date: Tue Oct 24 10:58:54 2023 -0700 implement isinf20 and isnan20 (#17874) commit abb3291 Author: Changming Sun <chasun@microsoft.com> Date: Tue Oct 24 10:50:12 2023 -0700 Update win-wasm-ci.yml: increase the timeout value (#18023) commit e63ccd3 Author: Jian Chen <cjian@microsoft.com> Date: Tue Oct 24 10:47:23 2023 -0700 Install CUDA 12.2 on Windows (#18044)   commit eb47008 Author: Jiajia Qin <jiajia.qin@intel.com> Date: Tue Oct 24 13:56:56 2023 +0800 [js/webgpu] FP16 Cast, Resize (#18035)  Cast/Resize with f16 are missing in vae-decoder-f16. With this change, vae-decoder-f16 becomes 315 ms from over than 1 seconds. commit 688524a Author: Tianlei Wu <tlwu@microsoft.com> Date: Mon Oct 23 22:00:02 2023 -0700 [CUDA EP] Add warning logs when adding memcpy nodes (#18032) Memcpy nodes could have negative impact on performance, they also cause ORT unable to run CUDA graph. Here we add a warning log for CUDA EP when this happens. It could help trouble shooting. For example, when CUDA graph cannot run, we can see the logs to find out where the Memcpy nodes are inserted (Although it is also possible through saving optimized model, but that need more time and disk space). Note that the warning is per graph. When there are subgraphs, we might see multiple warnings if the issue happens in multiple graphs. Example logs: ``` 2023-10-19 20:58:10.678176531 [I:onnxruntime:, transformer_memcpy.cc:329 AddCopyNode] Add MemcpyFromHost after input_ids for CUDAExecutionProvider 2023-10-19 20:58:10.678198702 [I:onnxruntime:, transformer_memcpy.cc:329 AddCopyNode] Add MemcpyFromHost after /text_model/ArgMax_output_0 for CUDAExecutionProvider 2023-10-19 20:58:10.678211727 [I:onnxruntime:, transformer_memcpy.cc:329 AddCopyNode] Add MemcpyFromHost after /text_model/Gather_3_output_0 for CUDAExecutionProvider 2023-10-19 20:58:10.678257903 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 3 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message. ``` commit 555b2af Author: Chi Lo <54722500+chilo-ms@users.noreply.github.com> Date: Tue Oct 24 02:41:15 2023 +0000 [TensorRT EP] Add unit test for user provided cuda stream (#17974) Add a unit test for testing user provided CUDA stream commit 4ffd022 Author: Chi Lo <54722500+chilo-ms@users.noreply.github.com> Date: Tue Oct 24 00:46:38 2023 +0000 [TensorRT EP] Refactor of TRT plugins support (#17946) Make sure "trt.plugins" custom op domain only being registered once. The bottom line is "trt.plugins" custom op domain needs to be registered before model load. `CreateTensorRTCustomOpDomainList()` is TRT EP's function to create "trt.plugins" custom op domain. Following are places where this function will be called. (This function only fetches all the TRT plugins from TRT plugin registry but not yet registered them to ORT custom op registry. The real registration happens in AddCustomOpDomains()) C/C++ APIs: - `OrtApis::SessionOptionsAppendExecutionProvider_TensorRT_XX`: This function will make session option object contain the "trt.plugins" custom op domain for ORT to register. So that later the session creation api can register the custom op domain accordingly and won't complain about invalid onnx node. - `InferenceSession::RegisterExecutionProvider`: In some cases, users might create the session object first and later call session_object.RegisterExecutionProvider(). This function will call p_exec_provider->GetCustomOpDomainList() which returns "trt.plugins" custom op domain. Otherwise, session_object.Load(model) will complain. Python APIs: - `RegisterTensorRTPluginsAsCustomOps`: Need to call this function so that session option object contains the "trt.plugins" custom op domain for ORT to register. Different language bindings have slightly different workflow of initializing the session. This might cause duplicate custom op domain in `session_option.custom_op_domains_` or `CreateTensorRTCustomOpDomainList()` being called more than once, but we put checks to make sure ep's custom op domain won't be registered twice. commit 2c50b75 Author: Dmitri Smirnov <yuslepukhin@users.noreply.github.com> Date: Mon Oct 23 17:42:20 2023 -0700 Functions Ahead Of Time inlininng (#17764) Inline functions in an EP aware fashion. The result of this PR is that models that are having been inlined by ONNX inliner and optimized and models that have been AOT inlined appear to be visually identical. For tests I used two models. The only difference is the resulting size because ONNX inliner removes local function definitions and AOT does not. Difference in sizes for `HF Mobile` model was 2.5 MB, and for `HF Bart` it was ~500K. It seems that the resuling model size affects the load time more than the actual optimizations. In general, the inlined models grow in size very fast and can easily exceed 2Gb limit. Q. Should we make AOT optional? `If` costant folding and the removal of local inlined models will be coming in other PRs. Some stats: ![image](https://github.com/microsoft/onnxruntime/assets/11303988/fcb4c815-2e06-4574-8d96-5a0a727d1ecf) commit f3cfe08 Author: satyajandhyala <satya.k.jandhyala@gmail.com> Date: Mon Oct 23 16:02:50 2023 -0700 [JS/Web] Enabled 1d spacial input to GlobalAveragePool (#17973) Enable one-dim special input to GlobalAveragePoll input  Currently only 2D input is supported. commit 780ee18 Author: snadampal <87143774+snadampal@users.noreply.github.com> Date: Mon Oct 23 16:49:04 2023 -0500 [aarch64] Implement QGEMM kernels with UMMLA/SMMLA instructions (#17160)  This PR adds UMMLA and SMMLA based QGEMM kernels for aarch64. This covers (i) symmetric quantization (zero point is Zero) (ii) asymmetric quantization (zero point is non zero) (iii) per channel as well as per tensor quantization (iv) Signed weights (U8S8 Gemm) (v) Unsigned weights (U8U8 Gemm) and (vi) Signed activations and weights (S8S8 Gemm) scenarios I've enabled the ummla/smmla kernels based on cpuinfo check for `I8MM` support MMLA QGEMM kernels are enabled for all the devices that support I8MM instructions.  This is to improve INT8 quantized MatMul performance on aarch64 platform. I have run the below benchmarking script (bert , roberta and gpt2 model inference) on AWS Graviton3 based c7g.4xl instance and observed up to 1.33x performance improvement compared to the optimized UDOT qgemm kernel performance. ``` cd onnxruntime/python/tools/transformers python3 benchmark.py ``` I have also run the unit tests, and made sure all are passing ``` ./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync ``` commit 2a17d5c Author: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com> Date: Mon Oct 23 13:00:56 2023 -0700 LLaMA Model Optimization (#18021) This PR contains fusion-level and kernel-level optimizations for [Meta's LLaMA-2](https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-meta-expand-their-ai-partnership-with-llama-2-on-azure-and-windows/). Some of the added optimizations include: - SimplifiedLayerNorm changes - Fusions for multiple variants - SkipSimplifiedLayerNorm changes - Kernel support for CPU - Rotary embeddings (previously did not exist) - Fusions for multiple variants - CPU and CUDA kernels - Supports interleaving and non-interleaving in the same kernels - Optimized cache that requires half of its originally exported sizes - Reduced from `(max_sequence_length, head_size)` to `(max_sequence_length, head_size / 2)` - Multi-head attention - Support for 2D and 3D attention masks - Group query attention (for FP16 CUDA and INT4 CUDA) - Integration with flash attention v2 and past-present buffer sharing - Removes need for `attention_mask` input as it is supported in the kernel - 4 bit quantization - `block_size` parameter is available for customizing - Support the new changes for [Microsoft version](https://github.com/microsoft/Llama-2-Onnx) - Support combinations of the below variants (ex: export ORT version and run with Optimum) Supported variants of LLaMA-2 include: - [ORT version](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models/llama) - Produces one ONNX file that is already optimized (and quantized if requested) - Integrates with Optimum - [Another Microsoft version](https://github.com/microsoft/Llama-2-Onnx) - Already exported and available off-the-shelf - Faster versions of those models will be uploaded there soon - [Hugging Face version](https://huggingface.co/meta-llama) - Models that end with `-hf` - Some older and current versions of [`transformers`](https://github.com/huggingface/transformers) and [`optimum`](https://github.com/huggingface/optimum) that export the model to ONNX differently - Note that while some older versions are supported, it is recommended to use the latest package versions. To use the optimizations, please see `README.md` for details. Please note the various `requirements.txt` files for the package versions recommended in order to use these changes. To run the ORT transformer optimizer separately, run the script as follows: ``` $ cd onnxruntime/onnxruntime/python/tools/transformers/ $ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type gpt2 --num_heads <number of attention heads> --hidden_size <attention hidden size> --use_external_data_format --opt_level 0 ``` This PR helps the following issues: - #14997 - #16254 - #17681 - #17925 - microsoft/onnxruntime-inference-examples#320 This PR uses changes from the following PRs: - pytorch/pytorch#104468 - pytorch/pytorch#109759 - #17020 - #17674 - #17890 - #17920 - huggingface/transformers#26162 - huggingface/optimum#1257 - huggingface/optimum#1289 - huggingface/optimum#1462 This PR uses changes from the following issues and PRs to begin supporting the [new TorchDynamo exporter](https://pytorch.org/docs/stable/onnx.html#torchdynamo-based-onnx-exporter): - huggingface/transformers#26307 - pytorch/pytorch#104903 - pytorch/pytorch#105040 - microsoft/onnxscript#847 - microsoft/onnxscript#862 - microsoft/onnxscript#493 commit 8a12b2c Author: Jiajia Qin <jiajia.qin@intel.com> Date: Tue Oct 24 02:02:19 2023 +0800 [js/webgpu] Fix the transpose error when dims > 4D (#18027)  Currently, the uniform support has bugs when dims rank is larger than 4. See #17860 item 1. So this PR only enables shapes uniforms when shape rank is <= 4 for transpose. Otherwise, below compilation errors are thrown: ``` 1 error(s) generated while compiling the shader: :3:50 error: uniform storage requires that array elements are aligned to 16 bytes, but array element of type 'u32' has a stride of 4 bytes. Consider using a vector or struct as the element type instead. struct Uniforms { output_size:u32, a_shape:array<u32, 5>, a_strides:array<u32, 5>, output_shape:array<u32, 5>, output_strides:array<u32, 5> }; ^^^^^^^^^^^^^ :3:7 note: see layout of struct: /* align(4) size(84) */ struct Uniforms { /* offset( 0) align(4) size( 4) */ output_size : u32; /* offset( 4) align(4) size(20) */ a_shape : array<u32, 5>; /* offset(24) align(4) size(20) */ a_strides : array<u32, 5>; /* offset(44) align(4) size(20) */ output_shape : array<u32, 5>; /* offset(64) align(4) size(20) */ output_strides : array<u32, 5>; /* */ }; struct Uniforms { output_size:u32, a_shape:array<u32, 5>, a_strides:array<u32, 5>, output_shape:array<u32, 5>, output_strides:array<u32, 5> }; ^^^^^^ :4:42 note: 'Uniforms' used in address space 'uniform' here @group(0) @binding(2) var<uniform> uniforms: Uniforms; ^^^^^^^^ ``` commit f0d5ea5 Author: Hector Li <hecli@microsoft.com> Date: Mon Oct 23 09:01:29 2023 -0700 [QNN EP] Disable flaky test QnnCPUBackendTests.MatMulOp_Broadcast (#18033) Disable flaky test QnnCPUBackendTests.MatMulOp_Broadcast. The test failed on Linux randomly. commit b7ae293 Author: JiCheng <wejoncy@163.com> Date: Sun Oct 22 23:33:29 2023 +0800 Support large model export using multi-gpu (#17990) This PR is to implemente a exporter which works for large language models(LLM). It works for models like Llama2-70b or gpt-175. The main idea is to utilize multiple-GPU and dispatch differnet layers to different GPU, in short, it symply implemented auto pipeline parallelism. For example : to export Llama2-70b, you need 8x V100-32GB or 4x A100-80G or More GPU memories. It would expect to export decoder-only models. For encoder-decoder arch-like models, we didn't test it yet.  --------- Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> commit 444a0ed Author: pengwa <pengwa@microsoft.com> Date: Sat Oct 21 19:45:45 2023 +0800 Avoid one time clone to save memory peak (#17934) commit 009cd4e Author: RandySheriffH <48490400+RandySheriffH@users.noreply.github.com> Date: Fri Oct 20 16:12:21 2023 -0700 Allow cuda custom ops allocate deferred cpu mem (#17893) Expose a new allocator from cuda stream. The allocator manages deferred cpu memory which only get recycled before stream destruction. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com> commit 2f57625 Author: Chi Lo <54722500+chilo-ms@users.noreply.github.com> Date: Fri Oct 20 22:09:46 2023 +0000 [TensorRT EP] Add stream sync after enqueue (#18026) If the model is partitioned into TRT subgraphs and CUDA EP node, we observed cuda stream synchronization issue when multithreading. Calling stream sync API after enqueue can solve this issue without adding much performance overhead. commit 020824e Author: liqun Fu <liqfu@microsoft.com> Date: Fri Oct 20 15:08:25 2023 -0700 Update ONNX to 1.15.0rc1 (#17914) commit a43c57f Author: Baiju Meswani <bmeswani@microsoft.com> Date: Fri Oct 20 11:39:57 2023 -0700 ResizeGrad CUDA/ROCM kernel implementation (#17772) commit cc7e8cc Author: Changming Sun <chasun@microsoft.com> Date: Fri Oct 20 09:24:21 2023 -0700 Update dockerfiles/Dockerfile.source to avoid installing onnx (#17975) Update dockerfiles/Dockerfile.source to avoid installing onnx python package. ONNX is not listed in https://github.com/microsoft/onnxruntime/blob/main/requirements.txt.in. We do not have to install it. Especially when we do not run tests, the package provides no help when building onnxruntime from source. Resolve #17781 commit 99b8dca Author: Yi Zhang <zhanyi@microsoft.com> Date: Fri Oct 20 23:41:40 2023 +0800 Disable dml stage in windows GPU pipeline temporarily. (#18034)

This PR contains fusion-level and kernel-level optimizations for [Meta's LLaMA-2](https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-meta-expand-their-ai-partnership-with-llama-2-on-azure-and-windows/). Some of the added optimizations include: - SimplifiedLayerNorm changes - Fusions for multiple variants - SkipSimplifiedLayerNorm changes - Kernel support for CPU - Rotary embeddings (previously did not exist) - Fusions for multiple variants - CPU and CUDA kernels - Supports interleaving and non-interleaving in the same kernels - Optimized cache that requires half of its originally exported sizes - Reduced from `(max_sequence_length, head_size)` to `(max_sequence_length, head_size / 2)` - Multi-head attention - Support for 2D and 3D attention masks - Group query attention (for FP16 CUDA and INT4 CUDA) - Integration with flash attention v2 and past-present buffer sharing - Removes need for `attention_mask` input as it is supported in the kernel - 4 bit quantization - `block_size` parameter is available for customizing - Support the new changes for [Microsoft version](https://github.com/microsoft/Llama-2-Onnx) - Support combinations of the below variants (ex: export ORT version and run with Optimum) Supported variants of LLaMA-2 include: - [ORT version](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models/llama) - Produces one ONNX file that is already optimized (and quantized if requested) - Integrates with Optimum - [Another Microsoft version](https://github.com/microsoft/Llama-2-Onnx) - Already exported and available off-the-shelf - Faster versions of those models will be uploaded there soon - [Hugging Face version](https://huggingface.co/meta-llama) - Models that end with `-hf` - Some older and current versions of [`transformers`](https://github.com/huggingface/transformers) and [`optimum`](https://github.com/huggingface/optimum) that export the model to ONNX differently - Note that while some older versions are supported, it is recommended to use the latest package versions. To use the optimizations, please see `README.md` for details. Please note the various `requirements.txt` files for the package versions recommended in order to use these changes. To run the ORT transformer optimizer separately, run the script as follows: ``` $ cd onnxruntime/onnxruntime/python/tools/transformers/ $ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type gpt2 --num_heads <number of attention heads> --hidden_size <attention hidden size> --use_external_data_format --opt_level 0 ``` This PR helps the following issues: - #14997 - #16254 - #17681 - #17925 - microsoft/onnxruntime-inference-examples#320 This PR uses changes from the following PRs: - pytorch/pytorch#104468 - pytorch/pytorch#109759 - #17020 - #17674 - #17890 - #17920 - huggingface/transformers#26162 - huggingface/optimum#1257 - huggingface/optimum#1289 - huggingface/optimum#1462 This PR uses changes from the following issues and PRs to begin supporting the [new TorchDynamo exporter](https://pytorch.org/docs/stable/onnx.html#torchdynamo-based-onnx-exporter): - huggingface/transformers#26307 - pytorch/pytorch#104903 - pytorch/pytorch#105040 - microsoft/onnxscript#847 - microsoft/onnxscript#862 - microsoft/onnxscript#493

…huggingface#26162) * remove unnecessary unsqueeze-squeeze in llama * correct other models * fix * revert gpt_neox_japanese * fix copie * fix test

### Description This PR contains fusion-level and kernel-level optimizations for [Meta's LLaMA-2](https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-meta-expand-their-ai-partnership-with-llama-2-on-azure-and-windows/). Some of the added optimizations include: - SimplifiedLayerNorm changes - Fusions for multiple variants - SkipSimplifiedLayerNorm changes - Kernel support for CPU - Rotary embeddings (previously did not exist) - Fusions for multiple variants - CPU and CUDA kernels - Supports interleaving and non-interleaving in the same kernels - Optimized cache that requires half of its originally exported sizes - Reduced from `(max_sequence_length, head_size)` to `(max_sequence_length, head_size / 2)` - Multi-head attention - Support for 2D and 3D attention masks - Group query attention (for FP16 CUDA and INT4 CUDA) - Integration with flash attention v2 and past-present buffer sharing - Removes need for `attention_mask` input as it is supported in the kernel - 4 bit quantization - `block_size` parameter is available for customizing - Support the new changes for [Microsoft version](https://github.com/microsoft/Llama-2-Onnx) - Support combinations of the below variants (ex: export ORT version and run with Optimum) Supported variants of LLaMA-2 include: - [ORT version](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models/llama) - Produces one ONNX file that is already optimized (and quantized if requested) - Integrates with Optimum - [Another Microsoft version](https://github.com/microsoft/Llama-2-Onnx) - Already exported and available off-the-shelf - Faster versions of those models will be uploaded there soon - [Hugging Face version](https://huggingface.co/meta-llama) - Models that end with `-hf` - Some older and current versions of [`transformers`](https://github.com/huggingface/transformers) and [`optimum`](https://github.com/huggingface/optimum) that export the model to ONNX differently - Note that while some older versions are supported, it is recommended to use the latest package versions. ### Usage To use the optimizations, please see `README.md` for details. Please note the various `requirements.txt` files for the package versions recommended in order to use these changes. To run the ORT transformer optimizer separately, run the script as follows: ``` $ cd onnxruntime/onnxruntime/python/tools/transformers/ $ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type gpt2 --num_heads <number of attention heads> --hidden_size <attention hidden size> --use_external_data_format --opt_level 0 ``` ### Motivation and Context This PR helps the following issues: - microsoft#14997 - microsoft#16254 - microsoft#17681 - microsoft#17925 - microsoft/onnxruntime-inference-examples#320 This PR uses changes from the following PRs: - pytorch/pytorch#104468 - pytorch/pytorch#109759 - microsoft#17020 - microsoft#17674 - microsoft#17890 - microsoft#17920 - huggingface/transformers#26162 - huggingface/optimum#1257 - huggingface/optimum#1289 - huggingface/optimum#1462 ### New TorchDynamo Exporter (experimental stage) This PR uses changes from the following issues and PRs to begin supporting the [new TorchDynamo exporter](https://pytorch.org/docs/stable/onnx.html#torchdynamo-based-onnx-exporter): - huggingface/transformers#26307 - pytorch/pytorch#104903 - pytorch/pytorch#105040 - microsoft/onnxscript#847 - microsoft/onnxscript#862 - microsoft/onnxscript#493

remove unnecessary unsqueeze-squeeze in llama

21675a8

fxmarty requested review from ArthurZucker, amyeroberts and LysandreJik September 14, 2023 13:16

amyeroberts approved these changes Sep 14, 2023

View reviewed changes

ArthurZucker approved these changes Sep 14, 2023

View reviewed changes

fxmarty added 2 commits September 16, 2023 09:58

correct other models

768b8b0

fix

4183f9f

fxmarty changed the title ~~Llama: remove unnecessary unsqueeze - squeeze~~ Remove unnecessary unsqueeze - squeeze in rotary positional embedding Sep 16, 2023

revert gpt_neox_japanese

e0bab59

fxmarty requested review from ArthurZucker and amyeroberts September 16, 2023 08:07

ArthurZucker approved these changes Sep 18, 2023

View reviewed changes

fxmarty mentioned this pull request Sep 26, 2023

Weirdness when ONNX optimize/exporting and quantizing Llama2 - fails on both, have tried python and CLI approaches huggingface/optimum#1409

Closed

4 tasks

ArthurZucker reviewed Oct 5, 2023

View reviewed changes

fxmarty added 2 commits October 5, 2023 11:07

Merge branch 'main' into remove-back-and-forth-unsqueeze-squeeze

5443100

fix copie

4f89883

fix test

69a4840

fxmarty requested a review from ArthurZucker October 5, 2023 12:14

ArthurZucker approved these changes Oct 6, 2023

View reviewed changes

fxmarty merged commit 6484530 into huggingface:main Oct 6, 2023
3 checks passed

kunal-vaishnavi mentioned this pull request Oct 18, 2023

LLaMA Model Optimization microsoft/onnxruntime#18021

Merged

younesbelkada mentioned this pull request Nov 3, 2023

[core / attention] Fix fused attention generation with newest transformers version casper-hansen/AutoAWQ#146

Merged

ArthurZucker mentioned this pull request Dec 3, 2023

Bug Fixed GPTNeoX Flax supports #25334

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove unnecessary unsqueeze - squeeze in rotary positional embedding #26162

Remove unnecessary unsqueeze - squeeze in rotary positional embedding #26162

fxmarty commented Sep 14, 2023

HuggingFaceDocBuilderDev commented Sep 14, 2023 •

edited

Loading

amyeroberts left a comment

ArthurZucker left a comment

ArthurZucker commented Sep 14, 2023

fxmarty commented Sep 14, 2023 •

edited

Loading

ArthurZucker commented Sep 14, 2023

fxmarty commented Sep 16, 2023 •

edited

Loading

ArthurZucker left a comment •

edited

Loading

fxmarty commented Sep 28, 2023

ArthurZucker commented Sep 29, 2023 •

edited

Loading

ArthurZucker left a comment

fxmarty commented Oct 5, 2023

fxmarty commented Oct 5, 2023

fxmarty commented Oct 5, 2023 •

edited

Loading

ArthurZucker left a comment

gante commented Oct 17, 2023

Remove unnecessary unsqueeze - squeeze in rotary positional embedding #26162

Remove unnecessary unsqueeze - squeeze in rotary positional embedding #26162

Conversation

fxmarty commented Sep 14, 2023

HuggingFaceDocBuilderDev commented Sep 14, 2023 • edited Loading

amyeroberts left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker commented Sep 14, 2023

fxmarty commented Sep 14, 2023 • edited Loading

ArthurZucker commented Sep 14, 2023

fxmarty commented Sep 16, 2023 • edited Loading

ArthurZucker left a comment • edited Loading

Choose a reason for hiding this comment

fxmarty commented Sep 28, 2023

ArthurZucker commented Sep 29, 2023 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

fxmarty commented Oct 5, 2023

fxmarty commented Oct 5, 2023

fxmarty commented Oct 5, 2023 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

gante commented Oct 17, 2023

HuggingFaceDocBuilderDev commented Sep 14, 2023 •

edited

Loading

fxmarty commented Sep 14, 2023 •

edited

Loading

fxmarty commented Sep 16, 2023 •

edited

Loading

ArthurZucker left a comment •

edited

Loading

ArthurZucker commented Sep 29, 2023 •

edited

Loading

fxmarty commented Oct 5, 2023 •

edited

Loading