Add llama onnx export & onnxruntime support #975

fxmarty · 2023-04-17T08:21:15Z

As per title

gjain7 · 2023-05-03T10:11:34Z

Hi i was trying out obtaining onnx of llama model using optimum library using the command below,
optimum-cli export onnx --model decapoda-research/llama-13b-hf --optimize O2 llama_13b_onnx.

transformer version = "4.28.1"
optimum version = 1.8.2

the path of model is from hugging face. but i was facing an issue which i was not getting when I working with HuggingFaceM4/tiny-random-LlamaForCausalLM.

Framework not specified. Using pt to export to ONNX.
Downloading (…)lve/main/config.json: 100%|█████| 427/427 [00:00<00:00, 1.95MB/s]
Traceback (most recent call last):
File "/opt/conda/bin/optimum-cli", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/optimum/commands/optimum_cli.py", line 163, in main
service.run()
File "/opt/conda/lib/python3.10/site-packages/optimum/commands/export/onnx.py", line 203, in run
main_export(
File "/opt/conda/lib/python3.10/site-packages/optimum/exporters/onnx/main.py", line 169, in main_export
model = TasksManager.get_model_from_task(
File "/opt/conda/lib/python3.10/site-packages/optimum/exporters/tasks.py", line 1367, in get_model_from_task
model_class = TasksManager.get_model_class_for_task(
File "/opt/conda/lib/python3.10/site-packages/optimum/exporters/tasks.py", line 1085, in get_model_class_for_task
return getattr(loaded_library, model_class_name)
File "/opt/conda/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1150, in getattr
raise AttributeError(f"module {self.name} has no attribute {name}")
AttributeError: module transformers has no attribute LLaMAForCausalLM

Could you help me what might be the issue in this scenario . I

eric8607242 · 2023-05-03T10:18:41Z

Me too.

Is there any example command to export the LLaMA to fp16 onnx?

Thanks!

regisss · 2023-05-03T16:38:58Z

@gjain7 The problem is that in decapoda-research/llama-13b-hf the model class specified in the config.json file should be LlamaForCausalLM and not LLaMAForCausalLM. I see that several PRs were opened in the repo to correct this but they have not been merged so far.
edit: my immediate recommendation is to try with another 13B checkpoint, such as this one for instance.

regisss · 2023-05-03T16:50:51Z

@eric8607242 Could you try the following command please?

optimum-cli export onnx --model path_to_model --fp16 --optimize O2 output_dir

eric8607242 · 2023-05-05T02:01:25Z

@regisss Hi, thanks for your response. It is very helpful!

gjain7 · 2023-05-05T12:01:05Z

@regisss Thanks for the suggestion , it did work with the model you specified (y)

Unlike other models llama was giving 3 .onnx files as output , decoder_model_merged.onnx. decoder_model.onnx , decoder_with_past_model.onnx . Along with that decoder_model_merged.onnx_data (48 gb) , decoder_model.onnx_data(48 gb) , decoder_with_past_model.onnx_data (48gb) . Why is it actually giving like this and if i want to proceed to triton which .onnx file should i go with .

It would be really helpful if these queries are answered. Thanks

regisss · 2023-05-05T19:25:24Z

@gjain7 Quoting @echarlaix here:

The decoder can be used to perform inference (in which case the past_key_values will be computed at each generation step), the combination of the decoder and the decoder_with_past can be used to perform inference leveraging the pkv (decoder enabling the first generation step while the decoder_with_past will perform the rest). The merged_decoder was recently integrated (available since v1.7 for ORTModel) and is the combination of the decoder and decoder_with_past models to have one single ONNX model, which is interesting in terms of memory but after that we are no longer able to apply graph optimization / quantization (which needs to be done prior to merging).

So, in your case, I recommend that you use decoder_model_merged.onnx.

gjain7 · 2023-05-09T07:54:26Z

@regisss Thank you , it was very useful information .Was able to clear the doubts .

xijianlou1 · 2023-10-17T20:24:30Z

Hi @regisss :) I'm trying to export TinyLlama-1.1B-intermediate-step-480k-1T to ONNX (both with optimum.onnxruntime and optimum-cli) but there it failed with dimension mismatch errors. Since Llama is supported by onnx export now. Do you mind give some insight about why this llama model cannot be exported? Here's the script and corresponding error:

import os
from pathlib import Path
import transformers
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM
model = ORTModelForCausalLM.from_pretrained("PY007/TinyLlama-1.1B-intermediate-step-480k-1T", from_transformers=True)

The argument `from_transformers` is deprecated, and will be removed in optimum 2.0.  Use `export` instead
Framework not specified. Using pt to export to ONNX.
Using the export variant default. Available variants are:
        - default: The default ONNX variant.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using framework PyTorch: 2.1.0+cu118
Overriding 1 configuration item(s)
        - use_cache -> True
C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\transformers\models\llama\modeling_llama.py:808: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if input_shape[-1] > 1:
C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\transformers\models\llama\modeling_llama.py:146: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if seq_len > self.max_seq_len_cached:
C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\transformers\models\llama\modeling_llama.py:375: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\transformers\models\llama\modeling_llama.py:382: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\transformers\models\llama\modeling_llama.py:392: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
Saving external data to one file...
Using framework PyTorch: 2.1.0+cu118
Overriding 1 configuration item(s)
        - use_cache -> True
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python311\Lib\site-packages\optimum\onnxruntime\modeling_ort.py", line 647, in from_pretrained
    return super().from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python311\Lib\site-packages\optimum\modeling_base.py", line 372, in from_pretrained
    return from_pretrained_method(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python311\Lib\site-packages\optimum\onnxruntime\modeling_decoder.py", line 574, in _from_transformers
    main_export(
  File "C:\Python311\Lib\site-packages\optimum\exporters\onnx\__main__.py", line 505, in main_export
    _, onnx_outputs = export_models(
                      ^^^^^^^^^^^^^^
  File "C:\Python311\Lib\site-packages\optimum\exporters\onnx\convert.py", line 752, in export_models
    export(
  File "C:\Python311\Lib\site-packages\optimum\exporters\onnx\convert.py", line 855, in export
    export_output = export_pytorch(
                    ^^^^^^^^^^^^^^^
  File "C:\Python311\Lib\site-packages\optimum\exporters\onnx\convert.py", line 572, in export_pytorch
    onnx_export(
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\onnx\utils.py", line 516, in export
    _export(
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\onnx\utils.py", line 1596, in _export
    graph, params_dict, torch_out = _model_to_graph(
                                    ^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\onnx\utils.py", line 1135, in _model_to_graph
    graph, params, torch_out, module = _create_jit_graph(model, args)
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\onnx\utils.py", line 1011, in _create_jit_graph
    graph, torch_out = _trace_and_get_graph_from_model(model, args)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\onnx\utils.py", line 915, in _trace_and_get_graph_from_model
    trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\jit\_trace.py", line 1285, in _get_trace_graph
    outs = ONNXTracedModule(
           ^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\jit\_trace.py", line 133, in forward
    graph, out = torch._C._create_graph_by_tracing(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\jit\_trace.py", line 124, in wrapper
    outs.append(self.inner(*trace_inputs))
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1508, in _slow_forward
    result = self.forward(*input, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python311\Lib\site-packages\optimum\exporters\onnx\model_patcher.py", line 112, in patched_forward
    outputs = self.orig_forward(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\transformers\models\llama\modeling_llama.py", line 1038, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1508, in _slow_forward
    result = self.forward(*input, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\transformers\models\llama\modeling_llama.py", line 925, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1508, in _slow_forward
    result = self.forward(*input, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\transformers\models\llama\modeling_llama.py", line 635, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1508, in _slow_forward
    result = self.forward(*input, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\transformers\models\llama\modeling_llama.py", line 365, in forward
    key_states = torch.cat([past_key_value[0], key_states], dim=2)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 32 but got size 4 for tensor number 1 in the list.

fxmarty · 2023-10-18T08:12:59Z

Hi @xijianlou1, thank you for the report. Can you try on the main branch? This is likely to be the sale as #1399 & to have been fixed if you install from source. We'll have an upcoming release.

nenkoru and others added 14 commits March 25, 2023 01:20

Add config for Llama

0446874

Register Llama in tasks

168d4cb

Add llama and it's corresponding tiny-random model from hf into tests

ab4a8bf

Add tests for modeling and exporters

7e06da1

Add entry for a Llama

95a8810

Add llama into supported normalized configs

c669d55

Add optimization support for llama

a39fbed

Change tiny-llama source to trl-internal-testing

43e98b7

Merge branch 'main' into main

8d1ecf4

Change tiny-llama source to trl-internal-testing

e4898e7

can I push?

f4a7f55

fix tests

ea7bdec

Merge branch 'master' into add-llama-onnx

92538ee

fix task map

3eeed4f

fxmarty mentioned this pull request Apr 17, 2023

Add support for exporting LLaMA to ONNX format #922

Closed

2 tasks

fxmarty merged commit 22b10a4 into huggingface:main Apr 17, 2023

fxmarty mentioned this pull request Apr 17, 2023

Support for LLaMA #918

Closed

regisss mentioned this pull request Apr 28, 2023

Community contribution - optimum.exporters.onnx support for new models! #555

Open

tikikun mentioned this pull request Jul 11, 2023

trt give random output value, diffs with onnxruntime NVIDIA/TensorRT#2928

Open

fxmarty mentioned this pull request Oct 5, 2023

Fix llama ONNX export #1432

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add llama onnx export & onnxruntime support #975

Add llama onnx export & onnxruntime support #975

fxmarty commented Apr 17, 2023

gjain7 commented May 3, 2023

eric8607242 commented May 3, 2023

regisss commented May 3, 2023 •

edited

Loading

regisss commented May 3, 2023

eric8607242 commented May 5, 2023

gjain7 commented May 5, 2023 •

edited

Loading

regisss commented May 5, 2023

gjain7 commented May 9, 2023

xijianlou1 commented Oct 17, 2023

fxmarty commented Oct 18, 2023

Add llama onnx export & onnxruntime support #975

Add llama onnx export & onnxruntime support #975

Conversation

fxmarty commented Apr 17, 2023

gjain7 commented May 3, 2023

eric8607242 commented May 3, 2023

regisss commented May 3, 2023 • edited Loading

regisss commented May 3, 2023

eric8607242 commented May 5, 2023

gjain7 commented May 5, 2023 • edited Loading

regisss commented May 5, 2023

gjain7 commented May 9, 2023

xijianlou1 commented Oct 17, 2023

fxmarty commented Oct 18, 2023

regisss commented May 3, 2023 •

edited

Loading

gjain7 commented May 5, 2023 •

edited

Loading