Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add llama onnx export & onnxruntime support #975

Merged
merged 14 commits into from
Apr 17, 2023

Conversation

fxmarty
Copy link
Contributor

@fxmarty fxmarty commented Apr 17, 2023

As per title

@gjain7
Copy link

gjain7 commented May 3, 2023

Hi i was trying out obtaining onnx of llama model using optimum library using the command below,
optimum-cli export onnx --model decapoda-research/llama-13b-hf --optimize O2 llama_13b_onnx.

transformer version = "4.28.1"
optimum version = 1.8.2

the path of model is from hugging face. but i was facing an issue which i was not getting when I working with HuggingFaceM4/tiny-random-LlamaForCausalLM.

Framework not specified. Using pt to export to ONNX.
Downloading (…)lve/main/config.json: 100%|█████| 427/427 [00:00<00:00, 1.95MB/s]
Traceback (most recent call last):
File "/opt/conda/bin/optimum-cli", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/optimum/commands/optimum_cli.py", line 163, in main
service.run()
File "/opt/conda/lib/python3.10/site-packages/optimum/commands/export/onnx.py", line 203, in run
main_export(
File "/opt/conda/lib/python3.10/site-packages/optimum/exporters/onnx/main.py", line 169, in main_export
model = TasksManager.get_model_from_task(
File "/opt/conda/lib/python3.10/site-packages/optimum/exporters/tasks.py", line 1367, in get_model_from_task
model_class = TasksManager.get_model_class_for_task(
File "/opt/conda/lib/python3.10/site-packages/optimum/exporters/tasks.py", line 1085, in get_model_class_for_task
return getattr(loaded_library, model_class_name)
File "/opt/conda/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1150, in getattr
raise AttributeError(f"module {self.name} has no attribute {name}")
AttributeError: module transformers has no attribute LLaMAForCausalLM

Could you help me what might be the issue in this scenario . I

@eric8607242
Copy link

Me too.

Is there any example command to export the LLaMA to fp16 onnx?

Thanks!

@regisss
Copy link
Contributor

regisss commented May 3, 2023

@gjain7 The problem is that in decapoda-research/llama-13b-hf the model class specified in the config.json file should be LlamaForCausalLM and not LLaMAForCausalLM. I see that several PRs were opened in the repo to correct this but they have not been merged so far.
edit: my immediate recommendation is to try with another 13B checkpoint, such as this one for instance.

@regisss
Copy link
Contributor

regisss commented May 3, 2023

@eric8607242 Could you try the following command please?

optimum-cli export onnx --model path_to_model --fp16 --optimize O2 output_dir

@eric8607242
Copy link

@regisss Hi, thanks for your response. It is very helpful!

@gjain7
Copy link

gjain7 commented May 5, 2023

@regisss Thanks for the suggestion , it did work with the model you specified (y)

Unlike other models llama was giving 3 .onnx files as output , decoder_model_merged.onnx. decoder_model.onnx , decoder_with_past_model.onnx . Along with that decoder_model_merged.onnx_data (48 gb) , decoder_model.onnx_data(48 gb) , decoder_with_past_model.onnx_data (48gb) . Why is it actually giving like this and if i want to proceed to triton which .onnx file should i go with .

It would be really helpful if these queries are answered. Thanks

@regisss
Copy link
Contributor

regisss commented May 5, 2023

@gjain7 Quoting @echarlaix here:

The decoder can be used to perform inference (in which case the past_key_values will be computed at each generation step), the combination of the decoder and the decoder_with_past can be used to perform inference leveraging the pkv (decoder enabling the first generation step while the decoder_with_past will perform the rest). The merged_decoder was recently integrated (available since v1.7 for ORTModel) and is the combination of the decoder and decoder_with_past models to have one single ONNX model, which is interesting in terms of memory but after that we are no longer able to apply graph optimization / quantization (which needs to be done prior to merging).

So, in your case, I recommend that you use decoder_model_merged.onnx.

@gjain7
Copy link

gjain7 commented May 9, 2023

@regisss Thank you , it was very useful information .Was able to clear the doubts .

@xijianlou1
Copy link

Hi @regisss :) I'm trying to export TinyLlama-1.1B-intermediate-step-480k-1T to ONNX (both with optimum.onnxruntime and optimum-cli) but there it failed with dimension mismatch errors. Since Llama is supported by onnx export now. Do you mind give some insight about why this llama model cannot be exported? Here's the script and corresponding error:

import os
from pathlib import Path
import transformers
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM
model = ORTModelForCausalLM.from_pretrained("PY007/TinyLlama-1.1B-intermediate-step-480k-1T", from_transformers=True)
The argument `from_transformers` is deprecated, and will be removed in optimum 2.0.  Use `export` instead
Framework not specified. Using pt to export to ONNX.
Using the export variant default. Available variants are:
        - default: The default ONNX variant.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using framework PyTorch: 2.1.0+cu118
Overriding 1 configuration item(s)
        - use_cache -> True
C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\transformers\models\llama\modeling_llama.py:808: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if input_shape[-1] > 1:
C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\transformers\models\llama\modeling_llama.py:146: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if seq_len > self.max_seq_len_cached:
C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\transformers\models\llama\modeling_llama.py:375: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\transformers\models\llama\modeling_llama.py:382: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\transformers\models\llama\modeling_llama.py:392: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
Saving external data to one file...
Using framework PyTorch: 2.1.0+cu118
Overriding 1 configuration item(s)
        - use_cache -> True
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python311\Lib\site-packages\optimum\onnxruntime\modeling_ort.py", line 647, in from_pretrained
    return super().from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python311\Lib\site-packages\optimum\modeling_base.py", line 372, in from_pretrained
    return from_pretrained_method(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python311\Lib\site-packages\optimum\onnxruntime\modeling_decoder.py", line 574, in _from_transformers
    main_export(
  File "C:\Python311\Lib\site-packages\optimum\exporters\onnx\__main__.py", line 505, in main_export
    _, onnx_outputs = export_models(
                      ^^^^^^^^^^^^^^
  File "C:\Python311\Lib\site-packages\optimum\exporters\onnx\convert.py", line 752, in export_models
    export(
  File "C:\Python311\Lib\site-packages\optimum\exporters\onnx\convert.py", line 855, in export
    export_output = export_pytorch(
                    ^^^^^^^^^^^^^^^
  File "C:\Python311\Lib\site-packages\optimum\exporters\onnx\convert.py", line 572, in export_pytorch
    onnx_export(
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\onnx\utils.py", line 516, in export
    _export(
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\onnx\utils.py", line 1596, in _export
    graph, params_dict, torch_out = _model_to_graph(
                                    ^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\onnx\utils.py", line 1135, in _model_to_graph
    graph, params, torch_out, module = _create_jit_graph(model, args)
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\onnx\utils.py", line 1011, in _create_jit_graph
    graph, torch_out = _trace_and_get_graph_from_model(model, args)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\onnx\utils.py", line 915, in _trace_and_get_graph_from_model
    trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\jit\_trace.py", line 1285, in _get_trace_graph
    outs = ONNXTracedModule(
           ^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\jit\_trace.py", line 133, in forward
    graph, out = torch._C._create_graph_by_tracing(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\jit\_trace.py", line 124, in wrapper
    outs.append(self.inner(*trace_inputs))
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1508, in _slow_forward
    result = self.forward(*input, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python311\Lib\site-packages\optimum\exporters\onnx\model_patcher.py", line 112, in patched_forward
    outputs = self.orig_forward(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\transformers\models\llama\modeling_llama.py", line 1038, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1508, in _slow_forward
    result = self.forward(*input, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\transformers\models\llama\modeling_llama.py", line 925, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1508, in _slow_forward
    result = self.forward(*input, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\transformers\models\llama\modeling_llama.py", line 635, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py", line 1508, in _slow_forward
    result = self.forward(*input, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xijianlou\AppData\Roaming\Python\Python311\site-packages\transformers\models\llama\modeling_llama.py", line 365, in forward
    key_states = torch.cat([past_key_value[0], key_states], dim=2)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 32 but got size 4 for tensor number 1 in the list.

@fxmarty
Copy link
Contributor Author

fxmarty commented Oct 18, 2023

Hi @xijianlou1, thank you for the report. Can you try on the main branch? This is likely to be the sale as #1399 & to have been fixed if you install from source. We'll have an upcoming release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants