Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to compile a model on Inf1 with optimum-cli due to lack of arguments #471

Closed
2 of 4 tasks
tagucci opened this issue Feb 9, 2024 · 2 comments
Closed
2 of 4 tasks
Assignees
Labels
bug Something isn't working

Comments

@tagucci
Copy link

tagucci commented Feb 9, 2024

System Info

- `optimum` version: 1.16.2
- `transformers` version: 4.36.2
- Platform: Linux-5.15.0-1051-aws-x86_64-with-glibc2.29
- Python version: 3.8.10
- Huggingface_hub version: 0.20.1
- PyTorch version (GPU?): 1.13.1+cu117 (cuda availabe: False)
- Tensorflow version (GPU?): not installed (cuda availabe: NA)

Who can help?

@JingyaHuang

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

When I attempted to compile bert-base-uncased model on an Inf1 instance following the official document, I encountered the following error occurred. I used the pre-built PyTorch environment for Inf1 provided by "Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20240102".

$ source /opt/aws_neuron_venv_pytorch_inf1/bin/activate
$ pip install optimum[neuron]
$ optimum-cli export neuron \
  --model bert-base-uncased \
  --sequence_length 128 \
  --batch_size 1 \
  bert_neuron/

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/aws_neuron_venv_pytorch_inf1/lib/python3.8/site-packages/optimum/exporters/neuron/__main__.py", line 541, in <module>
    main()
  File "/opt/aws_neuron_venv_pytorch_inf1/lib/python3.8/site-packages/optimum/exporters/neuron/__main__.py", line 487, in main
    is_sentence_transformers = args.library_name == "sentence_transformers"
AttributeError: 'Namespace' object has no attribute 'library_name'
Traceback (most recent call last):
  File "/opt/aws_neuron_venv_pytorch_inf1/bin/optimum-cli", line 8, in <module>
    sys.exit(main())
  File "/opt/aws_neuron_venv_pytorch_inf1/lib/python3.8/site-packages/optimum/commands/optimum_cli.py", line 163, in main
    service.run()
  File "/opt/aws_neuron_venv_pytorch_inf1/lib/python3.8/site-packages/optimum/commands/export/neuron.py", line 137, in run
    subprocess.run(full_command, shell=True, check=True)
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 -m optimum.exporters.neuron --model bert-base-uncased --sequence_length 128 --batch_size 1 bert_neuron/' returned non-zero exit status 1.

Expected behavior

This error occurs because neuron.py does not use utilize arguments such as --library_name, --subfolder, --compiler_workdir, --disable-weights-neff-inline, and other arguments in the level_group category, which are used in neuronx.py. When I modified neuron.py to use the same arguments as neuronx.py, the model was successfully compiled. The output is as follows:

$ optimum-cli export neuron \
  --model bert-base-uncased \
  --sequence_length 128 \
  --batch_size 1 \
  bert_neuron/

config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 570/570 [00:00<00:00, 91.6kB/s]
model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████| 440M/440M [00:01<00:00, 245MB/s]
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'cls.seq_relationship.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████| 28.0/28.0 [00:00<00:00, 4.94kB/s]
vocab.txt: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 698kB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████| 466k/466k [00:00<00:00, 42.9MB/s]
***** Compiling bert-base-uncased *****
INFO:Neuron:There are 3 ops of 1 different types in the TorchScript that are not compiled by neuron-cc: aten::embedding, (For more information see https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/compiler/neuron-cc/neuron-cc-ops/neuron-cc-ops-pytorch.html)
INFO:Neuron:Number of arithmetic operators (pre-compilation) before = 563, fused = 546, percent fused = 96.98%
INFO:Neuron:Compiler args type is <class 'list'> value is ['--fast-math', 'none']
INFO:Neuron:Compiling function _NeuronGraph$704 with neuron-cc
INFO:Neuron:Compiling with command line: '/opt/aws_neuron_venv_pytorch_inf1/bin/neuron-cc compile /tmp/tmptgmdk1g3/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmptgmdk1g3/graph_def.neff --io-config {"inputs": {"0:0": [[1, 128, 768], "float32"], "1:0": [[1, 1, 1, 128], "float32"], "2:0": [[30522, 768], "float32"]}, "outputs": ["BertForMaskedLM_1/BertOnlyMLMHead_7/BertLMPredictionHead_1/Linear_4/aten_linear/Add:0"]} --fast-math none --verbose 35'
.......
Compiler status PASS
INFO:Neuron:Number of arithmetic operators (post-compilation) before = 563, compiled = 546, percent compiled = 96.98%
INFO:Neuron:The neuron partitioner created 1 sub-graphs
INFO:Neuron:Neuron successfully compiled 1 sub-graphs, Total fused subgraphs = 1, Percent of model sub-graphs successfully compiled = 100.0%
INFO:Neuron:Compiled these operators (and operator counts) to Neuron:
INFO:Neuron: => aten::Int: 96
INFO:Neuron: => aten::add: 36
INFO:Neuron: => aten::contiguous: 12
INFO:Neuron: => aten::div: 12
INFO:Neuron: => aten::dropout: 37
INFO:Neuron: => aten::gelu: 13
INFO:Neuron: => aten::layer_norm: 26
INFO:Neuron: => aten::linear: 74
INFO:Neuron: => aten::matmul: 24
INFO:Neuron: => aten::permute: 48
INFO:Neuron: => aten::size: 96
INFO:Neuron: => aten::softmax: 12
INFO:Neuron: => aten::transpose: 12
INFO:Neuron: => aten::view: 48
INFO:Neuron:Not compiled operators (and operator counts) to Neuron:
INFO:Neuron: => aten::Int: 1 [supported]
INFO:Neuron: => aten::add: 2 [supported]
INFO:Neuron: => aten::add_: 1 [supported]
INFO:Neuron: => aten::embedding: 3 [not supported]
INFO:Neuron: => aten::mul: 1 [supported]
INFO:Neuron: => aten::rsub: 1 [supported]
INFO:Neuron: => aten::size: 1 [supported]
INFO:Neuron: => aten::slice: 4 [supported]
INFO:Neuron: => aten::to: 1 [supported]
INFO:Neuron: => aten::unsqueeze: 2 [supported]
[Compilation Time] 237.75 seconds.
[Total compilation Time] 237.75 seconds.
Validating bert-base-uncased model...
	- Validating Neuron Model output "logits":
		-[✓] (1, 128, 30522) matches (1, 128, 30522)
		-[x] values not close enough, max diff: 0.28158092498779297 (atol: 0.001)
The maximum absolute difference between the output of the reference model and the Neuron exported model is not within the set tolerance 0.001:
- logits: max diff = 0.28158092498779297
The Neuron export succeeded and the exported model was saved at: bert_neuron
@tagucci tagucci added the bug Something isn't working label Feb 9, 2024
@JingyaHuang
Copy link
Collaborator

Hi @tagucci, thanks a lot for reporting! I can reproduce the issue and it seems that our CI disabled the cli export test thus the bug was not detected. I just put a fix for this issue at #474, thanks again for catching it and report to us!

@JingyaHuang JingyaHuang self-assigned this Feb 9, 2024
@JingyaHuang
Copy link
Collaborator

#474 is merged, we will do a release this week to include it. Feel free to reopen the issue if there are any further questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants