Support bigbird ONNX export with attention_type == "block_sparse" #754

harindercnvrg · 2023-02-07T07:31:51Z

System Info

CPU

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 57 bits virtual
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  2
Core(s) per socket:  32
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               106
Model name:          Intel(R) Xeon(R) Gold 6338N CPU @ 2.20GHz
Stepping:            6
CPU MHz:             2200.000
CPU max MHz:         3500.0000
CPU min MHz:         800.0000
BogoMIPS:            4400.00
Virtualization:      VT-x
L1d cache:           48K
L1i cache:           32K
L2 cache:            1280K
L3 cache:            49152K
NUMA node0 CPU(s):   0-31,64-95
NUMA node1 CPU(s):   32-63,96-127

python == 3.8.10

Installed packages:

absl-py==1.4.0
aiofiles==22.1.0
aiohttp==3.8.3
aiosignal==1.3.1
aiosqlite==0.18.0
anyio==3.6.2
argcomplete==1.10.3
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
arrow==1.2.3
asttokens==2.2.1
async-timeout==4.0.2
attrs==22.2.0
autograd==1.5
azure-core==1.10.0
azure-storage-blob==12.6.0
Babel==2.11.0
backcall==0.2.0
backports.zoneinfo==0.2.1
beautifulsoup4==4.8.2
bleach==6.0.0
boto3==1.26.64
botocore==1.29.64
cachetools==5.3.0
certifi==2022.12.7
cffi==1.15.1
chardet==3.0.4
charset-normalizer==2.1.1
click==8.1.3
cma==2.7.0
cnvrg==0.7.54
colorama==0.4.6
coloredlogs==15.0.1
comm==0.1.2
compressed-rtf==1.0.6
contourpy==1.0.7
croniter==1.3.8
cryptography==39.0.0
cycler==0.11.0
datasets==2.9.0
debugpy==1.6.6
decorator==5.1.1
defusedxml==0.7.1
dill==0.3.6
docx2txt==0.8
ebcdic==1.1.1
evaluate==0.4.0
executing==1.2.0
extract-msg==0.28.7
fastjsonschema==2.16.2
filelock==3.9.0
flatbuffers==23.1.21
fonttools==4.38.0
fqdn==1.5.1
frozenlist==1.3.3
fsspec==2023.1.0
future==0.18.3
gitdb==4.0.10
GitPython==3.1.30
google-api-core==2.11.0
google-auth==2.16.0
google-auth-oauthlib==0.4.6
google-cloud-core==2.3.2
google-cloud-storage==2.7.0
google-crc32c==1.5.0
google-resumable-media==2.4.1
googleapis-common-protos==1.58.0
grpcio==1.51.1
huggingface-hub==0.12.0
humanfriendly==10.0
idna==2.10
IMAPClient==2.1.0
importlib-metadata==6.0.0
importlib-resources==5.10.2
ipykernel==6.21.1
ipython==8.9.0
ipython-genutils==0.2.0
isodate==0.6.1
isoduration==20.11.0
jedi==0.18.2
Jinja2==3.1.2
jmespath==1.0.1
joblib==1.2.0
json5==0.9.11
jsonpointer==2.3
jsonschema==4.17.3
jstyleson==0.0.2
jupyter-client==8.0.2
jupyter-core==5.2.0
jupyter-events==0.5.0
jupyter-server==2.2.1
jupyter-server-fileid==0.6.0
jupyter-server-mathjax==0.2.6
jupyter-server-terminals==0.4.4
jupyter-server-ydoc==0.6.1
jupyter-ydoc==0.2.2
jupyterlab==3.6.1
jupyterlab-git==0.41.0
jupyterlab-pygments==0.2.2
jupyterlab-server==2.19.0
kiwisolver==1.4.4
lxml==4.9.2
Markdown==3.4.1
MarkupSafe==2.1.2
matplotlib==3.6.3
matplotlib-inline==0.1.6
mistune==2.0.4
mpmath==1.2.1
msrest==0.6.21
multidict==6.0.4
multiprocess==0.70.14
natsort==8.2.0
nbclassic==0.5.1
nbclient==0.7.2
nbconvert==7.2.9
nbdime==3.1.1
nbformat==5.7.3
nest-asyncio==1.5.6
networkx==2.8.2
ninja==1.10.2.4
nncf==2.4.0
notebook==6.5.2
notebook-shim==0.2.2
numpy==1.23.4
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
oauthlib==3.2.2
olefile==0.46
onnx==1.12.0
onnxruntime==1.12.1
openvino==2022.3.0
openvino-telemetry==2022.3.0
optimum==1.6.3
optimum-intel==1.6.1
packaging==23.0
pandas==1.5.2
pandocfilters==1.5.0
parso==0.8.3
pdfminer.six==20191110
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.4.0
pkgutil-resolve-name==1.3.10
platformdirs==2.6.2
progress==1.6
prometheus-client==0.16.0
prompt-toolkit==3.0.36
protobuf==3.20.1
psutil==5.9.4
ptyprocess==0.7.0
pure-eval==0.2.2
pyaml==21.10.1
pyarrow==11.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.21
pycryptodome==3.17
pydot==1.4.2
Pygments==2.14.0
pymoo==0.5.0
pyparsing==2.4.7
pyrsistent==0.19.3
python-dateutil==2.8.2
python-json-logger==2.0.4
python-pptx==0.6.21
pytz==2022.7.1
pytz-deprecation-shim==0.1.0.post0
PyYAML==6.0
pyzmq==25.0.0
regex==2022.10.31
requests==2.28.2
requests-oauthlib==1.3.1
responses==0.18.0
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rsa==4.9
s3transfer==0.6.0
scikit-learn==1.2.1
scipy==1.10.0
Send2Trash==1.8.0
sentencepiece==0.1.97
six==1.12.0
smmap==5.0.0
sniffio==1.3.0
sortedcontainers==2.4.0
soupsieve==2.3.2.post1
SpeechRecognition==3.8.1
stack-data==0.6.2
sympy==1.11.1
tensorboard==2.11.2
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
terminado==0.17.1
textract==1.6.5
texttable==1.6.7
threadpoolctl==3.1.0
tinycss2==1.2.1
tinynetrc==1.3.1
tokenizers==0.13.2
tomli==2.0.1
torch==1.13.1
torchvision==0.14.1
tornado==6.2
tqdm==4.64.1
traitlets==5.9.0
transformers==4.26.0
typing-extensions==4.4.0
tzdata==2022.7
tzlocal==4.2
uri-template==1.2.0
urllib3==1.25.11
wcwidth==0.2.6
webcolors==1.12
webencodings==0.5.1
websocket-client==1.5.1
Werkzeug==2.2.2
xlrd==1.2.0
XlsxWriter==3.0.8
xxhash==3.2.0
y-py==0.5.5
yarl==1.8.2
ypy-websocket==0.8.2
zipp==3.12.1

Who can help?

@lewtun @michaelbenayoun

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I converted the summarizer model to onnx and then ran it:

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSeq2SeqLM
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")
billsum = billsum.train_test_split(test_size=0.2)
to_summarize = billsum["train"][0]['text']

model_id = "google/pegasus-pubmed"
model = ORTModelForSeq2SeqLM.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("summarization", model=model, tokenizer=tokenizer)
prediction = pipe(to_summarize)

I also tried the openvino runtime

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import OVModelForSeq2SeqLM
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")
billsum = billsum.train_test_split(test_size=0.2)
to_summarize = billsum["train"][0]['text']

model_id = "google/pegasus-pubmed"
model = OVModelForSeq2SeqLM.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("summarization", model=model, tokenizer=tokenizer)
prediction = pipe(to_summarize)

Expected behavior

This is supposed to provide faster inference than the original pytorch model. Neither the onnx and nor the openvino runtime improve speed, in fact the inference time increases by manifold.

The text was updated successfully, but these errors were encountered:

fxmarty · 2023-02-07T15:54:19Z

Thank you, for OpenVINO, could you open an issue in https://github.com/huggingface/optimum-intel ?

For ONNX Runtime, I suspect it is the same issue as #753 .

harindercnvrg · 2023-02-09T06:45:17Z

I have raised the issue.
huggingface/optimum-intel#188

fxmarty · 2023-02-11T00:04:45Z

@harindercnvrg Could you provide a reproduction script and the result of lscpu?

I would recommend as well to try on Optimum main as a critical bug due to the transformers 4.26.0 release was fixed recently: #756 . The fix will be included in the next release next week.

harindercnvrg · 2023-02-13T05:53:42Z

@fxmarty the system info provided at the top of the issue is from the lscpu command. I have also provided reproduction script at the bottom of the issue raised.

fxmarty · 2023-02-13T08:06:21Z

Thanks @harindercnvrg my bad I missed the lscpu. I meant a reproduction script with time measured, the scripts above only run inference. So that I can try to reproduce the issue on my side.

harindercnvrg · 2023-02-13T08:38:38Z

Orginal code:

from transformers import BigBirdPegasusForConditionalGeneration, AutoTokenizer
from breakup import breaker
from datasets import load_dataset
import time

billsum = load_dataset("billsum", split="ca_test")
billsum = billsum.train_test_split(test_size=0.2)
to_summarize = billsum["train"][0]['text']

tokenizer = AutoTokenizer.from_pretrained("google/bigbird-pegasus-large-pubmed")

model = BigBirdPegasusForConditionalGeneration.from_pretrained("google/bigbird-pegasus-large-pubmed")
tic=time.time()
inputs = tokenizer(to_summarize, return_tensors='pt')
prediction = model.generate(**inputs)
prediction = tokenizer.batch_decode(prediction)
toc=time.time()
print(" Time taken: ", toc-tic)

Using ONNX

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSeq2SeqLM
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")
billsum = billsum.train_test_split(test_size=0.2)
to_summarize = billsum["train"][0]['text']

model_id ="google/bigbird-pegasus-large-pubmed"
model = ORTModelForSeq2SeqLM.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("summarization", model=model, tokenizer=tokenizer)
tic=time.time()
prediction = pipe(to_summarize)
toc=time.time()
print(" Time taken: ", toc-tic)

Using Openvino

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import OVModelForSeq2SeqLM
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")
billsum = billsum.train_test_split(test_size=0.2)
to_summarize = billsum["train"][0]['text']

model_id = "google/bigbird-pegasus-large-pubmed"
model = OVModelForSeq2SeqLM.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tic=time.time()
pipe = pipeline("summarization", model=model, tokenizer=tokenizer)
prediction = pipe(to_summarize)
toc=time.time()
print(" Time taken: ", toc-tic)

fxmarty · 2023-02-13T10:11:21Z

Thank you! I can reproduce the issue on main. Averaging over 5 runs for each, I get:

Average time (PyTorch eager): 29.747
Average time (ORT): 58.940

Note that when using ORTModelForSeq2SeqLM.from_pretrained(model_id, from_transformers=True), the PyTorch model is exported to ONNX on the fly. A log we get during the export is the following:

Attention type 'block_sparse' is not possible if sequence_length: 16 <= num global tokens: 2 * config.block_size + min. num sliding tokens: 3 * config.block_size + config.num_random_blocks * config.block_size + additional buffer: config.num_random_blocks * config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3. Changing attention type to 'original_full'...

This issue likely comes from there: the example input provided during the ONNX export is too short, hence registering the wrong controlflows that are slow for long sequences (as the one in the benchmark).

Thank you for notifying, will fix!

fxmarty · 2023-02-14T14:25:24Z

Hi @harindercnvrg , I investigated the issue a bit, and there is a critical issue for the ONNX export of BigBird, given that part of BigBird block sparse attention is written in numpy and pure python.

Up to now, BigBird was solely exported using attention_type == "original_full" (as the sequence length was too short) and not attention_type == "block_sparse" (which is arguably the interesting case). That being said, I understand this influences only the encoder, and it'd be worth checking the decoder only.

I worked a bit on rewritting BigBird to be pure PyTorch, which goes fine, but I am now hitting the issue that torch.onnx.export being extremely slow exporting in the block sparse attention case.

For now, I would recommend you to stick with the PyTorch implementation, or maybe Tensorflow XLA one if you find.

fxmarty · 2023-02-17T14:00:56Z

Hi @harindercnvrg , we will remove the support of bigbird and bigbird-pegasus in the ONNX export in #778 due to this issue.

A large chunk of bigbird's implementation in transformers is written in numpy and pure pytorch that makes it unfit for the ONNX export. I tried to rewrite it as pure PyTorch, which succeeded, but then the export becomes prohibitively slow.

If you would like to have a look and manage to solve the issue, you can start from: huggingface/transformers@main...fxmarty:transformers:use-torch-bigbird

harindercnvrg added the bug Something isn't working label Feb 7, 2023

fxmarty mentioned this issue Feb 7, 2023

Fix past key values usage huggingface/optimum-intel#187

Merged

This was referenced Feb 14, 2023

Remove BigBird ONNX support #778

Merged

Community contribution - optimum.exporters.onnx support for new models! #555

Open

fxmarty closed this as completed Feb 20, 2023

fxmarty reopened this Feb 20, 2023

fxmarty changed the title ~~Inference worse on onnx runtime and openvino runtime for converted seq2seq models on CPU~~ Support bigbird ONNX export with attention_type == "block_sparse" Feb 20, 2023

fxmarty added feature-request New feature or request onnx Related to the ONNX export and removed bug Something isn't working labels Feb 20, 2023

echarlaix mentioned this issue Mar 2, 2023

Inference worse on onnx runtime and openvino runtime for converted seq2seq models on CPU #754 huggingface/optimum-intel#188

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support bigbird ONNX export with attention_type == "block_sparse" #754

Support bigbird ONNX export with attention_type == "block_sparse" #754

harindercnvrg commented Feb 7, 2023 •

edited

Loading

fxmarty commented Feb 7, 2023

harindercnvrg commented Feb 9, 2023

fxmarty commented Feb 11, 2023 •

edited

Loading

harindercnvrg commented Feb 13, 2023

fxmarty commented Feb 13, 2023

harindercnvrg commented Feb 13, 2023

fxmarty commented Feb 13, 2023

fxmarty commented Feb 14, 2023 •

edited

Loading

fxmarty commented Feb 17, 2023

Support bigbird ONNX export with attention_type == "block_sparse" #754

Support bigbird ONNX export with attention_type == "block_sparse" #754

Comments

harindercnvrg commented Feb 7, 2023 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

fxmarty commented Feb 7, 2023

harindercnvrg commented Feb 9, 2023

fxmarty commented Feb 11, 2023 • edited Loading

harindercnvrg commented Feb 13, 2023

fxmarty commented Feb 13, 2023

harindercnvrg commented Feb 13, 2023

fxmarty commented Feb 13, 2023

fxmarty commented Feb 14, 2023 • edited Loading

fxmarty commented Feb 17, 2023

harindercnvrg commented Feb 7, 2023 •

edited

Loading

fxmarty commented Feb 11, 2023 •

edited

Loading

fxmarty commented Feb 14, 2023 •

edited

Loading