use_cache no effect for decoder #753

cyang49 · 2023-02-06T20:37:54Z

System Info

optimum 1.6.3 (latest code also looks the same)

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I was testing ORTModelForCausalLM and compared it with Pytorch baseline (not optimum just transformers API). I saw that for ORTModelForCausalLM, for a fixed number of tokens to generate, say 16, the inference latency increases 2x when input sequence length is 2x. Pytorch baseline, in contrast, has a much smaller inference latency increase.

The results from ORT model is unexpected because when use_cache=True, only the first output token latency should be significantly increased with input sequence length but not the subsequent output tokens. I looked at the timelines from nsight systems and saw that the ORT model produces all the output tokens with the same latency and the compute patterns are the same for all tokens. I checked the Optimum ORT model execution with python debugger and saw that while the decoder_with_past onnx inference session is created correctly, the past_key_values of the forward function is always None.

I looked at the current Optimum decoder implementation and realized that it is very different from the GPT2 implementation it refers to. Replacing the optimum code with the GPT2 implementation seems to produce expected results. Can someone help? Is this one of the features not implemented correctly yet?

Expected behavior

I expect when the use_cache flag is set, the decoder model uses cached data for speeding up inference.

The text was updated successfully, but these errors were encountered:

fxmarty · 2023-02-07T10:17:55Z

Thank, it is very worrisome we had this kind of bug, it means our tests are not good enough. Seems like the issue is just using past instead of past_key_values as argument to prepare_inputs_for_generation.

I think there's indeed an issue with the prepare_inputs_for_generation that are not class methods in transformers, and may be slightly different one model to the other. I'll extend the tests and see if our support is still enough.

cyang49 added the bug Something isn't working label Feb 6, 2023

This was referenced Feb 7, 2023

Fix ONNX Runtime cache usage for decoders, add relevant tests #756

Merged

Support bigbird ONNX export with attention_type == "block_sparse" #754

Open

fxmarty closed this as completed Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use_cache no effect for decoder #753

use_cache no effect for decoder #753

cyang49 commented Feb 6, 2023

fxmarty commented Feb 7, 2023

use_cache no effect for decoder #753

use_cache no effect for decoder #753

Comments

cyang49 commented Feb 6, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

fxmarty commented Feb 7, 2023