Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use_cache no effect for decoder #753

Closed
4 tasks
cyang49 opened this issue Feb 6, 2023 · 1 comment
Closed
4 tasks

use_cache no effect for decoder #753

cyang49 opened this issue Feb 6, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@cyang49
Copy link

cyang49 commented Feb 6, 2023

System Info

optimum 1.6.3 (latest code also looks the same)

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I was testing ORTModelForCausalLM and compared it with Pytorch baseline (not optimum just transformers API). I saw that for ORTModelForCausalLM, for a fixed number of tokens to generate, say 16, the inference latency increases 2x when input sequence length is 2x. Pytorch baseline, in contrast, has a much smaller inference latency increase.

The results from ORT model is unexpected because when use_cache=True, only the first output token latency should be significantly increased with input sequence length but not the subsequent output tokens. I looked at the timelines from nsight systems and saw that the ORT model produces all the output tokens with the same latency and the compute patterns are the same for all tokens. I checked the Optimum ORT model execution with python debugger and saw that while the decoder_with_past onnx inference session is created correctly, the past_key_values of the forward function is always None.

I looked at the current Optimum decoder implementation and realized that it is very different from the GPT2 implementation it refers to. Replacing the optimum code with the GPT2 implementation seems to produce expected results. Can someone help? Is this one of the features not implemented correctly yet?

Expected behavior

I expect when the use_cache flag is set, the decoder model uses cached data for speeding up inference.

@cyang49 cyang49 added the bug Something isn't working label Feb 6, 2023
@fxmarty
Copy link
Contributor

fxmarty commented Feb 7, 2023

Thank, it is very worrisome we had this kind of bug, it means our tests are not good enough. Seems like the issue is just using past instead of past_key_values as argument to prepare_inputs_for_generation.

I think there's indeed an issue with the prepare_inputs_for_generation that are not class methods in transformers, and may be slightly different one model to the other. I'll extend the tests and see if our support is still enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants