You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
I was testing ORTModelForCausalLM and compared it with Pytorch baseline (not optimum just transformers API). I saw that for ORTModelForCausalLM, for a fixed number of tokens to generate, say 16, the inference latency increases 2x when input sequence length is 2x. Pytorch baseline, in contrast, has a much smaller inference latency increase.
The results from ORT model is unexpected because when use_cache=True, only the first output token latency should be significantly increased with input sequence length but not the subsequent output tokens. I looked at the timelines from nsight systems and saw that the ORT model produces all the output tokens with the same latency and the compute patterns are the same for all tokens. I checked the Optimum ORT model execution with python debugger and saw that while the decoder_with_past onnx inference session is created correctly, the past_key_values of the forward function is always None.
I looked at the current Optimum decoder implementation and realized that it is very different from the GPT2 implementation it refers to. Replacing the optimum code with the GPT2 implementation seems to produce expected results. Can someone help? Is this one of the features not implemented correctly yet?
Expected behavior
I expect when the use_cache flag is set, the decoder model uses cached data for speeding up inference.
The text was updated successfully, but these errors were encountered:
Thank, it is very worrisome we had this kind of bug, it means our tests are not good enough. Seems like the issue is just using past instead of past_key_values as argument to prepare_inputs_for_generation.
I think there's indeed an issue with the prepare_inputs_for_generation that are not class methods in transformers, and may be slightly different one model to the other. I'll extend the tests and see if our support is still enough.
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I was testing ORTModelForCausalLM and compared it with Pytorch baseline (not optimum just transformers API). I saw that for ORTModelForCausalLM, for a fixed number of tokens to generate, say 16, the inference latency increases 2x when input sequence length is 2x. Pytorch baseline, in contrast, has a much smaller inference latency increase.
The results from ORT model is unexpected because when
use_cache=True
, only the first output token latency should be significantly increased with input sequence length but not the subsequent output tokens. I looked at the timelines from nsight systems and saw that the ORT model produces all the output tokens with the same latency and the compute patterns are the same for all tokens. I checked the Optimum ORT model execution with python debugger and saw that while thedecoder_with_past onnx
inference session is created correctly, thepast_key_values
of the forward function is always None.I looked at the current Optimum decoder implementation and realized that it is very different from the GPT2 implementation it refers to. Replacing the optimum code with the GPT2 implementation seems to produce expected results. Can someone help? Is this one of the features not implemented correctly yet?
Expected behavior
I expect when the use_cache flag is set, the decoder model uses cached data for speeding up inference.
The text was updated successfully, but these errors were encountered: