-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance]: High Latency in Stateful Inference for Obtaining State (get_state()) in OpenVINO #28474
Comments
Hi, may we ask for an isolated reproducer so we may debug it? |
Also, I have checked the code and looks like get_state for GPU plugin is returning a copy of it's internal memory buffer openvino/src/plugins/intel_gpu/src/plugin/variable_state.cpp Lines 125 to 137 in 2088b8f
It may explain why it's slow, but it's hard to me to know without any broader context |
Instead, if m_memory or the layout is all you need then you can use this openvino/src/plugins/intel_gpu/src/plugin/variable_state.cpp Lines 33 to 39 in 2088b8f
|
Thanks for you reply. import openvino as ov
import numpy as np
import torch
input_ids = torch.from_numpy(np.array([[13127,10,2467]],dtype=np.int64))
postion_ids = torch.from_numpy(np.array([[1, 2, 3]],dtype=np.int64))
attn_one = torch.from_numpy(np.array([[1, 1, 1]],dtype=np.int64))
beam_idx = torch.from_numpy(np.array([0],dtype=int))
inputs_1 = {}
inputs_1["input_ids"] = input_ids
inputs_1["attention_mask"] = attn_one
inputs_1["position_ids"] = postion_ids
inputs_1["beam_idx"] = beam_idx
core=ov.Core()
model_1 = core.compile_model("path\\llama\\openvino_model.xml","GPU")
import time
infer_request_1 = model_1.create_infer_request()
infer_request_1.infer(inputs_1)
infer_request_1.infer(inputs_1)
states=infer_request_1.query_state()
for state in states:
start_time = time.time()
state_buf = state.state.data
end_time = time.time()
elapsed_time = (end_time - start_time) * 1000 # 毫秒
print(f"Elapsed time: {elapsed_time} ms") The output in lnl Ultra 9
The output in MTL Ultra 5 is over than 40ms. the latency can not afford |
When we use get_states(), it automatically employs convert_and_copy_padded_source() due to the pad layout, which is very time-consuming. How can we configure it to use a no_pad layout? |
OpenVINO Version
2024.4.6
Operating System
Windows System
Device used for inference
iGPU
OpenVINO installation
PyPi
Programming Language
C++
Hardware Architecture
x86 (64 bits)
Model used
llama
Model quantization
No
Target Platform
No response
Performance issue description
We have identified a significant issue with the latency of stateful inference when obtaining the state in OpenVINO. The delay is excessively high, which impacts the overall performance of our application. The following sample code illustrates the problem:
In our tests, we have observed that the latency for obtaining the state is consistently around 40ms, which is unacceptable for our real-time application requirements.
Are there any good suggestions for optimization? Modifying the state is crucial for optimizing LLM models (such as Medusa and other optimization methods)
Step-by-step reproduction
No response
Issue submission checklist
The text was updated successfully, but these errors were encountered: