You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
The one below is the last chunk coming from the streaming response generator using the llama.cpp server. This is the expected behaviour since it contains usage info (CompletionUsage) about the tokens consumed:
When I use llama-cpp-python to start the server, keeping same settings (stream=True), and then from the client ask for the stream generator, the last chunk looks like this one:
As you can see, the last chunk does not contain info about the usage (it is always None), which is not expected. If I turn off the streaming (stream=False) the bug is not there.
Environment and Context
I'm using:
Windows 11
python 3.11.9
llama-cpp-python 0.2.84
llama.cpp b3486
Failure Information (for bugs)
Lack of usage information when stream is True.
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
step 1: run any supported gguf on llama-cpp-python 0.2.84 server
step 2: client side run the following code, ensuring stream=True:
importopenaiclient=openai.OpenAI(base_url="http://127.0.0.1:8000/v1")
m=client.models.list()
output_generator=client.chat.completions.create(
model=m.data[0].id,
messages= [
{"role": "system", "content": "You are an helpful assistant."},
{
"role": "user",
"content": "hello"
}
],
temperature=0,
stream=True
)
# Collect and concatenate responses from the generatorresponse_text=""foroutputinoutput_generator:
print(output)
new_content=output.choices[0].delta.contentifnew_content:
print(new_content, end='', flush=True)
response_text+=new_content
step 3: last print(output) is showing the snipped above
Failure Logs
Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.
The text was updated successfully, but these errors were encountered:
zm0n3
changed the title
Lack of usage information on streaming response
[server] Lack of usage information on streaming response
Jul 30, 2024
zm0n3
changed the title
[server] Lack of usage information on streaming response
[Bug:Server] Lack of usage information on streaming response
Jul 30, 2024
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
The one below is the last chunk coming from the streaming response generator using the
llama.cpp
server. This is the expected behaviour since it contains usage info (CompletionUsage) about the tokens consumed:Current Behavior
When I use
llama-cpp-python
to start the server, keeping same settings (stream=True), and then from the client ask for the stream generator, the last chunk looks like this one:As you can see, the last chunk does not contain info about the usage (it is always None), which is not expected. If I turn off the streaming (stream=False) the bug is not there.
Environment and Context
I'm using:
Failure Information (for bugs)
Lack of usage information when stream is True.
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
print(output)
is showing the snipped aboveFailure Logs
Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.
Environment info:
The text was updated successfully, but these errors were encountered: