[Bug:Server] Lack of usage information on streaming response #1640

zm0n3 · 2024-07-30T16:40:08Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

The one below is the last chunk coming from the streaming response generator using the llama.cpp server. This is the expected behaviour since it contains usage info (CompletionUsage) about the tokens consumed:

ChatCompletionChunk(
id='chatcmpl-<id>',
choices=[Choice(delta=ChoiceDelta(content=None, function_call=None, role=None, tool_calls=None), finish_reason='stop', index=0, logprobs=None)],
created=1722354477,
model='my_model.gguf',
object='chat.completion.chunk',
service_tier=None,
system_fingerprint=None,
usage=CompletionUsage(completion_tokens=10, prompt_tokens=22, total_tokens=32))

Current Behavior

When I use llama-cpp-python to start the server, keeping same settings (stream=True), and then from the client ask for the stream generator, the last chunk looks like this one:

ChatCompletionChunk(
id='chatcmpl-<id>',
choices=[Choice(delta=ChoiceDelta(content=None, function_call=None, role=None, tool_calls=None), finish_reason='stop', index=0, logprobs=None)],
created=1722354657,
model='my_model.gguf',
object='chat.completion.chunk',
service_tier=None,
system_fingerprint=None,
usage=None)

As you can see, the last chunk does not contain info about the usage (it is always None), which is not expected. If I turn off the streaming (stream=False) the bug is not there.

Environment and Context

I'm using:

Windows 11
python 3.11.9
llama-cpp-python 0.2.84
llama.cpp b3486

Failure Information (for bugs)

Lack of usage information when stream is True.

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

step 1: run any supported gguf on llama-cpp-python 0.2.84 server
step 2: client side run the following code, ensuring stream=True:

import openai

client = openai.OpenAI(base_url="http://127.0.0.1:8000/v1")

m=client.models.list()
output_generator = client.chat.completions.create(
	model=m.data[0].id,
	messages = [
              {"role": "system", "content": "You are an helpful assistant."},
              {
                  "role": "user",
                  "content": "hello"
              }
          ],
	temperature=0,
	stream=True
)
# Collect and concatenate responses from the generator
response_text = ""
for output in output_generator:
    print(output)
    new_content = output.choices[0].delta.content
    if new_content:
        print(new_content, end='', flush=True)
        response_text += new_content

step 3: last print(output) is showing the snipped above

Failure Logs

Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.

Environment info:

llama-cpp-python$ python3 --version
Python 3.11.9

llama-cpp-python$ pip list | egrep "uvicorn|fastapi|sse-starlette|numpy"
fastapi                                  0.111.1
numpy                                    1.26.4
sse-starlette                            2.1.2

The text was updated successfully, but these errors were encountered:

zm0n3 · 2024-08-11T08:23:20Z

tested now on llama-cpp-python==0.2.87 and it seems it has been solved.

zm0n3 changed the title ~~Lack of usage information on streaming response~~ [server] Lack of usage information on streaming response Jul 30, 2024

zm0n3 changed the title ~~[server] Lack of usage information on streaming response~~ [Bug:Server] Lack of usage information on streaming response Jul 30, 2024

abetlen added bug Something isn't working enhancement New feature or request and removed bug Something isn't working labels Aug 7, 2024

zm0n3 closed this as completed Aug 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug:Server] Lack of usage information on streaming response #1640

[Bug:Server] Lack of usage information on streaming response #1640

zm0n3 commented Jul 30, 2024

zm0n3 commented Aug 11, 2024

[Bug:Server] Lack of usage information on streaming response #1640

[Bug:Server] Lack of usage information on streaming response #1640

Comments

zm0n3 commented Jul 30, 2024

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs

zm0n3 commented Aug 11, 2024