Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug:Server] Lack of usage information on streaming response #1640

Closed
4 tasks done
zm0n3 opened this issue Jul 30, 2024 · 1 comment
Closed
4 tasks done

[Bug:Server] Lack of usage information on streaming response #1640

zm0n3 opened this issue Jul 30, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@zm0n3
Copy link

zm0n3 commented Jul 30, 2024

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

The one below is the last chunk coming from the streaming response generator using the llama.cpp server. This is the expected behaviour since it contains usage info (CompletionUsage) about the tokens consumed:

ChatCompletionChunk(
id='chatcmpl-<id>',
choices=[Choice(delta=ChoiceDelta(content=None, function_call=None, role=None, tool_calls=None), finish_reason='stop', index=0, logprobs=None)],
created=1722354477,
model='my_model.gguf',
object='chat.completion.chunk',
service_tier=None,
system_fingerprint=None,
usage=CompletionUsage(completion_tokens=10, prompt_tokens=22, total_tokens=32))

Current Behavior

When I use llama-cpp-python to start the server, keeping same settings (stream=True), and then from the client ask for the stream generator, the last chunk looks like this one:

ChatCompletionChunk(
id='chatcmpl-<id>',
choices=[Choice(delta=ChoiceDelta(content=None, function_call=None, role=None, tool_calls=None), finish_reason='stop', index=0, logprobs=None)],
created=1722354657,
model='my_model.gguf',
object='chat.completion.chunk',
service_tier=None,
system_fingerprint=None,
usage=None)

As you can see, the last chunk does not contain info about the usage (it is always None), which is not expected. If I turn off the streaming (stream=False) the bug is not there.

Environment and Context

I'm using:

  • Windows 11
  • python 3.11.9
  • llama-cpp-python 0.2.84
  • llama.cpp b3486

Failure Information (for bugs)

Lack of usage information when stream is True.

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. step 1: run any supported gguf on llama-cpp-python 0.2.84 server
  2. step 2: client side run the following code, ensuring stream=True:
import openai

client = openai.OpenAI(base_url="http://127.0.0.1:8000/v1")

m=client.models.list()
output_generator = client.chat.completions.create(
	model=m.data[0].id,
	messages = [
              {"role": "system", "content": "You are an helpful assistant."},
              {
                  "role": "user",
                  "content": "hello"
              }
          ],
	temperature=0,
	stream=True
)
# Collect and concatenate responses from the generator
response_text = ""
for output in output_generator:
    print(output)
    new_content = output.choices[0].delta.content
    if new_content:
        print(new_content, end='', flush=True)
        response_text += new_content
  1. step 3: last print(output) is showing the snipped above

Failure Logs

Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.

Environment info:

llama-cpp-python$ python3 --version
Python 3.11.9

llama-cpp-python$ pip list | egrep "uvicorn|fastapi|sse-starlette|numpy"
fastapi                                  0.111.1
numpy                                    1.26.4
sse-starlette                            2.1.2
@zm0n3 zm0n3 changed the title Lack of usage information on streaming response [server] Lack of usage information on streaming response Jul 30, 2024
@zm0n3 zm0n3 changed the title [server] Lack of usage information on streaming response [Bug:Server] Lack of usage information on streaming response Jul 30, 2024
@abetlen abetlen added bug Something isn't working enhancement New feature or request and removed bug Something isn't working labels Aug 7, 2024
@zm0n3
Copy link
Author

zm0n3 commented Aug 11, 2024

tested now on llama-cpp-python==0.2.87 and it seems it has been solved.

@zm0n3 zm0n3 closed this as completed Aug 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants