-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming assistant responses #185
Comments
It makes sense to me ! Streaming is definitely something that is hard. My initial thought is to used it as a additional check with timeout maybe at request level. My thought is for the client dont shut down the connection while they still fetching it. But I will revisit this when I have more familiarity with the code base. Is there any issue that you think I can help with ? |
If we just want to circumvent the timeout on the worker side, we could still implement streaming there. It wouldn't make a difference to the user as we would only return if we have the full response, but since we should get an initial response super fast, we could maybe lower the timeout again. |
Does communicating with the LLM ( This would then make it a lot easier to handle the streaming use-case (which I agree is absolutely essential for RAG). There might need to be a few more changes to deal with the fact that the streamed response will return an |
I assume that's there can also be self hosted LLM on the computer ? But I agreed. Since the final output of Rag is a synthesis step it should be able to support streaming if the underlying LLM support it |
@petrpan26 is right about the reason. We use the same abstraction for local LLMs as for hitting an external API. Thus, everything goes through the task queue.
I disagree here. Streaming is nice to have and not essential. Everything works perfectly fine without streaming. |
Could one of you explain to me why we need a worker/task queue specifically for a local LLM? I don't follow the logic here. If there's any interest, I've added a minimal PoC of streaming here. Rather than calling the Although I don't think any of this is particularly complex, it would likely imply some fairly major re-structuring of the code eg making the assistants truly async. So, I'm not too sure how realistic it is. Having said that:
Apologies, I might have come across a bit strong here in my excitement! What I meant really was that for slower connections/longer responses/self-hosting computation restraints, it's not unrealistic to be waiting beyond 30 s for the full response to be generated. I personally find this to be a real producitivity sink, as it's enough time for me to get distracted, or go elsewhere. Streaming is great because it keeps my engagement the entire time, even for slow connections etc. So, everything still works, but I think it's a question of how much value is lost. |
Making some progress on this in #215. One thing that I'm not super sure yet is the type of the individual "chunk" of the message that we return. We cannot just return a str here, but somehow also include the sources. For the Python API, we could just return multiple from ragna.core import Message
chunks = []
async for message_chunk in chat.answer(prompt):
chunks.append(message_chunk)
# do something with the chunk here
message = Message(
content="".join(chunk.content for chunk in chunks),
role=chunks[0].role,
sources=chunks[0].sources,
) Not to happy with it. Feels kinda clunky. I thought about having class method like ragna/ragna/source_storages/_vector_database.py Lines 44 to 48 in 0e1c35a
For the REST API, the situation is similar. My current thought is to return something like class AnswerOutput(BaseModel):
content: Optional[str]
message: Optional[Message] While streaming, we would have Any input appreciated. |
Two other OSS packages that I am reasonably familiar with (and have reasonably wide impact based on stars and social media discussion) have adopted similar designs to what you have just described.
I personally prefer the second approach. Asking a user to build About terminology: my two cents is that I agree with you about trying to avoid using chunks in multiple places, but that this is already widely done (see the two previous packages for examples) and so it's probably swimming against the tide. (#215 looking good!) |
Indeed. I think a good blue print that we can use here is how HTTP responses from
With that, one can do message = chat.answer(...)
print(message) or message = chat.answer(...)
for chunk in message:
print(chunk) In both cases they still have access to the other attributes. As for the REST API: I kinda like a similar approach, which is what the LLM API providers do as well. You always return the full JSON object, but if one hits the endpoint with |
Closed in #215. |
Some APIs like OpenAI and Anthropic support streaming the generated response. This is a really nice UX, because the user gets instant feedback and doesn't have to wait for a long time to receive a large blob of text.
However, due to Ragnas backend structure, this is either "impossible" (read infeasible for the upside) or really hard. The problem is that we receive the response on the worker, but consume it on the API side. In between them is our result storage. Meaning, we would need to somehow implement a streaming approach for the results and finally also implement streaming on our own API.
I really would like to have this feature and toyed with it in the past. Unfortunately, I gave up at some point for the reasons stated above. If someone wants to have a go at this, be my guest. But be warned that this likely will turn really ugly.
The text was updated successfully, but these errors were encountered: