Streaming assistant responses #185

pmeier · 2023-11-08T22:11:47Z

Some APIs like OpenAI and Anthropic support streaming the generated response. This is a really nice UX, because the user gets instant feedback and doesn't have to wait for a long time to receive a large blob of text.

However, due to Ragnas backend structure, this is either "impossible" (read infeasible for the upside) or really hard. The problem is that we receive the response on the worker, but consume it on the API side. In between them is our result storage. Meaning, we would need to somehow implement a streaming approach for the results and finally also implement streaming on our own API.

I really would like to have this feature and toyed with it in the past. Unfortunately, I gave up at some point for the reasons stated above. If someone wants to have a go at this, be my guest. But be warned that this likely will turn really ugly.

petrpan26 · 2023-11-08T22:17:03Z

It makes sense to me ! Streaming is definitely something that is hard. My initial thought is to used it as a additional check with timeout maybe at request level. My thought is for the client dont shut down the connection while they still fetching it. But I will revisit this when I have more familiarity with the code base. Is there any issue that you think I can help with ?

pmeier · 2023-11-08T22:27:36Z

If we just want to circumvent the timeout on the worker side, we could still implement streaming there. It wouldn't make a difference to the user as we would only return if we have the full response, but since we should get an initial response super fast, we could maybe lower the timeout again.

nenb · 2023-11-09T18:48:46Z

Does communicating with the LLM (answer()) need to be done via the worker? I understand that the embeddings (prepare()) absolutely need this, as they are compute-intensive. But communicating with the LLM is more of an I/O bound problem, and I've had no problem running it on the main thread (asynchronously of course) in previous applications.

This would then make it a lot easier to handle the streaming use-case (which I agree is absolutely essential for RAG). There might need to be a few more changes to deal with the fact that the streamed response will return an Iterable, but I don't see this as being a major issue right now.

petrpan26 · 2023-11-09T19:47:55Z

I assume that's there can also be self hosted LLM on the computer ? But I agreed. Since the final output of Rag is a synthesis step it should be able to support streaming if the underlying LLM support it

pmeier · 2023-11-09T22:54:35Z

@petrpan26 is right about the reason. We use the same abstraction for local LLMs as for hitting an external API. Thus, everything goes through the task queue.

@nenb

is absolutely essential for RAG

I disagree here. Streaming is nice to have and not essential. Everything works perfectly fine without streaming.

nenb · 2023-11-10T17:12:57Z

@petrpan26 is right about the reason. We use the same abstraction for local LLMs as for hitting an external API. Thus, everything goes through the task queue.

Could one of you explain to me why we need a worker/task queue specifically for a local LLM? I don't follow the logic here.

If there's any interest, I've added a minimal PoC of streaming here. Rather than calling the answer() method on the consumer, I called it on the producer. I added the async API so that this wouldn't cause problems for the producer. So, this is what's confusing me about the mention of local LLMs - which part of this wouldn't work with local LLMs?

Although I don't think any of this is particularly complex, it would likely imply some fairly major re-structuring of the code eg making the assistants truly async. So, I'm not too sure how realistic it is. Having said that:

I disagree here. Streaming is nice to have and not essential. Everything works perfectly fine without streaming.

Apologies, I might have come across a bit strong here in my excitement! What I meant really was that for slower connections/longer responses/self-hosting computation restraints, it's not unrealistic to be waiting beyond 30 s for the full response to be generated. I personally find this to be a real producitivity sink, as it's enough time for me to get distracted, or go elsewhere. Streaming is great because it keeps my engagement the entire time, even for slow connections etc. So, everything still works, but I think it's a question of how much value is lost.

pmeier · 2023-11-30T22:47:29Z

Making some progress on this in #215. One thing that I'm not super sure yet is the type of the individual "chunk" of the message that we return. We cannot just return a str here, but somehow also include the sources.

For the Python API, we could just return multiple Message's. Meaning, users can piece the full message together with something like:

from ragna.core import Message

chunks = []
async for message_chunk in chat.answer(prompt):
    chunks.append(message_chunk)
    # do something with the chunk here

message = Message(
    content="".join(chunk.content for chunk in chunks),
    role=chunks[0].role,
    sources=chunks[0].sources,
)

Not to happy with it. Feels kinda clunky. I thought about having class method like Message.from_chunks, but now we are introducing another "chunk" term into the Ragna lingo. We already have that for

ragna/ragna/source_storages/_vector_database.py

Lines 44 to 48 in 0e1c35a

    
           @dataclasses.dataclass 
        
           class Chunk: 
        
               text: str 
        
               page_numbers: Optional[list[int]] 
        
               num_tokens: int

For the REST API, the situation is similar. My current thought is to return something like

class AnswerOutput(BaseModel):
     content: Optional[str]
     message: Optional[Message]

While streaming, we would have AnswerOutput(content=..., message=None). When we are done streaming, we create the full message including the sources and return it like AnswerOutput(content=None, message=...). I think this change in the object internals isn't too bad and somewhat normal in the frontend world.

Any input appreciated.

nenb · 2023-12-03T17:25:56Z

Two other OSS packages that I am reasonably familiar with (and have reasonably wide impact based on stars and social media discussion) have adopted similar designs to what you have just described.

simpleaichat is similar to your first proposal - see the stream method here, and a notebook (cells 20-22) here.

llm seems more similar to your second proposal - see the __iter__, _force, __str__ and text methods here and some relevant documentation here.

I personally prefer the second approach. Asking a user to build Message objects themselves feels a bit clunky like you said. It feels kind of natural to me to have an AnswerOutput object that switches between two different states, depending on whether the stream has been exhausted or not. Is there a reason that a similar approach cannot also be used for the Python API?

About terminology: my two cents is that I agree with you about trying to avoid using chunks in multiple places, but that this is already widely done (see the two previous packages for examples) and so it's probably swimming against the tide.

(#215 looking good!)

pmeier · 2023-12-05T09:47:10Z

I personally prefer the second approach.

Indeed. I think a good blue print that we can use here is how HTTP responses from requests or httpx work. My current idea for the Python API is something along the lines of

pass an iterator as content to the message
implement __iter__ to return the items of the iterator
make the .content attribute a function, i.e. .content()
invoking .content() or .__str__() should exhaust the iterator and set an internal _content attribute for caching
If __iter__() is called and the iterator was already exhausted we error out and point users to .content() instead

With that, one can do

message = chat.answer(...)
print(message)

or

message = chat.answer(...)
for chunk in message:
    print(chunk)

In both cases they still have access to the other attributes.

As for the REST API: I kinda like a similar approach, which is what the LLM API providers do as well. You always return the full JSON object, but if one hits the endpoint with stream=True, the content field actually is just a chunk and not the full response. In that case the user has to built the full message by themselves, but I feel this ok given that they explicitly set stream=True.

pmeier · 2024-01-17T15:09:38Z

Closed in #215.

pmeier mentioned this issue Nov 8, 2023

Increase API assistant timeout #174

Closed

pmeier mentioned this issue Nov 9, 2023

Change the default queue implementation something other than memory #183

Closed

This was referenced Nov 15, 2023

queueless Ragna? #204

Closed

remove task queue #205

Merged

implement streaming for assistants #215

Merged

pmeier added this to the 0.2.0 milestone Nov 27, 2023

pmeier mentioned this issue Dec 9, 2023

don't return the full chat object on the prepare and answer endpoints of the REST API #242

Closed

pmeier closed this as completed Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming assistant responses #185

Streaming assistant responses #185

pmeier commented Nov 8, 2023

petrpan26 commented Nov 8, 2023

pmeier commented Nov 8, 2023

nenb commented Nov 9, 2023 •

edited

Loading

petrpan26 commented Nov 9, 2023

pmeier commented Nov 9, 2023

nenb commented Nov 10, 2023 •

edited

Loading

pmeier commented Nov 30, 2023

nenb commented Dec 3, 2023 •

edited

Loading

pmeier commented Dec 5, 2023

pmeier commented Jan 17, 2024

Streaming assistant responses #185

Streaming assistant responses #185

Comments

pmeier commented Nov 8, 2023

petrpan26 commented Nov 8, 2023

pmeier commented Nov 8, 2023

nenb commented Nov 9, 2023 • edited Loading

petrpan26 commented Nov 9, 2023

pmeier commented Nov 9, 2023

nenb commented Nov 10, 2023 • edited Loading

pmeier commented Nov 30, 2023

nenb commented Dec 3, 2023 • edited Loading

pmeier commented Dec 5, 2023

pmeier commented Jan 17, 2024

nenb commented Nov 9, 2023 •

edited

Loading

nenb commented Nov 10, 2023 •

edited

Loading

nenb commented Dec 3, 2023 •

edited

Loading