using a Agent and wanted to stream just the final response #2483

zhuxiaobin · 2023-04-06T09:58:04Z

I am using a Agent and wanted to stream just the final response, do you know if that is supported already? and how to do it?

Wzy232303 · 2023-04-06T10:32:18Z

Me，too

WarlaxZ · 2023-04-23T11:17:30Z

Ticked the top reply but commenting to bump the issue

Prayforhanluo · 2023-04-26T09:04:48Z

Me, too

pelyhe · 2023-05-09T08:48:44Z

Do you have any solution for this?

UmerHA · 2023-05-09T16:40:31Z

Do you want to just get the final response, without intermediate output? Doesn't verbose=False do the trick?

If you're looking for something else, feel free to describe it. I'd be happy to implement it :)

@zhuxiaobin @Wzy232303

pelyhe · 2023-05-09T16:47:35Z

I think the main question here is not that we want to get only the final answer, but the streaming part, so the user does have to wait 20-30 seconds for the answer, but it comes by words so the response seems faster.

hgoona · 2023-05-13T12:12:15Z

is there a solution to this?

pelyhe · 2023-05-13T12:18:02Z

As far as im concerned, no

UmerHA · 2023-05-13T16:12:55Z

Done ✅ - see #4630

Recording.mp4

Prayforhanluo · 2023-05-14T15:30:29Z

can we use agent.run(or any other method) returns as a generator so that we can do some custom processing for outputs? @UmerHA thanks

UmerHA · 2023-05-14T15:45:54Z

You can subclass FinalStreamingStdOutCallbackHandler and override its on_llm_new_token to get custom processing.
What's the specific use case you have in mind? @Prayforhanluo

Prayforhanluo · 2023-05-15T01:43:19Z

In fact, I need to return each char just like openai stream, I need to wrap each char in json with some additional information and pass it to the downstream service @UmerHA

UmerHA · 2023-05-15T02:38:55Z

LLMs return the output on a token-level, not character-level. Making the characters appear one after another is a UI trick.

Still, you can get your desired behavior by subclassing FinalStreamingStdOutCallbackHandler, adding an internal buffer and turning it into a generator. Here's the code:

import time
from datetime import datetime
from typing import Dict, List, Any, Tuple, Optional
from langchain.callbacks.streaming_stdout_final_only import FinalStreamingStdOutCallbackHandler
from langchain.schema import LLMResult
from uuid import UUID

class CustomStreamingStdOutCallbackHandler(FinalStreamingStdOutCallbackHandler):

	buffer: List[Tuple[str, float]] = []
	stop_token = "#!stop!#"
		
	def on_llm_start(
		self, serialized: Dict[str, Any], prompts: List[str], **kwargs: Any
	) -> None:
		super().on_llm_start(serialized, prompts, **kwargs)
		self.buffer = []
		
	def on_llm_end(
		self,
		response: LLMResult,
		*,
		run_id: UUID,
		parent_run_id: Optional[UUID] = None,
		**kwargs: Any,
	) -> Any:
		self.add_to_buffer(self.stop_token)
	
	def on_llm_new_token(self, token: str, **kwargs: Any) -> None:
		# Remember the last n tokens, where n = len(answer_prefix_tokens)
		self.last_tokens.append(token)
		if len(self.last_tokens) > len(self.answer_prefix_tokens):
			self.last_tokens.pop(0)

		# Check if the last n tokens match the answer_prefix_tokens list ...
		if self.last_tokens == self.answer_prefix_tokens:
			self.answer_reached = True
			# Do not print the last token in answer_prefix_tokens,
			# as it's not part of the answer yet
			return

		# ... if yes, then append tokens to buffer
		if self.answer_reached:
			self.add_to_buffer(token)
			
	def add_to_buffer(self, token:str) -> None:
		now = datetime.now()
		self.buffer.append((token, now))

	def stream_chars(self):
		while True:
			# when we didn't receive any token yet, just continue
			if len(self.buffer) == 0:
				continue
			
			token, timestamp = self.buffer.pop(0)
			
			if token != self.stop_token:
				for character in token:
					yield (character, timestamp)
					time.sleep(0.2)  # Remove this line. It's just for illustration purposes
			else:
				break

You can use it like this:

import os
from langchain.agents import load_tools
from langchain.agents import initialize_agent
from langchain.agents import AgentType
from langchain.llms import OpenAI
from langchain.utilities import WikipediaAPIWrapper

stream_handler = CustomStreamingStdOutCallbackHandler()

llm = OpenAI(streaming=True, callbacks=[stream_handler], temperature=0)
tools = load_tools(["wikipedia", "llm-math"], llm=llm)
agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=False)
agent.run("It's 2023 now. How many years ago did Konrad Adenauer become Chancellor of Germany.")

for character, timestamp in stream_handler.stream_chars():
	print(f"Received: |{character}| at {timestamp}")

@Prayforhanluo

Edit: formatting

stoneHah · 2023-05-15T09:40:44Z

FinalStreamingStdOutCallbackHandler

I updated the langchain to the latest version（v0.0.169）, but I didn't find the FinalStreamingStdOutCallbackHandler

UmerHA · 2023-05-15T17:36:59Z

FinalStreamingStdOutCallbackHandler

I updated the langchain to the latest version（v0.0.169）, but I didn't find the FinalStreamingStdOutCallbackHandler

The PR (#4630) hasn't been merged yet, so it's not part of the code base yet. Waiting for the LangChain team to merge it.

My3VM · 2023-05-16T13:55:13Z

Have been facing the same challenge! Eagerly waiting to take benefit from this PR. Hoping they'd soon merge with master.

# Streaming only final output of agent (#2483) As requested in issue #2483, this Callback allows to stream only the final output of an agent (ie not the intermediate steps). Fixes #2483 Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>

thaiminhpv · 2023-06-12T20:33:33Z

can we use agent.run(or any other method) returns as a generator so that we can do some custom processing for outputs?

I have created a PR #5937 to support returning a generator by using simple syntax:

Basic setup

from langchain.agents import load_tools, initialize_agent, AgentType
from langchain.llms import OpenAI
from langchain.callbacks import StreamingLastResponseCallbackHandler

llm = OpenAI(temperature=0, streaming=True)
tools = load_tools(["serpapi", "llm-math"], llm=llm)
agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=False)

Then this is all you need to do

import threading

stream = StreamingLastResponseCallbackHandler.from_agent_type(agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION)

def _run():
    agent.run("Who is Leo DiCaprio's girlfriend? What is her current age raised to the 0.43 power?", callbacks=[stream])
threading.Thread(target=_run).start()

for token in stream:  # <-- generator here
    print(token, end="", flush=True)

if you want to do it the callbacks way then it is also supported:

@stream.on_last_response_new_token()
def on_new_token(token: str):
    print(f"Next token: '{token}'")

arysttoto · 2023-07-08T06:40:27Z

Done ✅ - see #4630

Recording.mp4

Have you noticed the fact that not all types of agents can stream the reponse? Or is it just for me???

ericbellet · 2023-07-11T16:42:04Z

I posted a similar question here https://stackoverflow.com/questions/76663943/how-to-use-openai-in-streaming-mode-with-gradio

linpan · 2023-09-22T06:58:07Z

Me，too

jsham042 · 2023-10-18T18:32:03Z

For what Its worth, if you're using AgentType.CONVERSATIONAL_REACT_DESCRIPTION, this is the callbacks=[FinalStreamingStdOutCallbackHandler(answer_prefix_tokens=['AI', ':'])

usersina · 2023-11-26T20:14:32Z

I am not a fan of using stdout when not necessary, therefore I created my own CustomAsyncCallbackHandler based on AsyncIteratorCallbackHandler for now, until PR #5937 is merged:

import asyncio
from typing import Any, AsyncIterator, Dict, List, Literal, Union, cast

from langchain.callbacks.base import AsyncCallbackHandler
from langchain.schema.agent import AgentAction, AgentFinish
from langchain.schema.output import LLMResult


class CustomAsyncCallbackHandler(AsyncCallbackHandler):
    """
    Streaming callback handler that returns an async iterator. This supports
    both streaming llm and agent executors.

    :param is_agent: Whether this is an agent executor.
    """

    queue: asyncio.Queue[str]
    done: asyncio.Event

    @property
    def always_verbose(self) -> bool:
        return True

    def __init__(self, is_agent: bool = False) -> None:
        self.queue = asyncio.Queue()
        self.done = asyncio.Event()
        self.is_agent = is_agent

    async def on_llm_start(
        self, serialized: Dict[str, Any], prompts: List[str], **kwargs: Any
    ) -> None:
        print("================== LLM Start! ==========================")
        self.done.clear()

    async def on_llm_new_token(self, token: str, **kwargs: Any) -> None:
        if token is not None and token != "":
            self.queue.put_nowait(token)

    async def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None:
        """
        Do not close the queue here, as it may be used by the agent.
        """
        if not self.is_agent:
            self.done.set()
        else:
            print("================== LLM finished! ==========================")
            print(response)
            generation_info = response.generations[-1][-1].generation_info
            if generation_info is not None:
                print(
                    "================== LLM finish reason! =========================="
                )
                # Figured out through trial and error
                if generation_info.get("finish_reason") == "stop":
                    self.done.set()

    async def on_llm_error(self, error: BaseException, **kwargs: Any) -> None:
        if not self.is_agent:
            self.done.set()

    # async def on_agent_action(self, action: AgentAction, **kwargs: Any) -> None:
    #     """Run on agent action."""

    # async def on_agent_finish(self, finish: AgentFinish, **kwargs: Any) -> None:
    #    """Run on agent end."""
    #    self.done.set()

    # async def on_chain_start(
    #    self, serialized: Dict[str, Any], inputs: Dict[str, Any], **kwargs: Any
    # ) -> None:
    #    self.done.clear()

    # async def on_chain_end(self, outputs: Dict[str, Any], **kwargs: Any) -> None:
    #    """Run when chain ends running."""
    #   self.done.set()

    async def aiter(self) -> AsyncIterator[str]:
        """
        Returns an async iterator that yields tokens from the LLM.
        """
        while not self.queue.empty() or not self.done.is_set():
            # Wait for the next token in the queue or for the done event to be set
            done, other = await asyncio.wait(
                [
                    asyncio.ensure_future(self.queue.get()),
                    asyncio.ensure_future(self.done.wait()),
                ],
                return_when=asyncio.FIRST_COMPLETED,
            )

            # Cancel the other task
            if other:
                other.pop().cancel()

            # Extract the value of the first completed task
            token_or_done = cast(Union[str, Literal[True]], done.pop().result())

            # If the extracted value is the boolean True, the done event was set
            if token_or_done is True:
                break

            # Otherwise, the extracted value is a token, which we yield
            yield token_or_done

Edit: After some more details testing, looks like on_agent_action, on_agent_finish and on_chain_end are never getting called for some reason. I opted to check the last generation_info's finish_reason.

For completeness purposes, here's a FastAPI route:

@router.post("/")
async def general(input: Request):
    # Create the memory out of the provided messages list
    prompt = input.prompt
    memory = ConversationSummaryBufferMemory(
        llm=OpenAI(model="gpt-3.5-turbo-instruct"),
        chat_memory=get_chat_memory(input.messages or []),
    )

    # Create the conversation and the async callback handler
    handler = CustomAsyncCallbackHandler()
    conversation = ConversationChain(
        llm=OpenAI(model="gpt-3.5-turbo-instruct", streaming=True, callbacks=[handler]),
        memory=memory,
    )

    async def ask_question_async():
        asyncio.create_task(conversation.apredict(input=prompt))
        async for chunk in handler.aiter():
            yield f"data: {json.dumps({'content': chunk, 'tokens': 0})}\n\n"

    return StreamingResponse(ask_question_async(), media_type="text/plain")

mbrialesam · 2024-01-19T18:36:37Z

Trying the above, but want to include intermediate steps in terms of Tools usage, really difficult as I don't have a way to distinguish between final output and intermediate logs / thoughts...

usersina · 2024-01-19T21:06:04Z

Trying the above, but want to include intermediate steps in terms of Tools usage, really difficult as I don't have a way to distinguish between final output and intermediate logs / thoughts...

You might consider using the new LangChain expression language. I also recommend the new take on Agents from LangChain 0.1.0 omwards. You can do so much more know from streaming the output out of the box, to even stream the intermediary steps themselves.

I might follow with a project on GitHub (not in the near future though)

alexgg278 · 2024-03-09T09:55:58Z

@usersina can you refer to the documentation explaining that you can directly stream the final output. I have only found the same stream method that streams intermediare steps and actions

usersina · 2024-03-09T10:52:34Z

@usersina can you refer to the documentation explaining that you can directly stream the final output. I have only found the same stream method that streams intermediare steps and actions

This page has it all - https://python.langchain.com/docs/expression_language/streaming#using-stream-events

samuelint · 2024-06-06T20:03:41Z

I successfully resolved this issue using the langchain-openai-api-bridge library. You can find the library here: https://github.com/samuelint/langchain-openai-api-bridge

Key Benefits:

Streaming Support
Exposes Agent Final Message in an OpenAI-compatible API format.

Example Usage:
You can then consume the agent response with regular OpenAI Client library (python, javascript, or any other)

# Client
openai_client = OpenAI(
    base_url="http://my-langchain-server/my-custom-path/openai/v1",
)

chat_completion = openai_client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {
            "role": "user",
            "content": 'Say "This is a test"',
        }
    ],
)
print(chat_completion.choices[0].message.content)
#> "This is a test"

Hope this help :)

yesidc · 2024-07-03T10:07:38Z

This is how I solved this issue using Streamlit and GPT-4o

First, I created a custom callback that uses Streamlit. Every time a break \n line is found; the content in self.content is shown to the user.

class CallbackHandlerStreaming(StreamingStdOutCallbackHandler):
    """
    Callback handler for streaming output to the user interface using Streamlit.
    """
    
    def __init__(self):
        self.content: str = ""
  
    
    def on_llm_new_token(self, token: str, **kwargs: any) -> None:
        # runs for every token generated by the LLM
        if token:
            # streaming of every line
            if '\n' in token:
                self.content += token
                st.markdown(self.content)
                self.content = ""
            else:
                self.content += token
    
    
    def on_agent_finish(self, finish: AgentFinish, **kwargs: Any) -> Any:
        """Run on agent end."""
        if self.content:
            # if there is content left, stream it
            st.markdown(self.content)
            self.content = ""

Create AgentExecutor
Do not forget to set streaming=True when instantiating your LLM. In my case , it looks like this:

ChatOpenAI(model=OPEN_AI_MODEL, 
                          temperature=0, 
                          streaming=True,
                          )

  agent_executor = AgentExecutor(
      agent=agent,
      tools=tools,
      return_intermediate_steps=False,
      verbose=True,
      memory=memory,
      handle_parsing_errors=True,
      max_execution_time=60,  # Agent stops after 60 seconds 
     
  )

How to use it

  config ={
            'callbacks': [CallbackHandlerStreaming()]
        }
        response = agent_executor.invoke({"input": input}, config=config)

Why am I passing the callback through the config? see Request callbacks: https://python.langchain.com/v0.1/docs/modules/callbacks/

UmerHA mentioned this issue May 13, 2023

Streaming only final output of agent (#2483) #4630

Merged

dev2049 closed this as completed in #4630 May 20, 2023

willydouhard mentioned this issue May 30, 2023

Update to real Async and Streaming Chainlit/chainlit#7

Closed

danielchalef mentioned this issue Jun 5, 2023

Zep Hybrid Search #5742

Merged

thaiminhpv mentioned this issue Jun 9, 2023

Streaming final agent response callbacks #5937

Closed

This was referenced Jun 25, 2023

Zep Authentication #6725

Closed

Zep Authentication #6728

Merged

dogfootruler-kr mentioned this issue Jun 30, 2023

Stream only the final answer when using an agent langchain-ai/langchainjs#1820

Closed

dosubot bot mentioned this issue Aug 15, 2023

Stream the response of the custom tools in the agent #9271

Closed

dosubot bot mentioned this issue Sep 7, 2023

Streaming Turns on Verbose mode for AgentType.CONVERSATIONAL_REACT_DESCRIPTION #10339

Closed

14 tasks

dosubot bot mentioned this issue Oct 29, 2023

Issue: StreamingStdOutCallbackHandler and FinalStreamingStdOutCallbackHandler #12503

Closed

usersina mentioned this issue Nov 26, 2023

Issue: Stream a response from LangChain's OpenAI with Pyton Flask API #4945

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using a Agent and wanted to stream just the final response #2483

using a Agent and wanted to stream just the final response #2483

zhuxiaobin commented Apr 6, 2023

Wzy232303 commented Apr 6, 2023

WarlaxZ commented Apr 23, 2023

Prayforhanluo commented Apr 26, 2023

pelyhe commented May 9, 2023

UmerHA commented May 9, 2023 •

edited

Loading

pelyhe commented May 9, 2023 •

edited

Loading

hgoona commented May 13, 2023

pelyhe commented May 13, 2023

UmerHA commented May 13, 2023

Prayforhanluo commented May 14, 2023

UmerHA commented May 14, 2023

Prayforhanluo commented May 15, 2023

UmerHA commented May 15, 2023 •

edited

Loading

stoneHah commented May 15, 2023

UmerHA commented May 15, 2023

My3VM commented May 16, 2023

thaiminhpv commented Jun 12, 2023

arysttoto commented Jul 8, 2023

ericbellet commented Jul 11, 2023

linpan commented Sep 22, 2023

jsham042 commented Oct 18, 2023

usersina commented Nov 26, 2023 •

edited

Loading

mbrialesam commented Jan 19, 2024

usersina commented Jan 19, 2024

alexgg278 commented Mar 9, 2024

usersina commented Mar 9, 2024

samuelint commented Jun 6, 2024

yesidc commented Jul 3, 2024

using a Agent and wanted to stream just the final response #2483

using a Agent and wanted to stream just the final response #2483

Comments

zhuxiaobin commented Apr 6, 2023

Wzy232303 commented Apr 6, 2023

WarlaxZ commented Apr 23, 2023

Prayforhanluo commented Apr 26, 2023

pelyhe commented May 9, 2023

UmerHA commented May 9, 2023 • edited Loading

pelyhe commented May 9, 2023 • edited Loading

hgoona commented May 13, 2023

pelyhe commented May 13, 2023

UmerHA commented May 13, 2023

Prayforhanluo commented May 14, 2023

UmerHA commented May 14, 2023

Prayforhanluo commented May 15, 2023

UmerHA commented May 15, 2023 • edited Loading

stoneHah commented May 15, 2023

UmerHA commented May 15, 2023

My3VM commented May 16, 2023

thaiminhpv commented Jun 12, 2023

Basic setup

arysttoto commented Jul 8, 2023

ericbellet commented Jul 11, 2023

linpan commented Sep 22, 2023

jsham042 commented Oct 18, 2023

usersina commented Nov 26, 2023 • edited Loading

mbrialesam commented Jan 19, 2024

usersina commented Jan 19, 2024

alexgg278 commented Mar 9, 2024

usersina commented Mar 9, 2024

samuelint commented Jun 6, 2024

yesidc commented Jul 3, 2024

UmerHA commented May 9, 2023 •

edited

Loading

pelyhe commented May 9, 2023 •

edited

Loading

UmerHA commented May 15, 2023 •

edited

Loading

usersina commented Nov 26, 2023 •

edited

Loading