-
Notifications
You must be signed in to change notification settings - Fork 16k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue: Stream a response from LangChain's OpenAI with Pyton Flask API #4945
Comments
You could use the @app.route("/collection/<int:collection_id>/ask_question", methods=["POST"])
def ask_question(collection_id):
question = request.form["question"]
# response_generator = document_thread.askQuestion(collection_id, question)
# return jsonify(response_generator)
def stream(question):
completion = document_thread.askQuestion(collection_id, question)
for line in completion['answer']:
yield line
return app.response_class(stream_with_context(stream(question))) |
Sadly it doesn't work and I did exactly as you told me. |
I'm also wondering how this is done. Tried stream_template, stream_with_context, and my server only sends the response once it is done loading and not while it is streaming. I also tried different callback handlers to no avail. |
I am still playing around and trying to solve it, but without any success. @agola11 For now, my code looks like this:
|
What you need is overwrite the StreamingStdOutCallbackHandler's 'on_llm_new_token' method, as I realized that the method only print the token in stream, but do nothing to the output. So I put the token to a Queue in a thread, then read it from the other thread. It works for me.
|
Switched from Flask to FastAPI.. Moved to: #5409 |
working on a similar implementation but can't get it to work. |
wait, nevermind, got it to work! thanks for the first answer. |
should care about the AsyncIteratorCallbackHandler , it will stop iterator when stream completing, need to count the rest tokens and return as last data event |
With the usage of In flask API, you may create a queue to register tokens through langchain's callback. class StreamingHandler(BaseCallbackHandler):
...
def on_llm_new_token(self, token: str, **kwargs) -> None:
self.queue.put(token) You may get tokens from the same queue in your flask route. from flask import Response, stream_with_context
import threading
@app.route(....):
def stream_output():
q = Queue()
def generate(rq: Queue):
...
# add your logic to prevent while loop
# to run indefinitely
while( ...):
yield rq.get()
callback_fn = StreamingHandler(q)
threading.Thread(target= askQuestion, args=(collection_id, question, callback_fn))
return Response(stream_with_context(generate(q)) In your langchain's ChatOpenAI add the above custom callback StreamingHandler. self.llm = ChatOpenAI(
model_name=self.model_name,
temperature=self.temperature,
openai_api_key=os.environ.get('OPENAI_API_KEY'),
streaming=True,
callback=[callback_fn,]
) For reference: https://python.langchain.com/en/latest/modules/callbacks/getting_started.html#creating-a-custom-handler |
@varunsinghal @longmans nice work, I am building Flask-Langchain & want to include streaming functionality. Have you tested this approach with multiple concurrent requests? Would be fantastic if one of you could open a PR to add an extension-based callback handler and route class (or decorator?) to handle streaming responses to the Flask-Langchain project - this probably isn't functionality that belongs in the main Langchain library as it is Flask-specific. |
@varunsinghal Thank you for the great answer! Could you elaborate more on the implementation of your method? I couldn't reproduce a code with your method to get it to work. Thanks in advance! |
Working on the same problem. No success at the moment... @varunsinghal I am not getting your solution tbh |
hi @VionaWang @riccardolinares can you please share your code samples, so that I can make suggestions/debug on what could be going wrong over there? |
managed to get streaming work BUT with a |
solved in a very hacky way (of course can be improved), if the prompt comes from the condensator then the streaming will be discarded - so the final streamed tokens will contain only the answer without condensed question
|
How did you make it work, been bugging me |
It would be great, if you showed the whole code |
Please i can't see the code of the working solution, can you please show it ? |
Here's a full minimal working example, taking from all of the answers above (with typings, modularity using Blueprints and minimal error handling as a bonus): To explain how it all works:
import os
import threading
from queue import Queue
from flask import Response
from utils.streaming import StreamingStdOutCallbackHandlerYield, generate
page = Blueprint(os.path.splitext(os.path.basename(__file__))[0], __name__)
# Define the expected input type
class Input(TypedDict):
prompt: str
@page.route("/", methods=["POST"])
def stream_text() -> Response:
data: Input = request.get_json()
prompt = data["prompt"]
q = Queue()
def ask_question(callback_fn: StreamingStdOutCallbackHandlerYield):
# Note that a try/catch is not needed here. Callback takes care of all errors in `on_llm_error`
llm = OpenAI(streaming=True, callbacks=[callback_fn])
return llm(prompt=prompt)
callback_fn = StreamingStdOutCallbackHandlerYield(q)
threading.Thread(target=ask_question, args=(callback_fn,)).start()
return Response(generate(q), mimetype="text/event-stream")
import queue
from typing import Any, Dict, List, Union
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.schema import LLMResult
STOP_ITEM = "[END]"
"""
This is a special item that is used to signal the end of the stream.
"""
class StreamingStdOutCallbackHandlerYield(StreamingStdOutCallbackHandler):
"""
This is a callback handler that yields the tokens as they are generated.
For a usage example, see the :func:`generate` function below.
"""
q: queue.Queue
"""
The queue to write the tokens to as they are generated.
"""
def __init__(self, q: queue.Queue) -> None:
"""
Initialize the callback handler.
q: The queue to write the tokens to as they are generated.
"""
super().__init__()
self.q = q
def on_llm_start(
self, serialized: Dict[str, Any], prompts: List[str], **kwargs: Any
) -> None:
"""Run when LLM starts running."""
with self.q.mutex:
self.q.queue.clear()
def on_llm_new_token(self, token: str, **kwargs: Any) -> None:
"""Run on new LLM token. Only available when streaming is enabled."""
# Writes to stdout
# sys.stdout.write(token)
# sys.stdout.flush()
# Pass the token to the generator
self.q.put(token)
def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None:
"""Run when LLM ends running."""
self.q.put(STOP_ITEM)
def on_llm_error(
self, error: Union[Exception, KeyboardInterrupt], **kwargs: Any
) -> None:
"""Run when LLM errors."""
self.q.put("%s: %s" % (type(error).__name__, str(error)))
self.q.put(STOP_ITEM)
def generate(rq: queue.Queue):
"""
This is a generator that yields the items in the queue until it reaches the stop item.
Usage example:
```
def askQuestion(callback_fn: StreamingStdOutCallbackHandlerYield):
llm = OpenAI(streaming=True, callbacks=[callback_fn])
return llm(prompt="Write a poem about a tree.")
@app.route("/", methods=["GET"])
def generate_output():
q = Queue()
callback_fn = StreamingStdOutCallbackHandlerYield(q)
threading.Thread(target=askQuestion, args=(callback_fn,)).start()
return Response(generate(q), mimetype="text/event-stream")
```
"""
while True:
result: str = rq.get()
if result == STOP_ITEM or result is None:
break
yield result Complete folder structureHere's a the working tree, if you're struggling where the files are located: .
├── README.md
├── requirements.txt
└── src
├── main.py
├── routes
│ └── stream.py
└── utils
└── streaming.py
from dotenv import load_dotenv
from flask import Flask
from flask_cors import CORS
from routes.stream import page as stream_route
# Load environment variables
load_dotenv(
dotenv_path=".env", # Relative to where the script is running from
)
app = Flask(__name__)
# See https://github.com/corydolphin/flask-cors/issues/257
app.url_map.strict_slashes = False
CORS(app)
app.register_blueprint(stream_route, url_prefix="/api/chat")
if __name__ == "__main__":
app.run() I will soon follow with a full repository (probably) |
My previous solution is a performance killer, so here's a better, more concise one: import asyncio
import json
from langchain.callbacks.streaming_aiter import AsyncIteratorCallbackHandler
from langchain.memory import ConversationSummaryBufferMemory
from langchain.chains import ConversationChain
from langchain.llms.openai import OpenAI
@page.route("/general", methods=["POST"])
async def general_chat():
try:
memory = ConversationSummaryBufferMemory(
llm=OpenAI(), chat_memory=[]
)
handler = AsyncIteratorCallbackHandler()
conversation = ConversationChain(
llm=OpenAI(streaming=True, callbacks=[handler]), memory=memory
)
async def ask_question_async():
asyncio.create_task(conversation.apredict(input="Hello, how are you?"))
async for chunk in handler.aiter():
yield f"data: {json.dumps({'content': chunk, 'tokens': 0})}\n\n"
return ask_question_async(), {"Content-Type": "text/event-stream"}
except Exception as e:
return {"error": "{}: {}".format(type(e).__name__, str(e))}, 500 Note that |
What led you to choose conversation.apredict instead of the standard method of directly passing the user query to created chain? |
Because apredict is asynchronous. In fact you might also be able to directly call |
How about doing this using Retrieval chain, trying to but getting errrors |
How about doing this using Retrieval chain, trying to but getting errors |
@usersina thanks for providing your code. I've tried what you recommended in your comment, and it works except I do not get the final output from the agent. I get the chain thought process returned in my Flask app, but it stops short of returning the final answer. What am I missing?
import sys
import queue
from typing import Any, Dict, List, Optional, Union
from langchain.callbacks.base import BaseCallbackHandler
from langchain.schema import AgentAction, AgentFinish, LLMResult
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
STOP_ITEM = "[END]"
"""
This is a special item that is used to signal the end of the stream.
"""
class StreamingStdOutCallbackHandlerYield(StreamingStdOutCallbackHandler):
"""
This is a callback handler that yields the tokens as they are generated.
For a usage example, see the :func:`generate` function below.
"""
q: queue.Queue
"""
The queue to write the tokens to as they are generated.
"""
def __init__(self, q: queue.Queue) -> None:
"""
Initialize the callback handler.
q: The queue to write the tokens to as they are generated.
"""
super().__init__()
self.q = q
def on_llm_start(
self, serialized: Dict[str, Any], prompts: List[str], **kwargs: Any
) -> None:
"""Run when LLM starts running."""
with self.q.mutex:
self.q.queue.clear()
def on_llm_new_token(self, token: str, **kwargs: Any) -> None:
"""Run on new LLM token. Only available when streaming is enabled."""
# Writes to stdout
sys.stdout.write(token)
sys.stdout.flush()
# Pass the token to the generator
self.q.put(token)
def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None:
"""Run when LLM ends running."""
sys.stdout.write("THE END!!!")
self.q.put(response.output)
self.q.put(STOP_ITEM)
def on_llm_error(
self, error: Union[Exception, KeyboardInterrupt], **kwargs: Any
) -> None:
"""Run when LLM errors."""
sys.stdout.write(f"LLM Error: {error}\n")
self.q.put("%s: %s" % (type(error).__name__, str(error)))
self.q.put(STOP_ITEM)
def on_chain_start(self, serialized: Dict[str, Any], inputs: Dict[str, Any], **kwargs: Any) -> Any:
"""Print out that we are entering a chain."""
self.q.put("Entering the chain...\n\n")
def on_tool_start(self, serialized: Dict[str, Any], input_str: str, **kwargs: Any) -> Any:
sys.stdout.write(f"Tool: {serialized['name']}\n")
self.q.put(f"Tool: {serialized['name']}\n")
def on_agent_action(self, action: AgentAction, **kwargs: Any) -> Any:
sys.stdout.write(f"{action.log}\n")
self.q.put(f"{action.log}\n")
def generate(rq: queue.Queue):
"""
This is a generator that yields the items in the queue until it reaches the stop item.
Usage example:
```
def askQuestion(callback_fn: StreamingStdOutCallbackHandlerYield):
llm = OpenAI(streaming=True, callbacks=[callback_fn])
return llm(prompt="Write a poem about a tree.")
@app.route("/", methods=["GET"])
def generate_output():
q = Queue()
callback_fn = StreamingStdOutCallbackHandlerYield(q)
threading.Thread(target=askQuestion, args=(callback_fn,)).start()
return Response(generate(q), mimetype="text/event-stream")
```
"""
while True:
result: str = rq.get()
if result == STOP_ITEM or result is None:
break
yield result
@app.route('/chat', methods=['POST'])
@auth.secured()
def chat():
message = request.json['messages']
chat_message_history = CustomChatMessageHistory(
session_id=session['conversation_id'], connection_string="sqlite:///chat_history.db"
)
q = Queue()
callback_fn = StreamingStdOutCallbackHandlerYield(q)
def ask_question(callback_fn: StreamingStdOutCallbackHandlerYield):
# Callback manager
cb_manager = CallbackManager(handlers=[callback_fn])
## SQLDbAgent is a custom Tool class created to Q&A over a MS SQL Database
sql_search = SQLSearchAgent(llm=llm, k=30, callback_manager=cb_manager, return_direct=True)
## ChatGPTTool is a custom Tool class created to talk to ChatGPT knowledge
chatgpt_search = ChatGPTTool(llm=llm, callback_manager=cb_manager, return_direct=True)
tools = [sql_search, chatgpt_search]
agent = ConversationalChatAgent.from_llm_and_tools(llm=llm, tools=tools, system_message=CUSTOM_CHATBOT_PREFIX, human_message=CUSTOM_CHATBOT_SUFFIX)
memory = ConversationBufferWindowMemory(memory_key="chat_history", return_messages=True, k=10, chat_memory=chat_message_history)
brain_agent_executor = AgentExecutor.from_agent_and_tools(agent=agent, tools=tools, memory=memory, handle_parsing_errors=True, streaming=True)
return brain_agent_executor.run(message['content'])
threading.Thread(target=ask_question, args=(callback_fn,)).start()
return Response(generate(q), mimetype="text/event-stream") |
@mmoore7 - there might have been a change to the stop condition, that or the tool/train of thought end event is getting called. I cannot say for sure since I have long moved from Flask and classic LangChain to LangChain Expression Language and FastAPI for better streaming. |
LangServe has a number of examples that get streaming working out of the box with FastAPI. https://github.com/langchain-ai/langserve/tree/main?tab=readme-ov-file#examples We strongly recommed using LCEL, and depending on what you're doing either using the I am marking this issue as closed as there's enough examples and documentation for folks to solve this without much difficulty. LangServe will provide streaming that will be available to the RemoteRunnable js client in just a few lines of code! |
Flask needs an equivalent of StreamingResponse that FastAPI has. I too switched to FastAPI since the support in Flask is lacking. |
Issue you'd like to raise.
I am using Python Flask app for chat over data. So in the console I am getting streamable response directly from the OpenAI since I can enable streming with a flag streaming=True.
The problem is, that I can't “forward” the stream or “show” the strem than in my API call.
Code for the processing OpenAI and chain is:
and the API route code:
I am testing my endpoint with curl and I am passing flag -N to the curl, so I should get the streamable response, if it is possible.
When I make API call first the endpoint is waiting to process the data (I can see in my terminal in VS code the streamable answer) and when finished, I get everything displayed in one go.
Thanks
Suggestion:
No response
The text was updated successfully, but these errors were encountered: