Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama2 Chatbot on Mac #2618

Merged
merged 24 commits into from
Sep 30, 2023
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
bc6413d
Llama2 Chat app
agunapal Sep 27, 2023
4c09344
fix lint
agunapal Sep 27, 2023
9cbd9b0
Merge branch 'master' into examples/llama2_app
agunapal Sep 27, 2023
5c70f46
fix lint
agunapal Sep 27, 2023
14e8ad8
Merge branch 'examples/llama2_app' of https://github.com/agunapal/ser…
agunapal Sep 27, 2023
e82eab0
Added system architecture diagram
agunapal Sep 27, 2023
043630b
Merge branch 'master' into examples/llama2_app
agunapal Sep 27, 2023
cbea556
Added system architecture diagram
agunapal Sep 27, 2023
ae7dadd
Merge branch 'examples/llama2_app' of https://github.com/agunapal/ser…
agunapal Sep 27, 2023
9d106a8
Added system architecture diagram
agunapal Sep 27, 2023
f71007b
Added system architecture diagram
agunapal Sep 27, 2023
7e24125
Added system architecture diagram
agunapal Sep 27, 2023
e7f87f9
handler code
agunapal Sep 28, 2023
041d85b
lint
agunapal Sep 28, 2023
ff675c2
review comments
agunapal Sep 29, 2023
ed081da
added f strings
agunapal Sep 29, 2023
ddbe473
review comments
agunapal Sep 29, 2023
0956d20
Merge branch 'master' into examples/llama2_app
agunapal Sep 29, 2023
738e6db
review comments
agunapal Sep 29, 2023
bc97b8c
Merge branch 'examples/llama2_app' of https://github.com/agunapal/ser…
agunapal Sep 29, 2023
7b4cfbf
lint failure
agunapal Sep 29, 2023
fadb7b8
Merge branch 'master' into examples/llama2_app
agunapal Sep 30, 2023
0807fb8
Merge branch 'master' into examples/llama2_app
agunapal Sep 30, 2023
6eb0d65
Merge branch 'master' into examples/llama2_app
agunapal Sep 30, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
125 changes: 125 additions & 0 deletions examples/LLM/llama2/chat_app/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@

agunapal marked this conversation as resolved.
Show resolved Hide resolved
# TorchServe Llama 2 Chatapp

This is an example showing how to deploy a llama2 chat app using TorchServe.
We use [streamlit](https://github.com/streamlit/streamlit) to create the app

We are using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) in this example

You can run this example on your laptop to understand how to use TorchServe


## Architecture

![Chatbot Architecture](./screenshots/architecture.png)


## Pre-requisites

The following example has been tested on M1 Mac.
Before you install TorchServe, make sure you have the following installed
1) JDK 17

Make sure your javac version is `17.x.x`
```
javac --version
javac 17.0.8
```
You can download it from [java](https://www.oracle.com/java/technologies/downloads/#jdk17-mac)
2) Install conda with support for arm64

3) Since we are running this example on Mac, we will use the 7B llama2 model.
Download llama2-7b weights by following instructions [here](https://github.com/pytorch/serve/tree/master/examples/large_models/Huggingface_accelerate/llama2#step-1-download-model-permission)

4) Install streamlit with

```
python -m pip install -r requirements.txt
```


### Steps

#### Install TorchServe
Install TorchServe with the following steps

```
python ts_scripts/install_dependencies.py
pip install torchserve torch-model-archiver torch-workflow-archiver
```

#### Package model for TorchServe

Run this script to create `llamacpp.tar.gz` to be loaded in TorchServe

```
source package_llama.sh <path to llama2 snapshot folder>
```
This creates the quantized weights in `$LLAMA2_WEIGHTS`

For subsequent runs, we don't need to regenerate these weights. We only need to package the handler, model-config.yaml in the tar file.

Hence, you can skip the model generation by running the script as follows

```
source package_llama.sh <path to llama2 snapshot folder> false
```

You might need to run the below command if the script output indicates it.
```
sudo xcodebuild -license
```

#### Start TorchServe

We launch a streamlit app to configure TorchServe. This opens a UI in your browser, which you can use to start/stop TorchServe, register model, change some of the TorchServe parameters

```
streamlit run torchserve_server_app.py
```

You can check the model status on the app to make sure the model is ready to receive requests

![Server](./screenshots/Server.png)

#### Client Chat App

We launch a streamlit app from which a client can send requests to TorchServe. The reference app used is [here](https://blog.streamlit.io/how-to-build-a-llama-2-chatbot/)

```
streamlit run client_app.py
```

You can change the model parameters and ask the server questions in the following format

```
Question: What is the closest star to Earth ? Answer:
```
results in

```
Question: What is the closest star to Earth ? Answer: The closest star to Earth is Proxima Centauri, which is located about 4. nobody knows if there is other life out there similar to ours or not, but it's pretty cool that we know of a star so close to us!
```

![Client](./screenshots/Client.png)


### Experiments

You can launch a second client app from another terminal.

You can send requests simultaneously to see how quickly TorchServe responds

#### Dynamic Batching

You can make use of dynamic batching in TorchServe by configuring the `batch_size` and `max_batch_delay` parameters in TorchServe. You can do this on the Server app.
agunapal marked this conversation as resolved.
Show resolved Hide resolved

![Batch Size](./screenshots/batch_size.png)

#### Backend Workers

You can increase the number of backend workers in TorchServe by configuring `min_workers` parameter in TorchServe. You can do this on the Server app.
agunapal marked this conversation as resolved.
Show resolved Hide resolved

The number of workers can be autoscaled based on the traffic and usage patterns

![Workers](./screenshots/Workers.png)
97 changes: 97 additions & 0 deletions examples/LLM/llama2/chat_app/client_app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
import json

import requests
import streamlit as st

# App title
st.set_page_config(page_title="🦙💬 Llama 2 Chatbot")

# Replicate Credentials
with st.sidebar:
st.title("🦙💬 Llama 2 Chatbot")

try:
res = requests.get(url="http://localhost:8080/ping")
res = requests.get(url="http://localhost:8081/models/llamacpp")
status = json.loads(res.text)[0]["workers"][0]["status"]

if status == "READY":
st.success("Proceed to entering your prompt message!", icon="👉")
else:
st.warning("Model not loaded in TorchServe", icon="⚠️")

except requests.ConnectionError:
st.warning("TorchServe is not up. Try again", icon="⚠️")

st.subheader("Model parameters")
temperature = st.sidebar.slider(
"temperature", min_value=0.01, max_value=5.0, value=0.8, step=0.01
)
top_p = st.sidebar.slider(
"top_p", min_value=0.01, max_value=1.0, value=0.95, step=0.01
)
max_tokens = st.sidebar.slider(
"max_tokens", min_value=128, max_value=512, value=100, step=8
)

# Store LLM generated responses
if "messages" not in st.session_state.keys():
st.session_state.messages = [
{"role": "assistant", "content": "How may I assist you today?"}
]

# Display or clear chat messages
for message in st.session_state.messages:
with st.chat_message(message["role"]):
st.write(message["content"])


def clear_chat_history():
st.session_state.messages = [
{"role": "assistant", "content": "How may I assist you today?"}
]


st.sidebar.button("Clear Chat History", on_click=clear_chat_history)


# Function for generating LLaMA2 response. Refactored from https://github.com/a16z-infra/llama2-chatbot
def generate_llama2_response(prompt_input):
string_dialogue = (
"Question: What are the names of the planets in the solar system? Answer: "
)
headers = {"Content-type": "application/json", "Accept": "text/plain"}
url = "http://127.0.0.1:8080/predictions/llamacpp"
data = json.dumps(
{
"prompt": prompt_input,
"max_tokens": max_tokens,
"top_p": top_p,
"temperature": temperature,
}
)

res = requests.post(url=url, data=data, headers=headers)

return res.text


# User-provided prompt
if prompt := st.chat_input():
st.session_state.messages.append({"role": "user", "content": prompt})
with st.chat_message("user"):
st.write(prompt)

# Generate a new response if last message is not from assistant
if st.session_state.messages[-1]["role"] != "assistant":
with st.chat_message("assistant"):
with st.spinner("Thinking..."):
response = generate_llama2_response(prompt)
placeholder = st.empty()
full_response = ""
for item in response:
full_response += item
placeholder.markdown(full_response)
placeholder.markdown(full_response)
message = {"role": "assistant", "content": full_response}
st.session_state.messages.append(message)
57 changes: 57 additions & 0 deletions examples/LLM/llama2/chat_app/llama_cpp_handler.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
import logging
import os
from abc import ABC

import torch
from llama_cpp import Llama

from ts.torch_handler.base_handler import BaseHandler

logger = logging.getLogger(__name__)


class LlamaCppHandler(BaseHandler, ABC):
def __init__(self):
super(LlamaCppHandler, self).__init__()
self.initialized = False
logger.info("Init done")

def initialize(self, ctx):
"""In this initialize function, the HF large model is loaded and
partitioned using DeepSpeed.
Args:
ctx (context): It is a JSON Object containing information
pertaining to the model artifacts parameters.
"""
logger.info("Start initialize")
model_name = ctx.model_yaml_config["handler"]["model_name"]
model_path = ctx.model_yaml_config["handler"]["model_path"]
if not os.path.exists(model_path):
model_path = os.environ["LLAMA2_Q4_MODEL"]
seed = int(ctx.model_yaml_config["handler"]["manual_seed"])
torch.manual_seed(seed)

self.model = Llama(model_path=model_path)

def preprocess(self, data):
for row in data:
item = row.get("body")
return item

def inference(self, data):
result = self.model.create_completion(
data["prompt"],
max_tokens=data["max_tokens"],
top_p=data["top_p"],
temperature=data["temperature"],
stop=["Q:", "\n"],
echo=True,
)
tokens = self.model.tokenize(bytes(data["prompt"], "utf-8"))
return result

def postprocess(self, output):
logger.info(output)
result = []
result.append(output["choices"][0]["text"])
return result
7 changes: 7 additions & 0 deletions examples/LLM/llama2/chat_app/model-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# TorchServe frontend parameters
responseTimeout: 1200

handler:
model_name: "llama-cpp"
model_path: "/Users/agunapal/Documents/experiments/llama/ggml-model-q4_0.gguf"
agunapal marked this conversation as resolved.
Show resolved Hide resolved
manual_seed: 40
46 changes: 46 additions & 0 deletions examples/LLM/llama2/chat_app/package_llama.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@

# Check if the argument is empty or unset
if [ -z "$1" ]; then
echo "Missing Mandatory argument: Path to llama weights"
echo "Usage: ./package_llama.sh ./model/models--meta-llama--Llama-2-7b-chat-hf/snapshots/08751db2aca9bf2f7f80d2e516117a53d7450235"
exit 1
fi

MODEL_GENERATION="true"
LLAMA2_WEIGHTS="$1"

if [ -n "$2" ]; then
MODEL_GENERATION="$2"
fi

CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

if [ "$MODEL_GENERATION" = "true" ]; then
echo "Cleaning up previous build of llama-cpp"
rm -rf build
git clone https://github.com/ggerganov/llama.cpp.git build
cd build
make
python -m pip install -r requirements.txt

echo "Convert the 7B model to ggml FP16 format"
python convert.py $LLAMA2_WEIGHTS --outfile ggml-model-f16.gguf

echo "Quantize the model to 4-bits (using q4_0 method)"
./quantize ggml-model-f16.gguf ../ggml-model-q4_0.gguf q4_0

cd ..
export LLAMA2_Q4_MODEL=$PWD/ggml-model-q4_0.gguf
echo "Saved quantized model weights to $LLAMA2_Q4_MODEL"
fi

echo "Creating torchserve model archive"
torch-model-archiver --model-name llamacpp --version 1.0 --handler llama_cpp_handler.py --config-file model-config.yaml --archive-format tgz

mkdir -p model_store
mv llamacpp.tar.gz model_store/.
if [ "$MODEL_GENERATION" = "true" ]; then
echo "Cleaning up build of llama-cpp"
rm -rf build
fi

1 change: 1 addition & 0 deletions examples/LLM/llama2/chat_app/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
streamlit>=1.26.0
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading