pytorch · agunapal · Sep 30, 2023 · Sep 27, 2023 · Sep 27, 2023 · Sep 27, 2023
diff --git a/examples/LLM/llama2/chat_app/Readme.md b/examples/LLM/llama2/chat_app/Readme.md
@@ -0,0 +1,125 @@
+
+# TorchServe Llama 2 Chatapp
+
+This is an example showing how to deploy a llama2 chat app using TorchServe.
+We use [streamlit](https://github.com/streamlit/streamlit) to create the app
+
+We are using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) in this example
+
+You can run this example on your laptop to understand how to use TorchServe
+
+
+## Architecture
+
+![Chatbot Architecture](./screenshots/architecture.png)
+
+
+## Pre-requisites
+
+The following example has been tested on M1 Mac.
+Before you install TorchServe, make sure you have the following installed
+1) JDK 17
+
+Make sure your javac version is `17.x.x`
+```
+javac --version
+javac 17.0.8
+```
+You can download it from [java](https://www.oracle.com/java/technologies/downloads/#jdk17-mac)
+2) Install conda with support for arm64
+
+3) Since we are running this example on Mac, we will use the 7B llama2 model.
+Download llama2-7b weights by following instructions [here](https://github.com/pytorch/serve/tree/master/examples/large_models/Huggingface_accelerate/llama2#step-1-download-model-permission)
+
+4) Install streamlit with
+
+```
+python -m pip install -r requirements.txt
+```
+
+
+### Steps
+
+#### Install TorchServe
+Install TorchServe with the following steps
+
+```
+python ts_scripts/install_dependencies.py
+pip install torchserve torch-model-archiver torch-workflow-archiver
+```
+
+#### Package model for TorchServe
+
+Run this script to create `llamacpp.tar.gz` to be loaded in TorchServe
+
+```
+source package_llama.sh <path to llama2 snapshot folder>
+```
+This creates the quantized weights in `$LLAMA2_WEIGHTS`
+
+For subsequent runs, we don't need to regenerate these weights. We only need to package the handler, model-config.yaml in the tar file.
+
+Hence, you can skip the model generation by running the script as follows
+
+```
+source package_llama.sh <path to llama2 snapshot folder>  false
+```
+
+You might need to run the below command if the script output indicates it.
+```
+sudo xcodebuild -license
+```
+
+#### Start TorchServe
+
+We launch a streamlit app to configure TorchServe. This opens a UI in your browser, which you can use to start/stop TorchServe, register model, change some of the TorchServe parameters
+
+```
+streamlit run torchserve_server_app.py
+```
+
+You can check the model status on the app to make sure the model is ready to receive requests
+
+![Server](./screenshots/Server.png)
+
+#### Client Chat App
+
+We launch a streamlit app from which a client can send requests to TorchServe. The reference app used is [here](https://blog.streamlit.io/how-to-build-a-llama-2-chatbot/)
+
+```
+streamlit run client_app.py
+```
+
+You can change the model parameters and ask the server questions in the following format
+
+```
+Question: What is the closest star to Earth ? Answer:
+```
+results in
+
+```
+Question: What is the closest star to Earth ? Answer: The closest star to Earth is Proxima Centauri, which is located about 4. nobody knows if there is other life out there similar to ours or not, but it's pretty cool that we know of a star so close to us!
+```
+
+![Client](./screenshots/Client.png)
+
+
+### Experiments
+
+You can launch a second client app from another terminal.
+
+You can send requests simultaneously to see how quickly TorchServe responds
+
+#### Dynamic Batching
+
+You can make use of dynamic batching in TorchServe by configuring the `batch_size` and `max_batch_delay` parameters in TorchServe. You can do this on the Server app.
+
+![Batch Size](./screenshots/batch_size.png)
+
+#### Backend Workers
+
+You can increase the number of backend workers in TorchServe by configuring  `min_workers` parameter in TorchServe. You can do this on the Server app.
+
+The number of workers can be autoscaled based on the traffic and usage patterns
+
+![Workers](./screenshots/Workers.png)
diff --git a/examples/LLM/llama2/chat_app/client_app.py b/examples/LLM/llama2/chat_app/client_app.py
@@ -0,0 +1,97 @@
+import json
+
+import requests
+import streamlit as st
+
+# App title
+st.set_page_config(page_title="🦙💬 Llama 2 Chatbot")
+
+# Replicate Credentials
+with st.sidebar:
+    st.title("🦙💬 Llama 2 Chatbot")
+
+    try:
+        res = requests.get(url="http://localhost:8080/ping")
+        res = requests.get(url="http://localhost:8081/models/llamacpp")
+        status = json.loads(res.text)[0]["workers"][0]["status"]
+
+        if status == "READY":
+            st.success("Proceed to entering your prompt message!", icon="👉")
+        else:
+            st.warning("Model not loaded in TorchServe", icon="⚠️")
+
+    except requests.ConnectionError:
+        st.warning("TorchServe is not up. Try again", icon="⚠️")
+
+    st.subheader("Model parameters")
+    temperature = st.sidebar.slider(
+        "temperature", min_value=0.01, max_value=5.0, value=0.8, step=0.01
+    )
+    top_p = st.sidebar.slider(
+        "top_p", min_value=0.01, max_value=1.0, value=0.95, step=0.01
+    )
+    max_tokens = st.sidebar.slider(
+        "max_tokens", min_value=128, max_value=512, value=100, step=8
+    )
+
+# Store LLM generated responses
+if "messages" not in st.session_state.keys():
+    st.session_state.messages = [
+        {"role": "assistant", "content": "How may I assist you today?"}
+    ]
+
+# Display or clear chat messages
+for message in st.session_state.messages:
+    with st.chat_message(message["role"]):
+        st.write(message["content"])
+
+
+def clear_chat_history():
+    st.session_state.messages = [
+        {"role": "assistant", "content": "How may I assist you today?"}
+    ]
+
+
+st.sidebar.button("Clear Chat History", on_click=clear_chat_history)
+
+
+# Function for generating LLaMA2 response. Refactored from https://github.com/a16z-infra/llama2-chatbot
+def generate_llama2_response(prompt_input):
+    string_dialogue = (
+        "Question: What are the names of the planets in the solar system? Answer: "
+    )
+    headers = {"Content-type": "application/json", "Accept": "text/plain"}
+    url = "http://127.0.0.1:8080/predictions/llamacpp"
+    data = json.dumps(
+        {
+            "prompt": prompt_input,
+            "max_tokens": max_tokens,
+            "top_p": top_p,
+            "temperature": temperature,
+        }
+    )
+
+    res = requests.post(url=url, data=data, headers=headers)
+
+    return res.text
+
+
+# User-provided prompt
+if prompt := st.chat_input():
+    st.session_state.messages.append({"role": "user", "content": prompt})
+    with st.chat_message("user"):
+        st.write(prompt)
+
+# Generate a new response if last message is not from assistant
+if st.session_state.messages[-1]["role"] != "assistant":
+    with st.chat_message("assistant"):
+        with st.spinner("Thinking..."):
+            response = generate_llama2_response(prompt)
+            placeholder = st.empty()
+            full_response = ""
+            for item in response:
+                full_response += item
+                placeholder.markdown(full_response)
+            placeholder.markdown(full_response)
+    message = {"role": "assistant", "content": full_response}
+    st.session_state.messages.append(message)
diff --git a/examples/LLM/llama2/chat_app/llama_cpp_handler.py b/examples/LLM/llama2/chat_app/llama_cpp_handler.py
@@ -0,0 +1,57 @@
+import logging
+import os
+from abc import ABC
+
+import torch
+from llama_cpp import Llama
+
+from ts.torch_handler.base_handler import BaseHandler
+
+logger = logging.getLogger(__name__)
+
+
+class LlamaCppHandler(BaseHandler, ABC):
+    def __init__(self):
+        super(LlamaCppHandler, self).__init__()
+        self.initialized = False
+        logger.info("Init done")
+
+    def initialize(self, ctx):
+        """In this initialize function, the HF large model is loaded and
+        partitioned using DeepSpeed.
+        Args:
+            ctx (context): It is a JSON Object containing information
+            pertaining to the model artifacts parameters.
+        """
+        logger.info("Start initialize")
+        model_name = ctx.model_yaml_config["handler"]["model_name"]
+        model_path = ctx.model_yaml_config["handler"]["model_path"]
+        if not os.path.exists(model_path):
+            model_path = os.environ["LLAMA2_Q4_MODEL"]
+        seed = int(ctx.model_yaml_config["handler"]["manual_seed"])
+        torch.manual_seed(seed)
+
+        self.model = Llama(model_path=model_path)
+
+    def preprocess(self, data):
+        for row in data:
+            item = row.get("body")
+            return item
+
+    def inference(self, data):
+        result = self.model.create_completion(
+            data["prompt"],
+            max_tokens=data["max_tokens"],
+            top_p=data["top_p"],
+            temperature=data["temperature"],
+            stop=["Q:", "\n"],
+            echo=True,
+        )
+        tokens = self.model.tokenize(bytes(data["prompt"], "utf-8"))
+        return result
+
+    def postprocess(self, output):
+        logger.info(output)
+        result = []
+        result.append(output["choices"][0]["text"])
+        return result
diff --git a/examples/LLM/llama2/chat_app/model-config.yaml b/examples/LLM/llama2/chat_app/model-config.yaml
@@ -0,0 +1,7 @@
+# TorchServe frontend parameters
+responseTimeout: 1200
+
+handler:
+    model_name: "llama-cpp"
+    model_path: "/Users/agunapal/Documents/experiments/llama/ggml-model-q4_0.gguf"
+    manual_seed: 40
diff --git a/examples/LLM/llama2/chat_app/package_llama.sh b/examples/LLM/llama2/chat_app/package_llama.sh
@@ -0,0 +1,46 @@
+
+# Check if the argument is empty or unset
+if [ -z "$1" ]; then
+  echo "Missing Mandatory argument: Path to llama weights"
+  echo "Usage: ./package_llama.sh ./model/models--meta-llama--Llama-2-7b-chat-hf/snapshots/08751db2aca9bf2f7f80d2e516117a53d7450235"
+  exit 1
+fi
+
+MODEL_GENERATION="true"
+LLAMA2_WEIGHTS="$1"
+
+if [ -n "$2" ]; then
+  MODEL_GENERATION="$2"
+fi
+
+CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
+
+if [ "$MODEL_GENERATION" = "true" ]; then
+  echo "Cleaning up previous build of llama-cpp"
+  rm -rf build
+  git clone https://github.com/ggerganov/llama.cpp.git build
+  cd build
+  make 
+  python -m pip install -r requirements.txt
+
+  echo "Convert the 7B model to ggml FP16 format"
+  python convert.py $LLAMA2_WEIGHTS --outfile ggml-model-f16.gguf
+
+  echo "Quantize the model to 4-bits (using q4_0 method)"
+  ./quantize ggml-model-f16.gguf ../ggml-model-q4_0.gguf q4_0
+
+  cd ..
+  export LLAMA2_Q4_MODEL=$PWD/ggml-model-q4_0.gguf
+  echo "Saved quantized model weights to $LLAMA2_Q4_MODEL"
+fi
+
+echo "Creating torchserve model archive"
+torch-model-archiver --model-name llamacpp --version 1.0 --handler llama_cpp_handler.py --config-file model-config.yaml --archive-format tgz
+
+mkdir -p model_store
+mv llamacpp.tar.gz model_store/.
+if [ "$MODEL_GENERATION" = "true" ]; then
+  echo "Cleaning up build of llama-cpp"
+  rm -rf build
+fi
+
diff --git a/examples/LLM/llama2/chat_app/requirements.txt b/examples/LLM/llama2/chat_app/requirements.txt
@@ -0,0 +1 @@
+streamlit>=1.26.0
diff --git a/examples/LLM/llama2/chat_app/screenshots/Client.png b/examples/LLM/llama2/chat_app/screenshots/Client.png
diff --git a/examples/LLM/llama2/chat_app/screenshots/Server.png b/examples/LLM/llama2/chat_app/screenshots/Server.png
diff --git a/examples/LLM/llama2/chat_app/screenshots/Workers.png b/examples/LLM/llama2/chat_app/screenshots/Workers.png
diff --git a/examples/LLM/llama2/chat_app/screenshots/architecture.png b/examples/LLM/llama2/chat_app/screenshots/architecture.png
diff --git a/examples/LLM/llama2/chat_app/screenshots/batch_size.png b/examples/LLM/llama2/chat_app/screenshots/batch_size.png