pytorch · lxning · Feb 27, 2024 · Nov 16, 2023 · Nov 19, 2023 · Nov 21, 2023
diff --git a/examples/large_models/inferentia2/llama2/Readme.md b/examples/large_models/inferentia2/llama2/Readme.md
@@ -1,16 +1,13 @@
 # Large model inference on Inferentia2
 
-This document briefs on serving the [Llama 2](https://huggingface.co/meta-llama) model on [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) for text completion with [micro batching](https://github.com/pytorch/serve/tree/96450b9d0ab2a7290221f0e07aea5fda8a83efaf/examples/micro_batching) and [streaming response](https://github.com/pytorch/serve/blob/96450b9d0ab2a7290221f0e07aea5fda8a83efaf/docs/inference_api.md#curl-example-1) support.
+This document briefs on serving the [Llama 2](https://huggingface.co/meta-llama) model on [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) for text completion with TorchServe's features:
 
-Inferentia2 uses [Neuron SDK](https://aws.amazon.com/machine-learning/neuron/) which is built on top of PyTorch XLA stack. For large model inference [`transformers-neuronx`](https://github.com/aws-neuron/transformers-neuronx) package is used that takes care of model partitioning and running inference.
-
-**Note**: To run the model on an Inf2 instance, the model gets compiled as a preprocessing step. As part of the compilation process, to generate the model graph, a specific batch size is used. Following this, when running inference, we need to pass input which matches the batch size that was used during compilation. Model compilation and input padding to match compiled model batch size is taken care of by the [custom handler](inf2_handler.py) in this example.
+* demo1: [micro batching](https://github.com/pytorch/serve/tree/96450b9d0ab2a7290221f0e07aea5fda8a83efaf/examples/micro_batching) and [streaming response](https://github.com/pytorch/serve/blob/96450b9d0ab2a7290221f0e07aea5fda8a83efaf/docs/inference_api.md#curl-example-1) support in folder streamer.
+* demo2: continuous batching support in folder continuous_batching
 
-The batch size and micro batch size configurations are present in [model-config.yaml](model-config.yaml). The batch size indicates the maximum number of requests torchserve will aggregate and send to the custom handler within the batch delay.
-The batch size is chosen to be a relatively large value, say 16 since micro batching enables running the preprocess(tokenization) and inference steps in parallel on the micro batches. The micro batch size is the batch size used for the Inf2 model compilation.
-Since compilation batch size can influence compile time and also constrained by the Inf2 instance type, this is chosen to be a relatively smaller value, say 4.
+Inferentia2 uses [Neuron SDK](https://aws.amazon.com/machine-learning/neuron/) which is built on top of PyTorch XLA stack. For large model inference [`transformers-neuronx`](https://github.com/aws-neuron/transformers-neuronx) package is used that takes care of model partitioning and running inference.
 
-This example also demonstrates the utilization of neuronx cache to store inf2 model compilation artifacts using the `NEURONX_CACHE` and `NEURONX_DUMP_TO` environment variables in the custom handler.
+This example folder demonstrates the utilization of neuronx cache to store inf2 model compilation artifacts using the `NEURONX_CACHE` and `NEURON_COMPILE_CACHE_URL` environment variables in the custom handler.
 When the model is loaded for the first time, the model is compiled for the configured micro batch size and the compilation artifacts are saved to the neuronx cache.
 On subsequent model load, the compilation artifacts in the neuronx cache serves as `Ahead of Time(AOT)` compilation artifacts and significantly reduces the model load time.
 For convenience, the compiled model artifacts for this example are made available on the Torchserve model zoo: `s3://torchserve/mar_files/llama-2-13b-neuronx-b4`\
@@ -22,7 +19,7 @@ Get an Inf2 instance(Note: This example was tested on instance type:`inf2.24xlar
 DLAMI Name: ` Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20230720 Amazon Machine Image (AMI)` or higher.
 
 **Note**: The `inf2.24xlarge` instance consists of 6 neuron chips with 2 neuron cores each. The total accelerator memory is 192GB.
-Based on the configuration used in [model-config.yaml](model-config.yaml), with `tp_degree` set to 6, 3 of the 6 neuron chips are used, i.e 6 neuron cores.
+Based on the configuration used in [model-config.yaml](streamer/model-config.yaml), with `tp_degree` set to 6, 3 of the 6 neuron chips are used, i.e 6 neuron cores.
 On loading the model, the accelerator memory consumed is 38.1GB (12.7GB per chip).
 
 ### Step 2: Package Installations
@@ -85,7 +82,7 @@ python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13
 ### Step 4: Package model artifacts
 
 ```bash
-torch-model-archiver --model-name llama-2-13b --version 1.0 --handler inf2_handler.py -r requirements.txt --config-file model-config.yaml --archive-format no-archive
+torch-model-archiver --model-name llama-2-13b --version 1.0 --handler /PATH/TO/inf2_handler.py -r requirements.txt --config-file /PATH/TO/model-config.yaml --archive-format no-archive
 mv llama-2-13b-split llama-2-13b
 ```
 

diff --git a/examples/large_models/inferentia2/llama2/continuous_batching/Readme.md b/examples/large_models/inferentia2/llama2/continuous_batching/Readme.md
@@ -0,0 +1,21 @@
+# Demo2: Llama-2 Using TorchServe continuous batching on inf2
+
+This document briefs on serving the [Llama 2](https://huggingface.co/meta-llama) model on [AWS transformers-neuronx continuous batching](https://aws.amazon.com/ec2/instance-types/inf2/).
+
+This example can also be extended to support the following models.
+
+
+| Model       | Model Class                        |
+| :---        | :----:                             |
+| opt         | opt.model.OPTForSampling           |
+| gpt2        | gpt2.model.GPT2ForSampling         |
+| gptj        | gptj.model.GPTJForSampling         |
+| gpt_neox    | gptneox.model.GPTNeoXForSampling   |
+| llama       | lama.model.LlamaForSampling        |
+| mistral     | mistral.model.MistralForSampling   |
+| bloom       | bloom.model.BloomForSampling       |
+
+The batch size [model-config.yaml](model-config.yaml). The batch size indicates the maximum number of requests torchserve will aggregate and send to the custom handler within the batch delay. It is the batch size used for the Inf2 model compilation.
+Since compilation batch size can influence compile time and also constrained by the Inf2 instance type, this is chosen to be a relatively smaller value, say 4.
+
+`inf2-llama-2-continuous-batching.ipynb` is the notebook example.
diff --git a/...arge_models/inferentia2/llama2/continuous_batching/inf2-llama-2-continuous-batching.ipynb b/...arge_models/inferentia2/llama2/continuous_batching/inf2-llama-2-continuous-batching.ipynb
@@ -0,0 +1,128 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## TorchServe Continuous Batching Serve Llama-2 on Inferentia-2\n",
+    "This notebook demonstrates TorchServe continuous batching serving Llama-2-13b on Inferentia-2 `inf2.24xlarge` with DLAMI: Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20231226"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "### Installation\n",
+    "Note: This section can be skipped once [Neuron DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers) release TorchServe latest version."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "# Install Python venv\n",
+    "!sudo apt-get install -y python3.9-venv g++\n",
+    "\n",
+    "# Create Python venv\n",
+    "!python3.9 -m venv aws_neuron_venv_pytorch\n",
+    "\n",
+    "# Activate Python venv\n",
+    "!source aws_neuron_venv_pytorch/bin/activate\n",
+    "!python -m pip install -U pip\n",
+    "\n",
+    "# Clone Torchserve git repository\n",
+    "!git clone https://github.com/pytorch/serve.git\n",
+    "\n",
+    "# Install dependencies\n",
+    "!python ~/serve/ts_scripts/install_dependencies.py --neuronx --environment=dev\n",
+    "\n",
+    "# Install torchserve and torch-model-archiver\n",
+    "python ts_scripts/install_from_src.py"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "### Create model artifacts\n",
+    "\n",
+    "Note: run `mv model/models--meta-llama--Llama-2-13b-hf/snapshots/dc1d3b3bfdb69df26f8fc966c16353274b138c55/model.safetensors.index.json model/models--meta-llama--Llama-2-13b-hf/snapshots/dc1d3b3bfdb69df26f8fc966c16353274b138c55/model.safetensors.index.json.bkp`\n",
+    " if neuron sdk does not support safetensors"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "# login in Hugginface hub\n",
+    "!huggingface-cli login --token $HUGGINGFACE_TOKEN\n",
+    "!python ~/serve/examples/large_models/utils/Download_model.py --model_path model --model_name meta-llama/Llama-2-13b-hf --use_auth_token True\n",
+    "\n",
+    "# Create TorchServe model artifacts\n",
+    "!torch-model-archiver --model-name llama-2-13b --version 1.0 --handler inf2_handler.py -r requirements.txt --config-file model-config.yaml --archive-format no-archive\n",
+    "!mv model llama-2-13b\n",
+    "!mkdir -p ~/serve/model_store\n",
+    "!mv ~/serve/llama-2-13b /home/model-server/model_store\n",
+    "\n",
+    "# Precompile complete once the log \"Model llama-2-13b loaded successfully\"\n",
+    "torchserve --ncs --start --model-store /home/model-server/model_store --models llama-2-13b --ts-config ../config.properties"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "### Run inference"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "# Run single inference request\n",
+    "!python ~/serve/examples/large_models/utils/test_llm_streaming_response.py -m llama-2-13b -o 50 -t 2 -n 4 --prompt-text \"Today the weather is really nice and I am planning on \" --prompt-randomize"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
diff --git a/examples/large_models/inferentia2/llama2/continuous_batching/inf2_handler.py b/examples/large_models/inferentia2/llama2/continuous_batching/inf2_handler.py
@@ -0,0 +1,17 @@
+from ts.handler_utils.utils import import_class
+from ts.torch_handler.distributed.base_neuronx_continuous_batching_handler import (
+    BaseNeuronXContinuousBatchingHandler,
+)
+
+
+class LlamaContinuousBatchingHandler(BaseNeuronXContinuousBatchingHandler):
+    def __init__(self):
+        super(LlamaContinuousBatchingHandler, self).__init__()
+        self.model_class = import_class(
+            class_name="llama.model.LlamaForSampling",
+            module_prefix="transformers_neuronx",
+        )
+
+        self.tokenizer_class = import_class(
+            class_name="transformers.LlamaTokenizer",
+        )
diff --git a/examples/large_models/inferentia2/llama2/continuous_batching/model-config.yaml b/examples/large_models/inferentia2/llama2/continuous_batching/model-config.yaml
@@ -0,0 +1,15 @@
+minWorkers: 1
+maxWorkers: 1
+maxBatchDelay: 1
+responseTimeout: 10800
+batchSize: 8
+continuousBatching: true
+
+handler:
+    model_path: "model/models--meta-llama--Llama-2-13b-hf/snapshots/dc1d3b3bfdb69df26f8fc966c16353274b138c55"
+    model_checkpoint_dir: "llama-2-13b-split"
+    amp: "bf16"
+    tp_degree: 12
+    max_length: 100
+    max_new_tokens: 50
+    batch_size: 8
diff --git a/examples/large_models/inferentia2/llama2/continuous_batching/requirements.txt b/examples/large_models/inferentia2/llama2/continuous_batching/requirements.txt
@@ -0,0 +1 @@
+sentencepiece
diff --git a/examples/large_models/inferentia2/llama2/streamer/Readme.md b/examples/large_models/inferentia2/llama2/streamer/Readme.md
@@ -0,0 +1,9 @@
+# Demo1: Llama-2 Using TorchServe micro-batching and Streamer on inf2
+
+This document briefs on serving the [Llama 2](https://huggingface.co/meta-llama) model on [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) for text completion with TorchServe [micro batching](https://github.com/pytorch/serve/tree/96450b9d0ab2a7290221f0e07aea5fda8a83efaf/examples/micro_batching) and [streaming response](https://github.com/pytorch/serve/blob/96450b9d0ab2a7290221f0e07aea5fda8a83efaf/docs/inference_api.md#curl-example-1) support.
+
+**Note**: To run the model on an Inf2 instance, the model gets compiled as a preprocessing step. As part of the compilation process, to generate the model graph, a specific batch size is used. Following this, when running inference, we need to pass input which matches the batch size that was used during compilation. Model compilation and input padding to match compiled model batch size is taken care of by the [custom handler](inf2_handler.py) in this example.
+
+The batch size and micro batch size configurations are present in [model-config.yaml](model-config.yaml). The batch size indicates the maximum number of requests torchserve will aggregate and send to the custom handler within the batch delay.
+The batch size is chosen to be a relatively large value, say 16 since micro batching enables running the preprocess(tokenization) and inference steps in parallel on the micro batches. The micro batch size is the batch size used for the Inf2 model compilation.
+Since compilation batch size can influence compile time and also constrained by the Inf2 instance type, this is chosen to be a relatively smaller value, say 4.
diff --git a/...models/inferentia2/llama2/inf2_handler.py → ...ferentia2/llama2/streamer/inf2_handler.py b/...models/inferentia2/llama2/inf2_handler.py → ...ferentia2/llama2/streamer/inf2_handler.py
diff --git a/...dels/inferentia2/llama2/model-config.yaml → ...rentia2/llama2/streamer/model-config.yaml b/...dels/inferentia2/llama2/model-config.yaml → ...rentia2/llama2/streamer/model-config.yaml
diff --git a/examples/large_models/inferentia2/llama2/test_stream_response.py b/examples/large_models/inferentia2/llama2/test_stream_response.py
@@ -1,3 +1,4 @@
+import orjson
 import requests
 
 response = requests.post(
@@ -9,6 +10,7 @@
 for chunk in response.iter_content(chunk_size=None):
     if chunk:
         data = chunk.decode("utf-8")
-        print(data, end="", flush=True)
+        data = orjson.loads(data)
+        print(data["text"], end=" ", flush=True)
 
 print("")
diff --git a/examples/large_models/utils/Download_model.py b/examples/large_models/utils/Download_model.py
@@ -39,6 +39,13 @@ def hf_model(model_str):
 parser.add_argument(
     "--model_name", "-m", type=hf_model, required=True, help="HuggingFace model name"
 )
+parser.add_argument(
+    "--use_auth_token",
+    "-t",
+    type=bool,
+    default=False,
+    help="Use HF authentication token",
+)
 parser.add_argument("--revision", "-r", type=str, default="main", help="Revision")
 args = parser.parse_args()
 # Only download pytorch checkpoint files
@@ -49,6 +56,6 @@ def hf_model(model_str):
     revision=args.revision,
     allow_patterns=allow_patterns,
     cache_dir=args.model_path,
-    use_auth_token=False,
+    use_auth_token=args.use_auth_token,
 )
 print(f"Files for '{args.model_name}' is downloaded to '{snapshot_path}'")