-
Notifications
You must be signed in to change notification settings - Fork 863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support inf2 neuronx transformer continuous batching #2803
Changes from 40 commits
6b2a07c
6e27cbe
7dcfffd
5eed61e
c81320a
426e930
eb98816
8d1251f
285b1a6
81c4532
9f2e450
687a1f5
632e896
f6f6df1
31446cf
60f8a4c
7cee167
540115d
63f42b5
42d4719
5a5252e
dd42d7c
100927e
06e8417
607a349
ea28f27
e54e853
31af681
a3ad43a
ae0e7d3
12b34b6
c07e8a8
6a5867a
e0e8bae
32adb90
df7268e
e83d58c
3669a0d
642d59a
1c6b211
78ef0ae
a3cbd77
414dcd5
ded1c26
4bd9b8e
932e7ac
f7a5531
db4566e
cbfcec4
da34b53
b373077
aa3eafe
2cb2229
253882c
a2ba124
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
# Demo2: Llama-2 Using TorchServe continuous batching on inf2 | ||
|
||
This document briefs on serving the [Llama 2](https://huggingface.co/meta-llama) model on [AWS transformers-neuronx continuous batching](https://aws.amazon.com/ec2/instance-types/inf2/). | ||
|
||
This example can also be extended to support the following models. | ||
|
||
|
||
| Model | Model Class | | ||
| :--- | :----: | | ||
| opt | opt.model.OPTForSampling | | ||
| gpt2 | gpt2.model.GPT2ForSampling | | ||
| gptj | gptj.model.GPTJForSampling | | ||
| gpt_neox | gptneox.model.GPTNeoXForSampling | | ||
| llama | lama.model.LlamaForSampling | | ||
| mistral | mistral.model.MistralForSampling | | ||
| bloom | bloom.model.BloomForSampling | | ||
|
||
The batch size [model-config.yaml](model-config.yaml). The batch size indicates the maximum number of requests torchserve will aggregate and send to the custom handler within the batch delay. It is the batch size used for the Inf2 model compilation. | ||
Since compilation batch size can influence compile time and also constrained by the Inf2 instance type, this is chosen to be a relatively smaller value, say 4. | ||
|
||
`inf2-llama-2-continuous-batching.ipynb` is the notebook example. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,128 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"## TorchServe Continuous Batching Serve Llama-2 on Inferentia-2\n", | ||
"This notebook demonstrates TorchServe continuous batching serving Llama-2-13b on Inferentia-2 `inf2.24xlarge` with DLAMI: Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20231226" | ||
], | ||
"metadata": { | ||
"collapsed": false | ||
} | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"### Installation\n", | ||
"Note: This section can be skipped once [Neuron DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers) release TorchServe latest version." | ||
], | ||
"metadata": { | ||
"collapsed": false | ||
} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"outputs": [], | ||
"source": [ | ||
"# Install Python venv\n", | ||
"!sudo apt-get install -y python3.9-venv g++\n", | ||
"\n", | ||
"# Create Python venv\n", | ||
"!python3.9 -m venv aws_neuron_venv_pytorch\n", | ||
"\n", | ||
"# Activate Python venv\n", | ||
"!source aws_neuron_venv_pytorch/bin/activate\n", | ||
"!python -m pip install -U pip\n", | ||
"\n", | ||
"# Clone Torchserve git repository\n", | ||
"!git clone https://github.com/pytorch/serve.git\n", | ||
"\n", | ||
"# Install dependencies\n", | ||
"!python ~/serve/ts_scripts/install_dependencies.py --neuronx --environment=dev\n", | ||
"\n", | ||
"# Install torchserve and torch-model-archiver\n", | ||
"python ts_scripts/install_from_src.py" | ||
], | ||
"metadata": { | ||
"collapsed": false | ||
} | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"### Create model artifacts\n", | ||
"\n", | ||
"Note: run `mv model/models--meta-llama--Llama-2-13b-hf/snapshots/dc1d3b3bfdb69df26f8fc966c16353274b138c55/model.safetensors.index.json model/models--meta-llama--Llama-2-13b-hf/snapshots/dc1d3b3bfdb69df26f8fc966c16353274b138c55/model.safetensors.index.json.bkp`\n", | ||
" if neuron sdk does not support safetensors" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "if" neuron sdk ....? On what does this depend? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. neuron sdk is still in beta version to support model safetensors format. |
||
], | ||
"metadata": { | ||
"collapsed": false | ||
} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"outputs": [], | ||
"source": [ | ||
"# login in Hugginface hub\n", | ||
"!huggingface-cli login --token $HUGGINGFACE_TOKEN\n", | ||
"!python ~/serve/examples/large_models/utils/Download_model.py --model_path model --model_name meta-llama/Llama-2-13b-hf --use_auth_token True\n", | ||
"\n", | ||
"# Create TorchServe model artifacts\n", | ||
"!torch-model-archiver --model-name llama-2-13b --version 1.0 --handler inf2_handler.py -r requirements.txt --config-file model-config.yaml --archive-format no-archive\n", | ||
"!mv model llama-2-13b\n", | ||
"!mkdir -p ~/serve/model_store\n", | ||
"!mv ~/serve/llama-2-13b /home/model-server/model_store\n", | ||
"\n", | ||
"# Precompile complete once the log \"Model llama-2-13b loaded successfully\"\n", | ||
"torchserve --ncs --start --model-store /home/model-server/model_store --models llama-2-13b --ts-config ../config.properties" | ||
], | ||
"metadata": { | ||
"collapsed": false | ||
} | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"### Run inference" | ||
], | ||
"metadata": { | ||
"collapsed": false | ||
} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"outputs": [], | ||
"source": [ | ||
"# Run single inference request\n", | ||
"!python ~/serve/examples/large_models/utils/test_llm_streaming_response.py -m llama-2-13b -o 50 -t 2 -n 4 --prompt-text \"Today the weather is really nice and I am planning on \" --prompt-randomize" | ||
], | ||
"metadata": { | ||
"collapsed": false | ||
} | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 2 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython2", | ||
"version": "2.7.6" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 0 | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
from ts.handler_utils.utils import import_class | ||
from ts.torch_handler.distributed.base_neuronx_continuous_batching_handler import ( | ||
BaseNeuronXContinuousBatchingHandler, | ||
) | ||
|
||
|
||
class LlamaContinuousBatchingHandler(BaseNeuronXContinuousBatchingHandler): | ||
def __init__(self): | ||
super(LlamaContinuousBatchingHandler, self).__init__() | ||
self.model_class = import_class( | ||
class_name="llama.model.LlamaForSampling", | ||
module_prefix="transformers_neuronx", | ||
) | ||
|
||
self.tokenizer_class = import_class( | ||
class_name="transformers.LlamaTokenizer", | ||
) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
minWorkers: 1 | ||
maxWorkers: 1 | ||
maxBatchDelay: 1 | ||
responseTimeout: 10800 | ||
batchSize: 8 | ||
continuousBatching: true | ||
|
||
handler: | ||
model_path: "model/models--meta-llama--Llama-2-13b-hf/snapshots/dc1d3b3bfdb69df26f8fc966c16353274b138c55" | ||
model_checkpoint_dir: "llama-2-13b-split" | ||
amp: "bf16" | ||
tp_degree: 12 | ||
max_length: 100 | ||
max_new_tokens: 50 | ||
batch_size: 8 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
sentencepiece |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# Demo1: Llama-2 Using TorchServe micro-batching and Streamer on inf2 | ||
|
||
This document briefs on serving the [Llama 2](https://huggingface.co/meta-llama) model on [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) for text completion with TorchServe [micro batching](https://github.com/pytorch/serve/tree/96450b9d0ab2a7290221f0e07aea5fda8a83efaf/examples/micro_batching) and [streaming response](https://github.com/pytorch/serve/blob/96450b9d0ab2a7290221f0e07aea5fda8a83efaf/docs/inference_api.md#curl-example-1) support. | ||
|
||
**Note**: To run the model on an Inf2 instance, the model gets compiled as a preprocessing step. As part of the compilation process, to generate the model graph, a specific batch size is used. Following this, when running inference, we need to pass input which matches the batch size that was used during compilation. Model compilation and input padding to match compiled model batch size is taken care of by the [custom handler](inf2_handler.py) in this example. | ||
mreso marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The batch size and micro batch size configurations are present in [model-config.yaml](model-config.yaml). The batch size indicates the maximum number of requests torchserve will aggregate and send to the custom handler within the batch delay. | ||
The batch size is chosen to be a relatively large value, say 16 since micro batching enables running the preprocess(tokenization) and inference steps in parallel on the micro batches. The micro batch size is the batch size used for the Inf2 model compilation. | ||
Since compilation batch size can influence compile time and also constrained by the Inf2 instance type, this is chosen to be a relatively smaller value, say 4. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
import orjson | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is orjson used here? I've found it only in the dev dependencies. Would normal json be sufficient here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. according to the benchmark report, it seems that orjson is much better than standard json. |
||
import requests | ||
|
||
response = requests.post( | ||
|
@@ -9,6 +10,7 @@ | |
for chunk in response.iter_content(chunk_size=None): | ||
if chunk: | ||
data = chunk.decode("utf-8") | ||
print(data, end="", flush=True) | ||
data = orjson.loads(data) | ||
print(data["text"], end=" ", flush=True) | ||
|
||
print("") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to set an env variable where users switch between the two choices in a single place and then just copy the commands