Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support inf2 neuronx transformer continuous batching #2803

Merged
merged 55 commits into from
Feb 27, 2024
Merged
Show file tree
Hide file tree
Changes from 40 commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
6b2a07c
fmt
lxning Nov 16, 2023
6e27cbe
fmt
lxning Nov 19, 2023
7dcfffd
fmt
lxning Nov 21, 2023
5eed61e
add space
lxning Nov 21, 2023
c81320a
fmt
lxning Nov 21, 2023
426e930
fmt
lxning Nov 21, 2023
eb98816
fmt
lxning Nov 22, 2023
8d1251f
fmt
lxning Nov 22, 2023
285b1a6
Merge branch 'master' into feat/inf2_cb
lxning Nov 22, 2023
81c4532
fix regression test
lxning Nov 22, 2023
9f2e450
check key result
lxning Nov 23, 2023
687a1f5
fmt
lxning Nov 23, 2023
632e896
update folder
lxning Nov 23, 2023
f6f6df1
fmt
lxning Nov 25, 2023
31446cf
update key name
lxning Nov 26, 2023
60f8a4c
add orjson
lxning Nov 26, 2023
7cee167
update streamer
lxning Nov 27, 2023
540115d
add key text for streamer iterator
lxning Nov 27, 2023
63f42b5
update test_hf_batch_streamer output
lxning Nov 28, 2023
42d4719
integrate split checkpoint in handler
lxning Dec 2, 2023
5a5252e
fmt
lxning Dec 3, 2023
dd42d7c
fmt
lxning Dec 3, 2023
100927e
fmt
lxning Dec 11, 2023
06e8417
fmt
lxning Dec 14, 2023
607a349
fmt
lxning Dec 19, 2023
ea28f27
fmt
lxning Jan 3, 2024
e54e853
update notebook
lxning Jan 5, 2024
31af681
fmt
lxning Jan 5, 2024
a3ad43a
add handler utils
lxning Jan 6, 2024
ae0e7d3
fix typo
lxning Jan 6, 2024
12b34b6
fmt
lxning Jan 6, 2024
c07e8a8
fmt
lxning Jan 6, 2024
6a5867a
fmt
lxning Jan 7, 2024
e0e8bae
fmt
lxning Jan 8, 2024
32adb90
fmt
lxning Jan 9, 2024
df7268e
merge master
lxning Jan 9, 2024
e83d58c
Fix lint
lxning Jan 9, 2024
3669a0d
fix typo in notebook example
lxning Jan 9, 2024
642d59a
enable authentication
lxning Jan 9, 2024
1c6b211
fmt
lxning Jan 10, 2024
78ef0ae
fmt
lxning Jan 22, 2024
a3cbd77
Merge branch 'master' into feat/inf2_cb
lxning Jan 23, 2024
414dcd5
Merge branch 'master' into feat/inf2_cb
lxning Jan 23, 2024
ded1c26
fmt
lxning Jan 23, 2024
4bd9b8e
update readme
lxning Jan 24, 2024
932e7ac
fix lint
lxning Jan 24, 2024
f7a5531
fmt
lxning Feb 19, 2024
db4566e
Merge branch 'master' into feat/inf2_cb
lxning Feb 19, 2024
cbfcec4
update test data
lxning Feb 19, 2024
da34b53
update test
lxning Feb 20, 2024
b373077
update test
lxning Feb 20, 2024
aa3eafe
replace os.path with pathlib
lxning Feb 20, 2024
2cb2229
update test
lxning Feb 20, 2024
253882c
Merge branch 'master' into feat/inf2_cb
lxning Feb 20, 2024
a2ba124
fmt
lxning Feb 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 7 additions & 10 deletions examples/large_models/inferentia2/llama2/Readme.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,13 @@
# Large model inference on Inferentia2

This document briefs on serving the [Llama 2](https://huggingface.co/meta-llama) model on [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) for text completion with [micro batching](https://github.com/pytorch/serve/tree/96450b9d0ab2a7290221f0e07aea5fda8a83efaf/examples/micro_batching) and [streaming response](https://github.com/pytorch/serve/blob/96450b9d0ab2a7290221f0e07aea5fda8a83efaf/docs/inference_api.md#curl-example-1) support.
This document briefs on serving the [Llama 2](https://huggingface.co/meta-llama) model on [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) for text completion with TorchServe's features:

Inferentia2 uses [Neuron SDK](https://aws.amazon.com/machine-learning/neuron/) which is built on top of PyTorch XLA stack. For large model inference [`transformers-neuronx`](https://github.com/aws-neuron/transformers-neuronx) package is used that takes care of model partitioning and running inference.

**Note**: To run the model on an Inf2 instance, the model gets compiled as a preprocessing step. As part of the compilation process, to generate the model graph, a specific batch size is used. Following this, when running inference, we need to pass input which matches the batch size that was used during compilation. Model compilation and input padding to match compiled model batch size is taken care of by the [custom handler](inf2_handler.py) in this example.
* demo1: [micro batching](https://github.com/pytorch/serve/tree/96450b9d0ab2a7290221f0e07aea5fda8a83efaf/examples/micro_batching) and [streaming response](https://github.com/pytorch/serve/blob/96450b9d0ab2a7290221f0e07aea5fda8a83efaf/docs/inference_api.md#curl-example-1) support in folder streamer.
* demo2: continuous batching support in folder continuous_batching

The batch size and micro batch size configurations are present in [model-config.yaml](model-config.yaml). The batch size indicates the maximum number of requests torchserve will aggregate and send to the custom handler within the batch delay.
The batch size is chosen to be a relatively large value, say 16 since micro batching enables running the preprocess(tokenization) and inference steps in parallel on the micro batches. The micro batch size is the batch size used for the Inf2 model compilation.
Since compilation batch size can influence compile time and also constrained by the Inf2 instance type, this is chosen to be a relatively smaller value, say 4.
Inferentia2 uses [Neuron SDK](https://aws.amazon.com/machine-learning/neuron/) which is built on top of PyTorch XLA stack. For large model inference [`transformers-neuronx`](https://github.com/aws-neuron/transformers-neuronx) package is used that takes care of model partitioning and running inference.

This example also demonstrates the utilization of neuronx cache to store inf2 model compilation artifacts using the `NEURONX_CACHE` and `NEURONX_DUMP_TO` environment variables in the custom handler.
This example folder demonstrates the utilization of neuronx cache to store inf2 model compilation artifacts using the `NEURONX_CACHE` and `NEURON_COMPILE_CACHE_URL` environment variables in the custom handler.
When the model is loaded for the first time, the model is compiled for the configured micro batch size and the compilation artifacts are saved to the neuronx cache.
On subsequent model load, the compilation artifacts in the neuronx cache serves as `Ahead of Time(AOT)` compilation artifacts and significantly reduces the model load time.
For convenience, the compiled model artifacts for this example are made available on the Torchserve model zoo: `s3://torchserve/mar_files/llama-2-13b-neuronx-b4`\
Expand All @@ -22,7 +19,7 @@ Get an Inf2 instance(Note: This example was tested on instance type:`inf2.24xlar
DLAMI Name: ` Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20230720 Amazon Machine Image (AMI)` or higher.

**Note**: The `inf2.24xlarge` instance consists of 6 neuron chips with 2 neuron cores each. The total accelerator memory is 192GB.
Based on the configuration used in [model-config.yaml](model-config.yaml), with `tp_degree` set to 6, 3 of the 6 neuron chips are used, i.e 6 neuron cores.
Based on the configuration used in [model-config.yaml](streamer/model-config.yaml), with `tp_degree` set to 6, 3 of the 6 neuron chips are used, i.e 6 neuron cores.
On loading the model, the accelerator memory consumed is 38.1GB (12.7GB per chip).

### Step 2: Package Installations
Expand Down Expand Up @@ -85,7 +82,7 @@ python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13
### Step 4: Package model artifacts

```bash
torch-model-archiver --model-name llama-2-13b --version 1.0 --handler inf2_handler.py -r requirements.txt --config-file model-config.yaml --archive-format no-archive
torch-model-archiver --model-name llama-2-13b --version 1.0 --handler /PATH/TO/inf2_handler.py -r requirements.txt --config-file /PATH/TO/model-config.yaml --archive-format no-archive
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to set an env variable where users switch between the two choices in a single place and then just copy the commands

mv llama-2-13b-split llama-2-13b
```

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Demo2: Llama-2 Using TorchServe continuous batching on inf2

This document briefs on serving the [Llama 2](https://huggingface.co/meta-llama) model on [AWS transformers-neuronx continuous batching](https://aws.amazon.com/ec2/instance-types/inf2/).

This example can also be extended to support the following models.


| Model | Model Class |
| :--- | :----: |
| opt | opt.model.OPTForSampling |
| gpt2 | gpt2.model.GPT2ForSampling |
| gptj | gptj.model.GPTJForSampling |
| gpt_neox | gptneox.model.GPTNeoXForSampling |
| llama | lama.model.LlamaForSampling |
| mistral | mistral.model.MistralForSampling |
| bloom | bloom.model.BloomForSampling |

The batch size [model-config.yaml](model-config.yaml). The batch size indicates the maximum number of requests torchserve will aggregate and send to the custom handler within the batch delay. It is the batch size used for the Inf2 model compilation.
Since compilation batch size can influence compile time and also constrained by the Inf2 instance type, this is chosen to be a relatively smaller value, say 4.

`inf2-llama-2-continuous-batching.ipynb` is the notebook example.
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
{
"cells": [
{
"cell_type": "markdown",
"source": [
"## TorchServe Continuous Batching Serve Llama-2 on Inferentia-2\n",
"This notebook demonstrates TorchServe continuous batching serving Llama-2-13b on Inferentia-2 `inf2.24xlarge` with DLAMI: Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20231226"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"### Installation\n",
"Note: This section can be skipped once [Neuron DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers) release TorchServe latest version."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# Install Python venv\n",
"!sudo apt-get install -y python3.9-venv g++\n",
"\n",
"# Create Python venv\n",
"!python3.9 -m venv aws_neuron_venv_pytorch\n",
"\n",
"# Activate Python venv\n",
"!source aws_neuron_venv_pytorch/bin/activate\n",
"!python -m pip install -U pip\n",
"\n",
"# Clone Torchserve git repository\n",
"!git clone https://github.com/pytorch/serve.git\n",
"\n",
"# Install dependencies\n",
"!python ~/serve/ts_scripts/install_dependencies.py --neuronx --environment=dev\n",
"\n",
"# Install torchserve and torch-model-archiver\n",
"python ts_scripts/install_from_src.py"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"### Create model artifacts\n",
"\n",
"Note: run `mv model/models--meta-llama--Llama-2-13b-hf/snapshots/dc1d3b3bfdb69df26f8fc966c16353274b138c55/model.safetensors.index.json model/models--meta-llama--Llama-2-13b-hf/snapshots/dc1d3b3bfdb69df26f8fc966c16353274b138c55/model.safetensors.index.json.bkp`\n",
" if neuron sdk does not support safetensors"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"if" neuron sdk ....? On what does this depend?

Copy link
Collaborator Author

@lxning lxning Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neuron sdk is still in beta version to support model safetensors format.

],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# login in Hugginface hub\n",
"!huggingface-cli login --token $HUGGINGFACE_TOKEN\n",
"!python ~/serve/examples/large_models/utils/Download_model.py --model_path model --model_name meta-llama/Llama-2-13b-hf --use_auth_token True\n",
"\n",
"# Create TorchServe model artifacts\n",
"!torch-model-archiver --model-name llama-2-13b --version 1.0 --handler inf2_handler.py -r requirements.txt --config-file model-config.yaml --archive-format no-archive\n",
"!mv model llama-2-13b\n",
"!mkdir -p ~/serve/model_store\n",
"!mv ~/serve/llama-2-13b /home/model-server/model_store\n",
"\n",
"# Precompile complete once the log \"Model llama-2-13b loaded successfully\"\n",
"torchserve --ncs --start --model-store /home/model-server/model_store --models llama-2-13b --ts-config ../config.properties"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"### Run inference"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# Run single inference request\n",
"!python ~/serve/examples/large_models/utils/test_llm_streaming_response.py -m llama-2-13b -o 50 -t 2 -n 4 --prompt-text \"Today the weather is really nice and I am planning on \" --prompt-randomize"
],
"metadata": {
"collapsed": false
}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
from ts.handler_utils.utils import import_class
from ts.torch_handler.distributed.base_neuronx_continuous_batching_handler import (
BaseNeuronXContinuousBatchingHandler,
)


class LlamaContinuousBatchingHandler(BaseNeuronXContinuousBatchingHandler):
def __init__(self):
super(LlamaContinuousBatchingHandler, self).__init__()
self.model_class = import_class(
class_name="llama.model.LlamaForSampling",
module_prefix="transformers_neuronx",
)

self.tokenizer_class = import_class(
class_name="transformers.LlamaTokenizer",
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 1
responseTimeout: 10800
batchSize: 8
continuousBatching: true

handler:
model_path: "model/models--meta-llama--Llama-2-13b-hf/snapshots/dc1d3b3bfdb69df26f8fc966c16353274b138c55"
model_checkpoint_dir: "llama-2-13b-split"
amp: "bf16"
tp_degree: 12
max_length: 100
max_new_tokens: 50
batch_size: 8
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
sentencepiece
9 changes: 9 additions & 0 deletions examples/large_models/inferentia2/llama2/streamer/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Demo1: Llama-2 Using TorchServe micro-batching and Streamer on inf2

This document briefs on serving the [Llama 2](https://huggingface.co/meta-llama) model on [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) for text completion with TorchServe [micro batching](https://github.com/pytorch/serve/tree/96450b9d0ab2a7290221f0e07aea5fda8a83efaf/examples/micro_batching) and [streaming response](https://github.com/pytorch/serve/blob/96450b9d0ab2a7290221f0e07aea5fda8a83efaf/docs/inference_api.md#curl-example-1) support.

**Note**: To run the model on an Inf2 instance, the model gets compiled as a preprocessing step. As part of the compilation process, to generate the model graph, a specific batch size is used. Following this, when running inference, we need to pass input which matches the batch size that was used during compilation. Model compilation and input padding to match compiled model batch size is taken care of by the [custom handler](inf2_handler.py) in this example.
mreso marked this conversation as resolved.
Show resolved Hide resolved

The batch size and micro batch size configurations are present in [model-config.yaml](model-config.yaml). The batch size indicates the maximum number of requests torchserve will aggregate and send to the custom handler within the batch delay.
The batch size is chosen to be a relatively large value, say 16 since micro batching enables running the preprocess(tokenization) and inference steps in parallel on the micro batches. The micro batch size is the batch size used for the Inf2 model compilation.
Since compilation batch size can influence compile time and also constrained by the Inf2 instance type, this is chosen to be a relatively smaller value, say 4.
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import orjson
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is orjson used here? I've found it only in the dev dependencies. Would normal json be sufficient here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to the benchmark report, it seems that orjson is much better than standard json.

import requests

response = requests.post(
Expand All @@ -9,6 +10,7 @@
for chunk in response.iter_content(chunk_size=None):
if chunk:
data = chunk.decode("utf-8")
print(data, end="", flush=True)
data = orjson.loads(data)
print(data["text"], end=" ", flush=True)

print("")
9 changes: 8 additions & 1 deletion examples/large_models/utils/Download_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,13 @@ def hf_model(model_str):
parser.add_argument(
"--model_name", "-m", type=hf_model, required=True, help="HuggingFace model name"
)
parser.add_argument(
"--use_auth_token",
"-t",
type=bool,
default=False,
help="Use HF authentication token",
)
parser.add_argument("--revision", "-r", type=str, default="main", help="Revision")
args = parser.parse_args()
# Only download pytorch checkpoint files
Expand All @@ -49,6 +56,6 @@ def hf_model(model_str):
revision=args.revision,
allow_patterns=allow_patterns,
cache_dir=args.model_path,
use_auth_token=False,
use_auth_token=args.use_auth_token,
)
print(f"Files for '{args.model_name}' is downloaded to '{snapshot_path}'")
Loading
Loading