Update Readme to include details about AOT compilation

pytorch · Sep 13, 2023 · b747cd3 · b747cd3
1 parent 3da4a78
commit b747cd3
Showing 1 changed file with 17 additions and 24 deletions.
diff --git a/examples/large_models/inferentia2/llama2/Readme.md b/examples/large_models/inferentia2/llama2/Readme.md
@@ -4,18 +4,22 @@ This document briefs on serving the [Llama 2](https://huggingface.co/meta-llama)
 
 Inferentia2 uses [Neuron SDK](https://aws.amazon.com/machine-learning/neuron/) which is built on top of PyTorch XLA stack. For large model inference [`transformers-neuronx`](https://github.com/aws-neuron/transformers-neuronx) package is used that takes care of model partitioning and running inference.
 
-**Note** To run the model on an Inf2 instance, the model gets compiled as a preprocessing step. As part of the compilation process, to generate the model graph, a specific batch size is used. Following this, when running inference, we need to pass input which matches the batch size that was used during compilation. Model compilation and input padding to match compiled model batch size is taken care of by the custom handler in this example.
+**Note**: To run the model on an Inf2 instance, the model gets compiled as a preprocessing step. As part of the compilation process, to generate the model graph, a specific batch size is used. Following this, when running inference, we need to pass input which matches the batch size that was used during compilation. Model compilation and input padding to match compiled model batch size is taken care of by the custom handler in this example.
 
 The batch size and micro batch size configurations are present in `model-config.yaml`. The batch size indicates the maximum number of requests torchserve will aggregate and send to the custom handler within the batch delay.
 The batch size is chosen to be a relatively large value, say 16 since microbatching enables running the preprocess(tokenization) and inference steps in parallel on the microbatches. The micro batch size is the batch size used for the Inf2 model compilation.
 Since compilation batch size can influence compile time and also constrained by the Inf2 instance type, this is chosen to be a relatively smaller value, say 4.
 
+This example also demonstrates the utilization of neuronx cache to store inf2 model compilation artifacts using the `NEURONX_CACHE` and `NEURONX_DUMP_TO` environment variables in the custom handler.
+When the model is loaded for the first time, the model is compiled for the configured micro batch size and the compilation artifacts are saved to the neuronx cache.
+On subsequent model load, the compilation artifacts in the neuronx cache serves as `Ahead of Time(AOT)` compilation artifacts and significantly reduces the model load time.
+
 ### Step 1: Inf2 instance
 
 Get an Inf2 instance(Note: This example was tested on instance type:`inf2.24xlarge`), ssh to it, make sure to use the following DLAMI as it comes with PyTorch and necessary packages for AWS Neuron SDK pre-installed.
 DLAMI Name: ` Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20230720 Amazon Machine Image (AMI)` or higher.
 
-### Step 1: Package Installations
+### Step 2: Package Installations
 
 Follow the steps below to complete package installations
 
@@ -35,73 +39,62 @@ git clone https://github.com/pytorch/serve.git
 cd serve
 
 # Install dependencies
-python ts_scripts/install_dependencies.py --neuronx
+python ts_scripts/install_dependencies.py --neuronx --environment=dev
 
 # Install torchserve and torch-model-archiver
 python ts_scripts/install_from_src.py
 
-# Set pip repository pointing to the Neuron repository
-python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
-
-# Update Neuron Compiler, Framework and Transformers
-python -m pip install --upgrade neuronx-cc torch-neuronx transformers-neuronx
+# Navigate to `examples/large_models/inferentia2/llama2` directory
+cd examples/large_models/inferentia2/llama2/
 
 # Install additional necessary packages
-python -m pip install --upgrade transformers tokenizers sentencepiece
-
+python -m pip install -r requirements.txt
 ```
 
-
-
-### Step 2: Save the model split checkpoints compatible with `transformers-neuronx`
+### Step 3: Save the model split checkpoints compatible with `transformers-neuronx`
 Login to Huggingface
 ```bash
 huggingface-cli login
 ```
 
-Navigate to `examples/large_models/inferentia2/llama2` directory
-```bash
-cd examples/large_models/inferentia2/llama2/
-```
-
 Run the `inf2_save_split_checkpoints.py` script
 ```bash
 python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13b-hf --save_path './llama-2-13b-split'
 ```
 
 
-### Step 3: Package model artifacts
+### Step 4: Package model artifacts
 
 ```bash
 torch-model-archiver --model-name llama-2-13b --version 1.0 --handler inf2_handler.py --extra-files ./llama-2-13b-split  -r requirements.txt --config-file model-config.yaml --archive-format no-archive
 ```
 
-### Step 4: Add the mar file to model store
+### Step 5: Add the model artifacts to model store
 
 ```bash
 mkdir model_store
 mv llama-2-13b model_store
 ```
 
-### Step 5: Start torchserve
+### Step 6: Start torchserve
 
 ```bash
 torchserve --ncs --start --model-store model_store --ts-config config.properties
 ```
 
-### Step 6: Register model
+### Step 7: Register model
 
 ```bash
 curl -X POST "http://localhost:8081/models?url=llama-2-13b"
 ```
 
-### Step 7: Run inference
+### Step 8: Run inference
 
 ```bash
 python test_stream_response.py
 ```
 
-### Step 8: Stop torchserve
+### Step 9: Stop torchserve
 
 ```bash
 torchserve --stop