-
Notifications
You must be signed in to change notification settings - Fork 863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Enable generation of AOT compiled artifacts for llama2 on inf2 example #2733
Changes from all commits
7cf2937
6495fc3
9a53fb4
4a240ff
aeae4f6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,9 +10,9 @@ The batch size and micro batch size configurations are present in [model-config. | |
The batch size is chosen to be a relatively large value, say 16 since micro batching enables running the preprocess(tokenization) and inference steps in parallel on the micro batches. The micro batch size is the batch size used for the Inf2 model compilation. | ||
Since compilation batch size can influence compile time and also constrained by the Inf2 instance type, this is chosen to be a relatively smaller value, say 4. | ||
|
||
This example also demonstrates the utilization of neuronx cache to store inf2 model compilation artifacts using the `NEURONX_CACHE` and `NEURONX_DUMP_TO` environment variables in the custom handler. | ||
When the model is loaded for the first time, the model is compiled for the configured micro batch size and the compilation artifacts are saved to the neuronx cache. | ||
On subsequent model load, the compilation artifacts in the neuronx cache serves as `Ahead of Time(AOT)` compilation artifacts and significantly reduces the model load time. | ||
This example also demonstrates the utilization of [Neuron Persistent Cache](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/neuron-caching.html) for inf2 model compilation artifacts using the `NEURONX_CACHE` and `NEURONX_DUMP_TO` environment variables. | ||
When the model is loaded for the first time, the model is compiled for the configured micro batch size and the compilation artifacts are saved to the neuron persistent cache. | ||
On subsequent model load, the compilation artifacts in the neuron persistent cache serve as `Ahead of Time(AOT)` compilation artifacts and significantly reduces the model load time. | ||
For convenience, the compiled model artifacts for this example are made available on the Torchserve model zoo: `s3://torchserve/mar_files/llama-2-13b-neuronx-b4`\ | ||
Instructions on how to use the AOT compiled model artifacts is shown below. | ||
|
||
|
@@ -78,7 +78,7 @@ huggingface-cli login | |
|
||
Run the `inf2_save_split_checkpoints.py` script | ||
```bash | ||
python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13b-hf --save_path './llama-2-13b-split' | ||
python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13b-hf --save_path './llama-2-13b-split' generate_neuron_cache --neuron_cache_dir './neuron_cache' --batch_size 4 --amp 'bf16' --tp_degree 6 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: can you link to some official docs describing what the tp degree means? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This section of the Neuron documentation has a description of what |
||
``` | ||
|
||
|
||
|
@@ -87,6 +87,7 @@ python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13 | |
```bash | ||
torch-model-archiver --model-name llama-2-13b --version 1.0 --handler inf2_handler.py -r requirements.txt --config-file model-config.yaml --archive-format no-archive | ||
mv llama-2-13b-split llama-2-13b | ||
mv neuron_cache llama-2-13b | ||
``` | ||
|
||
### Step 5: Add the model artifacts to model store | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,6 +4,7 @@ | |
import torch | ||
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer | ||
from transformers.models.opt import OPTForCausalLM | ||
from transformers_neuronx.llama.model import LlamaForSampling | ||
from transformers_neuronx.module import save_pretrained_split | ||
|
||
os.environ["NEURON_CC_FLAGS"] = "--model-type=transformer-inference" | ||
|
@@ -40,6 +41,26 @@ def opt_amp_callback(model: OPTForCausalLM, dtype: torch.dtype) -> None: | |
default="./model-splits", | ||
help="Output directory for downloaded model files", | ||
) | ||
subparsers = parser.add_subparsers(dest="subparser") | ||
parser_neuron_cache = subparsers.add_parser("generate_neuron_cache") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. these don't feel like they should be required? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've included the |
||
parser_neuron_cache.add_argument( | ||
"--neuron_cache_dir", | ||
type=str, | ||
required=True, | ||
help="Target directory to store neuronx-cc compiled model", | ||
) | ||
parser_neuron_cache.add_argument( | ||
"--batch_size", type=int, required=True, help="Batch size for the compiled model" | ||
) | ||
parser_neuron_cache.add_argument( | ||
"--amp", type=str, required=True, help="Automatic mixed precision" | ||
) | ||
parser_neuron_cache.add_argument( | ||
"--tp_degree", | ||
type=int, | ||
required=True, | ||
help="Tensor parallelism degree for the compiled model", | ||
) | ||
args = parser.parse_args() | ||
|
||
save_path = create_directory_if_not_exists(args.save_path) | ||
|
@@ -62,3 +83,26 @@ def opt_amp_callback(model: OPTForCausalLM, dtype: torch.dtype) -> None: | |
tokenizer.save_pretrained(args.save_path) | ||
|
||
print(f"Files for '{args.model_name}' have been downloaded to '{args.save_path}'.") | ||
|
||
if args.subparser == "generate_neuron_cache": | ||
os.environ["NEURONX_CACHE"] = "on" | ||
os.environ["NEURONX_DUMP_TO"] = create_directory_if_not_exists( | ||
args.neuron_cache_dir | ||
) | ||
Comment on lines
+88
to
+91
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. According to Mike's update, NEURON_COMPILE_CACHE_URL is official setting. check https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/neuron-caching.html?highlight=NEURON_COMPILE_CACHE_URL#neuron-persistent-cache There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tested this and it seems to work with the latest SDK versions There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also tested loading So, will retain the example to use SDK There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this root cause of this error is there is bug with 2.14 compiler. "NEURON_COMPILE_CACHE_URL" is still recommended by inf2 team since the old way generates too much debug log. Please keep this PR open until the bug fixing is verified on neuron sdk 2.15 compile. |
||
os.environ["NEURON_CC_FLAGS"] = "--model-type=transformer-inference" | ||
|
||
if hf_model_config.model_type == "llama": | ||
model = LlamaForSampling.from_pretrained( | ||
args.save_path, | ||
batch_size=args.batch_size, | ||
amp=args.amp, | ||
tp_degree=args.tp_degree, | ||
) | ||
else: | ||
raise RuntimeError( | ||
f"Neuron cache generation for model {args.model_name} not supported" | ||
) | ||
|
||
print(f"Compiling '{args.model_name}'") | ||
model.to_neuron() | ||
print(f"Neuron cache for '{args.model_name}' saved to {args.neuron_cache_dir}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some docs feel missing, so presumably neuron is a JIT compiler and you're warming up a cache? Or is it an AOT compiler and you are saving the compiled artifacts in which it's not really a cache but a serialized compiled model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we actually have both. The
neuronx-cc
JIT cache is used if the compilation artifacts are present there and if not, the neuron persistent cache is checked to see if the compiled artifacts are present. The contents of the neuron persistent cache is what we are generating here to enable speed up for the first model load. More documentation is available here: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/neuron-caching.htmlWill include these details in the Readme as well.