Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Enable generation of AOT compiled artifacts for llama2 on inf2 example #2733

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions examples/large_models/inferentia2/llama2/Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ The batch size and micro batch size configurations are present in [model-config.
The batch size is chosen to be a relatively large value, say 16 since micro batching enables running the preprocess(tokenization) and inference steps in parallel on the micro batches. The micro batch size is the batch size used for the Inf2 model compilation.
Since compilation batch size can influence compile time and also constrained by the Inf2 instance type, this is chosen to be a relatively smaller value, say 4.

This example also demonstrates the utilization of neuronx cache to store inf2 model compilation artifacts using the `NEURONX_CACHE` and `NEURONX_DUMP_TO` environment variables in the custom handler.
When the model is loaded for the first time, the model is compiled for the configured micro batch size and the compilation artifacts are saved to the neuronx cache.
On subsequent model load, the compilation artifacts in the neuronx cache serves as `Ahead of Time(AOT)` compilation artifacts and significantly reduces the model load time.
This example also demonstrates the utilization of [Neuron Persistent Cache](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/neuron-caching.html) for inf2 model compilation artifacts using the `NEURONX_CACHE` and `NEURONX_DUMP_TO` environment variables.
When the model is loaded for the first time, the model is compiled for the configured micro batch size and the compilation artifacts are saved to the neuron persistent cache.
On subsequent model load, the compilation artifacts in the neuron persistent cache serve as `Ahead of Time(AOT)` compilation artifacts and significantly reduces the model load time.
For convenience, the compiled model artifacts for this example are made available on the Torchserve model zoo: `s3://torchserve/mar_files/llama-2-13b-neuronx-b4`\
Instructions on how to use the AOT compiled model artifacts is shown below.

Expand Down Expand Up @@ -78,7 +78,7 @@ huggingface-cli login

Run the `inf2_save_split_checkpoints.py` script
```bash
python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13b-hf --save_path './llama-2-13b-split'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some docs feel missing, so presumably neuron is a JIT compiler and you're warming up a cache? Or is it an AOT compiler and you are saving the compiled artifacts in which it's not really a cache but a serialized compiled model?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we actually have both. The neuronx-cc JIT cache is used if the compilation artifacts are present there and if not, the neuron persistent cache is checked to see if the compiled artifacts are present. The contents of the neuron persistent cache is what we are generating here to enable speed up for the first model load. More documentation is available here: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/neuron-caching.html
Will include these details in the Readme as well.

python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13b-hf --save_path './llama-2-13b-split' generate_neuron_cache --neuron_cache_dir './neuron_cache' --batch_size 4 --amp 'bf16' --tp_degree 6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you link to some official docs describing what the tp degree means?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

```


Expand All @@ -87,6 +87,7 @@ python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13
```bash
torch-model-archiver --model-name llama-2-13b --version 1.0 --handler inf2_handler.py -r requirements.txt --config-file model-config.yaml --archive-format no-archive
mv llama-2-13b-split llama-2-13b
mv neuron_cache llama-2-13b
```

### Step 5: Add the model artifacts to model store
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import torch
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
from transformers.models.opt import OPTForCausalLM
from transformers_neuronx.llama.model import LlamaForSampling
from transformers_neuronx.module import save_pretrained_split

os.environ["NEURON_CC_FLAGS"] = "--model-type=transformer-inference"
Expand Down Expand Up @@ -40,6 +41,26 @@ def opt_amp_callback(model: OPTForCausalLM, dtype: torch.dtype) -> None:
default="./model-splits",
help="Output directory for downloaded model files",
)
subparsers = parser.add_subparsers(dest="subparser")
parser_neuron_cache = subparsers.add_parser("generate_neuron_cache")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these don't feel like they should be required?

Copy link
Collaborator Author

@namannandan namannandan Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've included the generate_neuron_cache as an optional step when running inf2_save_split_checkpoints so that the following arguments such as neuron_cache_dir, batch_size etc.. are required only if generate_neuron_cache is specified. So, if we require only the model checkpoint we can just run python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13b-hf --save_path './llama-2-13b-split' as before.

parser_neuron_cache.add_argument(
"--neuron_cache_dir",
type=str,
required=True,
help="Target directory to store neuronx-cc compiled model",
)
parser_neuron_cache.add_argument(
"--batch_size", type=int, required=True, help="Batch size for the compiled model"
)
parser_neuron_cache.add_argument(
"--amp", type=str, required=True, help="Automatic mixed precision"
)
parser_neuron_cache.add_argument(
"--tp_degree",
type=int,
required=True,
help="Tensor parallelism degree for the compiled model",
)
args = parser.parse_args()

save_path = create_directory_if_not_exists(args.save_path)
Expand All @@ -62,3 +83,26 @@ def opt_amp_callback(model: OPTForCausalLM, dtype: torch.dtype) -> None:
tokenizer.save_pretrained(args.save_path)

print(f"Files for '{args.model_name}' have been downloaded to '{args.save_path}'.")

if args.subparser == "generate_neuron_cache":
os.environ["NEURONX_CACHE"] = "on"
os.environ["NEURONX_DUMP_TO"] = create_directory_if_not_exists(
args.neuron_cache_dir
)
Comment on lines +88 to +91
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this and it seems to work with the latest SDK versions 2.14.* and not with prior versions 2.12.* and lower. Since this example is based on Neuron SDK version 2.12, I believe we can retain NEURONX_DUMP_TO.

Copy link
Collaborator Author

@namannandan namannandan Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also tested loading llama2 using LlamaForSampling with Neuron SDK 2.14 and it fails with the following error:
RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/cc7c41fb-c8bf-4adf-b91f-6bb8bf5d7c04/model.MODULE_14cd248f5b2ed6a11af6+56a4bda8.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/cc7c41fb-c8bf-4adf-b91f-6bb8bf5d7c04/model.MODULE_14cd248f5b2ed6a11af6+56a4bda8.neff', '--model-type=transformer-inference', '--model-type=transformer', '--verbose=35']: 2023-10-25T17:36:47Z Too many instructions after unroll for function sg0000 !

So, will retain the example to use SDK 2.12 for now.

Copy link
Collaborator

@lxning lxning Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this root cause of this error is there is bug with 2.14 compiler. "NEURON_COMPILE_CACHE_URL" is still recommended by inf2 team since the old way generates too much debug log. Please keep this PR open until the bug fixing is verified on neuron sdk 2.15 compile.

os.environ["NEURON_CC_FLAGS"] = "--model-type=transformer-inference"

if hf_model_config.model_type == "llama":
model = LlamaForSampling.from_pretrained(
args.save_path,
batch_size=args.batch_size,
amp=args.amp,
tp_degree=args.tp_degree,
)
else:
raise RuntimeError(
f"Neuron cache generation for model {args.model_name} not supported"
)

print(f"Compiling '{args.model_name}'")
model.to_neuron()
print(f"Neuron cache for '{args.model_name}' saved to {args.neuron_cache_dir}")
Loading