[WIP] Enable generation of AOT compiled artifacts for llama2 on inf2 example #2733

namannandan · 2023-10-23T21:37:20Z

Description

Add support for generation of model compiled artifacts ahead of time for the llama2 on inf2 example.

Type of change

New feature (non-breaking change which adds functionality)
This change requires a documentation update

Feature/Issue validation/testing

$ torchserve --ncs --start --model-store model_store --ts-config config.properties

$ curl -X POST "http://localhost:8081/models?url=llama-2-13b"
{
  "status": "Model \"llama-2-13b\" Version: 1.0 registered with 1 initial workers"
}

$ python test_stream_response.py
Today the weather is really nice and I am planning on going to the beach. I am going to take my camera and take some pictures. I am going to take pictures of the beach and the ocean. I am also going to take pictures of the people who are at the beach. I am going to take pictures of the people who are swimming in the ocean. I am going to take pictures of the people who are sunbathing on the beach. I am going to take pictures of the people who are playing in the sand. I am

codecov · 2023-10-23T21:53:37Z

Codecov Report

Merging #2733 (1f771f3) into master (45d1bed) will not change coverage.
The diff coverage is n/a.

❗ Current head 1f771f3 differs from pull request most recent head aeae4f6. Consider uploading reports for the commit aeae4f6 to get more accurate results

@@           Coverage Diff           @@
##           master    #2733   +/-   ##
=======================================
  Coverage   72.44%   72.44%           
=======================================
  Files          85       85           
  Lines        3963     3963           
  Branches       58       58           
=======================================
  Hits         2871     2871           
  Misses       1088     1088           
  Partials        4        4

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

msaroufim

a few minor nits

msaroufim · 2023-10-24T00:18:47Z

examples/large_models/inferentia2/llama2/Readme.md

@@ -78,7 +78,7 @@ huggingface-cli login

 Run the `inf2_save_split_checkpoints.py` script
 ```bash
-python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13b-hf --save_path './llama-2-13b-split'


Some docs feel missing, so presumably neuron is a JIT compiler and you're warming up a cache? Or is it an AOT compiler and you are saving the compiled artifacts in which it's not really a cache but a serialized compiled model?

Here we actually have both. The neuronx-cc JIT cache is used if the compilation artifacts are present there and if not, the neuron persistent cache is checked to see if the compiled artifacts are present. The contents of the neuron persistent cache is what we are generating here to enable speed up for the first model load. More documentation is available here: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/neuron-caching.html
Will include these details in the Readme as well.

msaroufim · 2023-10-24T00:20:16Z

examples/large_models/inferentia2/llama2/Readme.md

@@ -78,7 +78,7 @@ huggingface-cli login

 Run the `inf2_save_split_checkpoints.py` script
 ```bash
-python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13b-hf --save_path './llama-2-13b-split'
+python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13b-hf --save_path './llama-2-13b-split' generate_neuron_cache --neuron_cache_dir './neuron_cache' --batch_size 4 --amp 'bf16' --tp_degree 6


nit: can you link to some official docs describing what the tp degree means?

This section of the Neuron documentation has a description of what tp_degree means
https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html?highlight=tp_degree#model-trace

msaroufim · 2023-10-24T00:22:42Z

examples/large_models/inferentia2/util/inf2_save_split_checkpoints.py

@@ -40,6 +41,12 @@ def opt_amp_callback(model: OPTForCausalLM, dtype: torch.dtype) -> None:
    default="./model-splits",
    help="Output directory for downloaded model files",
 )
+subparsers = parser.add_subparsers(dest="subparser")
+parser_neuron_cache = subparsers.add_parser("generate_neuron_cache")


these don't feel like they should be required?

I've included the generate_neuron_cache as an optional step when running inf2_save_split_checkpoints so that the following arguments such as neuron_cache_dir, batch_size etc.. are required only if generate_neuron_cache is specified. So, if we require only the model checkpoint we can just run python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13b-hf --save_path './llama-2-13b-split' as before.

msaroufim · 2023-10-24T00:23:38Z

examples/large_models/inferentia2/util/inf2_save_split_checkpoints.py

+parser_neuron_cache.add_argument("--neuron_cache_dir", type=str, required=True)
+parser_neuron_cache.add_argument("--batch_size", type=int, required=True)
+parser_neuron_cache.add_argument("--amp", type=str, required=True)
+parser_neuron_cache.add_argument("--tp_degree", type=int, required=True)


tp degree and neuron cache dir could use some help statements as wel

lxning · 2023-10-24T00:42:08Z

examples/large_models/inferentia2/util/inf2_save_split_checkpoints.py

+    os.environ["NEURONX_CACHE"] = "on"
+    os.environ["NEURONX_DUMP_TO"] = create_directory_if_not_exists(
+        args.neuron_cache_dir
+    )


According to Mike's update, NEURON_COMPILE_CACHE_URL is official setting. check https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/neuron-caching.html?highlight=NEURON_COMPILE_CACHE_URL#neuron-persistent-cache

I tested this and it seems to work with the latest SDK versions 2.14.* and not with prior versions 2.12.* and lower. Since this example is based on Neuron SDK version 2.12, I believe we can retain NEURONX_DUMP_TO.

Also tested loading llama2 using LlamaForSampling with Neuron SDK 2.14 and it fails with the following error:
RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/cc7c41fb-c8bf-4adf-b91f-6bb8bf5d7c04/model.MODULE_14cd248f5b2ed6a11af6+56a4bda8.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/cc7c41fb-c8bf-4adf-b91f-6bb8bf5d7c04/model.MODULE_14cd248f5b2ed6a11af6+56a4bda8.neff', '--model-type=transformer-inference', '--model-type=transformer', '--verbose=35']: 2023-10-25T17:36:47Z Too many instructions after unroll for function sg0000 !

So, will retain the example to use SDK 2.12 for now.

this root cause of this error is there is bug with 2.14 compiler. "NEURON_COMPILE_CACHE_URL" is still recommended by inf2 team since the old way generates too much debug log. Please keep this PR open until the bug fixing is verified on neuron sdk 2.15 compile.

lxning

The request is to remove this inf2_save_split_checkpoint tool. Please refer #2803. I moved all of the existing inf2 llama2 example into folder streamer in #2803.
please update code in the folder streamer.

lxning · 2024-03-27T20:00:36Z

This PR will replace inf2 streamer. close this PR.

Enable generation of neuron cache for llama2 on inf2 example

7cf2937

namannandan changed the title ~~Enable generation of ahead of time compiled artifacts for llama2 on inf2 example~~ Enable generation of AOT compiled artifacts for llama2 on inf2 example Oct 23, 2023

namannandan and others added 2 commits October 23, 2023 15:11

Merge branch 'master' into naman-llama2-inf2-neuron-cache

6495fc3

fix typo

9a53fb4

namannandan marked this pull request as ready for review October 23, 2023 22:14

namannandan requested review from lxning, msaroufim and agunapal October 23, 2023 22:14

msaroufim approved these changes Oct 24, 2023

View reviewed changes

lxning reviewed Oct 24, 2023

View reviewed changes

Naman Nandan and others added 2 commits October 26, 2023 00:10

Include additional documentation and help strings

4a240ff

Merge branch 'master' into naman-llama2-inf2-neuron-cache

aeae4f6

lxning reviewed Jan 30, 2024

View reviewed changes

namannandan changed the title ~~Enable generation of AOT compiled artifacts for llama2 on inf2 example~~ [WIP] Enable generation of AOT compiled artifacts for llama2 on inf2 example Feb 7, 2024

lxning mentioned this pull request Feb 27, 2024

support inf2 neuronx transformer continuous batching #2803

Merged

8 tasks

lxning assigned namannandan Mar 11, 2024

lxning added documentation Improvements or additions to documentation example python Pull requests that update Python code optimization llm labels Mar 11, 2024

lxning added this to the v0.10.1 milestone Mar 11, 2024

msaroufim requested review from msaroufim and removed request for msaroufim March 18, 2024 02:49

lxning closed this Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Enable generation of AOT compiled artifacts for llama2 on inf2 example #2733

[WIP] Enable generation of AOT compiled artifacts for llama2 on inf2 example #2733

namannandan commented Oct 23, 2023 •

edited

Loading

codecov bot commented Oct 23, 2023 •

edited

Loading

msaroufim left a comment

msaroufim Oct 24, 2023

namannandan Oct 25, 2023

msaroufim Oct 24, 2023

namannandan Oct 26, 2023

msaroufim Oct 24, 2023

namannandan Oct 25, 2023 •

edited

Loading

msaroufim Oct 24, 2023

lxning Oct 24, 2023

namannandan Oct 25, 2023

namannandan Oct 25, 2023 •

edited

Loading

lxning Oct 26, 2023 •

edited

Loading

lxning left a comment

lxning commented Mar 27, 2024

[WIP] Enable generation of AOT compiled artifacts for llama2 on inf2 example #2733

[WIP] Enable generation of AOT compiled artifacts for llama2 on inf2 example #2733

Conversation

namannandan commented Oct 23, 2023 • edited Loading

Description

Type of change

Feature/Issue validation/testing

codecov bot commented Oct 23, 2023 • edited Loading

Codecov Report

msaroufim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

namannandan Oct 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

namannandan Oct 25, 2023 • edited Loading

Choose a reason for hiding this comment

lxning Oct 26, 2023 • edited Loading

Choose a reason for hiding this comment

lxning left a comment

Choose a reason for hiding this comment

lxning commented Mar 27, 2024

namannandan commented Oct 23, 2023 •

edited

Loading

codecov bot commented Oct 23, 2023 •

edited

Loading

namannandan Oct 25, 2023 •

edited

Loading

namannandan Oct 25, 2023 •

edited

Loading

lxning Oct 26, 2023 •

edited

Loading