Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Enable generation of AOT compiled artifacts for llama2 on inf2 example #2733

Closed

Conversation

namannandan
Copy link
Collaborator

@namannandan namannandan commented Oct 23, 2023

Description

Add support for generation of model compiled artifacts ahead of time for the llama2 on inf2 example.

Type of change

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

$ torchserve --ncs --start --model-store model_store --ts-config config.properties

$ curl -X POST "http://localhost:8081/models?url=llama-2-13b"
{
  "status": "Model \"llama-2-13b\" Version: 1.0 registered with 1 initial workers"
}

$ python test_stream_response.py
Today the weather is really nice and I am planning on going to the beach. I am going to take my camera and take some pictures. I am going to take pictures of the beach and the ocean. I am also going to take pictures of the people who are at the beach. I am going to take pictures of the people who are swimming in the ocean. I am going to take pictures of the people who are sunbathing on the beach. I am going to take pictures of the people who are playing in the sand. I am

@codecov
Copy link

codecov bot commented Oct 23, 2023

Codecov Report

Merging #2733 (1f771f3) into master (45d1bed) will not change coverage.
The diff coverage is n/a.

❗ Current head 1f771f3 differs from pull request most recent head aeae4f6. Consider uploading reports for the commit aeae4f6 to get more accurate results

@@           Coverage Diff           @@
##           master    #2733   +/-   ##
=======================================
  Coverage   72.44%   72.44%           
=======================================
  Files          85       85           
  Lines        3963     3963           
  Branches       58       58           
=======================================
  Hits         2871     2871           
  Misses       1088     1088           
  Partials        4        4           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@namannandan namannandan changed the title Enable generation of ahead of time compiled artifacts for llama2 on inf2 example Enable generation of AOT compiled artifacts for llama2 on inf2 example Oct 23, 2023
@namannandan namannandan marked this pull request as ready for review October 23, 2023 22:14
Copy link
Member

@msaroufim msaroufim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few minor nits

@@ -78,7 +78,7 @@ huggingface-cli login

Run the `inf2_save_split_checkpoints.py` script
```bash
python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13b-hf --save_path './llama-2-13b-split'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some docs feel missing, so presumably neuron is a JIT compiler and you're warming up a cache? Or is it an AOT compiler and you are saving the compiled artifacts in which it's not really a cache but a serialized compiled model?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we actually have both. The neuronx-cc JIT cache is used if the compilation artifacts are present there and if not, the neuron persistent cache is checked to see if the compiled artifacts are present. The contents of the neuron persistent cache is what we are generating here to enable speed up for the first model load. More documentation is available here: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/neuron-caching.html
Will include these details in the Readme as well.

@@ -78,7 +78,7 @@ huggingface-cli login

Run the `inf2_save_split_checkpoints.py` script
```bash
python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13b-hf --save_path './llama-2-13b-split'
python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13b-hf --save_path './llama-2-13b-split' generate_neuron_cache --neuron_cache_dir './neuron_cache' --batch_size 4 --amp 'bf16' --tp_degree 6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you link to some official docs describing what the tp degree means?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -40,6 +41,12 @@ def opt_amp_callback(model: OPTForCausalLM, dtype: torch.dtype) -> None:
default="./model-splits",
help="Output directory for downloaded model files",
)
subparsers = parser.add_subparsers(dest="subparser")
parser_neuron_cache = subparsers.add_parser("generate_neuron_cache")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these don't feel like they should be required?

Copy link
Collaborator Author

@namannandan namannandan Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've included the generate_neuron_cache as an optional step when running inf2_save_split_checkpoints so that the following arguments such as neuron_cache_dir, batch_size etc.. are required only if generate_neuron_cache is specified. So, if we require only the model checkpoint we can just run python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13b-hf --save_path './llama-2-13b-split' as before.

parser_neuron_cache.add_argument("--neuron_cache_dir", type=str, required=True)
parser_neuron_cache.add_argument("--batch_size", type=int, required=True)
parser_neuron_cache.add_argument("--amp", type=str, required=True)
parser_neuron_cache.add_argument("--tp_degree", type=int, required=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tp degree and neuron cache dir could use some help statements as wel

Comment on lines +74 to +77
os.environ["NEURONX_CACHE"] = "on"
os.environ["NEURONX_DUMP_TO"] = create_directory_if_not_exists(
args.neuron_cache_dir
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this and it seems to work with the latest SDK versions 2.14.* and not with prior versions 2.12.* and lower. Since this example is based on Neuron SDK version 2.12, I believe we can retain NEURONX_DUMP_TO.

Copy link
Collaborator Author

@namannandan namannandan Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also tested loading llama2 using LlamaForSampling with Neuron SDK 2.14 and it fails with the following error:
RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/cc7c41fb-c8bf-4adf-b91f-6bb8bf5d7c04/model.MODULE_14cd248f5b2ed6a11af6+56a4bda8.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/cc7c41fb-c8bf-4adf-b91f-6bb8bf5d7c04/model.MODULE_14cd248f5b2ed6a11af6+56a4bda8.neff', '--model-type=transformer-inference', '--model-type=transformer', '--verbose=35']: 2023-10-25T17:36:47Z Too many instructions after unroll for function sg0000 !

So, will retain the example to use SDK 2.12 for now.

Copy link
Collaborator

@lxning lxning Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this root cause of this error is there is bug with 2.14 compiler. "NEURON_COMPILE_CACHE_URL" is still recommended by inf2 team since the old way generates too much debug log. Please keep this PR open until the bug fixing is verified on neuron sdk 2.15 compile.

Copy link
Collaborator

@lxning lxning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The request is to remove this inf2_save_split_checkpoint tool. Please refer #2803. I moved all of the existing inf2 llama2 example into folder streamer in #2803.
please update code in the folder streamer.

@namannandan namannandan changed the title Enable generation of AOT compiled artifacts for llama2 on inf2 example [WIP] Enable generation of AOT compiled artifacts for llama2 on inf2 example Feb 7, 2024
@lxning lxning added documentation Improvements or additions to documentation example python Pull requests that update Python code optimization llm labels Mar 11, 2024
@lxning lxning added this to the v0.10.1 milestone Mar 11, 2024
@msaroufim msaroufim requested review from msaroufim and removed request for msaroufim March 18, 2024 02:49
@lxning
Copy link
Collaborator

lxning commented Mar 27, 2024

This PR will replace inf2 streamer. close this PR.

@lxning lxning closed this Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation example llm optimization python Pull requests that update Python code
Projects
Status: Removed
Development

Successfully merging this pull request may close these issues.

3 participants