Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example for Llama2 on Inf2 #2458

Merged
merged 27 commits into from
Sep 19, 2023

Conversation

namannandan
Copy link
Collaborator

@namannandan namannandan commented Jul 12, 2023

Description

This PR adds an example that details the steps to compile and run the Llama2 model on Inferentia2 for text completion with micro batching and response streaming support.

Model: https://huggingface.co/meta-llama/Llama-2-13b-hf
Instance type: inf2.24xlarge

Type of change

  • New feature (non-breaking change which adds functionality)

Test

$ torchserve --ncs --start --model-store model_store
$ curl -X POST "http://localhost:8081/models?url=llama-2-13b" 
$ python test_stream_response.py 
Today the weather is really nice and I am planning on going to the beach. I am going to take my camera and take some pictures. I am going to take pictures of the beach and the ocean. I am going to 
take pictures of the people and the animals. I am going to take pictures of the sun and the sky. I am going to take pictures of the sand and the water. I am going to take pictures of the waves and 
the birds. I am going to take pictures of the shells and the rocks. I am going to

@codecov
Copy link

codecov bot commented Jul 12, 2023

Codecov Report

Merging #2458 (bbcdaf2) into master (80b1679) will decrease coverage by 0.59%.
The diff coverage is 12.12%.

❗ Current head bbcdaf2 differs from pull request most recent head cfaf385. Consider uploading reports for the commit cfaf385 to get more accurate results

@@            Coverage Diff             @@
##           master    #2458      +/-   ##
==========================================
- Coverage   70.87%   70.29%   -0.59%     
==========================================
  Files          83       84       +1     
  Lines        3839     3871      +32     
  Branches       58       58              
==========================================
  Hits         2721     2721              
- Misses       1114     1146      +32     
  Partials        4        4              
Files Changed Coverage Δ
ts/handler_utils/hf_batch_streamer.py 0.00% <0.00%> (ø)
ts/handler_utils/micro_batching.py 90.29% <80.00%> (-0.62%) ⬇️

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@namannandan namannandan marked this pull request as ready for review July 14, 2023 18:19
@namannandan namannandan changed the title Llama on Inf2 example Example for Llama on Inf2 Jul 14, 2023
Copy link
Collaborator

@lxning lxning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save_split_checkpoints.py is common. Please move it to large_models/util/ and rename it as inf2_save_split_checkpoints.py.

@namannandan namannandan force-pushed the naman-inf2-example-refactor branch 2 times, most recently from 0dd2d87 to b9f2654 Compare July 27, 2023 21:42
@namannandan namannandan changed the title Example for Llama on Inf2 Example for Llama2 on Inf2 Jul 27, 2023
@namannandan namannandan force-pushed the naman-inf2-example-refactor branch 2 times, most recently from 631b253 to 5e8b713 Compare July 28, 2023 17:46
@namannandan namannandan requested a review from lxning July 28, 2023 17:57
@namannandan namannandan requested a review from lxning August 1, 2023 06:19
@namannandan namannandan changed the title Example for Llama2 on Inf2 [WIP] Example for Llama2 on Inf2 Aug 24, 2023
@lxning
Copy link
Collaborator

lxning commented Sep 5, 2023

Please add feature AOT precompile and store model in cache in this example

  • update README
  • replace inf2_handler.py wtih the new one

@namannandan namannandan changed the title [WIP] Example for Llama2 on Inf2 Example for Llama2 on Inf2 Sep 18, 2023
@namannandan
Copy link
Collaborator Author

namannandan commented Sep 18, 2023

Issue to track follow up tasks on this PR: #2600

ts/handler_utils/hf_batch_streamer.py Show resolved Hide resolved
examples/large_models/inferentia2/llama2/Readme.md Outdated Show resolved Hide resolved
examples/large_models/inferentia2/llama2/Readme.md Outdated Show resolved Hide resolved
tp_degree=tp_degree,
)
logger.info("Starting to compile the model")
self.model.to_neuron()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@namannandan I am wondering if compilation can be done a head of time and we just load the compiled graphs here the way it was working for inf1?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the _save_compiled_artifacts . It is able to generate a neuron model. However, the transformers_neuronx still needs to recompile. I already let Neuron team know they need more work on the experimental feature _save_compiled_artifacts.

@namannandan namannandan dismissed stale reviews from mreso and chauhang September 19, 2023 21:55

Addressed review comments. Follow up tasks tracked here: #2600

@lxning lxning added this pull request to the merge queue Sep 19, 2023
Merged via the queue into pytorch:master with commit d0ae857 Sep 19, 2023
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants