Skip to content

Commit

Permalink
docs: fine tune llama with trainium (ray-project#48768)
Browse files Browse the repository at this point in the history
Introduce a new Ray Train example for AWS Trainium. 

![CleanShot 2024-11-16 at 12 48
57@2x](https://github.com/user-attachments/assets/8b7d12d8-846f-497f-ba25-fd8a613f9007)

Marked it as a community example as it is something we are collaborating
with AWS Neuron team.

![CleanShot 2024-11-16 at 12 48
37@2x](https://github.com/user-attachments/assets/589d8ff3-fcb6-4b90-865d-006bcb4815a3)

Docs screenshots

<img width="1142" alt="Screenshot 2024-11-20 at 11 19 39 AM"
src="https://github.com/user-attachments/assets/aa3dadf7-96b9-46cc-8b6d-44c3e3bc3e1e">
<img width="1161" alt="Screenshot 2024-11-20 at 11 19 47 AM"
src="https://github.com/user-attachments/assets/859508fd-e47e-4758-a4c7-f15a749ece82">
<img width="1149" alt="Screenshot 2024-11-20 at 11 19 54 AM"
src="https://github.com/user-attachments/assets/28858f36-8cca-4eaa-a8ec-a1f7dda899d0">

---------

Signed-off-by: Saihajpreet Singh <c-saihajpreet.singh@anyscale.com>
Co-authored-by: Chris Zhang <chris@anyscale.com>
Signed-off-by: hjiang <dentinyhao@gmail.com>
  • Loading branch information
2 people authored and dentiny committed Dec 7, 2024
1 parent ff3d11c commit 978e406
Show file tree
Hide file tree
Showing 3 changed files with 115 additions and 1 deletion.
1 change: 1 addition & 0 deletions doc/source/custom_directives.py
Original file line number Diff line number Diff line change
Expand Up @@ -481,6 +481,7 @@ def key(cls: type) -> str:
class Framework(ExampleEnum):
"""Framework type for example metadata."""

AWSNEURON = "AWS Neuron"
PYTORCH = "PyTorch"
LIGHTNING = "Lightning"
TRANSFORMERS = "Transformers"
Expand Down
12 changes: 11 additions & 1 deletion doc/source/train/examples.yml
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,17 @@ examples:
contributor: community
link: examples/intel_gaudi/llama_pretrain

- title: Fine-tune a Llama-2 text generation models with DeepSpeed and Hugging Face Accelerate
- title: Fine-tune Llama3.1 with AWS Trainium
frameworks:
- pytorch
- aws neuron
skill_level: advanced
use_cases:
- natural language processing
- large language models
contributor: community
link: examples/aws-trainium/llama3
- title: Fine-tune a Llama-2 text generation model with DeepSpeed and Hugging Face Accelerate
frameworks:
- accelerate
- deepspeed
Expand Down
103 changes: 103 additions & 0 deletions doc/source/train/examples/aws-trainium/llama3.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
:orphan:

Distributed fine-tuning of Llama 3.1 8B on AWS Trainium with Ray and PyTorch Lightning
======================================================================================


This example demonstrates how to fine-tune the `Llama 3.1 8B <https://huggingface.co/NousResearch/Meta-Llama-3.1-8B/>`__ model on `AWS
Trainium <https://aws.amazon.com/ai/machine-learning/trainium/>`__ instances using Ray Train, PyTorch Lightning, and AWS Neuron SDK.

AWS Trainium is the machine learning (ML) chip that AWS built for deep
learning (DL) training of 100B+ parameter models. `AWS Neuron
SDK <https://aws.amazon.com/machine-learning/neuron/>`__ helps
developers train models on Trainium accelerators.

Prepare the environment
-----------------------

See `Setup EKS cluster and tools <https://github.com/aws-neuron/aws-neuron-eks-samples/tree/master/llama3.1_8B_finetune_ray_ptl_neuron#setupeksclusterandtools>`__ for setting up an Amazon EKS cluster leveraging AWS Trainium instances.

Create a Docker image
---------------------
When the EKS cluster is ready, create an Amazon ECR repository for building and uploading the Docker image containing artifacts for fine-tuning a Llama3.1 8B model:

1. Clone the repo.

::

git clone https://github.com/aws-neuron/aws-neuron-eks-samples.git

2. Go to the ``llama3.1_8B_finetune_ray_ptl_neuron`` directory.

::

cd aws-neuron-eks-samples/llama3.1_8B_finetune_ray_ptl_neuron

3. Trigger the script.

::

chmod +x 0-kuberay-trn1-llama3-finetune-build-image.sh
./0-kuberay-trn1-llama3-finetune-build-image.sh

4. Enter the zone your cluster is running in, for example: us-east-2.

5. Verify in the AWS console that the Amazon ECR service has the newly
created ``kuberay_trn1_llama3.1_pytorch2`` repository.

6. Update the ECR image ARN in the manifest file used for creating the Ray cluster.

Replace the <AWS_ACCOUNT_ID> and <REGION> placeholders with actual values in the ``1-llama3-finetune-trn1-create-raycluster.yaml`` file using commands below to reflect the ECR image ARN created above:

::

export AWS_ACCOUNT_ID=<enter_your_aws_account_id> # for ex: 111222333444
export REGION=<enter_your_aws_region> # for ex: us-east-2
sed -i "s/<AWS_ACCOUNT_ID>/$AWS_ACCOUNT_ID/g" 1-llama3-finetune-trn1-create-raycluster.yaml
sed -i "s/<REGION>/$REGION/g" 1-llama3-finetune-trn1-create-raycluster.yaml

Configuring Ray Cluster
-----------------------

The ``llama3.1_8B_finetune_ray_ptl_neuron`` directory in the AWS Neuron samples repository simplifies the
Ray configuration. KubeRay provides a manifest that you can apply
to the cluster to set up the head and worker pods.

Run the following command to set up the Ray cluster:

::

kubectl apply -f 1-llama3-finetune-trn1-create-raycluster.yaml


Accessing Ray Dashboard
-----------------------
Port forward from the cluster to see the state of the Ray dashboard and
then view it on `http://localhost:8265 <http://localhost:8265/>`__.
Run it in the background with the following command:

::

kubectl port-forward service/kuberay-trn1-head-svc 8265:8265 &

Launching Ray Jobs
------------------

The Ray cluster now ready to handle workloads. Initiate the data preparation and fine-tuning Ray jobs:

1. Launch the Ray job for downloading the dolly-15k dataset and the Llama3.1 8B model artifacts:

::

kubectl apply -f 2-llama3-finetune-trn1-rayjob-create-data.yaml

2. When the job has executed successfully, run the following fine-tuning job:

::

kubectl apply -f 3-llama3-finetune-trn1-rayjob-submit-finetuning-job.yaml

3. Monitor the jobs via the Ray Dashboard


For detailed information on each of the steps above, see the `AWS documentation link <https://github.com/aws-neuron/aws-neuron-eks-samples/blob/master/llama3.1_8B_finetune_ray_ptl_neuron/README.md/>`__.

0 comments on commit 978e406

Please sign in to comment.