This GitHub repository demonstrates how to perform fine-tuning of large language models (LLM) in a distributed manner using Valohai. The repository provides detailed guides and code examples to help you get started with distributed training, showcasing the use of Valohai for efficient model fine-tuning.
Learn how to distribute the fine-tuning process of large language models across multiple machines or GPUs for improved training efficiency.
In this section, we explore the various approaches to distributed training of large language models (LLMs) within the Valohai GitHub repository. Distributed training is crucial when dealing with computationally intensive tasks, and this repository provides multiple methods to achieve it.
Note: To utilize the distributed training features outlined below, you will need a machine equipped with at least 2 or more GPUs. For setup assistance, please contact our support team.
In the second approach, we leverage the torchrun
(Elastic Launch) functionality, which extends the capabilities of torch.distributed.launch
. We employ the Hugging Face Transformers Trainer to fine-tune the language model using our dataset. With torchrun
, you can distribute the training process without making any modifications to your existing code. This method provides a straightforward introduction to distributed training for LLMs.
- Utilizes Transformers Trainer for model fine-tuning.
- No code changes needed for distributed training.
- Easy setup for those new to distributed training.
We employ the Accelerate library to facilitate distributed training. The train-accelerator.py
script is based on the Hugging Face example for summarization without using the Trainer class. It grants you complete control over the training loop, allowing for more flexibility in customizing training processes. Meanwhile, the Accelerate library takes care of the distribution aspects, making it an efficient choice for distributed training.
- Fine-grained control over the training loop.
- Accelerate library handles distribution seamlessly.
- Ideal for custom training approaches.
Note: you need to run train-task
as Valohai Task NOT Execution.
The third approach, currently under development within this repository, tackles distributed training across multiple machines simultaneously. To achieve this, we employ Valohai's valohai.distributed
, along with torch.distributed
and torch.multiprocessing
to establish communication between multiple machines during the training process. While still in development, this approach aims to provide a robust solution for training large language models across a distributed infrastructure.
- Distributes training across multiple machines.
- Uses Valohai's distributed capabilities.
- Enables efficient scaling for demanding training workloads.
To get started login to the Valohai app and create a new project.
Using UI
Configure this repository as the project's repository, by following these steps:
- Go to your project's page.
- Navigate to the Settings tab.
- Under the Repository section, locate the URL field.
- Enter the URL of this repository.
- Click on the Save button to save the changes.
Using terminal
To run the code on Valohai using the terminal, follow these steps:
- Install Valohai on your machine by running the following command:
pip install valohai-cli valohai-utils
- Log in to Valohai from the terminal using the command:
vh login
- Create a project for your Valohai workflow. Start by creating a directory for your project:
mkdir valohai-distributed-llms
cd valohai-distributed-llms
Then, create the Valohai project:
vh project create
- Clone the repository to your local machine:
git clone https://github.com/valohai/distributed-llms-example.git .
Using UI
- Go to the Executions tab in your project.
- Create a new execution by selecting the predefined step.
- Customize the execution parameters if needed.
- Start the execution to run the selected step.
Using terminal
To run individual steps, execute the following command:vh execution run <step-name> --adhoc
For example, to run the train-torchrun step, use the command:
vh execution run train-torchrun --adhoc
Using UI
- Go to the Tasks tab in your project.
- Create a new task by selecting the predefined step
train-task
. - Choose one of the following options:
- Navigate to Task type, and opt for Distributed. Adjust the execution count.
- Utilize the blueprint by clicking "Select Task blueprint" in the upper right corner.
- Customize the task parameters if needed.
- Start the task to run.
For bug reports and feature requests please visit GitHub Issues.
If you need any help, feel free to contact our support team!