Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PoC] accelerate support for trl #50

Closed
wants to merge 5 commits into from

Conversation

younesbelkada
Copy link
Contributor

@younesbelkada younesbelkada commented Dec 14, 2022

Hey there!

Let's see how we can integrate accelerate to enable larger models training (up to 10B scale). Here is a draft version on how we could potentially approach that.

Firstly, I didn't wrote tests yet, but for now, I think that having example scripts on examples/ and running them while developing seems to be the right approach. In the near future we'll need to add tests. I also think that we should develop the code outside nbdev

Secondly, I think that we should be able to support more models instead of writing a new file per architecture. The will be addressed too.

To sumarize, this PR addresses the following:

1- accelerate integration for naive training using accelerate

For now this only supports the native accelerate integration. It includes automatic mixed precision training and GPU device assignment. I need to test multi-GPU setup & Data Parallelism

2- Support for xxxForCausalLM architecture

Anyone can start experimenting using any of the xxxForCausalLM architecture, this would include any of the Bloom, OPT, etc.
The API looks like the following for now and can be adapted:

from transformers import AutoModelForCausalLM
from trl.models import AutoRegressiveLMWithValueHead

model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-560m")
ref_model = AutoRegressiveLMWithValueHead(model)

3- Support latest version of transformers

this PR adds also slight modifications to support latest transformers version.

4- Community orientation

I believe this repo should be a center point for anyone on the community that wants to experiment on hfrl. I think that we should educate users and contributors on how to implement a new training method. For now to add a new training method a user should:

  • Implement its own xxxTrainer class that inherits from BaseTrainer class
  • add a clear example on examples/
    That's it!

5- How am I testing the implementation?

For now I am running accelerate launch example/accelerate-ppo.py and observing the training dynamics of the model and assert that it has more or less the same behavior than the reference one: https://wandb.ai/lvwerra/trl-showcase/runs/1jtvxb1m?workspace=
Usually after 6-7 steps the reward starts to skyrocket.

What is next?

  • discuss how we can improve the API
  • remove dependency for nbdev
  • broader support for more architectures
  • test multi-GPU setup (DP)
  • Implement tests

cc @lvwerra

@younesbelkada
Copy link
Contributor Author

For now I am testing my implementation with accelerate launch example/ppo-accelerate.py

- add `make style` support
- add `accelerate` example that runs fine
- refactor `gpt2` to support latest version of `transformers`
- add `base` model to support `xxxForCausalLM` support
- refactor `accelerate` trainer
- add `utils` file
@younesbelkada
Copy link
Contributor Author

Regarding tests, this is tricky but from what I can see we can for now:

  • Test if all trainers respects the inheritance from BaseTrainer (by checking if all the needed functions are implemented)
  • Test if all models work as expected (thinking of generate method) and if we can in fact support all xxxForCausalLM architectures as claimed above. From what I can see, as long as the model has a proper generate method the PPOTrainer should work

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @younesbelkada! I left a few comments but I think figuring out all the details for the proposed changes will take some time so what do you think about breaking it down a bit:

  • remove nbdev and setup a CI
  • PR for the the AutoModelForCasualLMWithValueHead (we can easily test if this works as expected)
  • I think we only want the Accelerate trainer (which should be equivalent to what's there now if you just run accelerate on a single machine).
  • The two places where we would benefit the most from data parallelism is generation and step - let's think about how we can achieve in the most elegant way.

I also need to read up a bit on current strategies. Maybe there are otherways how to optimize the training loops (e.g. do ref and trained model to be different or could it be the same model with two heads).

How do I imagine the API?
```
from transformers import BloomForCausalLM, BloomTokenizer
from trl import AutoRegressiveLMWithValueHead
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about sticking to the transformers naming?

Suggested change
from trl import AutoRegressiveLMWithValueHead
from trl import AutoModelForCausalLMWithValueHead


#### Get response from gpt2
t = time.time()
response_tensors = ppo_trainer.get_response(query_tensors, **gen_kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about using familiar naming here:

Suggested change
response_tensors = ppo_trainer.get_response(query_tensors, **gen_kwargs)
response_tensors = ppo_trainer.generate(query_tensors, **gen_kwargs)

Comment on lines +88 to +93
logs.update({'game_log': wandb.Table(columns=['query', 'response', 'reward'], rows=table_rows)})
logs.update(timing)
logs.update(stats)
logs['env/reward_mean'] = torch.mean(rewards).cpu().numpy()
logs['env/reward_std'] = torch.std(rewards).cpu().numpy()
logs['env/reward_dist'] = rewards.cpu().numpy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should find a better way of logging things, maybe inside the trainer similar to how Trainer does it.


self.model, self.ref_model, self.optimizer, self.data_collator, self.dataloader = self.accelerator.prepare(self.model, self.ref_model, self.optimizer, self.data_collator, self.dataloader)

def _build_dataset(self, dataset_name="imdb"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep this outside the Trainer. that's the responsibility of the user to do and inside we have a very specific toy example.

# HACK: do we really need this?
self.tokenizer.pad_token = self.tokenizer.eos_token

self.model = AutoRegressiveLMWithValueHead(base_model)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can write this class in a way that we can call just:

self.model = AutoModelForCausalLMWithValueHead.from_pretrained(model_name)

instead of having the two way process?

wandb.watch(self.model, log='all')


def step(self, queries, responses, scores):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The step call is by far the slowest part of the whole training process:
Screenshot 2022-12-16 at 16 29 21

This means that we could profit a lot from parallelism (if it works). However, that would mean that we wrap the data into a dataloader and do minibatches in parallel. We would need to experiment and check in literature how much we can actually do that with PPO.

self.config.update(config)

# Step 2: Initialize model, tokenizer and dataset
self._build_models_and_tokenizer()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can follow a similar logic to what we do in the Trainer/Evaluator classes: the models can be either provided as torch.models or as strings in which case we load them.


timing['time/ppo/total'] = time.time()-t0
stats.update(timing)
return stats
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you initialize the W&B logger with accelerate we can probably just pass all the stats to it internally and don't put the burden on the user.

Also, i think we can make the type of logger a kwarg that's passed to PPO so we don't require users to use W&B and they could also setup other loggers.

@younesbelkada
Copy link
Contributor Author

Closing in favor of #58

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants