[feat] Ray train integration #312

maxreciprocate · 2023-02-16T22:39:24Z

This PR lets Ray Tune use Ray's AccelerateTrainer from @Yard1

Example of usage:
python -m trlx.sweep -y --config configs/sweeps/ppo_sweep.yml --accelerate_config configs/accelerate/zero2-bf16.yaml --num_gpus 4 examples/ppo_sentiments.py

Example of multi-node usage:
https://github.com/CarperAI/trlx/blob/b3664b61407cfc211b7baf2dd2c98c14a55f6ad0/scripts/sweep-cw.sh

ppo_sentiments/1gpu: https://wandb.ai/sorry/sweep_ppo_sentiments/reports/Hyperparameter-Optimization-Report-sweep_ppo_sentiments--VmlldzozOTI2MTA2
ppo_sentiments/12gpus: https://wandb.ai/sorry/sweep_ppo_sentiments/reports/Hyperparameter-Optimization-Report-sweep_ppo_sentiments--VmlldzozOTM0NzA0

https://wandb.ai/sorry/trlx-references/reports/ray-train-integration-v-main--VmlldzozOTQ0NzI4

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

ayulockin · 2023-02-21T15:39:08Z

Hey @reciprocated, do you need any help with this PR?

Yard1 · 2023-03-23T00:21:24Z

Hi, we have merged ray-project/ray#33269.

This PR can be updated to remove AccelerateTrainer from TRLX and instead use ray.train.huggingface.accelerate.AccelerateTrainer, after pinning to (tomorrow's) Ray nightly wheel.

* Use `AccelerateTrainer` from Ray Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> * fix(sweep): `accelerate_config_path` -> `accelerate_config` --------- Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: reciprocated <56548574+reciprocated@users.noreply.github.com>

jon-tow

Great to finally be able to scale these sweeps! Collapsing the ray_tune module into sweep.py is very clean 🤗

I left a small concern on the README example if y'all could take a look when possible.

jon-tow · 2023-03-30T21:17:28Z

trlx/sweep.py


-    # Initialize Ray.
    if args.server_address:
        ray.init(address=f"ray://{args.server_address}")
    else:
        ray.init()


Running the example script from the README.md within a singe node 8xA100 instance (CoreWeave) I get ConnectionErrors of the kind:

ConnectionError: Could not read 'session_name' from GCS. Did GCS start successfully?

(obviously not using GCS)

Passing the local flag to this init (ray.init("local")) seems to address the issue. So, I'm wondering if maybe we should include a CLI arg for such use cases or simply update the README to tell users to start a ray instance first and then run the command. Has any PR author run into the ConnectionError on the most recent commit?

Note: This is in a fresh env from this branch.

ray.init() just by itself should work fine, and it will automatically start a Ray cluster composed of just the current instance if one is not running already. Are you setting the server_address arg?

Setting local should be equivalent to the default value of None, so this may be a bug on our side. Is there anything else in your stderr? FYI GCS in this context refers to a Ray concept - https://docs.ray.io/en/latest/ray-core/scheduling/memory-management.html#concepts

I no longer have this issue. There have been some network issues with our cluster recently so I'm going to chalk it up to that. Note that I did not set the server_address arg for what it's worth.
Re GCS: I quickly assumed it was Google Cloud related, thank you for clarifying 🙏

@jon-tow How was it resolved for you? I experience the same thing right now, never encountering it before. Something must be messing with ray autodiscovery, either starting a ray cluster by hand or forcing local does work however.

I'm not quite sure. I tried a new node and it was then able to auto-launch locally 😅

jon-tow

Looks good on my end! (@reciprocated cleared up an issue I ran into with disk syncing so my concerns have been addressed outside here).

ayulockin · 2023-04-01T06:07:04Z

Hey @reciprocated now that this PR is merged we will have to update this W&B report: https://wandb.ai/ayut/trlx-ppo-sentiments-hyperopt/reports/RLHF-Hyperparameter-Optimization-for-trlX--VmlldzoyOTgxMTQ2

Do we have a tutorial or example I can refer for the changes. :)

maxreciprocate · 2023-04-06T21:04:16Z

@ayulockin Sure thing! However I don't think anything has changed functionality-wise, except for this remark from the report:

Currently distributed training is not supported and for that reason using only one GPU per trial is recommended

which has now become outdated :)

Yard1 and others added 25 commits January 6, 2023 23:49

WIP

cd1d062

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

WIP

ea38369

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Cleanup

ca6d7d1

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Nit

698985a

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Merge branch 'CarperAI:main' into ray_train_integration_2

993525b

Cleanup

649c01e

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Fixes

0793c95

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Merge branch 'main' into ray_train_integration_2

f3bd70b

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Make sure master_port, master_addr are set

9bdcbfe

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Make private

496c891

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Tweak

55022f2

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Restore wanddb

df0530e

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Add ray.init() back, remove unnecesary code

7be5448

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Merge branch 'main' into ray_train_integration_2

fe0bf79

Set ACCELERATE_TORCH_DEVICE

b885912

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Merge branch 'main' into ray_train_integration_2

bd9a769

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

refactor(ray_tune): collapse files into sweep.py & fix w&b reports

864d757

feat(configs): add & revert back flat updating of the config

2cda942

fix(base_trainer): reenable w&b logging through ray-tune

bbe2ecb

revert(base_trainer): remove trlx's verbosity limit when under ray

0092f60

refactor(sweep): flatten code & remove debug prints

88e8d86

chore(configs/ppo_sweep): update variable names to nested structure

caf13be

Merge branch 'main' into ray-train-integration

fc13a3d

style(ray_trainer): satisfy black

b5b59ad

style(sweep): satisfy isort

deb667a

maxreciprocate added 4 commits February 22, 2023 14:40

Merge branch 'main' into ray-train-integration

55260fb

chore(configs/sweeps): update variable names to the nested structure

9082cce

chore(ilql_sentiments): restructure config loading for sweeps

02c0976

fix(sweep): rework best_config block

9c39c1c

maxreciprocate added 7 commits March 14, 2023 22:57

feat(setup.cfg): pin ray wheel, update accelerate deepspeed

4cd3e61

merge: revert to upstream changes

a7e7bb4

fix(scripts/sweep): remove default_config

6fd5aef

merge(configs): remove yml files

ab5a860

Merge branch 'main' into ray-train-integration

a189067

merge(examples): upstream config usage

9b262a3

fix(setup.cfg): condition ray's pinned wheel

c7ac679

Yard1 and others added 8 commits March 23, 2023 23:59

Merge branch 'main' into ray-train-integration

8027bd3

chore(sweep): explicitly pin a GPU per worker

ee63dd9

fix(base_trainer): remove checkpointing while under ray

a677e25

chore(README): update sweep instructions

23d2c69

feat(configs/sweeps): update with more values

c69166a

style: satisfy flake

e14488b

style: satisfy flake

81fd0e7

maxreciprocate requested a review from jon-tow March 30, 2023 20:03

maxreciprocate added 2 commits March 30, 2023 23:05

revert(examples): remove redundant device selection

55a5aa9

style: satisfy black

962dfa3

jon-tow reviewed Mar 30, 2023

View reviewed changes

maxreciprocate added 4 commits March 31, 2023 13:38

fix(base_trainer): report evaluation stats at the end

fd6f9c1

fix(base_trainer): log final stats for w&b

69b0bb8

fix(examples/sentiments): improve hyperparameters

45feae6

style: satisfy black

ec638c3

jon-tow approved these changes Mar 31, 2023

View reviewed changes

fix(README): add ray cluster manual creation instruction

6c7e6f9

maxreciprocate merged commit c9ab683 into main Mar 31, 2023

maxreciprocate deleted the ray-train-integration branch April 6, 2023 21:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Ray train integration #312

[feat] Ray train integration #312

maxreciprocate commented Feb 16, 2023 •

edited

Loading

ayulockin commented Feb 21, 2023

Yard1 commented Mar 23, 2023

jon-tow left a comment

jon-tow Mar 30, 2023

Yard1 Mar 31, 2023 •

edited

Loading

jon-tow Mar 31, 2023

maxreciprocate Mar 31, 2023 •

edited

Loading

jon-tow Mar 31, 2023

jon-tow left a comment

ayulockin commented Apr 1, 2023

maxreciprocate commented Apr 6, 2023

[feat] Ray train integration #312

[feat] Ray train integration #312

Conversation

maxreciprocate commented Feb 16, 2023 • edited Loading

ayulockin commented Feb 21, 2023

Yard1 commented Mar 23, 2023

jon-tow left a comment

Choose a reason for hiding this comment

jon-tow Mar 30, 2023

Choose a reason for hiding this comment

Yard1 Mar 31, 2023 • edited Loading

Choose a reason for hiding this comment

jon-tow Mar 31, 2023

Choose a reason for hiding this comment

maxreciprocate Mar 31, 2023 • edited Loading

Choose a reason for hiding this comment

jon-tow Mar 31, 2023

Choose a reason for hiding this comment

jon-tow left a comment

Choose a reason for hiding this comment

ayulockin commented Apr 1, 2023

maxreciprocate commented Apr 6, 2023

maxreciprocate commented Feb 16, 2023 •

edited

Loading

Yard1 Mar 31, 2023 •

edited

Loading

maxreciprocate Mar 31, 2023 •

edited

Loading