Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Ray train integration #312

Merged
merged 67 commits into from
Mar 31, 2023
Merged

[feat] Ray train integration #312

merged 67 commits into from
Mar 31, 2023

Conversation

maxreciprocate
Copy link
Collaborator

@maxreciprocate maxreciprocate commented Feb 16, 2023

This PR lets Ray Tune use Ray's AccelerateTrainer from @Yard1

Example of usage:
python -m trlx.sweep -y --config configs/sweeps/ppo_sweep.yml --accelerate_config configs/accelerate/zero2-bf16.yaml --num_gpus 4 examples/ppo_sentiments.py

Example of multi-node usage:
https://github.com/CarperAI/trlx/blob/b3664b61407cfc211b7baf2dd2c98c14a55f6ad0/scripts/sweep-cw.sh

ppo_sentiments/1gpu: https://wandb.ai/sorry/sweep_ppo_sentiments/reports/Hyperparameter-Optimization-Report-sweep_ppo_sentiments--VmlldzozOTI2MTA2
ppo_sentiments/12gpus: https://wandb.ai/sorry/sweep_ppo_sentiments/reports/Hyperparameter-Optimization-Report-sweep_ppo_sentiments--VmlldzozOTM0NzA0

https://wandb.ai/sorry/trlx-references/reports/ray-train-integration-v-main--VmlldzozOTQ0NzI4

Yard1 and others added 25 commits January 6, 2023 23:49
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
@ayulockin
Copy link
Contributor

Hey @reciprocated, do you need any help with this PR?

@Yard1
Copy link
Contributor

Yard1 commented Mar 23, 2023

Hi, we have merged ray-project/ray#33269.

This PR can be updated to remove AccelerateTrainer from TRLX and instead use ray.train.huggingface.accelerate.AccelerateTrainer, after pinning to (tomorrow's) Ray nightly wheel.

Yard1 and others added 8 commits March 23, 2023 23:59
* Use `AccelerateTrainer` from Ray

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

* fix(sweep): `accelerate_config_path` -> `accelerate_config`

---------

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: reciprocated <56548574+reciprocated@users.noreply.github.com>
@maxreciprocate maxreciprocate requested a review from jon-tow March 30, 2023 20:03
Copy link
Collaborator

@jon-tow jon-tow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to finally be able to scale these sweeps! Collapsing the ray_tune module into sweep.py is very clean 🤗

I left a small concern on the README example if y'all could take a look when possible.


# Initialize Ray.
if args.server_address:
ray.init(address=f"ray://{args.server_address}")
else:
ray.init()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running the example script from the README.md within a singe node 8xA100 instance (CoreWeave) I get ConnectionErrors of the kind:

ConnectionError: Could not read 'session_name' from GCS. Did GCS start successfully?

(obviously not using GCS)

Passing the local flag to this init (ray.init("local")) seems to address the issue. So, I'm wondering if maybe we should include a CLI arg for such use cases or simply update the README to tell users to start a ray instance first and then run the command. Has any PR author run into the ConnectionError on the most recent commit?

Note: This is in a fresh env from this branch.

Copy link
Contributor

@Yard1 Yard1 Mar 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ray.init() just by itself should work fine, and it will automatically start a Ray cluster composed of just the current instance if one is not running already. Are you setting the server_address arg?

Setting local should be equivalent to the default value of None, so this may be a bug on our side. Is there anything else in your stderr? FYI GCS in this context refers to a Ray concept - https://docs.ray.io/en/latest/ray-core/scheduling/memory-management.html#concepts

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I no longer have this issue. There have been some network issues with our cluster recently so I'm going to chalk it up to that. Note that I did not set the server_address arg for what it's worth.
Re GCS: I quickly assumed it was Google Cloud related, thank you for clarifying 🙏

Copy link
Collaborator Author

@maxreciprocate maxreciprocate Mar 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jon-tow How was it resolved for you? I experience the same thing right now, never encountering it before. Something must be messing with ray autodiscovery, either starting a ray cluster by hand or forcing local does work however.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure. I tried a new node and it was then able to auto-launch locally 😅

Copy link
Collaborator

@jon-tow jon-tow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on my end! (@reciprocated cleared up an issue I ran into with disk syncing so my concerns have been addressed outside here).

@maxreciprocate maxreciprocate merged commit c9ab683 into main Mar 31, 2023
@ayulockin
Copy link
Contributor

Hey @reciprocated now that this PR is merged we will have to update this W&B report: https://wandb.ai/ayut/trlx-ppo-sentiments-hyperopt/reports/RLHF-Hyperparameter-Optimization-for-trlX--VmlldzoyOTgxMTQ2

Do we have a tutorial or example I can refer for the changes. :)

@maxreciprocate
Copy link
Collaborator Author

@ayulockin Sure thing! However I don't think anything has changed functionality-wise, except for this remark from the report:

Currently distributed training is not supported and for that reason using only one GPU per trial is recommended

which has now become outdated :)

@maxreciprocate maxreciprocate deleted the ray-train-integration branch April 6, 2023 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants