-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] Ray train integration #312
Conversation
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Hey @reciprocated, do you need any help with this PR? |
Hi, we have merged ray-project/ray#33269. This PR can be updated to remove |
* Use `AccelerateTrainer` from Ray Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> * fix(sweep): `accelerate_config_path` -> `accelerate_config` --------- Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: reciprocated <56548574+reciprocated@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to finally be able to scale these sweeps! Collapsing the ray_tune
module into sweep.py
is very clean 🤗
I left a small concern on the README example if y'all could take a look when possible.
|
||
# Initialize Ray. | ||
if args.server_address: | ||
ray.init(address=f"ray://{args.server_address}") | ||
else: | ||
ray.init() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running the example script from the README.md
within a singe node 8xA100 instance (CoreWeave) I get ConnectionError
s of the kind:
ConnectionError: Could not read 'session_name' from GCS. Did GCS start successfully?
(obviously not using GCS)
Passing the local
flag to this init (ray.init("local")
) seems to address the issue. So, I'm wondering if maybe we should include a CLI arg for such use cases or simply update the README to tell users to start a ray instance first and then run the command. Has any PR author run into the ConnectionError
on the most recent commit?
Note: This is in a fresh env from this branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ray.init()
just by itself should work fine, and it will automatically start a Ray cluster composed of just the current instance if one is not running already. Are you setting the server_address
arg?
Setting local
should be equivalent to the default value of None
, so this may be a bug on our side. Is there anything else in your stderr? FYI GCS in this context refers to a Ray concept - https://docs.ray.io/en/latest/ray-core/scheduling/memory-management.html#concepts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I no longer have this issue. There have been some network issues with our cluster recently so I'm going to chalk it up to that. Note that I did not set the server_address
arg for what it's worth.
Re GCS: I quickly assumed it was Google Cloud related, thank you for clarifying 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jon-tow How was it resolved for you? I experience the same thing right now, never encountering it before. Something must be messing with ray autodiscovery, either starting a ray cluster by hand or forcing local
does work however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure. I tried a new node and it was then able to auto-launch locally 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good on my end! (@reciprocated cleared up an issue I ran into with disk syncing so my concerns have been addressed outside here).
Hey @reciprocated now that this PR is merged we will have to update this W&B report: https://wandb.ai/ayut/trlx-ppo-sentiments-hyperopt/reports/RLHF-Hyperparameter-Optimization-for-trlX--VmlldzoyOTgxMTQ2 Do we have a tutorial or example I can refer for the changes. :) |
@ayulockin Sure thing! However I don't think anything has changed functionality-wise, except for this remark from the report:
which has now become outdated :) |
This PR lets Ray Tune use Ray's
AccelerateTrainer
from @Yard1Example of usage:
python -m trlx.sweep -y --config configs/sweeps/ppo_sweep.yml --accelerate_config configs/accelerate/zero2-bf16.yaml --num_gpus 4 examples/ppo_sentiments.py
Example of multi-node usage:
https://github.com/CarperAI/trlx/blob/b3664b61407cfc211b7baf2dd2c98c14a55f6ad0/scripts/sweep-cw.sh
ppo_sentiments/1gpu: https://wandb.ai/sorry/sweep_ppo_sentiments/reports/Hyperparameter-Optimization-Report-sweep_ppo_sentiments--VmlldzozOTI2MTA2
ppo_sentiments/12gpus: https://wandb.ai/sorry/sweep_ppo_sentiments/reports/Hyperparameter-Optimization-Report-sweep_ppo_sentiments--VmlldzozOTM0NzA0
https://wandb.ai/sorry/trlx-references/reports/ray-train-integration-v-main--VmlldzozOTQ0NzI4