-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Simplify accelerator API, add training type argument #6090
Comments
just a quick remark, is this a typo?
|
These are all the possibilities we currently have training_types = [
"single",
"dp",
"ddp",
"ddp2",
"ddp_spawn",
"ddp_sharded",
"ddp_sharded_spawn",
"horovod",
]
accelerators = ["cpu", "gpu", "tpu"]
import itertools
list(itertools.product(accelerators, training_types))
# comment indicates how it is currently set
[
✅ ("cpu", "single"), # gpus=0
👎 ("cpu", "ddp"),
👎 ("cpu", "ddp2"),
✅ ("cpu", "ddp_spawn"), # accelerator=ddp_cpu, num_processes=n
👎 ("cpu", "ddp_sharded"),
👎 ("cpu", "ddp_sharded_spawn"),
👎 ("cpu", "dp"),
✅ ("cpu", "horovod"), # accelerator=horovod
✅ ("gpu", "single"), # gpus=n
✅ ("gpu", "ddp"), # accelerator=ddp, gpus=n
✅ ("gpu", "ddp2"), # accelerator=ddp2, gpus=n
✅ ("gpu", "ddp_spawn"), # accelerator=ddp_spawn, gpus=n
✅ ("gpu", "ddp_sharded"), # accelerator=ddp, gpus=n, plugins=ddp_sharded
✅ ("gpu", "ddp_sharded_spawn"), # accelerator=ddp, gpus=n, plugins=ddp_sharded_spawn
✅ ("gpu", "dp"), # accelerator=dp, gpus=n
✅ ("gpu", "horovod"), # accelerator=horovod, gpus=n
✅ ("tpu", "single"), # tpu_cores=n
👎 ("tpu", "ddp"),
👎 ("tpu", "ddp2"),
✅ ("tpu", "ddp_spawn"), # accelerator=ddp_spawn, tpu_cores=n
👎 ("tpu", "ddp_sharded"),
👎 ("tpu", "ddp_sharded_spawn"),
👎 ("tpu", "dp"),
👎 ("tpu", "horovod"),
] (Might have made a mistake on some, feel free to edit and correct) DeepSpeed, RPC, RPC sequential are just plugins
I don't think we have a choice. I wouldn't leave it to the user to understand how each of these combinations works. Also, we might want to evaluate standardizing how the number of devices is set. We currently have |
Thanks @carmocca! Just to play devils advocate, we've gotten this far without having to worry about blacklisting of sorts. Assuming the user knows what they are doing isn't as fatal as we're making it out to believe, as most of these flags requires checking docs to see what is supported. Regardless on the decision of blacklisting/whitelisting I think this should remain a separate endeavour to the API but it is an important extension that the API should support! I agree on standardising the devices! Definitely a separate case that can extend from this API |
IMO, it shouldn't be necessary for the User to pass the accelerator flag and be explicit. Trainer(training_type='ddp', accelerator='gpu')
Trainer(training_type='ddp_spawn', accelerator='cpu') Just passing |
The downside about what you are saying (the current implementation) is that if you want to switch between accelerators, you cannot do it by passing just a different parameter ( This can be confusing for users who don't understand the difference between |
I also agree with carmocca that having something like |
That's one case. How about the cases where users define |
But if you have your trainer for GPUs like this: In the deprecation phase, I think we have to use whatever is provided by the current Another thing this would enable would be something like |
In case you are going for this approach I vote for a Also I vote for providing good defaults where possible, i.e., I don't see |
@kaushikb11 can you comment on the status of this issue and what would the next steps be? |
TODOs I could think of:
|
Closing this - tracking the previous comment in #9932 |
🚀 Feature
After the accelerator refactor, we've added some nicety around the
TrainingTypePlugin
, which now represents all logic that sits on top of an accelerator. However our API seems to be based on the old paradigm where the accelerator/distributed backend were tied, and has bit us already #6089What I suggest is we do the following:
Incompatible Plugin/Accelerator
This is something I haven't hashed out but let's say the user does:
This currently is not compatible. We should throw an exception of some sort, but this delves into whitelisting/blacklisting support plugins which could get a bit unwieldy. An alternative is to just assume that the user knows enough about compatibility already.
Backwards compatibility
We allow the
TrainingTypePlugin
to be specified via theaccelerator
trainer flag but throw a warning that this will be deprecated in the future.cc @ananthsub @carmocca @awaelchli @justusschock @tchaton @Borda @kaushikb11 @williamFalcon
The text was updated successfully, but these errors were encountered: