Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Simplify accelerator API, add training type argument #6090

Closed
SeanNaren opened this issue Feb 19, 2021 · 12 comments
Closed

[RFC] Simplify accelerator API, add training type argument #6090

SeanNaren opened this issue Feb 19, 2021 · 12 comments
Assignees
Labels
design Includes a design discussion discussion In a discussion stage feature Is an improvement or enhancement help wanted Open to be worked on
Milestone

Comments

@SeanNaren
Copy link
Contributor

SeanNaren commented Feb 19, 2021

🚀 Feature

After the accelerator refactor, we've added some nicety around the TrainingTypePlugin, which now represents all logic that sits on top of an accelerator. However our API seems to be based on the old paradigm where the accelerator/distributed backend were tied, and has bit us already #6089

What I suggest is we do the following:

# Correct
Trainer(gpus=1, accelerator='ddp')  # this is fine. should be set to ddp
Trainer(gpus=1, plugins='ddp_sharded', accelerator='gpu')
Trainer(gpus=1, plugins='ddp_sharded')  # implicit. assume GPU for ddp_sharded as it is the only supported accelerator

# Incorrect
Trainer(gpus=1, accelerator='ddp_sharded')  # throw misconfiguration exception
Trainer(gpus=1, plugins='ddp_sharded', accelerator='ddp') # this should throw a deprecation warning

# The Future

# Plugins can be TrainingTypePlugins, PrecisionPlugins, or whatever the user has provided.
# For precision, we don't do `Trainer(plugins='16bit_precision')` but instead `Trainer(precision=16)` 
# so why do we use the plugins flag for the training type?

Trainer(training_type='ddp', accelerator='gpu')
Trainer(training_type='ddp_spawn', accelerator='cpu')

# so Trainer(plugins=...) is left for cluster_environments, SLURM, ...

Incompatible Plugin/Accelerator

This is something I haven't hashed out but let's say the user does:

Trainer(training_type='ddp_sharded', accelerator='tpu')

This currently is not compatible. We should throw an exception of some sort, but this delves into whitelisting/blacklisting support plugins which could get a bit unwieldy. An alternative is to just assume that the user knows enough about compatibility already.

Backwards compatibility

We allow the TrainingTypePlugin to be specified via the accelerator trainer flag but throw a warning that this will be deprecated in the future.

cc @ananthsub @carmocca @awaelchli @justusschock @tchaton @Borda @kaushikb11 @williamFalcon

@SeanNaren SeanNaren added feature Is an improvement or enhancement help wanted Open to be worked on design Includes a design discussion labels Feb 19, 2021
@SeanNaren SeanNaren self-assigned this Feb 19, 2021
@Borda
Copy link
Member

Borda commented Feb 19, 2021

just a quick remark, is this a typo?

Trainer(gpus=1, training_type='ddp_spawn', accelerator='cpu')

@Borda Borda added discussion In a discussion stage Important labels Feb 19, 2021
@Borda Borda added this to the 1.3 milestone Feb 19, 2021
@carmocca
Copy link
Contributor

carmocca commented Feb 19, 2021

These are all the possibilities we currently have

training_types = [
    "single",
    "dp",
    "ddp",
    "ddp2",
    "ddp_spawn",
    "ddp_sharded",
    "ddp_sharded_spawn",
    "horovod",
]
accelerators = ["cpu", "gpu", "tpu"]

import itertools
list(itertools.product(accelerators, training_types))
# comment indicates how it is currently set
[
    ✅ ("cpu", "single"),  # gpus=0
    👎 ("cpu", "ddp"),
    👎 ("cpu", "ddp2"),
    ✅ ("cpu", "ddp_spawn"),  # accelerator=ddp_cpu, num_processes=n
    👎 ("cpu", "ddp_sharded"),
    👎 ("cpu", "ddp_sharded_spawn"),
    👎 ("cpu", "dp"),
    ✅ ("cpu", "horovod"),  # accelerator=horovod

    ✅ ("gpu", "single"),  # gpus=n
    ✅ ("gpu", "ddp"),  # accelerator=ddp, gpus=n
    ✅ ("gpu", "ddp2"),  # accelerator=ddp2, gpus=n
    ✅ ("gpu", "ddp_spawn"),  # accelerator=ddp_spawn, gpus=n
    ✅ ("gpu", "ddp_sharded"),  #  accelerator=ddp, gpus=n, plugins=ddp_sharded
    ✅ ("gpu", "ddp_sharded_spawn"),  #  accelerator=ddp, gpus=n, plugins=ddp_sharded_spawn
    ✅ ("gpu", "dp"),  # accelerator=dp, gpus=n
    ✅ ("gpu", "horovod"),  # accelerator=horovod, gpus=n

    ✅ ("tpu", "single"),  # tpu_cores=n
    👎 ("tpu", "ddp"),
    👎 ("tpu", "ddp2"),
    ✅ ("tpu", "ddp_spawn"),  # accelerator=ddp_spawn, tpu_cores=n
    👎 ("tpu", "ddp_sharded"),
    👎 ("tpu", "ddp_sharded_spawn"),
    👎 ("tpu", "dp"),
    👎 ("tpu", "horovod"),
]

(Might have made a mistake on some, feel free to edit and correct)

DeepSpeed, RPC, RPC sequential are just plugins

We should throw an exception of some sort, but this delves into whitelisting/blacklisting support plugins which could get a bit unwieldy

I don't think we have a choice. I wouldn't leave it to the user to understand how each of these combinations works.
Also no user will be implementing their own training_type or accelerator so it should be safe for us to decide which of these are allowed.

Also, we might want to evaluate standardizing how the number of devices is set. We currently have num_processes, gpus, and tpu_cores. They could be merged into one devices or num_processes

@SeanNaren
Copy link
Contributor Author

Thanks @carmocca!

Just to play devils advocate, we've gotten this far without having to worry about blacklisting of sorts. Assuming the user knows what they are doing isn't as fatal as we're making it out to believe, as most of these flags requires checking docs to see what is supported. Regardless on the decision of blacklisting/whitelisting I think this should remain a separate endeavour to the API but it is an important extension that the API should support!

I agree on standardising the devices! Definitely a separate case that can extend from this API

@kaushikb11
Copy link
Contributor

IMO, it shouldn't be necessary for the User to pass the accelerator flag and be explicit.

Trainer(training_type='ddp', accelerator='gpu')
Trainer(training_type='ddp_spawn', accelerator='cpu')

Just passing gpus=n or tpu_cores=n should be sufficient, with the CPU accelerator being default.

@carmocca
Copy link
Contributor

IMO, it shouldn't be necessary for the User to pass the accelerator flag and be explicit.

The downside about what you are saying (the current implementation) is that if you want to switch between accelerators, you cannot do it by passing just a different parameter (accelerator='cpu'|'gpu', num_processes=5), instead, you have to use a different argument entirely (num_processes=5 vs gpus=5) which is "harder" programmatically.

This can be confusing for users who don't understand the difference between gpus and num_processes or who might think they need to set both flags. Doing the proposed split would clear this up.

@justusschock
Copy link
Member

I also agree with carmocca that having something like num_processes (or maybe num_devices since not all backends use multiple processes) together with a flag that indicates the hardware type is cleaner.
This would be a major API change though

@kaushikb11
Copy link
Contributor

That's one case. How about the cases where users define (accelerator='cpu'|'gpu'|'tpu', gpus=5) or when they just mention gpus=n or tpu_cores=n, should it just be ignored? I think it's a good enough signal to select the accelerator. I feel it's already a bit messy with multiple flags.
Also, for eg., to train it on TPUs, it would be unnecessary to pass both accelerator="tpu" and tpu_cores=n every time, rather than just passing the later one.

@justusschock
Copy link
Member

But if you have your trainer for GPUs like this:
Trainer(gpus=8)
and want to switch to TPU, you have to make two changes. One to change the value of gpus and one to change tpu_cores. With the new API, you only had to change accelerator from 'gpu' to 'tpu`

In the deprecation phase, I think we have to use whatever is provided by the current gpus or tpu_cores arguments though.

Another thing this would enable would be something like accelerator='auto', num_processes='auto', where we do smart device detection by ourselves (i.e. we check what's available and use the most advanced thing that's available (TPUs > GPUs > CPUs).

@kaushikb11 kaushikb11 self-assigned this Jun 1, 2021
@awaelchli
Copy link
Contributor

awaelchli commented Jun 1, 2021

In case you are going for this approach I vote for a devices argument which works like the gpus argument today (int, list, string) and remove num_processes instead.

Also I vote for providing good defaults where possible, i.e., I don't see training_type as a mandatory field

@carmocca
Copy link
Contributor

@kaushikb11 can you comment on the status of this issue and what would the next steps be?

@carmocca
Copy link
Contributor

carmocca commented Oct 14, 2021

TODOs I could think of:

cc @kaushikb11 @tchaton @awaelchli

@carmocca
Copy link
Contributor

Closing this - tracking the previous comment in #9932

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Includes a design discussion discussion In a discussion stage feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

7 participants