[RFC] Simplify accelerator API, add training type argument #6090

SeanNaren · 2021-02-19T19:02:15Z

🚀 Feature

After the accelerator refactor, we've added some nicety around the TrainingTypePlugin, which now represents all logic that sits on top of an accelerator. However our API seems to be based on the old paradigm where the accelerator/distributed backend were tied, and has bit us already #6089

What I suggest is we do the following:

# Correct
Trainer(gpus=1, accelerator='ddp')  # this is fine. should be set to ddp
Trainer(gpus=1, plugins='ddp_sharded', accelerator='gpu')
Trainer(gpus=1, plugins='ddp_sharded')  # implicit. assume GPU for ddp_sharded as it is the only supported accelerator

# Incorrect
Trainer(gpus=1, accelerator='ddp_sharded')  # throw misconfiguration exception
Trainer(gpus=1, plugins='ddp_sharded', accelerator='ddp') # this should throw a deprecation warning

# The Future

# Plugins can be TrainingTypePlugins, PrecisionPlugins, or whatever the user has provided.
# For precision, we don't do `Trainer(plugins='16bit_precision')` but instead `Trainer(precision=16)` 
# so why do we use the plugins flag for the training type?

Trainer(training_type='ddp', accelerator='gpu')
Trainer(training_type='ddp_spawn', accelerator='cpu')

# so Trainer(plugins=...) is left for cluster_environments, SLURM, ...

Incompatible Plugin/Accelerator

This is something I haven't hashed out but let's say the user does:

Trainer(training_type='ddp_sharded', accelerator='tpu')

This currently is not compatible. We should throw an exception of some sort, but this delves into whitelisting/blacklisting support plugins which could get a bit unwieldy. An alternative is to just assume that the user knows enough about compatibility already.

Backwards compatibility

We allow the TrainingTypePlugin to be specified via the accelerator trainer flag but throw a warning that this will be deprecated in the future.

cc @ananthsub @carmocca @awaelchli @justusschock @tchaton @Borda @kaushikb11 @williamFalcon

The text was updated successfully, but these errors were encountered:

Borda · 2021-02-19T20:11:45Z

just a quick remark, is this a typo?

Trainer(gpus=1, training_type='ddp_spawn', accelerator='cpu')

carmocca · 2021-02-19T23:14:17Z

These are all the possibilities we currently have

training_types = [
    "single",
    "dp",
    "ddp",
    "ddp2",
    "ddp_spawn",
    "ddp_sharded",
    "ddp_sharded_spawn",
    "horovod",
]
accelerators = ["cpu", "gpu", "tpu"]

import itertools
list(itertools.product(accelerators, training_types))
# comment indicates how it is currently set
[
    ✅ ("cpu", "single"),  # gpus=0
    👎 ("cpu", "ddp"),
    👎 ("cpu", "ddp2"),
    ✅ ("cpu", "ddp_spawn"),  # accelerator=ddp_cpu, num_processes=n
    👎 ("cpu", "ddp_sharded"),
    👎 ("cpu", "ddp_sharded_spawn"),
    👎 ("cpu", "dp"),
    ✅ ("cpu", "horovod"),  # accelerator=horovod

    ✅ ("gpu", "single"),  # gpus=n
    ✅ ("gpu", "ddp"),  # accelerator=ddp, gpus=n
    ✅ ("gpu", "ddp2"),  # accelerator=ddp2, gpus=n
    ✅ ("gpu", "ddp_spawn"),  # accelerator=ddp_spawn, gpus=n
    ✅ ("gpu", "ddp_sharded"),  #  accelerator=ddp, gpus=n, plugins=ddp_sharded
    ✅ ("gpu", "ddp_sharded_spawn"),  #  accelerator=ddp, gpus=n, plugins=ddp_sharded_spawn
    ✅ ("gpu", "dp"),  # accelerator=dp, gpus=n
    ✅ ("gpu", "horovod"),  # accelerator=horovod, gpus=n

    ✅ ("tpu", "single"),  # tpu_cores=n
    👎 ("tpu", "ddp"),
    👎 ("tpu", "ddp2"),
    ✅ ("tpu", "ddp_spawn"),  # accelerator=ddp_spawn, tpu_cores=n
    👎 ("tpu", "ddp_sharded"),
    👎 ("tpu", "ddp_sharded_spawn"),
    👎 ("tpu", "dp"),
    👎 ("tpu", "horovod"),
]

(Might have made a mistake on some, feel free to edit and correct)

DeepSpeed, RPC, RPC sequential are just plugins

We should throw an exception of some sort, but this delves into whitelisting/blacklisting support plugins which could get a bit unwieldy

I don't think we have a choice. I wouldn't leave it to the user to understand how each of these combinations works.
Also no user will be implementing their own training_type or accelerator so it should be safe for us to decide which of these are allowed.

Also, we might want to evaluate standardizing how the number of devices is set. We currently have num_processes, gpus, and tpu_cores. They could be merged into one devices or num_processes

SeanNaren · 2021-02-22T10:46:47Z

Thanks @carmocca!

Just to play devils advocate, we've gotten this far without having to worry about blacklisting of sorts. Assuming the user knows what they are doing isn't as fatal as we're making it out to believe, as most of these flags requires checking docs to see what is supported. Regardless on the decision of blacklisting/whitelisting I think this should remain a separate endeavour to the API but it is an important extension that the API should support!

I agree on standardising the devices! Definitely a separate case that can extend from this API

kaushikb11 · 2021-05-31T23:30:11Z

IMO, it shouldn't be necessary for the User to pass the accelerator flag and be explicit.

Trainer(training_type='ddp', accelerator='gpu')
Trainer(training_type='ddp_spawn', accelerator='cpu')

Just passing gpus=n or tpu_cores=n should be sufficient, with the CPU accelerator being default.

carmocca · 2021-05-31T23:55:20Z

IMO, it shouldn't be necessary for the User to pass the accelerator flag and be explicit.

The downside about what you are saying (the current implementation) is that if you want to switch between accelerators, you cannot do it by passing just a different parameter (accelerator='cpu'|'gpu', num_processes=5), instead, you have to use a different argument entirely (num_processes=5 vs gpus=5) which is "harder" programmatically.

This can be confusing for users who don't understand the difference between gpus and num_processes or who might think they need to set both flags. Doing the proposed split would clear this up.

justusschock · 2021-06-01T05:42:15Z

I also agree with carmocca that having something like num_processes (or maybe num_devices since not all backends use multiple processes) together with a flag that indicates the hardware type is cleaner.
This would be a major API change though

kaushikb11 · 2021-06-01T11:54:46Z

That's one case. How about the cases where users define (accelerator='cpu'|'gpu'|'tpu', gpus=5) or when they just mention gpus=n or tpu_cores=n, should it just be ignored? I think it's a good enough signal to select the accelerator. I feel it's already a bit messy with multiple flags.
Also, for eg., to train it on TPUs, it would be unnecessary to pass both accelerator="tpu" and tpu_cores=n every time, rather than just passing the later one.

justusschock · 2021-06-01T12:00:07Z

But if you have your trainer for GPUs like this:
Trainer(gpus=8)
and want to switch to TPU, you have to make two changes. One to change the value of gpus and one to change tpu_cores. With the new API, you only had to change accelerator from 'gpu' to 'tpu`

In the deprecation phase, I think we have to use whatever is provided by the current gpus or tpu_cores arguments though.

Another thing this would enable would be something like accelerator='auto', num_processes='auto', where we do smart device detection by ourselves (i.e. we check what's available and use the most advanced thing that's available (TPUs > GPUs > CPUs).

awaelchli · 2021-06-01T12:09:11Z

In case you are going for this approach I vote for a devices argument which works like the gpus argument today (int, list, string) and remove num_processes instead.

Also I vote for providing good defaults where possible, i.e., I don't see training_type as a mandatory field

carmocca · 2021-07-23T02:14:31Z

@kaushikb11 can you comment on the status of this issue and what would the next steps be?

carmocca · 2021-10-14T14:26:03Z

carmocca · 2021-10-14T20:19:17Z

Closing this - tracking the previous comment in #9932

SeanNaren added feature Is an improvement or enhancement help wanted Open to be worked on design Includes a design discussion labels Feb 19, 2021

SeanNaren self-assigned this Feb 19, 2021

Borda added discussion In a discussion stage Important labels Feb 19, 2021

Borda added this to the 1.3 milestone Feb 19, 2021

carmocca pinned this issue Feb 19, 2021

carmocca mentioned this issue Feb 22, 2021

ddp_cpu breaks while lookinf for .module: ModuleAttributeError: 'BoringModel' object has no attribute 'module' #4356

Closed

SeanNaren mentioned this issue Apr 7, 2021

Deepspeed + Auto Select GPUs = CUDA Out of Memory Error #6857

Closed

carmocca mentioned this issue Apr 12, 2021

[RFC] Introduce Plugin Registry #6923

Closed

SeanNaren mentioned this issue Apr 21, 2021

FSDP integration #6152

Closed

11 tasks

edenlightning removed this from the v1.3 milestone Apr 27, 2021

edenlightning added this to the v1.4 milestone May 9, 2021

awaelchli mentioned this issue May 27, 2021

Using DDPPlugin changes accelerator to ddp #7744

Closed

justusschock mentioned this issue May 28, 2021

IPU Integration #7735

Closed

11 tasks

kaushikb11 self-assigned this Jun 1, 2021

kaushikb11 mentioned this issue Jun 2, 2021

Add support for (accelerator='cpu'|'gpu'|'tpu'|'ipu'|'auto') #7808

Merged

11 tasks

carmocca mentioned this issue Jun 9, 2021

extend Enum api #5478

Closed

3 tasks

edenlightning modified the milestones: v1.4, v1.5 Jul 9, 2021

kaushikb11 mentioned this issue Jul 16, 2021

Add support for devices flag to Trainer #8440

Merged

12 tasks

edenlightning unassigned SeanNaren Jul 27, 2021

kaushikb11 mentioned this issue Aug 3, 2021

Add strategy argument to Trainer #8597

Merged

12 tasks

kaushikb11 mentioned this issue Aug 23, 2021

[RFC] Introduce strategy flag to Trainer #9053

Closed

ananyahjha93 unpinned this issue Sep 1, 2021

carmocca closed this as completed Oct 14, 2021

kaushikb11 mentioned this issue Oct 30, 2021

Add support for devices='auto' #10264

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Simplify accelerator API, add training type argument #6090

[RFC] Simplify accelerator API, add training type argument #6090

SeanNaren commented Feb 19, 2021 •

edited by carmocca

Loading

Borda commented Feb 19, 2021

carmocca commented Feb 19, 2021 •

edited

Loading

SeanNaren commented Feb 22, 2021

kaushikb11 commented May 31, 2021

carmocca commented May 31, 2021

justusschock commented Jun 1, 2021

kaushikb11 commented Jun 1, 2021

justusschock commented Jun 1, 2021

awaelchli commented Jun 1, 2021 •

edited

Loading

carmocca commented Jul 23, 2021

carmocca commented Oct 14, 2021 •

edited

Loading

carmocca commented Oct 14, 2021

[RFC] Simplify accelerator API, add training type argument #6090

[RFC] Simplify accelerator API, add training type argument #6090

Comments

SeanNaren commented Feb 19, 2021 • edited by carmocca Loading

🚀 Feature

Incompatible Plugin/Accelerator

Backwards compatibility

Borda commented Feb 19, 2021

carmocca commented Feb 19, 2021 • edited Loading

SeanNaren commented Feb 22, 2021

kaushikb11 commented May 31, 2021

carmocca commented May 31, 2021

justusschock commented Jun 1, 2021

kaushikb11 commented Jun 1, 2021

justusschock commented Jun 1, 2021

awaelchli commented Jun 1, 2021 • edited Loading

carmocca commented Jul 23, 2021

carmocca commented Oct 14, 2021 • edited Loading

carmocca commented Oct 14, 2021

SeanNaren commented Feb 19, 2021 •

edited by carmocca

Loading

carmocca commented Feb 19, 2021 •

edited

Loading

awaelchli commented Jun 1, 2021 •

edited

Loading

carmocca commented Oct 14, 2021 •

edited

Loading