Maintaining the Trainer constructor over time #9006

ananthsub · 2021-08-20T02:20:12Z

ananthsub
Aug 20, 2021

I wanted to raise this discussion based on observations of going through the trainer constructor, the number of arguments it currently contains, and why this list has continually grown. I think this is representative of a number of other components in the framework as it has grown over time (hooks added, args added, etc).

I've now filed several issues around how we can deprecate arguments in the trainer constructor

More often than not:

There's already an existing argument which these could be folded into (e.g. use the callbacks argument instead of creating a new flag). I think we should update the guidelines here that adding a flag to the trainer is not cheap: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/.github/CONTRIBUTING.md#force-user-decisions-to-best-practices
The argument could belong to another component (e.g. the LightningModule, DataModule, etc)

From discussions with @awaelchli , some patterns emerge:

The trainer is also partially a command line interface where users can interact with it without code changes. By supporting a flat list of primitive type arguments, it lends well to configuration tools like argparsers for command-line based invocation and script execution. By asking users to instantiate objects, users have to write some additional code.
Psychologically, I think another reason this argument list grows is that it's already so big, so people feel less bad about adding one more line. What's 56 vs 57 arguments anyways?

My thoughts are:

Configuration should never be the determining factor for the API design. If we do, our API is limited by the bounds of the configuration system. We lose out on a lot of flexibility and features if we restrict how we can write code, or how we expect users to instantiate the Trainer. This is fundamentally a problem for configuration tools to solve, such as the LightningCLI/jsonargparse and Hydra to figure out. Ideally, the trainer (or any Lightning component) should not take any dependence on any config system or assume that a user is even using a config system.
Unchecked, this list will continue to grow. The longer this list grows, the harder it is for users to onboard. Who has patience to read through all these arguments anyways? If this is part of people's getting started with lightning journey, how many users do we lose here in the funnel because they decide its easier to write their own training program?
The longer this list grows, the harder it is for us to make changes over time, especially with seemingly redundant flags, or with arguments which nobody has context over anymore. Deprecation cycles are painful for both developers & users. The project slows down in the meanwhile because the framework has to support both old and new paths at the same time, which makes the internal codebase much more complex.

This is based on my prior experience working on PyTorch training frameworks. The codebase got so complex that the decision was made to do a full rewrite, which is the most painful project of all. One of my motivations for switching to lightning was its relative simplicity at the time. I'd really like to focus more on how we can simplify the framework!

@PyTorchLightning/core-contributors I'd love your input here as well

justusschock · 2021-08-20T08:56:42Z

justusschock
Aug 20, 2021
Maintainer

@ananthsub I fully agree here.

I think many of those arguments can actually be removed by forcing users to pass in the respective callback for that, since in many cases the argument is just a shortcut for the callback instantiation. This means we can unbloat the trainer interface and also make things more explicit.

With the following approach (which is just my opinion) we would be down from 60 trainer init args to ~ 15 core args and some others added to the respective trainer entrypoints (since they are only valid for that specific function)

Detailed list of what should happen with each argument IMO

From the list of the current trainer init args:

logger: Union[LightningLoggerBase, Iterable[LightningLoggerBase], bool] = True -> should stay
checkpoint_callback: bool = True -> could possibly be removed. Are there cases one does not want to checkpoint? Would it be sufficient to just point it to /tmp in cases like that?
callbacks: Optional[Union[List[Callback], Callback]] = None -> definitely has to stay
default_root_dir: Optional[str] = None, -> Has to stay for implicit checkpointing; could be removed if we decide users always need to insert a checkpointing callback and a logger explicitly (not advisable IMO). Depending on arg handling we could remove weights_save_path for that.
gradient_clip_val: float = 0.0, -> Should this really be part of the trainer API? Maybe this could be realized as a callback as well?
gradient_clip_algorithm: str = "norm", -> same as above
process_position: int = 0, -> Can be removed by manually passing progressbar callback if necessary
num_nodes: int = 1, -> Has to stay
num_processes: int = 1, -> will be replaced by devices below?!
devices: Optional[Union[List[int], str, int]] = None, -> keep that
gpus: Optional[Union[List[int], str, int]] = None, -> remove it in favor of devices and accelerator='gpu' combination
auto_select_gpus: bool = False, -> Could be removed by accelerator='auto' and devices=N combination (would select the accelerator type and also choose gpus if necessary)
tpu_cores: Optional[Union[List[int], str, int]] = None, -> could be removed by accelerator='tpu' and devices=N combination
ipus: Optional[int] = None, -> could be removed by accelerator='ipu' and devices=N combination
log_gpu_memory: Optional[str] = None, -> can be removed by passing GpuStatsMonitor callback
progress_bar_refresh_rate: Optional[int] = None, -> can be removed by passing progress bar callback
overfit_batches: Union[int, float] = 0.0, -> remove in favor of explicit limit_batches
track_grad_norm: Union[int, float, str] = -1, -> Can this be realized by a callback (together with the clipping args)?
check_val_every_n_epoch: int = 1, -> has to stay
fast_dev_run: Union[int, bool] = False, -> has to stay
accumulate_grad_batches: Union[int, Dict[int, int], List[list]] = 1, Can be removed by passing in the scheduling callback explicitly
max_epochs: Optional[int] = None, -> Can we generalize the early-stopping callback to a general stopping condition and maybe move this there?
min_epochs: Optional[int] = None, -> Can we generalize the early-stopping callback to a general stopping condition and maybe move this there?
max_steps: Optional[int] = None, -> Can we generalize the early-stopping callback to a general stopping condition and maybe move this there?
min_steps: Optional[int] = None, -> Can we generalize the early-stopping callback to a general stopping condition and maybe move this there?
max_time: Optional[Union[str, timedelta, Dict[str, int]]] = None, -> Can we generalize the early-stopping callback to a general stopping condition and maybe move this there?
limit_train_batches: Union[int, float] = 1.0, -> Can this be moved to the fit function?
limit_val_batches: Union[int, float] = 1.0, -> can this be moved to the fit function? And as limit_batches to the validate?
limit_test_batches: Union[int, float] = 1.0, -> can this be moved to the test function as limit_batches?
limit_predict_batches: Union[int, float] = 1.0, -> can this be moved to the predict function as limit_batches?
val_check_interval: Union[int, float] = 1.0, -> Only relevant for fit -> move it there?
flush_logs_every_n_steps: int = 100, -> Move to loggers?
log_every_n_steps: int = 50, -> Move to loggers?
accelerator: Optional[Union[str, Accelerator]] = None, -> has to stay
sync_batchnorm: bool = False, -> pass in plugins explicitly?
precision: int = 32, -> should stay since plugin selection is highly coupled with other plugins
weights_summary: Optional[str] = "top", -> remove that in favor of utility function / add as a callback
weights_save_path: Optional[str] = None, -> can be removed (see above)
num_sanity_val_steps: int = 2, -> only relevant for fit -> move it there
resume_from_checkpoint: Optional[Union[Path, str]] = None, -> Should stay
profiler: Optional[Union[BaseProfiler, str]] = None, -> should stay
benchmark: bool = False, -> just sets a torch flag, can be removed here. Not sure if necessary to replace this at all
deterministic: bool = False, -> just sets a torch flag, can be removed here. Not sure if necessary to replace this at all
reload_dataloaders_every_n_epochs: int = 0, -> Only relevant for fit -> move it there?
reload_dataloaders_every_epoch: bool = False, -> can be removed in favor of reload_dataloaders_every_n_epochs
auto_lr_find: Union[bool, str] = False, -> can stay (not sure how much it is used though)
replace_sampler_ddp: bool = True, -> should stay
terminate_on_nan: bool = False, -> usually only relevant for training, move to fit?
auto_scale_batch_size: Union[str, bool] = False, -> should stay
prepare_data_per_node: bool = True, -> Should stay
plugins: Optional[Union[List[Union[Plugin, ClusterEnvironment, str]], Plugin, ClusterEnvironment, str]] = None, -> has to stay
amp_backend: str = "native", -> Can we remove this and always assume native amp backend? Since apex isn't developed any longer, it should be fine for us to always assume amp to be native if people don't pass in the apex plugin
amp_level: str = "O2", -> would be removed due to the logic above
distributed_backend: Optional[str] = None, -> Remove this in favor of explicit plugin passing
move_metrics_to_cpu: bool = False, -> Keep this
multiple_trainloader_mode: str = "max_size_cycle", -> Only for fit, move it there? @ananthsub mentioned something about other tools for loader combination they could eventually provide?
stochastic_weight_avg: bool = False, -> remove in favor of callback

On the other hand, I know that one of the main concerns was to have everything easily accessible for the user. However, I strongly think that it's more of a downside now since it becomes overwhelming.

Maybe we could have a function that takes all these arguments and provides condensed trainer arguments (including callbacks etc.) for users instead?

directly tagging @tchaton @awaelchli @SeanNaren @carmocca @kaushikb11 :)

EDIT:
I'm fully aware that those would include major breaking changes, which is why it is likely not to happen like this before a new major release. And even then we probably need to provide the mentioned function as a compatibility layer.

3 replies

carmocca Aug 20, 2021

checkpoint_callback: bool = True -> could possibly be removed. Are there cases one does not want to checkpoint? Would it be sufficient to just point it to /tmp in cases like that?

Any time you are trying things out, it's likely you don't want to checkpoint, in fact, this is part of what fast_dev_run does but something you are trying things out for longer.
If the argument was removed, we'd need an alternative to remove it as it isn't so straightforward considering it gets added to callbacks dynamically.
Another option would be to remove it AND not create it automatically, thus requiring users to instantiate it and pass it as any other callback.
Personally leaning towards keeping it.

gradient_clip_val: float = 0.0, -> Should this really be part of the trainer API? Maybe this could be realized as a callback as well?
gradient_clip_algorithm: str = "norm", -> same as above

These should be part of the model: #6346

gpus: Optional[Union[List[int], str, int]] = None, -> remove it in favor of devices and accelerator='gpu' combination
tpu_cores: Optional[Union[List[int], str, int]] = None, -> could be removed by accelerator='tpu' and devices=N combination
...
Regarding all these, we might have to keep them forever and just map them to the associated accelerator/devices combination. Perhaps raising warnings if both are specified and mentioning it in the docs. This way both people familiar and unfamiliar with how accelerator/devices work are content.

accumulate_grad_batches: Union[int, Dict[int, int], List[list]] = 1, Can be removed by passing in the scheduling callback explicitly

I think this needs to stay - the fact that it is a callback internally is just how we implemented it. But doesn't necessarily mean they need to know about it.

overfit_batches: Union[int, float] = 0.0, -> remove in favor of explicit limit_batches

But overfit_batches works differently from limit_batches. If limit_batches was passed to fit/validate/test, setting shuffle=False automatically would still be the difference

max_epochs: Optional[int] = None, -> Can we generalize the early-stopping callback to a general stopping condition and maybe move this there?
min_epochs: Optional[int] = None, -> Can we generalize the early-stopping callback to a general stopping condition and maybe move this there?
max_steps: Optional[int] = None, -> Can we generalize the early-stopping callback to a general stopping condition and maybe move this there?
min_steps: Optional[int] = None, -> Can we generalize the early-stopping callback to a general stopping condition and maybe move this there?
max_time: Optional[Union[str, timedelta, Dict[str, int]]] = None, -> Can we generalize the early-stopping callback to a general stopping condition and maybe move this there?

Now that we have support for multiple callbacks of the same type (cc: @awaelchli), this could be 3 different early-stopping callbacks (epochs, steps, time).

Although this is tricky because some of these refer to how loops should run and others to how early stopping should run - which isn't quite the same. For example, min_steps could make sense for the loops even if EarlyStopping is not present.

resume_from_checkpoint: Optional[Union[Path, str]] = None, -> Should stay

This should be moved to fit and be renamed ckpt_path, just as for validate/test/predict cc: @SeanNaren

All others I mostly agree with. Thanks for reviewing all @justusschock!

justusschock Aug 20, 2021
Maintainer

accumulate_grad_batches: Union[int, Dict[int, int], List[list]] = 1, Can be removed by passing in the scheduling callback explicitly

I think this needs to stay - the fact that it is a callback internally is just how we implemented it. But doesn't necessarily mean they need to know about it.

@carmocca For now that's just implemented like a callback internally. But I think if we decide to elevate the callbacks more to a first-class object in lightning (currently they are mainly for advanced customization and usually you disregard them as a user), I think most of the flags related to callbacks (e.g the gradient accumulation ones) could be removed. I think it is more intuitive to explicitly opt-in into things by providing the callbacks but to assume it all magically works and really having to dig into why it doesn't if it doesn't :D

gpus: Optional[Union[List[int], str, int]] = None, -> remove it in favor of devices and accelerator='gpu' combination
tpu_cores: Optional[Union[List[int], str, int]] = None, -> could be removed by accelerator='tpu' and devices=N combination
...

Regarding all these, we might have to keep them forever and just map them to the associated accelerator/devices combination. Perhaps raising warnings if both are specified and mentioning it in the docs. This way both people familiar and unfamiliar with how accelerator/devices work are content.

I disagree here. If new users are not familiar with it and they know they just need to provide those flags it will be fine, they don't need to know about internals. I just want to avoid that when we add something new (like we did with ipu) we will get yet another flag (or even multiple in the worst case) just due to that. Also it means more changes if you just want to switch between different kind of accelerators.

Now that we have support for multiple callbacks of the same type (cc: @awaelchli), this could be 3 different early-stopping callbacks (epochs, steps, time).

Although this is tricky because some of these refer to how loops should run and others to how early stopping should run - which isn't quite the same. For example, min_steps could make sense for the loops even if EarlyStopping is not present.

This is why I said, we could think about elevating it to a general Stopping Criterion (and early stopping just being a very specific implementation of that). So by default the trainer would just run infinitely and the loops could check the respective stopping criterion (which could also include things like not stopping until min_epochs is met)

ananthsub Aug 20, 2021
Author

checkpoint_callback: bool = True -> could possibly be removed. Are there cases one does not want to checkpoint? Would it be sufficient to just point it to /tmp in cases like that?

In addition to what @carmocca said, disabling this in unit tests is another use case. for use of lightning in a large organization's codebase, forgetting to disable logging and checkpointing can result in lots of slow and flaky and slow tests because of writing large models or saving metrics to disk. for the vast majority of tests, i do not want to enable tensorboard or checkpointing, so I need to specify logger=False and checkpoint_callback=False everywhere

carmocca · 2021-08-20T12:26:04Z

carmocca
Aug 20, 2021

Thanks for the thoughtful arguments and overview @ananthsub ❤️

To give some more context, the Trainer reached due to the following decisions:

The users should only need to interact with one single object (the Trainer) to minimize the API overhead when adapting it to the user's needs.
The interaction should be easy and straightforward - available with Python primitives.
If a flag could be useful for 25-50% of the users, then it was good to be added. With "useful" I mean that they might want to use it, not that 25-50% of users will have modified it.

This number of Trainer's __init__ flags and LightningModule's hooks is the result of minimizing the number of classes instantiated/handled/learned by the users.

I see this as a trade-off between increased usability for general users vs good engineering design and practices. A side effect of the latter can actually be increased usability for power and experienced users (all of us having this discussion).

As a power user myself, I find your approach much more satisfying, but I want for us to see the argument both ways before moving forward.

2 replies

justusschock Aug 20, 2021
Maintainer

If a flag could be useful for 25-50% of the users, then it was good to be added. With "useful" I mean that they might want to use it, not that 25-50% of users will have modified it.

IMO due to the growing number of options that is kind of outdated. So at least (if we decide to not do anything about existing arguments) we have to reformulate that rule :)

The users should only need to interact with one single object (the Trainer) to minimize the API overhead when adapting it to the user's needs.

But if we provide sensible defaults, this would be easy, we could also just allow the name of the callback inside the callback list if used with default arguments. With the class the overhead is only the import, which is really minimal.

ananthsub Aug 22, 2021
Author

I see this as a trade-off between increased usability for general users vs good engineering design and practices. A side effect of the latter can actually be increased usability for power and experienced users (all of us having this discussion). As a power user myself, I find your approach much more satisfying, but I want for us to see the argument both ways before moving forward.

Expanding on my post above:
It feels like most of our feature requests come from existing users - this can turn into a cycle where we support existing users really well, and potentially overfit to these use cases. This is entirely natural because these are the use cases we have good visibility into already and where there's community contribution.

I am really curious about what might today structurally or perceptually prevent new users from leveraging lightning, especially for power users. @edenafek

I think the good engineering design & principle, especially at the interfaces users interact with, will ultimately enable new users who couldn't or didn't want to use the library before without sacrificing the core values of the project. If the implementation internal to the framework is not as clean in some spots, this is a secondary concern to me as we should be able to independently make improvements there over time. Anything that's end-user facing should take priority.

My aim with these re-organization projects is to:

better prepare Lightning to enable those users who couldn't or didn't want to use the library before (without sacrificing the core values of the project!). IMO, we should prioritize the power users experience because those are the use cases that become the general users down the line.
Ensure the framework remains quickly adaptable if/when some new paradigm emerges. The more baggage we carry, the harder this becomes.

But I definitely want to see other arguments as well!

awaelchli · 2021-08-21T00:11:47Z

awaelchli
Aug 21, 2021

This is based on my prior experience working on PyTorch training frameworks. The codebase got so complex that the decision was made to do a full rewrite, which is the most painful project of all. One of my motivations for switching to lightning was its relative simplicity at the time. I'd really like to focus more on how we can simplify the framework!

@PyTorchLightning/core-contributors we should keep this in mind and learn from these experiences. If any of you made similar observations in past projects, share your learnings <3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maintaining the Trainer constructor over time #9006

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Maintaining the Trainer constructor over time #9006

ananthsub Aug 20, 2021

Replies: 3 comments · 5 replies

justusschock Aug 20, 2021 Maintainer

carmocca Aug 20, 2021

justusschock Aug 20, 2021 Maintainer

ananthsub Aug 20, 2021 Author

carmocca Aug 20, 2021

justusschock Aug 20, 2021 Maintainer

ananthsub Aug 22, 2021 Author

awaelchli Aug 21, 2021

ananthsub
Aug 20, 2021

Replies: 3 comments 5 replies

justusschock
Aug 20, 2021
Maintainer

justusschock Aug 20, 2021
Maintainer

ananthsub Aug 20, 2021
Author

carmocca
Aug 20, 2021

justusschock Aug 20, 2021
Maintainer

ananthsub Aug 22, 2021
Author

awaelchli
Aug 21, 2021