Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Move Trainer's loop-affecting arguments to fit, validate, test, and predict #10444

Closed
ananthsub opened this issue Nov 10, 2021 · 21 comments
Labels
deprecation Includes a deprecation design Includes a design discussion refactor

Comments

@ananthsub
Copy link
Contributor

ananthsub commented Nov 10, 2021

Proposed refactoring or deprecation

Background

We are auditing the Lightning components and APIs to assess opportunities for improvements:

This issue aims to address how loop conditions are specified when using the Lightning trainer.

Motivation

  1. Address surprising effects with using the Lightning Trainer.
    Can a new user tell how long the test runs for here? One could read this and think, naturally this runs for 10 steps. Only in diving into the documentation does one realize that this argument doesn't affect the test run at all!
trainer = Trainer(max_steps=10)
trainer.test(model)
  1. Another example is calling fit repeatedly with the same Trainer object (Missing cleanup after trainer.fit() and trainer.test() #4385). This re-design would make the trainer more reusable: users shouldn't need to re-instantiate the trainer if they want to only kick off a new run with different stopping conditions or loop settings.

  2. Address maintenance concerns with the Trainer constructor over time: Maintaining the Trainer constructor over time #9006 .

What happens if we need to add a new trainer function? The current design & practice would mean replicating all these args again on the Trainer. This makes discovering features of the core trainer harder with more flags added.

  1. We recently underwent a similar move for ckpt_path: resume_from_checkpoint took effect only during fit whereas ckpt_path was a function argument to validate. test, predict - this was really confusing having 2 different ways to specify paths to load training state from. [checkpoint] Resolve 2 different checkpoint loading paths across fit vs validate/test/predict #9405

I've grouped the constructor arguments here based on which trainer function/loop they take effect in. Critically, almost all of these arguments do not apply to all loops.

  1. Fit Loop
  • check_val_every_n_epoch: this is an integer for whether to run validation after N train epochs
  • max_epochs: number of training epochs to run before stopping
  • min_epochs: force this number of training epochs to complete before stopping
  • max_steps: maximum number of training steps to run before stopping
  • min_steps: minimum number of training steps to run before stopping
  • max_time: stop training after the specified time has elapsed (in case the number of steps/epochs aren’t known in advance)
  • num_sanity_val_steps: number of batches to run before starting the training routine
  • val_check_interval: how often to check the validation set. supports either % of dataset or based on number of steps
  • limit_train_batches: How much of associated dataset to iterate through per epoch. If a float is passed in, lightning assumes this is a fraction (check X% of dataset) while if an integer is passed in, lightning parses this as check N steps.
  • limit_val_batches: same as limit_train_batches but for val_dataloader
  • reload_dataloaders_every_n_epochs: Reloads dataloaders during fitting (there's a separate issue for deprecating this off the trainer entirely in favor of the datahooks): Move reload_dataloaders_every_n_epochs to the DataHooks class #8738
  1. Validate loop
  • limit_val_batches: same as limit_train_batches but for val_dataloader
  1. Test Loop
  • limit_test_batches: same as limit_train_batches but for test_dataloader
  1. Predict Loop
  • limit_predict_batches: same as limit_train_batches but for predict_dataloader

Applies to all

  • fast_dev_run: Runs n steps of train/val/test to find bugs. A form of canarying the run (fast e2e check)
  • overfit_batches: applies to fit/val/test but not predict (should this only apply to training?)

Pitch

  1. Add the corresponding arguments to the trainer function they take effect in. Concretely:

fit

Before:

def fit(
    self,
    model: "pl.LightningModule",
    train_dataloaders: Optional[Union[TRAIN_DATALOADERS, LightningDataModule]] = None,
    val_dataloaders: Optional[EVAL_DATALOADERS] = None,
    datamodule: Optional[LightningDataModule] = None,
    ckpt_path: Optional[str] = None,
):

After:

def fit(
    self,
    model: "pl.LightningModule",
    train_dataloaders: Optional[Union[TRAIN_DATALOADERS, LightningDataModule]] = None,
    val_dataloaders: Optional[EVAL_DATALOADERS] = None,
    datamodule: Optional[LightningDataModule] = None,
    ckpt_path: Optional[str] = None,
    fast_dev_run: Union[int, bool] = False,
    max_epochs: Optional[int] = None,
    min_epochs: Optional[int] = None,
    max_steps: int = -1,
    min_steps: Optional[int] = None,
    max_time: Optional[Union[str, timedelta, Dict[str, int]]] = None,
    limit_train_batches: Union[int, float] = 1.0,
    limit_val_batches: Union[int, float] = 1.0,
    num_sanity_val_steps: int = 2,
    val_check_interval: Union[int, float] = 1.0,
    overfit_batches: Union[int, float] = 0.0,
):

validate

Before:

def validate(
    self,
    model: Optional["pl.LightningModule"] = None,
    dataloaders: Optional[Union[EVAL_DATALOADERS, LightningDataModule]] = None,
    ckpt_path: Optional[str] = None,
    verbose: bool = True,
    datamodule: Optional[LightningDataModule] = None,
) -> _EVALUATE_OUTPUT:

After:

def validate(
    self,
    model: Optional["pl.LightningModule"] = None,
    dataloaders: Optional[Union[EVAL_DATALOADERS, LightningDataModule]] = None,
    ckpt_path: Optional[str] = None,
    verbose: bool = True,
    datamodule: Optional[LightningDataModule] = None,
    fast_dev_run: Union[int, bool] = False,
    limit_batches: Union[int, float] = 1.0,
) -> _EVALUATE_OUTPUT:

test

Before:

def test(
    self,
    model: Optional["pl.LightningModule"] = None,
    dataloaders: Optional[Union[EVAL_DATALOADERS, LightningDataModule]] = None,
    ckpt_path: Optional[str] = None,
    verbose: bool = True,
    datamodule: Optional[LightningDataModule] = None,
) -> _EVALUATE_OUTPUT:

After:

def test(
    self,
    model: Optional["pl.LightningModule"] = None,
    dataloaders: Optional[Union[EVAL_DATALOADERS, LightningDataModule]] = None,
    ckpt_path: Optional[str] = None,
    verbose: bool = True,
    datamodule: Optional[LightningDataModule] = None,
    fast_dev_run: Union[int, bool] = False,
    limit_batches: Union[int, float] = 1.0,
) -> _EVALUATE_OUTPUT:

predict

Before:

def predict(
    self,
    model: Optional["pl.LightningModule"] = None,
    dataloaders: Optional[Union[EVAL_DATALOADERS, LightningDataModule]] = None,
    datamodule: Optional[LightningDataModule] = None,
    return_predictions: Optional[bool] = None,
    ckpt_path: Optional[str] = None,
) -> Optional[_PREDICT_OUTPUT]:

After:

def predict(
    self,
    model: Optional["pl.LightningModule"] = None,
    dataloaders: Optional[Union[EVAL_DATALOADERS, LightningDataModule]] = None,
    datamodule: Optional[LightningDataModule] = None,
    return_predictions: Optional[bool] = None,
    ckpt_path: Optional[str] = None,
    fast_dev_run: Union[int, bool] = False,
    limit_batches: Union[int, float] = 1.0,
) -> Optional[_PREDICT_OUTPUT]:
  1. Deprecate the arguments from the trainer constructor

Alternatives

Keep as is

Additional context

It is very tempting to make everything easily configurable. It's also how you end up with functions/classes that take 200+ arguments, which have pairwise conflicts, and which no one knows how to use them or what they work with. Config encourages monolithic designs over modularity. Coding by config has never worked with pytorch because configs are not as flexible as code, and PyTorch encourages writing code. I have seen this repeatedly (like way more than you'd imagine) with prior platforms I've worked on, and every single time, the whole project is either rewritten from scratch or someone adopts something more flexible because it allows people to write the code they want. That migration is much more painful for everyone. That is why I feel very strongly about this: because w/o more modular design, my concern is either we'd have to rewrite Lightning from first principles, or someone else will write a leaner version instead which users will adopt. @williamFalcon


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

  • Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

  • Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

@ananthsub ananthsub added refactor design Includes a design discussion deprecation Includes a deprecation labels Nov 10, 2021
@ananthsub ananthsub changed the title Move Trainer's loop-affecting arguments to fit, validate, test, and predict [RFC] Move Trainer's loop-affecting arguments to fit, validate, test, and predict Nov 10, 2021
@four4fish
Copy link
Contributor

+1 love this idea!
It will be super helpful for custom loops in the future.

@ananthsub
Copy link
Contributor Author

@PyTorchLightning/core-contributors I'd like your feedback on this

@tchaton
Copy link
Contributor

tchaton commented Nov 12, 2021

Dear @ananthsub,

Thanks for opening this issue. I wanted to make this proposition for some time already !

This makes discovering features of the core trainer harder with more flags added.

I entirely agree with this and I believe this change would improve `Loop Customization if designed properly as right now, the Trainer arguments aren't passed to the new loops.

However, this would be one of our biggest breaking changes and we should action on it only if both the Lightning Team, Facebook, and Lightning BDFL are onboard.

Best,
T.C

@ananthsub
Copy link
Contributor Author

@williamFalcon what are your thoughts on this?

@tchaton
Copy link
Contributor

tchaton commented Nov 15, 2021

@rohitgr7
Copy link
Contributor

I like this change.

But as @tchaton mentioned this is going to be a big refactor with a lot of deprecations at least for .fit and we need to make sure all of them should be done within the same release if we go with this.

Just some personal opinions
Pros:

  • Will help keep flags only where they are required.
  • Not many changes will be required since most of the flags hover around epochs and batch limitations.
  • Will make it easier to implement and integrate custom loops.

Possible Cons:
Can't find any.. Still thinking

Ref:
We have a similar suggestion for the tuner as well. Will help move 2 more hooks from Trainer.
#9103

@SeanNaren
Copy link
Contributor

Does look nice from an API level however I'm wondering how does this weight against the already established familiarity of Lightning?

Things such as this reddit comment make me think very hard if the value of changing the API is worth it.

Not against such as large API change if we think the overall value is worth it, just throwing some caution in there!

@carmocca
Copy link
Contributor

carmocca commented Nov 15, 2021

I'm fully in favor of this. This is also something I have found confusing since I initially found Lightning. However, it is a BIG design shift.

This will also make it clearer that this is not a trainer but an engine named "Trainer", as now the trainer-specific flags will be under trainer.fit().

As a side note, we would need to re-think LightningCLI(run=False) as the new entrypoint arguments would not get parsed.

@awaelchli
Copy link
Contributor

This is not Lightning anymore. In Lightning, the Trainer arguments are in one place so they can be easily configured. This change will effectively make the Trainer.from_argparse_args and the like not useful anymore and will have to be removed. This is the same complaint I had for removing the resume_from_checkpoint argument from the Trainer constructor. How will this usability problem be addressed? You will have to recommend to directly switch to LightningCLI, so LightningCLI will not really be optional anymore but an integral part of configuration.

@Borda
Copy link
Member

Borda commented Nov 15, 2021

In a certain way, I like the argument - move some arguments to particular methods for better understanding what arguments are used in the fit/test phase, but

This is not Lightning anymore. In Lightning, the Trainer arguments are in one place so they can be easily configured.

fully agree with @awaelchli

@tchaton
Copy link
Contributor

tchaton commented Nov 15, 2021

Hey @ananthsub,

This is still open to debate but here is a quick summary from the previous comments.

Pros:

  • Will help keep flags only where they are required. (@rohitgr7)
  • Clearer that this is not a trainer but an engine named "Trainer" (@carmocca)
  • Not many changes will be required since most of the flags hover around epochs and batch limitations. (@rohitgr7)
  • Will make it easier to implement and integrate custom loops. (@tchaton)
  • Avoids confusion about which flags are useful for which loops.

Cons:

  • Make Lightning API indecisive and unstable -> possible reddit, twitter storm (@SeanNaren)
  • Trainer arguments are in one place so they can be easily configured (@awaelchli)
  • Trainer.from_argparse_args would not be useful anymore (@awaelchli)
  • (minor): LightningCLI would need some refactoring (@carmocca)

Right now, it seems there are quite good arguments on both sides and my conservative/stability side would push me toward no change.

@ananthsub
Copy link
Contributor Author

ananthsub commented Nov 16, 2021

First, does everyone agree that the pain points outlined in the issue today are problems? In particular, confusing/misleading trainer flags and not knowing when they take effect. For all of these flags, it is way too much work to expect users to read the implementations to see how/when they take effect. This behavior can surprise people. This should be immediately obvious by the name and where the args are specified.

If you agree that these are problems, do you have alternate proposals that would address those concerns that is not what's listed here? I am looking to address confusion over how to use the Lightning API for loop behaviors, and this way seemed most natural to me. If you had to re-design this today, knowing what features the Trainer has today, how would you do it?

Regarding the cons:

  1. This would be an intentional change. The API is not indecisive or unstable. It is exactly what @carmocca says: all the arguments for the Trainer as an engine and trainer as a loop driver are at the same place. Those responsibilities grow and manifest in different ways. This is one proposal to isolate those features to the most localized areas possible. This can establish a pattern that anything that touches the loop boundary conditions should go into fit/validate/test/predict, and leave the Trainer constructor for components that would be functionality used within those loops.

Furthermore, this would follow the documented API evolution process. This would not be a breaking change, and this issue is open for public comment. I don't anticipate a social media storm because I think that happens for silent/breaking changes that are not clearly communicated.

  1. Regarding configuration, my counter argument is the Python API has to be the top most priority. If there's confusion around how to use this, it doesn't matter what the configuration story is. Trainer.fit is required for all Lightning users to go through. Not all users have to use Trainer.from_argparse_args or the LightningCLI. The configuration choice should follow the pattern the code lays out. Moreover, users can continue using Trainer.from_argparse_args. When they call fit/validate/etc, those users would need to specify the arguments from their argparser specific to the trainer function being called. Therefore, they can still the vast majority of the argparser code. This change would not require the LightningCLI to be used.

In Lightning, the Trainer arguments are in one place so they can be easily configured

This worked when the project was a lot smaller. There were far fewer features, and there weren't as many trainer functions or trainer features. But the project has grown a lot. I strongly believe we ought to prioritize API principles & correctness over convenient configuration. As stated above, there are users today who are confused by the flags offered and when they take effect. It doesn't matter if it's easy for those users to set values if they're using the flags incorrectly.

That being said, this is not saying configuration doesn't have a role in usability. I am not proposing something that would make instantiation much more difficult, such as relying on complex objects as input values to the Trainer constructor.

I'd argue that the clarity this provides is worth that change. This feels much cleaner to me, opens up more extension points, and it means we don't need to force all functionality through the Trainer constructor. And most critically, it changes how users should view the Trainer given the far expanded functionality it now has compared to the start of the project.

It is very tempting to make everything easily configurable. It's also how you end up with functions/classes that take 200+ arguments, which have pairwise conflicts, and which no one knows how to use them or what they work with. Config encourages monolithic designs over modularity. Coding by config has never worked with pytorch because configs are not as flexible as code, and PyTorch encourages writing code. I have seen this repeatedly (like way more than you'd imagine) with prior platforms I've worked on, and every single time, the whole project is either rewritten from scratch or someone adopts something more flexible because it allows people to write the code they want. That migration is much more painful for everyone. That is why I feel very strongly about this: because w/o more modular design, my concern is either we'd have to rewrite Lightning from first principles, or someone else will write a leaner version instead which users will adopt. @williamFalcon

@okuchaiev
Copy link

I'd like to provide some thoughts from the perspective of outside product's (NeMo) development on top of Lightning. PTL has two major classes: Trainer and LightningModule which (roughly) separate engineering and science in DL training. It was convenient to build on top of these two abstractions. However, every time Trainer and/or LightningModule loses instance variables that configure the session it will break other products built on top of PTL (such as NeMo).

If this RFC is to pass, may we ask that neither Trainer nor LigntingModule lose their members or constructor arguments? And when the same name is passed to .fit(…)/.validate(…) then the one which was passed in the member function overrides the one in Trainer?

For example:

trainer = Trainer(max_steps=100)
trainer.fit(model, max_steps=200) # this one takes precedence and training happens over 200 steps

Backwards compatibility is not the only reason behind this ask. Sometimes the model needs to be able to access Trainer instance before the model is even constructed. For example, for model and/or pipeline parallelism, the model needs to know world size. As another example, learning rate scheduler may (or may not) need to know number of GPUs, max_steps, batch size etc.

Additionally, we have concerns with setting chpt_path in .fit(…) method. There are two scenarios when one might want to do that: (a) fine-tuning and (b) resuming training from checkpoint. The second scenario is must have for training on shared clusters with strict time limits (universities and companies). But it would be very error prone because the users won't be able to re-submit the script as-is and have it do the right thing. Instead they will need to adjust this parameter with every re-submission to make sure the right checkpoint is used.

@lantiga
Copy link
Collaborator

lantiga commented Nov 23, 2021

Echoing @okuchaiev, there's value in knowing the value of parameters as soon as possible, prior to model creation. Which ones is hard to tell actually, we don't have real visibility on what products built on top of Lightning might decide to do.

Knowing how fit will operate before it's called (so first configure, then let external code see the configuration, then run), is a great pattern because it provides external visibility about what those configurations are as in the NeMo example.

Following this logic, allowing Trainer arguments to be eventually overridden by the same arguments to fit (as in @okuchaiev's proposal) goes against that pattern, because at that point configuration will cease to be authoritative with respect to what will happen during fit.

I'm wondering if there's a way to retain the current configure -> run pattern, while at the same time making interpreting the effects of arguments (and ultimately discoverability) easier.

A couple of possibilities come to mind, but I may or may not agree with them 🙂:

  • standardize prefixing arguments with the scope (fit_, test_) where it makes sense. This could have an easier deprecation path and provide clarity on what happens. It wouldn't be too pretty and still wouldn't go in the direction of modularization
  • allow Trainer to be configured in a more articulate way after creation but before the call to fit. One would instantiate the trainer with trainer = Trainer(...) and then trainer.config.fit.max_iterations = 100 or otherwise call a function to set those. This option would allow for modularization, because one could register additional configuration scopes to the Trainer.

@williamFalcon
Copy link
Contributor

With Lightning's maturity comes the expectation of a stable API. Unfortunately, the changes proposed here are not backwards compatible and will jeopardize an otherwise stable API.

I do however agree with the sentiment that there might be some confusion. But let's take a step back and revisit the issues from the user perspective (and make sure it applies to the whole community) before jumping into any one particular decision.

So, for now, let's go ahead and close this while we identify the core usability issues here and brainstorm ways to address them that don't affect backward compatibility.

For all intensive purposes, consider the Trainer API as core/foundationally/stable.

@williamFalcon
Copy link
Contributor

Btw... on the configuration combinatronics explosion point, it turns out we've solved a lot of those issues through our current CI/CD (kind of one of the very valuable points that PL adds). we have soooo many tests that test ALL possible combinations of things. In addition, any time something has been broken, we fixed and added a test.

So, at this point this is not a top of mind worry that I would spend too much time concerned with.

Lightning Trainer API is stable. It's very unlikely more flags will be added. It's 100% unlikely (zero probability) that > 10 flags will EVER be added, so we don't need to solve "scale" with 2000 flags, etc... because that's never going to happen.

The simplifications to the Trainer are places where we can remove certain flags in favor of more expressive flags (but not throw stuff into a callback hidden under a submodule).

However, we've been thinking about these flags for 2+ years at this point (with thousands of eyeballs) and we haven't had any new insights in a long time about any flags that might not be necessary, or flags that could be better named/more expressive. However, I am very hopeful that there will be more insights that help the Trainer become more expressive!

@okuchaiev
Copy link

@williamFalcon , thanks for this: "Lightning Trainer API is stable." :)
This RFC does raise some valid concerns.
I think the great strength of PTL is that it is used by a range of very different users - from students with one GPU to teams with super computer access.
IMHO, the solution proposed here didn't consider the whole range of users. While it is not possible to always satisfy everyone, we aren't 100% against any breaking changes. Let's just have them more gradually with a stable foundation.

@carmocca
Copy link
Contributor

I would argue this isn't so much a question of compatibility/stability but a question of vision.

We can maintain backwards compatible support for existing flags for an arbitrary length of time. Either by keeping the original flags or adding a catch-all Trainer(..., **kwargs) until we decide to remove support. The original post does not mention it, but this deprecation could be kept until 2.0 is released (which does not have a set date).

This would allow our users to gradually and slowly convert. Lengthening out this process also gives us time to gather reactions from the community, as these RFCs do not reach deep into our userbase.

For this reason, what's important to me is knowing whether this proposed change is a better API had Lightning not been initially designed with most arguments grouped into the Trainer. Of course, hindsight is 20/20.

@ananthsub
Copy link
Contributor Author

ananthsub commented Nov 24, 2021

Thanks all for the comments. The purpose of an RFC is to gather feedback, and I'm glad this discussion took place. And I agree 100% with what @carmocca said above^. Ignoring any backward compatibility concerns, is the API proposal better than what currently exists?

Addressing specific concerns raised:

@williamFalcon

So, for now, let's go ahead and close this while we identify the core usability issues here and brainstorm ways to address them that don't affect backward compatibility.

The core usability issue highlighted here is that the Trainer is both the runner and the engine. By this I mean:
Runner = defines the loop boundaries and overall control flow
Engine = all the engineering logic that happens inside of those loop bounds

We see circular references & ownership throughout the codebase as a result. The "runner" logic of the trainer ends up having a reference to all other components in Lightning. But with the trainer as an engine, all other components in Lightning end up with a reference to the Trainer (lightning module, data module, callbacks, even the lightning optimizer). I see this as a huge potential for misuse: #7315

It also lays out that different people have different definitions of "model code" vs "engineering code" which creates this ambiguity.

Btw... on the configuration combinatronics explosion point, it turns out we've solved a lot of those issues through our current CI/CD (kind of one of the very valuable points that PL adds). we have soooo many tests that test ALL possible combinations of things. In addition, any time something has been broken, we fixed and added a test.

We rely a lot on mini end to end tests that leverage these flags. I cannot confidently say we test all possible combinations, but we do our best on this front. However, this reliance on end to end tests is tricky for exactly the combinatorics presented, in addition to dev efficiency reason: we consistently see slow & flaky tests. Keeping the components modular ensures we can provide better stability & a great dev experience without resorting to more extreme measures like fuzz testing all possible combinations of input types/values and waiting hours/days for test signals.

@okuchaiev

But it would be very error prone because the users won't be able to re-submit the script as-is and have it do the right thing. Instead they will need to adjust this parameter with every re-submission to make sure the right checkpoint is used.

This also applies to needing to pass resume_from_checkpoint as an argument to the Trainer constructor, so it is not
unique to passing arguments to trainer.fit.

Backwards compatibility is not the only reason behind this ask. Sometimes the model needs to be able to access Trainer instance before the model is even constructed. For example, for model and/or pipeline parallelism, the model needs to know world size. As another example, learning rate scheduler may (or may not) need to know number of GPUs, max_steps, batch size etc.

This is exactly the issue I mentioned above.
Similar issues show up here: #10430

It also puts much greater pressure on the initialization & hook order. The more hooks offered, the more potential paths through them, the more chance there's a conflict between what users want. Some will want A to happen before B and some will want B to happen before A. What happens then?

In a plain python/pytorch training script, this is entirely user controlled, and these are likely parametrized in a way that dependencies are clearly laid out. But with Lightning now, users end up relying on the Trainer to access all these properties.

On a tactical level, guaranteeing every property's availability per each hook for all different trainer configurations is a very tall order. From a user experience POV, Lightning's claim is that it is organized PyTorch, not layers over PyTorch. If users end up writing code that looks different with vs without the framework, that is a concern.

@blisc
Copy link

blisc commented Nov 29, 2021

From my point of view, Lightning is a training framework and fit() is the most important function. As @williamFalcon mentioned, the current design of Lightning is to create this Trainer object and define it from its kwargs and parameters. Breaking this design pattern for fit() is a massive change that would break projects like NeMo that depend on it. I would venture to say that NeMo is not the only one that will be affected by it.

I am strongly urging like in #10573 that any existing arguments and properties in Trainer remain in Trainer's init if they affect fit().

Backwards compatibility is not the only reason behind this ask. Sometimes the model needs to be able to access Trainer instance before the model is even constructed. For example, for model and/or pipeline parallelism, the model needs to know world size. As another example, learning rate scheduler may (or may not) need to know number of GPUs, max_steps, batch size etc.

This is exactly the issue I mentioned above.
Similar issues show up here: #10430
It also puts much greater pressure on the initialization & hook order. The more hooks offered, the more potential paths through them, the more chance there's a conflict between what users want. Some will want A to happen before B and some will want B to happen before A. What happens then?

We are not interested in which hooks are run when. We are interested in being able to access Trainer to get properties prior to fit() that are relevant to the training run so we do relevant processing before ever entering fit(). For example, we want to get world_size prior to calling fit(). This requires us to define it when we initialize Trainer before we enter fit().

standardize prefixing arguments with the scope (fit_, test_) where it makes sense. This could have an easier deprecation path and provide clarity on what happens. It wouldn't be too pretty and still wouldn't go in the direction of modularization

allow Trainer to be configured in a more articulate way after creation but before the call to fit. One would instantiate the trainer with trainer = Trainer(...) and then trainer.config.fit.max_iterations = 100 or otherwise call a function to set those. This option would allow for modularization, because one could register additional configuration scopes to the Trainer.

Building on top of @lantiga's suggestions, I would suggest

  • Add relevant arguments as mentioned in the beginning of this RFC to fit, validate, test, predict
  • Do not remove said arguments from Trainer
  • Prepend all duplicated said arguments with fit_
  • Explicitly note in the docs which Trainer arguments only affect fit
  • Add a property to Trainer, where it makes sense, that stores the arguments passed to Trainer
  • Allow users to overwrite this property inbetween Trainer's initialization and fit
  • In fit, if the relevant parameter was None, try to use the associated Trainer property

@justusschock
Copy link
Member

  • Add a property to Trainer, where it makes sense, that stores the arguments passed to Trainer
  • Allow users to overwrite this property inbetween Trainer's initialization and fit

I think that without any changes to where the arguments will be located, this would already be a great enhancement, since this allows a more flexible configuration and also changing these arguments inbetween two fit-calls.

In total: I understand both sides here and I think that for now, stability is way more important than changing the location. I also think though, that this is something, we definitely should revisit, when considering Lightning 2.0 with breaking changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deprecation Includes a deprecation design Includes a design discussion refactor
Projects
None yet
Development

No branches or pull requests