Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Default to infinite epochs, not 1000 #10343

Open
zplizzi opened this issue Nov 3, 2021 · 18 comments
Open

[RFC] Default to infinite epochs, not 1000 #10343

zplizzi opened this issue Nov 3, 2021 · 18 comments
Labels
help wanted Open to be worked on let's do it! approved to implement trainer: argument
Milestone

Comments

@zplizzi
Copy link

zplizzi commented Nov 3, 2021

Currently max_epochs defaults to 1000:

If both max_epochs and max_steps aren't specified, max_epochs will default to 1000. To enable infinite training, set max_epochs = -1.

As a user, though, I would expect that if I don't specify a specific ending point, the training would continue indefinitely. In my own experiments, when the training cut off at 999 epochs, I was confused, and googling the issue didn't readily turn up this line in the documentation. When I checked my logs of all the hyperparams, max_epochs was set to None (I guess this override is applied internally). So as a user I feel like this is bad UX - I can't see a reason to put an arbitrary cutoff versus defaulting to infinite training.

It's especially frustrating when you've invested significant time into a training run, only to have it prematurely cut off due to this unexpected max_epochs limit.

cc @Borda

@zplizzi zplizzi added the feature Is an improvement or enhancement label Nov 3, 2021
@carmocca
Copy link
Contributor

carmocca commented Nov 3, 2021

If we were to change this, we would have the opposite problem: the user forgot to set max_epochs and suddenly their energy (or AWS) bill is way too high because training has been running non-stop.

We have to set a default, so this is why a fixed number of epochs was chosen. The specific number of a 1000 is arbitrary and has been kept for backward compatibility.

To improve this, one thing we could do is to print a message if no value is passed.

@carmocca carmocca added this to the v1.6 milestone Nov 3, 2021
@zplizzi
Copy link
Author

zplizzi commented Nov 3, 2021

If the user is worried about their energy/AWS bill, they should explicitly set a termination condition appropriate to their workload. I would guess the majority of users aren't somehow forgetting about their training jobs and racking up huge bills. On AWS, the instance isn't gonna shut down when the job ends, so this doesn't even really help (unless they're running it in a container on EKS or something - which is advanced enough that I would expect they would consider setting a proper termination point). And on a personal computer, it's rather hard to forget that you have a training job running.

But if we must keep it, yes printing a message on launch would be much better. And ideally also printing a message when the training ends, saying "run stopped because max_epochs (set to 1000) was reached". As is, I have no idea why my training ended abruptly.

@carmocca carmocca added docs Documentation related good first issue Good for newcomers and removed feature Is an improvement or enhancement labels Nov 4, 2021
@puhuk
Copy link
Contributor

puhuk commented Nov 4, 2021

@zplizzi , @carmocca If you do not mind for me to tackle it, I want to take this. :)

@awaelchli
Copy link
Contributor

I don't have any strong preferences or arguments for or against the change. Asking the user to set a limit themselves vs. asking them to set max_epochs=-1 for infinite training is about the same effort.

I definitely agree that we should improve our reporting for when the training loop stops. There can be a few reasons actually. This was definitely discussed before and I think we are all in favor of that (not sure where it was discussed, hard to find)

@puhuk This is in a discussion phase so we should wait and give more people the chance to leave their comments here.

@puhuk
Copy link
Contributor

puhuk commented Nov 4, 2021

@awaelchli, got it !

@carmocca
Copy link
Contributor

carmocca commented Nov 4, 2021

cc @ananthsub as I remember you commented once about the default epoch number

@zplizzi
Copy link
Author

zplizzi commented Nov 4, 2021

Asking the user to set a limit themselves vs. asking them to set max_epochs=-1 for infinite training is about the same effort.

I don't think this is quite right - if someone wants to limit the duration of training, the default of 1000 epochs is almost never going to be right/helpful, so they're going to have to set a different limit. So the current state of affairs means that everyone has to set a value for max_epochs. Whereas if the default were -1, then people that are ok with stopping the experiment manually won't have to set any value for this parameter. So I feel that a default of -1 is less effort overall.

@ananthsub
Copy link
Contributor

ananthsub commented Nov 4, 2021

This issue highlights some great points for improvement:

  1. How to make the documentation clearer to avoid these issues:

I was confused, and googling the issue didn't readily turn up this line in the documentation

We should address this in the trainer documentation on the site.

  1. The UX for hyperparameters when the trainer defaults to behavior not expressed in the default values of the constructor arguments

When I checked my logs of all the hyperparams, max_epochs was set to None (I guess this override is applied internally)

What's the right means of capturing hyperparameters that have custom post-processing inside of the trainer?

  1. The default stopping condition for training

I can't see a reason to put an arbitrary cutoff versus defaulting to infinite training.
cc @ananthsub as I remember you commented once about the default epoch number

My suggestion for the default was to make it 1 epoch so if execution unexpectedly terminates, this happens much sooner than waiting for 1000 epochs. I personally feel that something that stops by default is safer than infinite training. Otherwise users need to figure out how to kill the jobs, which can be painful especially if they haven't configured early stopping and/or are doing multi-process jobs.

@carmocca
Copy link
Contributor

carmocca commented Nov 9, 2021

I personally feel that something that stops by default is safer than infinite training. Otherwise users need to figure out how to kill the jobs, which can be painful especially if they haven't configured early stopping and/or are doing multi-process jobs

I agree with this.
If @awaelchli agrees, the takeaways of this are:

  1. Show a warning about max_epochs not being set.
  2. Improve docs about our default behavior.
  3. In a separate issue, discuss the 1000 epochs -> 1 epoch change

@kaushikb11 kaushikb11 changed the title Default to infinite epochs, not 1000 [RFC] Default to infinite epochs, not 1000 Nov 9, 2021
@kaushikb11 kaushikb11 added design Includes a design discussion discussion In a discussion stage and removed design Includes a design discussion labels Nov 9, 2021
@awaelchli
Copy link
Contributor

awaelchli commented Nov 9, 2021

agreed. One could also consider 1 step instead of 1 epoch, since 1 epoch is not guaranteed to be "small" or even finite in size (depends all on the dataloader).

@Rajathbharadwaj
Copy link
Contributor

Hey, I wish to contribute to this. Any steps on how I can get started?

@carmocca
Copy link
Contributor

Hi @Rajathbharadwaj, you can work on 2 separate PRs. These would be

  1. Show an info message when Trainer(max_epochs) is not passed and we default to 1000.
  2. Look for improvements in our docs about the default behavior for max_epochs. @rohitgr7 and/or @kaushikb11 can help you with where's the best place for it.

@kaushikb11
Copy link
Contributor

@Rajathbharadwaj I have assigned the issue to you, go for it! Feel free to ask questions

@Rajathbharadwaj
Copy link
Contributor

Rajathbharadwaj commented Nov 23, 2021

Awesome thanks! @kaushikb11 @carmocca
Also, shall I also add functionality to pause until the user confirms? Maybe like an input from the user 'y' to confirm. Or just show an info message when max_epochs is not passed.

Also, should this go to userwarning or resourcewarning as the warning type

@rohitgr7
Copy link
Contributor

@Rajathbharadwaj

Also, shall I also add functionality to pause until the user confirms? Maybe like an input from the user 'y' to confirm. Or just show an info message when max_epochs is not passed.

just an info msg is enough. and should be a userwarning. You can update the docs for it here as a note:

@Rajathbharadwaj
Copy link
Contributor

Rajathbharadwaj commented Nov 23, 2021

@rohitgr7 @kaushikb11 @carmocca could you please let me know if any changes are required? Maybe I'm guessing I should remove the print statement? 😄

@ananthsub
Copy link
Contributor

#10444 would've addressed any ambiguity around defaults for fitting. It could've been an error to not call fit without specifying a stopping criterion, which the Trainer cannot do today.

@Rajathbharadwaj
Copy link
Contributor

@ananthsub I'm sorry, but I'm not sure I understand what you mean.

@carmocca carmocca removed this from the 1.6 milestone Feb 1, 2022
@carmocca carmocca added this to the future milestone Feb 1, 2022
@carmocca carmocca added help wanted Open to be worked on let's do it! approved to implement trainer: argument and removed good first issue Good for newcomers docs Documentation related discussion In a discussion stage labels Feb 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Open to be worked on let's do it! approved to implement trainer: argument
Projects
No open projects
Status: No status
Development

No branches or pull requests

8 participants