Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new: add quaterion progress bar, add trainer defaults, move callbacks… #114

Merged
merged 11 commits into from
Jun 8, 2022

Conversation

joein
Copy link
Member

@joein joein commented May 31, 2022

#113

@netlify
Copy link

netlify bot commented May 31, 2022

Deploy Preview for capable-unicorn-d5e336 ready!

Name Link
🔨 Latest commit 1276aeb
🔍 Latest deploy log https://app.netlify.com/sites/capable-unicorn-d5e336/deploys/62a0de161e3a4b00093bc6bf
😎 Deploy Preview https://deploy-preview-114--capable-unicorn-d5e336.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@joein
Copy link
Member Author

joein commented May 31, 2022

[WIP] Errors in netlify log

Lightning-AI/pytorch-lightning#13159

@joein joein requested a review from generall May 31, 2022 14:49
@joein joein self-assigned this May 31, 2022
@joein joein added the enhancement New feature or request label May 31, 2022
@joein joein linked an issue May 31, 2022 that may be closed by this pull request
train_dataloader: DataLoader instance to retrieve samples during training
stage
val_dataloader: Optional DataLoader instance to retrieve samples during
validation stage
trainer: `pytorch_lightning.Trainer` instance to handle fitting routine
internally. If not passed will be created with `Quaterion.trainer_defaults`
ckpt_path: Path/URL of the checkpoint from which training is resumed. If there is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it some kind of requirement or something intended for UX?

I guess it will instead cause a headache most of the time. TBH, sounds more awkward than #29.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of reasons:

  1. We want to replace default lightning progress bar with our progress bar. We don't want to replace a user defined one.
    We have no a possibility to distinguish whether progress bar was set by user or by trainer itself.
    Hence, we can pass our progress bar via trainer_defaults. But if user wants to provide some custom progress bar, it can be done with overriding callbacks in trainer_defaults or with passing a ready to use pl.Trainer instance.

  2. It just relieves users from some boilerplate code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there are a lot of parameters we would like to pre-define in trainer, which would serve as reasonable defaults, but it looks like trainer is not designed to be manipulated after it is already created. Additionally, we can't wrap trainer, because it have a huge list of parameters, and supporting backward compatibility for them would be very painful.

So there is a proposal to handle 90% of cases with default trainer configuration which will look nice, at the same time keeping the ability to override trainer if required. Default configuration should work good enough for all the tutorial examples, if it is currently not - we can adjust.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this idea comes from using a custom progress bar, then I would like to stick to plain old TQDM. Noone can blame us for using a package that is downloaded ~10m a week.

We set many defaults here, and even if a user wants to change one of them, they will lose all the good defaults. And they will need to continuously check what is set. Some of them are boilerplates, but some are hyperparameters. In #29, we concluded that we cannot safely predict user's intention and that it's not our responsibility. However, this change assumes user intentions from a more radical perspective, and we take more responsibilities. It's just confusing.

The final decision is yours, but this PR should be reverted in my humble opinion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We set many defaults here, and even if a user wants to change one of them, they will lose all the good defaults.

can you please elaborate on that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I was afraid of that, but maybe we can show info/warning in case if user is using a default trainier with a link to explanation? I would really want to have a default trainer, it looks much better on GIF and demos

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also prefer writing fewer lines of code and agree that it looks much better, but I strongly believe that hyperparameters should be really, really set by users themselves.

Alternatively, we can convert this one into a discussion and log a link to that discussion in Quaterion.fit() before actually merging it into master.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which hyperparameters are you referring to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of epochs and early stopping are hyperparameters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can preserve the default value of max_epochs as it is in pytorch_lightning.Trainer, AFAIK it is 1000.

Imho, I don't see something really wrong in usage of EarlyStopping by default.
As I can judge, it is quite common to set EarlyStopping and monitor validation_loss, is not it?

num_batches = len(train_dataloader)
if num_batches > 0:
defaults["log_every_n_steps"] = min(50, num_batches)
except Exception: # If dataset has to length
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe specify exception? (TypeError, NotImplementedError)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though about it, but I don't trust this list used in Lightning, the whole section is optional, so I am if it fails for whatever reason

not encoder.trainable
for encoder in trainable_model.model.encoders.values()
]
) and (trainable_model.configure_caches() is not None)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a correct check?
We have CacheType.None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

def render(self, task) -> RenderableType:
task_speed = f"{task.speed:>.2f}" if task.speed is not None else "0.00"
self.max_length = max(len(task_speed), self.max_length)
task_speed = " " * (self.max_length - len(task_speed)) + task_speed
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we limit the minimal width (4 as a len("0.00")), but do not limit the maximum, right?
we allow this field to be broader then 4 symbols ("42.42")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the idea is to keep maximal seed length. Otherwise the whole line will be shifted each time the speed changes from 9.99 to 10.01

@joein
Copy link
Member Author

joein commented Jun 1, 2022

One of my concerns regarding our custom progress bar is that once user uses a custom progress bar or lightning default one, our Caching description will be switched back to Predicting.
It also could be done occasionally, when user don't use trainer_defaults or when callbacks field in trainer_defaults is overridden with a new list of callbacks.

Comment on lines 161 to 175
disable_checkpoints = all(
[
not encoder.trainable
for encoder in trainable_model.model.encoders.values()
]
) and (
cache_config is not None and cache_config.cache_type != CacheType.NONE
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all_encoders_frozen = all(not encoder.trainable for encoder in trainable_model.model.encoders.values())
cache_configured = cache_config is not None and cache_config.cache_type != CacheType.None
disable_checkpoints = all_encoders_frozen and cache_configured

@joein
Copy link
Member Author

joein commented Jun 3, 2022

I accidentally merged pull request into this branch, will amend two last commits and make a force push

@generall generall merged commit 83b901c into master Jun 8, 2022
@generall generall deleted the 113-progress-bar branch June 10, 2022 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make quaterion progress bar look nice
4 participants