Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix galore lr display with schedulers #31710

Merged

Conversation

vasqu
Copy link
Contributor

@vasqu vasqu commented Jun 29, 2024

What does this PR do?

See #31707 for a detailed rundown. Fixes #31707

Tl;dr: Galore still has issues displaying the correct lr due to the lr scheduler this time.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@muellerzr @SunMarc @amyeroberts @Minami-su

@vasqu
Copy link
Contributor Author

vasqu commented Jun 29, 2024

Failing tests seem unrelated to me: TF and hub issues.

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, overall this makes sense. Can you add a test in trainer_utils for this by chance? https://github.com/huggingface/transformers/blob/main/tests/trainer/test_trainer_utils.py

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@vasqu
Copy link
Contributor Author

vasqu commented Jul 2, 2024

@muellerzr Wouldn't it make more sense over here?

def test_galore(self):

I would add two tests:

  • A basic learning rate check without a scheduler.
  • Cosine with warmup steps and check the lr logs to roughly follow the correct pattern.

Is that reasonable? I'm not sure when I'll have the time tho.

Comment on lines +1712 to +1732
# reach given learning rate peak and end with 0 lr
self.assertTrue(logs[num_warmup_steps - 2]["learning_rate"] == learning_rate)
self.assertTrue(logs[-1]["learning_rate"] == 0)

# increasing and decreasing pattern of lrs
increasing_lrs = [
logs[i]["learning_rate"] < logs[i + 1]["learning_rate"]
for i in range(len(logs))
if i < num_warmup_steps - 2
]
decreasing_lrs = [
logs[i]["learning_rate"] > logs[i + 1]["learning_rate"]
for i in range(len(logs) - 1)
if i >= num_warmup_steps - 2
]

self.assertTrue(all(increasing_lrs))
self.assertTrue(all(decreasing_lrs))

# warm up steps << total steps
self.assertTrue(len(decreasing_lrs) > len(increasing_lrs))
Copy link
Contributor Author

@vasqu vasqu Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking for the general patterns of the cosine scheduler. We could just hardcode the values, but I don't think that's necessary.

Moved the tests in the general trainer tests but could also be moved elsewhere. Thought it was more appropriate over here.

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job with the test!

cc @amyeroberts for final review

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing!

Just a comment on the default LR

@@ -519,7 +519,7 @@ def scheduler_hook(param):
if param.requires_grad:
param.register_post_accumulate_grad_hook(scheduler_hook)

return LayerWiseDummyScheduler()
return LayerWiseDummyScheduler(optimizer_dict=optimizer_dict, lr=optimizer.defaults.get("lr", 1e-3))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the 1e-3 come from here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is like a double fallback. Shouldn't be necessary since the dummy optimizer is guaranteed to have a value.

The 1e-3 itself comes from torch galore as their specific defaults.

last_epoch = -1
verbose = False
super().__init__(optimizer, last_epoch, verbose)

def get_lr(self):
return [group["lr"] for group in self.optimizer.param_groups]
# default value
lrs = [1e-3]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move the 1e-3 value out to a constant which get_lr and get_scheduler so that we only need to update in one place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value is in the dummy optimizer, I'll just save them on the initial creation of the dummy scheduler. This way we won't have the hardcoded value.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks for iterating!

Rebasing on main should resolve any timeout issues on the CI runs

@vasqu vasqu force-pushed the fix-galore-lr-display-with-schedulers branch from 55f9d8f to 230adf6 Compare July 5, 2024 15:02
@vasqu
Copy link
Contributor Author

vasqu commented Jul 5, 2024

@amyeroberts One timeout didn't make it through. Is it just my luck? 😆

@amyeroberts
Copy link
Collaborator

@vasqu Just bad luck - although we'll need to look into it on our side why these flaky failures are happening. Thankfully some re-runs worked. Thanks for your patience!

@amyeroberts amyeroberts merged commit a01b033 into huggingface:main Jul 5, 2024
20 checks passed
@vasqu vasqu deleted the fix-galore-lr-display-with-schedulers branch July 5, 2024 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When I used galore, the learning rate was set to 8e-6, but the training rate was 0.001
4 participants