Add a loss scale optimizer #851

mattdangerw · 2023-09-07T16:51:36Z

This is the big missing piece we need for feature parity when running mixed precision training compared to tf.keras.

Fixes #571

This is the big missing piece we need for feature parity when running mixed precision training compared to tf.keras.

codecov · 2023-09-07T16:57:39Z

Codecov Report

Patch coverage: 86.30% and project coverage change: +0.09% 🎉

Comparison is base (ab45558) 75.99% compared to head (fd58cfb) 76.09%.
Report is 6 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #851      +/-   ##
==========================================
+ Coverage   75.99%   76.09%   +0.09%     
==========================================
  Files         328      329       +1     
  Lines       31099    31269     +170     
  Branches     6051     6083      +32     
==========================================
+ Hits        23635    23793     +158     
- Misses       5866     5874       +8     
- Partials     1598     1602       +4

Flag	Coverage Δ
keras_core	`75.99% <86.30%> (+0.08%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
keras_core/backend/tensorflow/optimizer.py	`90.56% <ø> (-0.35%)`	⬇️
keras_core/callbacks/tensorboard.py	`83.84% <ø> (+0.11%)`	⬆️
keras_core/ops/core.py	`74.09% <0.00%> (-4.05%)`	⬇️
keras_core/backend/torch/trainer.py	`89.56% <50.00%> (-0.35%)`	⬇️
keras_core/backend/tensorflow/trainer.py	`78.53% <66.66%> (-0.14%)`	⬇️
keras_core/optimizers/base_optimizer.py	`74.76% <90.00%> (+0.56%)`	⬆️
keras_core/optimizers/loss_scale_optimizer.py	`94.28% <94.28%> (ø)`
keras_core/backend/jax/numpy.py	`97.69% <100.00%> (+0.01%)`	⬆️
keras_core/backend/jax/trainer.py	`96.08% <100.00%> (+0.11%)`	⬆️
keras_core/optimizers/__init__.py	`92.10% <100.00%> (+0.21%)`	⬆️
... and 1 more

... and 5 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mattdangerw · 2023-09-07T17:03:47Z

A few points of awkwardness/discussion:

ops.cond needs to be stateless for jax. Autoscaling has two cond branches with variables updates in all branches. To do this I overrode stateless_apply separately and had to do a lot of StatelessScopes. This feels very verbose and awkward, but I wasn't able to think of a great way around it. Suggestions welcome!
We want learning_rate to proxy the inner optimizer learning rate. I overrode the learning_rate property to do so. But the base optimizer still has to be created with a learning rate, so I ended up just passing a zero valued variable which never gets used. This feels awkward and a bit confusing.
Because the loss scale optimizer needs a variable to scale the loss, and we don't sync jax state except on epoch boundaries, I made a scale_loss and stateless_scale_loss. This is probably fine, just feels like a lot of code.

mattdangerw · 2023-09-07T17:19:51Z

Looks like there is some sort of device placement issue for tensorflow GPU that isn't picked up in our CPU only tests? Will poke around. Confirmed torch/jax are working.

[Edit: now working on all backends]

fchollet

Thanks for the PR!

keras_core/backend/tensorflow/optimizer.py

keras_core/optimizers/loss_scale_optimizer.py

keras_core/trainers/trainer.py

mattdangerw · 2023-09-08T02:01:25Z

Addressed the initial round, though I may play with a test that deliberately triggers the underflow in trainer and asserts that variable updates appear. I don't think that should be too hard? But we will see.

fchollet · 2023-09-08T04:25:38Z

keras_core/optimizers/base_optimizer.py

+            return loss * self.loss_scale_factor
+        return loss
+
+    def stateless_scale_loss(self, optimizer_variables, loss):


I'm not sure this should take optimizer_variables -- it only reads the value of one variable. I'm also not sure it should exist at all: you could just use a stateless scope when you call it. No strong opinion though. What are the trade offs?

Good idea. I like the stateless scope. That keeps the weirdness isolated to jax.

Overall the jax trainer is definitely a little wonky (pushing state through aux for jax.value_and_grad is kinda bleh), but the real grossness is the jax implementation of the loss scale optimizer itself. Basically all jax control flow cannot be written then same was as tf/torch because a stateless scope is insufficient, you still have to return all state from each control flow callback. Not something to fix this PR, and I'm not sure we can really do anything about it, but it does break the idea of a backend agnostic graph of ops.

keras_core/optimizers/loss_scale_optimizer_test.py

fchollet · 2023-09-08T04:27:53Z

keras_core/optimizers/loss_scale_optimizer.py

+@keras_core_export(
+    [
+        "keras_core.optimizers.LossScaleOptimizer",
+        "keras_core.mixed_precision.LossScaleOptimizer",


Do we need this export path for backward compat? It's a bit awkward to have an optimizer in the mixed precision namespace. If it doesn't break too many people, I'd just drop it.

Yeah, IIUC, this was the only name it was ever exposed as, so if we want the backward compat we need this alias.

keras_core/backend/tensorflow/optimizer.py

mattdangerw · 2023-09-09T00:21:16Z

Added a test end to end test only looking at variables updates across fit(), which will hopefully keep us from accidentally breaking this.

fchollet

Great work! LGTM

Add a loss scale optimizer

67455d3

This is the big missing piece we need for feature parity when running mixed precision training compared to tf.keras.

mattdangerw mentioned this pull request Sep 7, 2023

Add loss scaling technique to BaseOptimizer #842

Closed

mattdangerw requested a review from fchollet September 7, 2023 17:03

mattdangerw added 2 commits September 7, 2023 10:23

Add the tf.keras export path for compat

85ceb85

Try fixing tf GPU

34b1357

fchollet reviewed Sep 7, 2023

View reviewed changes

keras_core/backend/tensorflow/optimizer.py Show resolved Hide resolved

keras_core/optimizers/loss_scale_optimizer.py Outdated Show resolved Hide resolved

keras_core/trainers/trainer.py Outdated Show resolved Hide resolved

keras_core/trainers/trainer.py Show resolved Hide resolved

Address comments

f7f4300

fchollet reviewed Sep 8, 2023

View reviewed changes

Address comments

7b249cc

Fix torch tests (torch cpu doesn't support half dtypes)

fd58cfb

fchollet approved these changes Sep 9, 2023

View reviewed changes

fchollet merged commit d0b53fd into keras-team:main Sep 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a loss scale optimizer #851

Add a loss scale optimizer #851

mattdangerw commented Sep 7, 2023 •

edited

Loading

codecov bot commented Sep 7, 2023 •

edited

Loading

mattdangerw commented Sep 7, 2023

mattdangerw commented Sep 7, 2023 •

edited

Loading

fchollet left a comment

mattdangerw commented Sep 8, 2023

fchollet Sep 8, 2023

mattdangerw Sep 9, 2023

fchollet Sep 8, 2023

mattdangerw Sep 8, 2023

mattdangerw commented Sep 9, 2023

fchollet left a comment

Add a loss scale optimizer #851

Add a loss scale optimizer #851

Conversation

mattdangerw commented Sep 7, 2023 • edited Loading

codecov bot commented Sep 7, 2023 • edited Loading

Codecov Report

mattdangerw commented Sep 7, 2023

mattdangerw commented Sep 7, 2023 • edited Loading

fchollet left a comment

Choose a reason for hiding this comment

mattdangerw commented Sep 8, 2023

fchollet Sep 8, 2023

Choose a reason for hiding this comment

mattdangerw Sep 9, 2023

Choose a reason for hiding this comment

fchollet Sep 8, 2023

Choose a reason for hiding this comment

mattdangerw Sep 8, 2023

Choose a reason for hiding this comment

mattdangerw commented Sep 9, 2023

fchollet left a comment

Choose a reason for hiding this comment

mattdangerw commented Sep 7, 2023 •

edited

Loading

codecov bot commented Sep 7, 2023 •

edited

Loading

mattdangerw commented Sep 7, 2023 •

edited

Loading