keras.models.load_model resets the optimizer's state #70

SiLiKhon · 2021-10-14T09:47:03Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes, mostly based on the example from https://www.tensorflow.org/guide/keras/save_and_serialize
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): google colab (Linux 59a52e5448f6 5.4.104+ keras-team/keras#1 SMP Sat Jun 5 09:50:34 PDT 2021 x86_64 x86_64 x86_64 GNU/Linux)
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: no
TensorFlow installed from (source or binary): google colab version
TensorFlow version (use command below): v2.6.0-0-g919f693420e 2.6.0
Python version: 3.7.12 (default, Sep 10 2021, 00:21:48) [GCC 7.5.0]
Bazel version (if compiling from source): no
GCC/Compiler version (if compiling from source): no
CUDA/cuDNN version: 11.2
GPU model and memory: Tesla K80, 11441MiB

Describe the current behavior

When restoring a keras model with keras.models.load_model, the returned model's optimizer is in the reset state (e.g. its weights attribute is empty).

Describe the expected behavior

The original call:

reconstructed_model = tf.keras.models.load_model("my_model")

should have restored and kept the optimizer's weights.

Standalone code to reproduce the issue

import tensorflow as tf
import numpy as np

def get_model():
    # Create a simple model.
    inputs = tf.keras.Input(shape=(32,))
    outputs = tf.keras.layers.Dense(1)(inputs)
    model = tf.keras.Model(inputs, outputs)
    model.compile(optimizer="adam", loss="mean_squared_error")
    return model


model = get_model()

# Train the model.
test_input = np.random.random((128, 32))
test_target = np.random.random((128, 1))
model.fit(test_input, test_target)

# Calling `save('my_model')` creates a SavedModel folder `my_model`.
model.save("my_model")

# It can be used to reconstruct the model identically.
reconstructed_model = tf.keras.models.load_model("my_model")

print(reconstructed_model.optimizer.weights)

output:

4/4 [==============================] - 1s 4ms/step - loss: 0.1829
INFO:tensorflow:Assets written to: my_model/assets
[]

If we additionally provide a compile=False argument, the optimizer's weights are restored:

reconstructed_model = tf.keras.models.load_model("my_model", compile=False)
for w in reconstructed_model.optimizer.weights:
    print(w.shape)

output:

(32, 1)
(1,)
(32, 1)
(1,)

However, trying to use the restored optimizer fails with an exception:

reconstructed_model.compile(reconstructed_model.optimizer, loss="mean_squared_error")
reconstructed_model.fit(test_input, test_target)

output:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-3-22a4ff24818b> in <module>()
      1 reconstructed_model.compile(reconstructed_model.optimizer, loss="mean_squared_error")
----> 2 reconstructed_model.fit(test_input, test_target)

9 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py in wrapper(*args, **kwargs)
    992           except Exception as e:  # pylint:disable=broad-except
    993             if hasattr(e, "ag_error_metadata"):
--> 994               raise e.ag_error_metadata.to_exception(e)
    995             else:
    996               raise

NotImplementedError: in user code:

    /usr/local/lib/python3.7/dist-packages/keras/engine/training.py:853 train_function  *
        return step_function(self, iterator)
    /usr/local/lib/python3.7/dist-packages/keras/engine/training.py:842 step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:1286 run
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:2849 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:3632 _call_for_each_replica
        return fn(*args, **kwargs)
    /usr/local/lib/python3.7/dist-packages/keras/engine/training.py:835 run_step  **
        outputs = model.train_step(data)
    /usr/local/lib/python3.7/dist-packages/keras/engine/training.py:791 train_step
        self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
    /usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/optimizer_v2.py:522 minimize
        return self.apply_gradients(grads_and_vars, name=name)
    /usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/optimizer_v2.py:660 apply_gradients
        apply_state)
    /usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/optimizer_v2.py:707 _distributed_apply
        var, apply_grad_to_update_var, args=(grad,), group=False)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:2595 update
        var, fn, args=args, kwargs=kwargs, group=group)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:2473 _replica_ctx_update
        return replica_context.merge_call(merge_fn, args=args, kwargs=kwargs)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:3064 merge_call
        return self._merge_call(merge_fn, args, kwargs)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:3071 _merge_call
        return merge_fn(self._strategy, *args, **kwargs)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:2471 merge_fn  **
        return self.update(var, fn, merged_args, merged_kwargs, group=group)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:2592 update
        return self._update(var, fn, args, kwargs, group)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:3646 _update
        return self._update_non_slot(var, fn, (var,) + tuple(args), kwargs, group)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:3652 _update_non_slot
        result = fn(*args, **kwargs)
    /usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/optimizer_v2.py:689 apply_grad_to_update_var  **
        update_op = self._resource_apply_dense(grad, var, **apply_kwargs)
    /usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/optimizer_v2.py:1241 _resource_apply_dense
        raise NotImplementedError("Must be implemented in subclasses.")

    NotImplementedError: Must be implemented in subclasses.

The text was updated successfully, but these errors were encountered:

jvishnuvardhan · 2021-10-14T17:54:42Z

@SiLiKhon I think the error is expected because when you print the following

reconstructed_model = tf.keras.models.load_model("my_model", compile=False)
for w in reconstructed_model.optimizer.weights:
    print(w.shape)

print(reconstructed_model.optimizer)  # outputs <keras.optimizer_v2.optimizer_v2.RestoredOptimizer at 0x7fdebd716950>

The above optimizer is unknown to keras and it throws NotImplementedError.

Alternatively, you can save weights along with model. During model loading, you can load model and then load weights as shown below

# Calling `save('my_model')` creates a SavedModel folder `my_model`.
model.save("my_model")
model.save_weights('my_weights')

# It can be used to reconstruct the model identically.
reconstructed_model = tf.keras.models.load_model("my_model")
reconstructed_model.load_weights('my_weights')

reconstructed_model.compile(reconstructed_model.optimizer, loss="mean_squared_error")
# reconstructed_model.compile(optimizer="adam", loss="mean_squared_error")
reconstructed_model.fit(test_input, test_target)

Please check the gist here

I am not sure about your use-case. If you want to retrain the model from where it was left, you can load model and retrain (without recompiling). Thanks!

SiLiKhon · 2021-10-18T09:23:53Z

@jvishnuvardhan thanks for the reply!

tbh, I don't quite understand: I thought that a call to tf.keras.models.load_model should be enough to restore both the model and the optimizer. It's what seems to be implied in the example from this section of the tutorial, e.g. there is the following comment in that code example:

# The reconstructed model is already compiled and has retained the optimizer
# state, so training can resume:

If, however, you run the example code, the optimizer's state will not be restored. Adding separate save_weights and load_weights calls as in your snippet does fix the issue, but to me its super counterintuitive as to why both of save/load_model and save_weights/load_weights are needed.

I am not sure about your use-case. If you want to retrain the model from where it was left, you can load model and retrain (without recompiling).

That's exactly my use-case. And I'm saying that just a pair of save/load_model calls doesn't do its job in this :)

bhack · 2021-10-19T22:44:15Z

For Adam and optimizers with SLOTS I think that we have still this tensorflow/tensorflow#44670

jvishnuvardhan · 2021-10-19T23:00:19Z

@k-w-w Please take a look at this issue. Thanks

bhack · 2021-10-20T00:02:27Z

https://github.com/keras-team/keras/blob/0b1d504ba21cf917a4f42055aba6bdf91fa5e5d6/keras/saving/saved_model/load.py#L161-L166

adriangb · 2021-11-09T17:13:30Z

As per tensorflow/tensorflow#44670 (comment) this is solvable. @bhack what is holding up this being fixed? Is the implementation really that hard?

bhack · 2021-11-09T19:20:27Z

@adriangb If you want to know my opinion as a first step I will try write a first PR with some expecting failing test to cover this missing feature.

E.g. like in https://github.com/tensorflow/tensorflow/pull/51538/files

When the test PR is approved and so the team agree on the use cases coverage of this feature I think that we could wait for another user contributed PR that invalidate these test so we can remove the expecting failure annotation and close this bug.

Sometimes the feature are also implemented by the natural internal coding activity so the failing tests IMHO are still useful to fail in the case it will be solve by some internal development as a natural way to monitor open confirmed tickets.

This is just my own view as probably someone else might not like the fact of making a feature request (or bug) with just expecting failing tests.

janhartman · 2021-11-17T21:36:48Z

Looks like I created a dupe of this by accident here: tensorflow/tensorflow#53064.
Since this might not get fixed soon, can you please put a disclaimer into the docs that this does not work? It was a big surprise to me, especially since it works flawlessly in TF1. Due to it failing silently, it's very hard to catch and can have really bad consequences if it's not caught.
I also couldn't find anything in the docs that states anything similar to ("you cannot restore an optimizer's state"). I think this is a core functionality of TF and should not fail silently.

bhack · 2021-11-17T22:32:31Z

@janhartman Is this warning not working in your case:
https://github.com/keras-team/keras/blob/0b1d504ba21cf917a4f42055aba6bdf91fa5e5d6/keras/saving/saved_model/load.py#L161-L166

janhartman · 2021-11-17T22:40:01Z

@bhack Check out the notebook I linked in my issue: I don't see the warning in Colab or on my machine. Regardless of the warning, this should still be put into the docs.

adriangb · 2021-11-17T22:44:17Z

+1 for plastering this warning all over the docs. I would even go so far as to making this an error (only if fit is called). TensorFlow emits all sorts of warnings left and right; even if this did emit 1 warning there's a lot of noise. It's a relatively obscure and hard to detect bug since you can only see if via the resultant data/weights/training results. I could easily see this causing great harm to research projects or real world applications.

bhack · 2021-11-17T23:06:58Z

@adriangb By an historical point of view you can read a little bit the discussion of this warning at tensorflow/tensorflow#42846

I still believe that an expected failing test PR like in https://github.com/tensorflow/tensorflow/pull/51538/files could really help and also support a community fix. At least when failing test are merged we could have a good overview of what tests will need to pass to implement/resolve this missing feature.

Are you interested to try to contribute a PR to extend the tests with this (expected) failing case?

adriangb · 2021-11-17T23:29:49Z

keras-team/keras#15661

BenjaminChoou · 2021-11-19T04:31:40Z

Looks like I created a dupe of this by accident here: tensorflow/tensorflow#53064. Since this might not get fixed soon, can you please put a disclaimer into the docs that this does not work? It was a big surprise to me, especially since it works flawlessly in TF1. Due to it failing silently, it's very hard to catch and can have really bad consequences if it's not caught. I also couldn't find anything in the docs that states anything similar to ("you cannot restore an optimizer's state"). I think this is a core functionality of TF and should not fail silently.

Same situation here. But I find if the model is saved in H5 format, the optimizer states will be restored. Is it a bug in SavedModel format?

adriangb · 2021-11-19T05:48:43Z

Yes, it is primarily a bug in SavedModel. I believe that, as you say, H5 works fine.

jvishnuvardhan self-assigned this Oct 14, 2021

jvishnuvardhan added the stat:awaiting response from contributor label Oct 14, 2021

jvishnuvardhan assigned k-w-w and unassigned jvishnuvardhan Oct 19, 2021

jvishnuvardhan removed the stat:awaiting response from contributor label Oct 19, 2021

adriangb mentioned this issue Nov 9, 2021

DRAFT: Avoid patching the Keras Model if Keras alrady supports __reduce__ adriangb/scikeras#248

Draft

janhartman mentioned this issue Nov 17, 2021

TF2 does not load optimizer weights when restoring a saved model tensorflow/tensorflow#53064

Closed

hhoefener mentioned this issue Oct 4, 2022

Information gets lost when saving and loading a model tensorflow/tensorflow#57917

Closed

adriangb mentioned this issue Sep 22, 2023

Unexpected breaking change: Optimizer.get_weights() removed #442

Open

sachinprasadhs transferred this issue from keras-team/keras Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

keras.models.load_model resets the optimizer's state #70

keras.models.load_model resets the optimizer's state #70

SiLiKhon commented Oct 14, 2021

jvishnuvardhan commented Oct 14, 2021

SiLiKhon commented Oct 18, 2021

bhack commented Oct 19, 2021

jvishnuvardhan commented Oct 19, 2021

bhack commented Oct 20, 2021

adriangb commented Nov 9, 2021

bhack commented Nov 9, 2021

janhartman commented Nov 17, 2021

bhack commented Nov 17, 2021

janhartman commented Nov 17, 2021

adriangb commented Nov 17, 2021

bhack commented Nov 17, 2021

adriangb commented Nov 17, 2021

BenjaminChoou commented Nov 19, 2021

adriangb commented Nov 19, 2021

keras.models.load_model resets the optimizer's state #70

keras.models.load_model resets the optimizer's state #70

Comments

SiLiKhon commented Oct 14, 2021

jvishnuvardhan commented Oct 14, 2021

SiLiKhon commented Oct 18, 2021

bhack commented Oct 19, 2021

jvishnuvardhan commented Oct 19, 2021

bhack commented Oct 20, 2021

adriangb commented Nov 9, 2021

bhack commented Nov 9, 2021

janhartman commented Nov 17, 2021

bhack commented Nov 17, 2021

janhartman commented Nov 17, 2021

adriangb commented Nov 17, 2021

bhack commented Nov 17, 2021

adriangb commented Nov 17, 2021

BenjaminChoou commented Nov 19, 2021

adriangb commented Nov 19, 2021