Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keras.models.load_model resets the optimizer's state #70

Open
SiLiKhon opened this issue Oct 14, 2021 · 15 comments
Open

keras.models.load_model resets the optimizer's state #70

SiLiKhon opened this issue Oct 14, 2021 · 15 comments
Assignees

Comments

@SiLiKhon
Copy link

(Moving an issue from the tf repo)

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes, mostly based on the example from https://www.tensorflow.org/guide/keras/save_and_serialize
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): google colab (Linux 59a52e5448f6 5.4.104+ keras-team/keras#1 SMP Sat Jun 5 09:50:34 PDT 2021 x86_64 x86_64 x86_64 GNU/Linux)
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: no
  • TensorFlow installed from (source or binary): google colab version
  • TensorFlow version (use command below): v2.6.0-0-g919f693420e 2.6.0
  • Python version: 3.7.12 (default, Sep 10 2021, 00:21:48) [GCC 7.5.0]
  • Bazel version (if compiling from source): no
  • GCC/Compiler version (if compiling from source): no
  • CUDA/cuDNN version: 11.2
  • GPU model and memory: Tesla K80, 11441MiB

Describe the current behavior

When restoring a keras model with keras.models.load_model, the returned model's optimizer is in the reset state (e.g. its weights attribute is empty).

Describe the expected behavior

The original call:

reconstructed_model = tf.keras.models.load_model("my_model")

should have restored and kept the optimizer's weights.

Standalone code to reproduce the issue

import tensorflow as tf
import numpy as np

def get_model():
    # Create a simple model.
    inputs = tf.keras.Input(shape=(32,))
    outputs = tf.keras.layers.Dense(1)(inputs)
    model = tf.keras.Model(inputs, outputs)
    model.compile(optimizer="adam", loss="mean_squared_error")
    return model


model = get_model()

# Train the model.
test_input = np.random.random((128, 32))
test_target = np.random.random((128, 1))
model.fit(test_input, test_target)

# Calling `save('my_model')` creates a SavedModel folder `my_model`.
model.save("my_model")

# It can be used to reconstruct the model identically.
reconstructed_model = tf.keras.models.load_model("my_model")

print(reconstructed_model.optimizer.weights)

output:

4/4 [==============================] - 1s 4ms/step - loss: 0.1829
INFO:tensorflow:Assets written to: my_model/assets
[]

If we additionally provide a compile=False argument, the optimizer's weights are restored:

reconstructed_model = tf.keras.models.load_model("my_model", compile=False)
for w in reconstructed_model.optimizer.weights:
    print(w.shape)

output:

(32, 1)
(1,)
(32, 1)
(1,)

However, trying to use the restored optimizer fails with an exception:

reconstructed_model.compile(reconstructed_model.optimizer, loss="mean_squared_error")
reconstructed_model.fit(test_input, test_target)

output:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-3-22a4ff24818b> in <module>()
      1 reconstructed_model.compile(reconstructed_model.optimizer, loss="mean_squared_error")
----> 2 reconstructed_model.fit(test_input, test_target)

9 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py in wrapper(*args, **kwargs)
    992           except Exception as e:  # pylint:disable=broad-except
    993             if hasattr(e, "ag_error_metadata"):
--> 994               raise e.ag_error_metadata.to_exception(e)
    995             else:
    996               raise

NotImplementedError: in user code:

    /usr/local/lib/python3.7/dist-packages/keras/engine/training.py:853 train_function  *
        return step_function(self, iterator)
    /usr/local/lib/python3.7/dist-packages/keras/engine/training.py:842 step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:1286 run
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:2849 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:3632 _call_for_each_replica
        return fn(*args, **kwargs)
    /usr/local/lib/python3.7/dist-packages/keras/engine/training.py:835 run_step  **
        outputs = model.train_step(data)
    /usr/local/lib/python3.7/dist-packages/keras/engine/training.py:791 train_step
        self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
    /usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/optimizer_v2.py:522 minimize
        return self.apply_gradients(grads_and_vars, name=name)
    /usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/optimizer_v2.py:660 apply_gradients
        apply_state)
    /usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/optimizer_v2.py:707 _distributed_apply
        var, apply_grad_to_update_var, args=(grad,), group=False)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:2595 update
        var, fn, args=args, kwargs=kwargs, group=group)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:2473 _replica_ctx_update
        return replica_context.merge_call(merge_fn, args=args, kwargs=kwargs)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:3064 merge_call
        return self._merge_call(merge_fn, args, kwargs)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:3071 _merge_call
        return merge_fn(self._strategy, *args, **kwargs)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:2471 merge_fn  **
        return self.update(var, fn, merged_args, merged_kwargs, group=group)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:2592 update
        return self._update(var, fn, args, kwargs, group)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:3646 _update
        return self._update_non_slot(var, fn, (var,) + tuple(args), kwargs, group)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:3652 _update_non_slot
        result = fn(*args, **kwargs)
    /usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/optimizer_v2.py:689 apply_grad_to_update_var  **
        update_op = self._resource_apply_dense(grad, var, **apply_kwargs)
    /usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/optimizer_v2.py:1241 _resource_apply_dense
        raise NotImplementedError("Must be implemented in subclasses.")

    NotImplementedError: Must be implemented in subclasses.
@jvishnuvardhan
Copy link
Contributor

@SiLiKhon I think the error is expected because when you print the following

reconstructed_model = tf.keras.models.load_model("my_model", compile=False)
for w in reconstructed_model.optimizer.weights:
    print(w.shape)

print(reconstructed_model.optimizer)  # outputs <keras.optimizer_v2.optimizer_v2.RestoredOptimizer at 0x7fdebd716950>

The above optimizer is unknown to keras and it throws NotImplementedError.

Alternatively, you can save weights along with model. During model loading, you can load model and then load weights as shown below

# Calling `save('my_model')` creates a SavedModel folder `my_model`.
model.save("my_model")
model.save_weights('my_weights')

# It can be used to reconstruct the model identically.
reconstructed_model = tf.keras.models.load_model("my_model")
reconstructed_model.load_weights('my_weights')

reconstructed_model.compile(reconstructed_model.optimizer, loss="mean_squared_error")
# reconstructed_model.compile(optimizer="adam", loss="mean_squared_error")
reconstructed_model.fit(test_input, test_target)

Please check the gist here

I am not sure about your use-case. If you want to retrain the model from where it was left, you can load model and retrain (without recompiling). Thanks!

@SiLiKhon
Copy link
Author

@jvishnuvardhan thanks for the reply!

tbh, I don't quite understand: I thought that a call to tf.keras.models.load_model should be enough to restore both the model and the optimizer. It's what seems to be implied in the example from this section of the tutorial, e.g. there is the following comment in that code example:

# The reconstructed model is already compiled and has retained the optimizer
# state, so training can resume:

If, however, you run the example code, the optimizer's state will not be restored. Adding separate save_weights and load_weights calls as in your snippet does fix the issue, but to me its super counterintuitive as to why both of save/load_model and save_weights/load_weights are needed.

I am not sure about your use-case. If you want to retrain the model from where it was left, you can load model and retrain (without recompiling).

That's exactly my use-case. And I'm saying that just a pair of save/load_model calls doesn't do its job in this :)

@bhack
Copy link
Contributor

bhack commented Oct 19, 2021

For Adam and optimizers with SLOTS I think that we have still this tensorflow/tensorflow#44670

@jvishnuvardhan
Copy link
Contributor

@k-w-w Please take a look at this issue. Thanks

@adriangb
Copy link
Contributor

adriangb commented Nov 9, 2021

As per tensorflow/tensorflow#44670 (comment) this is solvable. @bhack what is holding up this being fixed? Is the implementation really that hard?

@bhack
Copy link
Contributor

bhack commented Nov 9, 2021

@adriangb If you want to know my opinion as a first step I will try write a first PR with some expecting failing test to cover this missing feature.

E.g. like in https://github.com/tensorflow/tensorflow/pull/51538/files

When the test PR is approved and so the team agree on the use cases coverage of this feature I think that we could wait for another user contributed PR that invalidate these test so we can remove the expecting failure annotation and close this bug.

Sometimes the feature are also implemented by the natural internal coding activity so the failing tests IMHO are still useful to fail in the case it will be solve by some internal development as a natural way to monitor open confirmed tickets.

This is just my own view as probably someone else might not like the fact of making a feature request (or bug) with just expecting failing tests.

@janhartman
Copy link

Looks like I created a dupe of this by accident here: tensorflow/tensorflow#53064.
Since this might not get fixed soon, can you please put a disclaimer into the docs that this does not work? It was a big surprise to me, especially since it works flawlessly in TF1. Due to it failing silently, it's very hard to catch and can have really bad consequences if it's not caught.
I also couldn't find anything in the docs that states anything similar to ("you cannot restore an optimizer's state"). I think this is a core functionality of TF and should not fail silently.

@bhack
Copy link
Contributor

bhack commented Nov 17, 2021

@janhartman
Copy link

@bhack Check out the notebook I linked in my issue: I don't see the warning in Colab or on my machine. Regardless of the warning, this should still be put into the docs.

@adriangb
Copy link
Contributor

+1 for plastering this warning all over the docs. I would even go so far as to making this an error (only if fit is called). TensorFlow emits all sorts of warnings left and right; even if this did emit 1 warning there's a lot of noise. It's a relatively obscure and hard to detect bug since you can only see if via the resultant data/weights/training results. I could easily see this causing great harm to research projects or real world applications.

@bhack
Copy link
Contributor

bhack commented Nov 17, 2021

@adriangb By an historical point of view you can read a little bit the discussion of this warning at tensorflow/tensorflow#42846

I still believe that an expected failing test PR like in https://github.com/tensorflow/tensorflow/pull/51538/files could really help and also support a community fix. At least when failing test are merged we could have a good overview of what tests will need to pass to implement/resolve this missing feature.

Are you interested to try to contribute a PR to extend the tests with this (expected) failing case?

@adriangb
Copy link
Contributor

keras-team/keras#15661

@BenjaminChoou
Copy link

Looks like I created a dupe of this by accident here: tensorflow/tensorflow#53064. Since this might not get fixed soon, can you please put a disclaimer into the docs that this does not work? It was a big surprise to me, especially since it works flawlessly in TF1. Due to it failing silently, it's very hard to catch and can have really bad consequences if it's not caught. I also couldn't find anything in the docs that states anything similar to ("you cannot restore an optimizer's state"). I think this is a core functionality of TF and should not fail silently.

Same situation here. But I find if the model is saved in H5 format, the optimizer states will be restored. Is it a bug in SavedModel format?

@adriangb
Copy link
Contributor

Yes, it is primarily a bug in SavedModel. I believe that, as you say, H5 works fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants