Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble loading checkpoints #46

Open
xoffey opened this issue Jan 9, 2018 · 1 comment
Open

Trouble loading checkpoints #46

xoffey opened this issue Jan 9, 2018 · 1 comment

Comments

@xoffey
Copy link

xoffey commented Jan 9, 2018

I need to be able to resume training breakout-v0 after stopping it. I would also like to be able to move a checkpoint dir to another machine and resume training there.

When I train on my laptop, using ubuntu 14.04, I am able to resume after stopping. But on the faster machine I really want to use, I can not resume after stopping. That machine uses ubuntu 16.04, FWIW.

Both machines use tensorflow 1.3.0. The working laptop uses python 3.6 and the non-working machine uses python 3.5.2. OpenAI gym is version 0.9.4 on both machines, as installed by pip. Neither machine uses GPU, and both use NHWC.

On both machines, I have cloned from the devsisters/DQN-tensorflow repository and manually fixed the bugs that prevent it from working with python 3.x.

`~/DQN-tensorflow$ python main.py --env_name=Breakout-v0 --is_train=True --display=False

[*] GPU : 1.0000
{'_save_step': 500000,
'_test_step': 50000,
'action_repeat': 4,
'backend': 'tf',
'batch_size': 32,
'cnn_format': 'NHWC',
'discount': 0.99,
'display': False,
'double_q': False,
'dueling': False,
'env_name': 'Breakout-v0',
'env_type': 'detail',
'ep_end': 0.1,
'ep_end_t': 1000000,
'ep_start': 1.0,
'history_length': 4,
'learn_start': 50000.0,
'learning_rate': 0.00025,
'learning_rate_decay': 0.96,
'learning_rate_decay_step': 50000,
'learning_rate_minimum': 0.00025,
'max_delta': 1,
'max_reward': 1.0,
'max_step': 50000000,
'memory_size': 1000000,
'min_delta': -1,
'min_reward': -1.0,
'model': 'm1',
'random_start': 30,
'scale': 10000,
'screen_height': 84,
'screen_width': 84,
'target_q_update_step': 10000,
'train_frequency': 4}
WARNING:tensorflow:From /home/mjc/DQN-tensorflow/dqn/agent.py:224: calling argmax (from tensorflow.python.ops.math_ops) with dimension is deprecated and will be removed in a future version.
Instructions for updating:
Use the axis argument instead
WARNING:tensorflow:From /opt/anaconda/miniconda3/envs/tfbuild/lib/python3.5/site-packages/tensorflow/python/util/tf_should_use.py:107: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use tf.global_variables_initializer instead.

[*] Loading checkpoints...
[!] Load FAILED: checkpoints/Breakout-v0/backend-tf/ep_end-0.1/model-m1/screen_width-84/env_type-detail/learning_rate-0.00025/learning_rate_minimum-0.00025/memory_size-1000000/env_name-Breakout-v0/dueling-False/learning_rate_decay-0.96/batch_size-32/min_delta--1/max_reward-1.0/learn_start-50000.0/double_q-False/max_delta-1/scale-10000/random_start-30/cnn_format-NHWC/discount-0.99/min_reward--1.0/action_repeat-4/learning_rate_decay_step-50000/ep_start-1.0/history_length-4/target_q_update_step-10000/ep_end_t-1000000/train_frequency-4/max_step-50000000/screen_height-84/
`

How can this problem be fixed?

@Martellacci
Copy link

I've the same problema

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants