Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saved model starts with initital loss and accuracy values after loading #12263

Closed
ParikhKadam opened this issue Feb 13, 2019 · 14 comments
Closed
Labels
To investigate Looks like a bug. It needs someone to investigate. type:bug/performance

Comments

@ParikhKadam
Copy link

ParikhKadam commented Feb 13, 2019

I am building a model for machine comprehension. It's a heavy model required to train on lots of data and this requires me more time. I have used keras callbacks to save model after every epoch and also save a history of loss and accuracy.

The problem is, when I am loading a trained model, and try to continue it's training using initial_epoch argument, the loss and accuracy values are same as untrained model.

Here is the code: https://github.com/ParikhKadam/bidaf-keras
The code used to save and load model is in /models/bidaf.py
The script I am using to load the model is:


from .models import BidirectionalAttentionFlow
from .scripts.data_generator import load_data_generators
import os
import numpy as np


def main():
    emdim = 400
    bidaf = BidirectionalAttentionFlow(emdim=emdim, num_highway_layers=2,
                                       num_decoders=1, encoder_dropout=0.4, decoder_dropout=0.6)
    bidaf.load_bidaf(os.path.join(os.path.dirname(__file__), 'saved_items', 'bidaf_29.h5')) 
    train_generator, validation_generator = load_data_generators(batch_size=16, emdim=emdim, shuffle=True)
    model = bidaf.train_model(train_generator, epochs=50, validation_generator=validation_generator, initial_epoch=29, 
                              save_history=False, save_model_per_epoch=False)


if __name__ == '__main__':
    main()

The training history is quite good which is:

epoch,accuracy,loss,val_accuracy,val_loss
0,0.5021367247352657,5.479433422293752,0.502228641179383,5.451400522458351
1,0.5028450897193741,5.234336488338403,0.5037527732234647,5.0748545675049
2,0.5036885394022954,5.042028017280698,0.5039489093881276,5.0298488218407975
3,0.503893446146289,4.996997425685413,0.5040753162241299,4.976164487656699
4,0.5040576918224873,4.955544574118662,0.5041905890181151,4.931354981493792
5,0.5042372655790888,4.909940965651957,0.5043896965802341,4.881359395178988
6,0.504458428129642,4.8542871887472465,0.5045972716586732,4.815464454729135
7,0.50471843351102,4.791098495962496,0.5048680457262408,4.747811231472629
8,0.5050776754196002,4.713560494026321,0.5054184527602898,4.64730478015052
9,0.5058853749443502,4.580552254050073,0.5071290369370443,4.446513280167718
10,0.5081544614246304,4.341471499420364,0.5132941329030303,4.145318906086552
11,0.5123970410575613,4.081624463197288,0.5178775145611896,4.027316586998608
12,0.5149879128865782,3.9577423109634613,0.5187159608315838,3.950151870168726
13,0.5161411008840144,3.8964761709052578,0.5191430166876064,3.906301355196609
14,0.5168211272672539,3.8585826589385697,0.5191263493850466,3.865382308412537
15,0.5173216891201444,3.830764191839807,0.519219763635108,3.8341492204942607
16,0.5177805591697787,3.805340048675155,0.5197178382215892,3.8204319018292585
17,0.5181171635676399,3.7877712072310343,0.5193657963810704,3.798006804522368
18,0.5184295824699279,3.77086071548255,0.5193122694008523,3.7820449101377243
19,0.5187343664397653,3.7555085003534194,0.5203585262348183,3.776260506494833
20,0.519005008308583,3.7430062334375065,0.5195983755362352,3.7605361109533995
21,0.5192872482429703,3.731001830462149,0.5202017035842986,3.7515058917231405
22,0.5195097722222706,3.7194103983513553,0.5207148585133065,3.7446572377159795
23,0.5197511249107636,3.7101052441559905,0.5207420740297026,3.740088335181619
24,0.5199862479678652,3.701593302911729,0.5200187951731082,3.7254406861185188
25,0.5200847805044403,3.6944093077914464,0.520112738649039,3.7203616696860786
26,0.5203289568582412,3.6844954882274092,0.5217114634669081,3.7214983577364547
27,0.5205629846610852,3.6781935968943595,0.520915311442328,3.705435317731209
28,0.5206827641463226,3.6718110897539193,0.5214088439286978,3.7003081666703377

Also, I have already taken care of loading custom objects such as layers, loss function and accuracy.

I am kind of frustrated by now as I took me days to train this model upto epochs and now I can't resume training. I have referred various threads in keras issues and found many people are facing such issues but can't find a solution.

Someone in a thread said that "Keras will not save RNN states" (I ain't using stateful RNNs) and someone else said "Keras reinitializes all the weights before saving which we can handle using a flag." I mean, if such problems exist in Keras, what will be the use of functions like save().

I have also tried saving only weights after every epoch and then building model from scratch and then loading those weights into it. But that didn't work. You can find the old code I used to save weights only in the above listed github repo's older branches.

I have referred this issue with no help - #4875

That issue is open from past two years. Can't understand what all the developers are doing! Is anyone here who can help? Should I switch to tensorflow or I will face the same issues in that too?

Please help...

@ParikhKadam ParikhKadam changed the title Saved model starts with intital loss and accuracy values after loading Saved model starts with initital loss and accuracy values after loading Feb 13, 2019
@gabrieldemarmiesse gabrieldemarmiesse added the To investigate Looks like a bug. It needs someone to investigate. label Feb 13, 2019
@ParikhKadam
Copy link
Author

@fchollet I haven't tried saving model using model.save() but I have seen people on other threads saying that the issue was solved with model.save() and models.save_model(). If it is actually solved, ModelCheckpoint should also save optimizer state to resume training but it doesn't (or can't) whatsoever the reason. I have verified the code of ModelCheckpoint callback which indirectly calls model.save() which leads a call to models.save_model(). So theoretically, if the issue in the base i.e. models.save_model() is solved, then it should also be solved in other functions.

Sorry, but I don't have a powerful machine to check this practically. If someone here has, I have shared my code on github and the link is provided in the issue. Please try the resume training on it and detect the cause of this problem.

I am using the computer provided by a national institute and hence, students here need to share this single computer for their projects. I can't use that computer for such tasks. Thank you..

@ParikhKadam
Copy link
Author

Recently, I tried to check if the weights are saved correctly. For that, I evaluated the model with my validation generator. I saw that the output loss and accuracy remained the same as those in the beginning of model training. Seeing this, I reached a conclusion that its actually the issue with saving weights of the model. I might be wrong here..

BTW, I have also used multi_gpu_model() in my model code. Can it cause this issue? I can't try training model on CPU as it too heavy for that and will take a few days for only 1 epoch to complete. Can anyone help debugging?

I see no response in such issues these days. Just list current issues on the README.md in keras github so that users can be aware before trying keras out and wasting months behind it.

@ParikhKadam
Copy link
Author

@msymp @gabrieldemarmiesse
It's been a day here with no update or replies. Even this issue is left unassigned. Now, I am just asking a single question and try to reply it with a yes/no. Do you people really want to maintain Keras?

If don't, just write out the issues and a disclaimer in the main Keras page. I have seen many issues just left out without a solution. Only issues where the author directly points out the problem are considered here. Know that if one can directly point out the issue, he/she can also solve that issue.. such people only open issues requesting a feature update or reporting bugs. While the other who really need help are left unconsidered.

@ParikhKadam
Copy link
Author

Thought the problem was with mult_gpu_model but it isn't..

I tried this code here to save and load the serial model whenever these functions are called on parallel model but that too didn't work..

from keras import Model
class ModelMGPU(Model):
    def __init__(self, ser_model, gpus):
        pmodel = multi_gpu_model(ser_model, gpus)
        self.__dict__.update(pmodel.__dict__)
        self._smodel = ser_model

    def __getattribute__(self, attrname):
        '''Override load and save methods to be used from the serial-model. The
           serial-model holds references to the weights in the multi-gpu model.
           '''
        if 'load' in attrname or 'save' in attrname:
           return getattr(self._smodel, attrname)
        else:
           #return Model.__getattribute__(self, attrname)
           return super(ModelMGPU, self).__getattribute__(attrname)

Anything else that I can try?

@gabrieldemarmiesse Is anyone actually investigating or that tag is just for show? This is my third issue here on Keras. And both of the previous two issues were solved by me without anyone helping.. This time, I have to deal with the internals of Keras rather than my model and so I am forcing on someone's help..

Heavy model, lack of resources and this.. What should I say!

@ParikhKadam
Copy link
Author

@trane293 I saw your comment here - #2378 (comment)

I tried contacting you earlier to share your piece of code so that I can verify it on mine's. Can you help on this issue please?

@ParikhKadam
Copy link
Author

UPDATE: I tried removing the ModelMGPU wrapper as well as the multi_gpu_model. Hence, I tried to train the model on single GPU and yet it didn't work!

@jvishnuvardhan @gabrieldemarmiesse Any updates?

@mthaha123
Copy link

mthaha123 commented Mar 19, 2019

@ParikhKadam hello,I get the same problem,I reload weights,but get the different loss ,and the loss bigger, I checked the weights, can you tell me how to solve this problem

@ParikhKadam
Copy link
Author

@mthaha123 Yes... can you share your code?

@mthaha123
Copy link

mthaha123 commented Mar 20, 2019

@ParikhKadam here is my code: https://github.com/mthaha123/nice_glow
let me express my code and problem. My code is used to implement the generative model:nice. before I save model weights ,I got loss value thousands level ,but after save model weights and reload the model and load weights, I got loss value millions level. Thats say I can't use the model get correct output.
Can you help me find the reason?

@ParikhKadam
Copy link
Author

ParikhKadam commented Mar 20, 2019

@mthaha123 I looked at your code. It seems that you have implemented custom objects (layers) in your model. When you do it, you also need to tell model which are the custom objects and how to load that custom object.

Have a look at custom layer implemented by me here.. try to understand get_config method and implement it in your custom layers. That will solve your problem..maybe!

If you still have problems, ask me.

@mthaha123
Copy link

mthaha123 commented Mar 20, 2019

I think I found the reason caused the problem, thank you for help! @ParikhKadam

@alirezaghader
Copy link

@ParikhKadam Hello dear,
I faced a similar problem, I use multiple weights for a single variable and when reevaluating the model, it gives me a bigger loss ( about 5 percent). I think the weights are saved a with lower precision than the values the fit function calculated.I will be grateful to hear your comment abt this.

@ParikhKadam
Copy link
Author

@alirezaghader Can you share your code so that I can have a look?

@yoyoanais
Copy link

2021 same problem still exists......anyone finds solution? or maybe i should use Pytorch rather than Keras?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
To investigate Looks like a bug. It needs someone to investigate. type:bug/performance
Projects
None yet
Development

No branches or pull requests

6 participants