Got different accuracy between history and evaluate #10014

z888888861 · 2018-04-23T11:14:26Z

I fit a model as follow: history = model.fit(x_train, y_train, epochs=50, verbose=1, validation_data=(x_val,y_val))

Got the answer :
Epoch 48/50
49/49 [==============================] - 0s 3ms/step - loss: 0.0228 - acc: 0.9796 - val_loss: 3.3064 - val_acc: 0.6923
Epoch 49/50
49/49 [==============================] - 0s 3ms/step - loss: 0.0186 - acc: 1.0000 - val_loss: 3.3164 - val_acc: 0.6923
Epoch 50/50
49/49 [==============================] - 0s 2ms/step - loss: 0.0150 - acc: 1.0000 - val_loss: 3.3186 - val_acc: 0.6923

While, when I try to evaluate my model in train set with model.evaluate(x_train,y_train)

I got this [4.552013397216797, 0.44897958636283875]

I have no idea how this happen? Thank you.

The text was updated successfully, but these errors were encountered:

SpecKROELLchen · 2018-04-23T13:03:58Z

I have the same issue that is why i post it also here.

First i tested my model after training in a separate session.
I tested if i had read in the data and labels in a wrong order or
if the preprocessing is different, but everything seemed fine.,
Then i thought i had an issue similar to this one ##4875

But then i implemented testing directly after training and also on the training dataset,
which should give the exact same result as the training accuracy is?!

I compile using the fit_generator:
(i know the model is massively overfitting here, but this should not be an issue since it is just for tests)
Epoch 16/17
72/73 [============================>.] - ETA: 0s - loss: 0.1138 - acc: 0.9648 - weighted_acc: 0.9648Epoch 00016: val_acc did not improve
73/73 [==============================] - 20s 274ms/step - loss: 0.1139 - acc: 0.9646 - weighted_acc: 0.9646 - val_loss: 0.9137 - val_acc: 0.6249 - val_weighted_acc: 0.6249
Epoch 17/17
72/73 [============================>.] - ETA: 0s - loss: 0.1059 - acc: 0.9661 - weighted_acc: 0.9661
73/73 [==============================] - 20s 280ms/step - loss: 0.1053 - acc: 0.9664 - weighted_acc: 0.9664 - val_loss: 0.5450 - val_acc: 0.7273 - val_weighted_acc: 0.7273
879/879 [==============================] - 6s 6ms/step

If i then use model.evaluate with the same batch_size the result is:
Test_accuracy: 63.663%

My guess is that the model trains correctly but at the end of training storing/ updating the model or model weights something goes wrong.

update: Sorry i forgot to mention that i use python 3.6 and keras 1.2.4 and to save the model I use keras.callbacks.ModelCheckpoint with the following setup:
ModelCheckpoint(Create_callbackfolder, monitor='val_acc', verbose=1, save_best_only=True)

Any help appriciated

datumbox · 2018-04-23T22:15:22Z

@z888888861 Do you use Batch Normalization layers? Are you fine-tuning the network (trainable=False for some of the layers)?

If not, there is a very high chance you are overfitting the network.

SpecKROELLchen · 2018-04-24T05:30:46Z

@datumbox "If not, there is a very high chance you are overfitting the network."
How should overfitting be the problem when he is testing the training data? Maybe I miss something here.

btw. i upgrades tf to version 1.0.7 with cuda 9.0 and the problem remains.
Does anyone has an idea how to fix this problem?

datumbox · 2018-04-24T06:47:46Z

@SpecKROELLchen Did not notice he was testing on training data. Are you using BN layers and finetuning? If yes you might be affected by what is currently discussed here #9965

SpecKROELLchen · 2018-04-25T11:35:34Z

@datumbox Sorry for the late response, and thanks for your response. I don't use BN layers but fine tuning.
I also checked you link, but did not understand how my problem is related to this one.
Basically the code snipped looks like this:

READ IN AND STUFF...

pmodel = resnet50.ResNet50(include_top=False, input_tensor=custom_input, weights="imagenet",
classes=20)
pmodel = pmodel.output
pmodel = GlobalAveragePooling2D()(pmodel)
#pmodel = Dropout(drop_out_rate)(pmodel)
predictions = Dense(num_classes, activation='sigmoid')(pmodel)
pmodel = Model(inputs=custom_input, outputs=predictions)
pmodel.compile(optimizer="Adam", loss=binary_crossentropy, metrics=["accuracy"])
callback = ModelCheckpoint(Folder, monitor = "val_acc", verbose=1, save_best_only=True)

history = pmodel.fit_generator(TRAIN_DATA, validation_data = VAL_DATA, shufle= True,
callbacks=callback)

score = pmodel.evaluate(TRAIN_DATA,verbose=1, batch_size=12)
print(pmodel.metrics_names[1], score[1]*100)

And the test-accuracy ON TRAINING DATA is always something between 50-70%, while
the train accuracy is like 95%.

I still did not solve this problem. So any help is appriciated.
@z888888861 since you first posted and did not response later on, did you maybe solve your problem?

datumbox · 2018-04-25T19:19:50Z

@SpecKROELLchen You are using BN layers, ResNet50 is full of them. Unfortunately you are also affected by how Keras implements this type of layer. Check the discussion on the PR for potential workarounds.

SpecKROELLchen · 2018-04-25T20:51:10Z

@datumbox Ah, thank you. I thought you meant the layers behind the basemodel. Thx, i will keep an eye on that.

SpecKROELLchen · 2018-04-27T06:10:46Z

Sorry i have to repost here again. I think my problem might still be a little bit different or a mix of the BN problem and another one.
I tested VGG16 and the training AND validation accuracy did almost not change.
It stayed between 68-69%.
Then i tested it with Xception model (should also have BN) but then it massively overfitted
and train_acc = 98%, while val_acc=70%.
And no matter what i tried (increasing drop out, l2 kernel regularizer, reduce learning rate for Adams optimizer, i could not handle the overfitting).
And on ResNet50, i have high train_acc AND val_acc (like 95%), while in reality or test mode the results are still poor.
I tried to switch every parameter but do not get good results.
Any help is appreciated.

SpecKROELLchen · 2018-04-28T12:15:19Z

Okay,
i still did not solve my problem and will describe my problem a little bit more in detail.
First step: reduced my problem to a 1 class binary classification as i have done it multiple times.

1.) Read in the data, doing padding when resizing the image
to the desired and needed target_shape.

2.) doing a train-test-split 80-20.

READ IN DATA AND LABELS...

# PREPROCESS the read in data
im_dat_gen = ImageDataGenerator(rescale=1. / 255,
                                rotation_range=40,
                                width_shift_range=0.2,
                                height_shift_range=0.2,
                                shear_range=0.2,
                                zoom_range=0.2,
                                fill_mode='nearest')

train_gen = im_dat_gen.flow(X_train, Y_train, batch_size=12)  # img_data_generator
val_gen = im_dat_gen.flow(X_val, Y_val)

Custom_input = Input(255,255,3)
base_model = vgg19.VGG19(include_top = False, input_tensor = input, weights = "imagenet", classes = num_classes)
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dropout(drop_out_rate)(x)
x = Dense(fully_connected_size, activation='relu', kernel_regularizer=regularizers.l2(kernel_regu))(x)  # ,input_dim=7 * 7 * 512
x = Dropout(drop_out_rate)(x)
# 1 CLASS BINARY PROBLEM
predictions = Dense(1, activation='sigmoid')(x) 
model = Model(inputs=Custon_input, outputs=predictions)

model.compile(optimizer="adam", loss="binary_crossentropy, metrics=["accuracy"])
# TB CALLBACKS
ckpt_callback = ModelCheckpoint(callbackfolder, monitor='val_acc', verbose=1, save_best_only=True)
tb_callback = TensorBoard("MyModel.h5", write_graph=True,
                              write_images=True)     #
cb_list = [ckpt_callback, tb_callback]
# FIT MODEL
history = model.fit_generator(train_gen,validation_data = val_gen, steps_per_epoch=steps_per_epoch,shuffle = True, epochs=num_epoch,
                                  callbacks=cb_list, validation_steps=validation_steps)
# TEST MODEL on training data (just to see if the weights etc. are saved correctly)
score = model.evaluate(X_train, Y_train, verbose=1, batch_size=batch_size)
print("%s: %.3f%%" % (model.metrics_names[1], score[1] * 100))

Comments:
I tested everything i can imagine, changing activation, not using dropout, not using freeze,
train n layers.
But the problem from my post above this one still exists.
The model does not seem to train at all.
while if i use resnet50 for the exact same code (just doing a different preprocessing),
my model trains, validation is very nice, but test on training data fails totally
(probably due to the BN issue).
I can assure that the labels and images are read in correctly
and that the training data do not contain much noise or anything.
I really need help :D.

ucohen · 2018-10-21T11:06:36Z

have the same issue, any updates on this issue?

mblouin02 · 2018-11-11T22:04:05Z

Same issue here. When training using fit_generator, I get a training and validation accuracy that are both much higher than the ones I get when I evaluate the model manually, on training and testing data.

tbagnoli · 2018-11-28T14:09:55Z

Same problem here. I'm training with fit_generator, using separate generators for the training and validation sets:

history = classifier.fit_generator(train_generator,
    steps_per_epoch=train_batches,
    validation_data=val_generator,
    validation_steps=val_batches,
    epochs=60, verbose = 0)

The loss and accuracy that are stored in history for the training and validation sets have simply nothing to do with the values I get from e.g., using

scores = classifier.evaluate(X_val, Y_val, verbose=0)
print('validation loss:', scores[0])
print('validation accuracy:', scores[1])

adityapatadia · 2019-02-02T05:16:50Z

Try to downgrade to keras 2.1.6 and see if you still face such issues. I could solve it by downgrading.

shivam2298 · 2019-02-04T04:10:57Z

Even I am facing a problem similar to this one. I am using resnet as my base model.
After 150 epochs, my train accuracy is 60% but when I use model.evaluate() on the train set I get 10%. Is this being caused due fit_generator or resnet50?

hrosspet · 2019-02-11T22:03:01Z

Try to downgrade to keras 2.1.6 and see if you still face such issues. I could solve it by downgrading.

Thanks @adityapatadia! This solved my problem, too.

Issuenate · 2019-04-26T04:41:28Z

I suggest you guys can save model/weights and load it and test, that might avoid the problem.

midneet · 2019-05-13T02:52:23Z

I also got this problem, and when I save model/weights and load it to test, this problem just occurs. Also applying fine-tune with fit_generator; the accuracy of training data using fit_geterator and evaluate are greatly different.

amintavakol · 2019-07-17T03:43:32Z

Loading the model from the saved model and evaluating it on the test set gives me 16% while
the reported validation accuracy after the last epoch (during training the model) is 85%.
I'm using fit_generator and real-time data augmentation.
Keras version: 2.2.4, with tensorflow backend: 1.13.1
python: 2.7.16

Issuenate · 2019-07-17T03:57:15Z

Try the following might help:

build your network again in your test phrase with the same code for building your network when you were training.
load weights from the trained model/weights.

amintavakol · 2019-07-17T04:00:46Z

what is the difference between what you suggested and load_model method?
Do you mean to set_weights for each layer separately?

Issuenate · 2019-07-17T04:35:33Z

if there is your own custom layers/functions in your model.
if you didn't define well or what, you might not load the model correctly so it might come to the terrible test accuracy.
i used the "load_weights" methods and it's fine to avoid those problems.

it is just my suggestion, you can try it tho. Or check if you load your custom layers/functions correctly.

tinamautd · 2020-07-11T03:36:29Z

Hello, I think I got similar issue. Actually when I use keras.fit() to train the model and use the training data as validation data, I get different training accuracy and validation accuracy during the training process. Have you found the solution for the different training/test accuracy?

RaviBansal7717 · 2020-08-04T12:02:15Z

I have found a fix for this issue.
I encountered this issue while using ImageDataGenerators and obviously my model accuracies on both training and validation set were far more lower when using mode.evaluate() as compared to values returned from model history.
So the fix is to set shuffle=False while creating your validation generator and then your accuracy will match with the validation set.
For training set it may not match as we generally keep shuffle=True for training.
Below is an example of how to create validation DataGenerator for reproducible results (Set shuffle=False) :

validation_datagen=ImageDataGenerator(rescale=1./255)
validation_generator=validation_datagen.flow_from_directory(
    validation_directory,
    target_size=target_size,
    batch_size=validation_batch_size,
    class_mode=class_mode,
    shuffle=False
)

malraharsh · 2021-02-23T10:49:46Z

Thank you very much @RaviBansal7717.

lostdatum · 2021-02-23T11:50:43Z

@RaviBansal7717 Nothing made sense, world was turning gray, you are a lifesaver ! Why on earth would shuffle=True be the default though.

xiluo67 · 2021-03-26T19:16:25Z

I still have the problem that the training accuracy in history is 100% but when I use the model. evaluate(Train_image, Train_label), it gives me 86%, plus I already turned off the shuffle, regularization, dropout, and set batch size equal to the whole dataset number during training. Really have no idea what went wrong?

hollemantv · 2021-08-06T00:27:47Z

I have a slightly different issue. Whether my training accuracy is 70, 80, or 98%, I routinely run model.evaluate(X_test, y_test) after training and get an accuracy score around 0.05. I've seen it happen with and without BN and dropout layers. I'm using Keras 2.4.3, TF 2.5.0 and Python 3.9.5. Suggestions very much appreciated .

SpecKROELLchen mentioned this issue Apr 28, 2018

Fine-tuning InceptionV3: High accuracy during training, low during predictions #10045

Closed

datumbox mentioned this issue Mar 12, 2020

Align BN trainable behaviour with TF 2.0 #13892

Closed

fchollet closed this as completed Jun 24, 2021

arthurflor23 mentioned this issue Sep 4, 2024

Mismatch Between Training Progress and History/CSVLogger Callback Values #20212

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Got different accuracy between history and evaluate #10014

Got different accuracy between history and evaluate #10014

z888888861 commented Apr 23, 2018 •

edited

Loading

SpecKROELLchen commented Apr 23, 2018 •

edited

Loading

datumbox commented Apr 23, 2018

SpecKROELLchen commented Apr 24, 2018 •

edited

Loading

datumbox commented Apr 24, 2018

SpecKROELLchen commented Apr 25, 2018 •

edited

Loading

datumbox commented Apr 25, 2018

SpecKROELLchen commented Apr 25, 2018

SpecKROELLchen commented Apr 27, 2018 •

edited

Loading

SpecKROELLchen commented Apr 28, 2018 •

edited

Loading

ucohen commented Oct 21, 2018

mblouin02 commented Nov 11, 2018

tbagnoli commented Nov 28, 2018

adityapatadia commented Feb 2, 2019

shivam2298 commented Feb 4, 2019

hrosspet commented Feb 11, 2019

Issuenate commented Apr 26, 2019

midneet commented May 13, 2019

amintavakol commented Jul 17, 2019

Issuenate commented Jul 17, 2019

amintavakol commented Jul 17, 2019

Issuenate commented Jul 17, 2019

tinamautd commented Jul 11, 2020

RaviBansal7717 commented Aug 4, 2020 •

edited

Loading

malraharsh commented Feb 23, 2021

lostdatum commented Feb 23, 2021

xiluo67 commented Mar 26, 2021

hollemantv commented Aug 6, 2021

Got different accuracy between history and evaluate #10014

Got different accuracy between history and evaluate #10014

Comments

z888888861 commented Apr 23, 2018 • edited Loading

SpecKROELLchen commented Apr 23, 2018 • edited Loading

datumbox commented Apr 23, 2018

SpecKROELLchen commented Apr 24, 2018 • edited Loading

datumbox commented Apr 24, 2018

SpecKROELLchen commented Apr 25, 2018 • edited Loading

datumbox commented Apr 25, 2018

SpecKROELLchen commented Apr 25, 2018

SpecKROELLchen commented Apr 27, 2018 • edited Loading

SpecKROELLchen commented Apr 28, 2018 • edited Loading

ucohen commented Oct 21, 2018

mblouin02 commented Nov 11, 2018

tbagnoli commented Nov 28, 2018

adityapatadia commented Feb 2, 2019

shivam2298 commented Feb 4, 2019

hrosspet commented Feb 11, 2019

Issuenate commented Apr 26, 2019

midneet commented May 13, 2019

amintavakol commented Jul 17, 2019

Issuenate commented Jul 17, 2019

amintavakol commented Jul 17, 2019

Issuenate commented Jul 17, 2019

tinamautd commented Jul 11, 2020

RaviBansal7717 commented Aug 4, 2020 • edited Loading

malraharsh commented Feb 23, 2021

lostdatum commented Feb 23, 2021

xiluo67 commented Mar 26, 2021

hollemantv commented Aug 6, 2021

z888888861 commented Apr 23, 2018 •

edited

Loading

SpecKROELLchen commented Apr 23, 2018 •

edited

Loading

SpecKROELLchen commented Apr 24, 2018 •

edited

Loading

SpecKROELLchen commented Apr 25, 2018 •

edited

Loading

SpecKROELLchen commented Apr 27, 2018 •

edited

Loading

SpecKROELLchen commented Apr 28, 2018 •

edited

Loading

RaviBansal7717 commented Aug 4, 2020 •

edited

Loading