-
Notifications
You must be signed in to change notification settings - Fork 19.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change BN layer to use moving mean/var if frozen #9965
Conversation
# Conflicts: # tests/keras/layers/normalization_test.py
Thanks for the effort. You are misunderstanding the meaning of the "trainable" property of layers. Historically, it has initially meant "this layer should not be trainable, i.e. the weights of this layer should not be updated during backprop (specifically What you want is a BN layer in inference mode. There is an argument to control training/inference mode in BN (and other layers): it's the What you want is: x = BatchNormalization()(y, training=False) For fine-tuning, you could do something like: # Set up inference-mode base
K.set_learning_phase(0)
inputs = Input(...)
x = layer1(...)(inputs)
x = layer2(...)(x)
...
x = layerN(...)(x)
# Add training-mode layers
K.set_learning_phase(1)
x = layerNp1(...)(x)
x = layerNp2(...)(x) |
Hi @fchollet, First of all thanks for taking the time to review and respond. I was aware that this is a significant change in the default behaviour and that there would be debate. :) I understand that your main concern is around the semantic meaning of the trainable property and how it is being used in this PR. I agree that semantically the training parameter that you proposed is closer to what I do, nevertheless this parameter can't change after the network definition. For instance when you use one of the pre-trained models of keras or when you load a persisted model you have no control over this variable. Would you be open to discuss a solution that would make the training variable changeable after the network definition (or perhaps another property)? If you are open to this, I could update my PR to reflect the agreed behaviour. Concerning your second recommendation of updating the learning_phase as the network is defined, I see two limitations:
I'm not sure if you had a look on the blog post (it is understandably a bit long), but you can see how significant perfomance boost you get by making it possible to set the BN in inference mode. Without this the trainable layers after the BNs adjust their weights based on input that has different scale (comparing to inference). I hope, we can re-open this PR; I'm happy to update it until it satisfies the semantic definitions. Cheers! |
Again, there is an existing API that does exactly what you want: the Additionally, your proposed PR adds a computational overhead (which might amount to a ~5% slowdown for a BN-heavy model like InceptionV3) to every single convnet that uses BN, fine-tuning or not. This is a heavy price to pay for supporting an incrementally simpler UX (disputable) for a very specific use case.
Typically if you want to heavily modify an existing model, rather than merely use it in inference mode, you should have access to the code for the model. But even if you don't, you can still do your style of fine-tuning in this case:
|
@fchollet My main point is that the training argument can't be changed after model definition, so the existing API does not cover this valid case. I don't argue that there are workarounds, but they are hacky/non-elegant and the default behaviour leads to much confusion to users. Interesting what you mention about the 5% slow down, I would love to see the benchmarks; perhaps it can be resolved. Finally something you don't address here is whether this discrepancy in the scaling makes sense (theoretically or otherwise) and whether the accuracy decrease is worth it. At any case, let's agree we disagree. I do hope though that you will revise your decision on the future, as it happened with the update of the mini-batch statistics on the BN. |
This is based on something I've observed in the past for InceptionV3 with static learning phase vs. with dynamic learning phase. Only difference between the two settings is |
Thanks for the clarifying that you are referring to a different benchmark and not to something you ran on this PR. I can't comment on the results without seeing them but when I ran comparisons on CIFAR10 the time difference was negligible (current branch: 4216 secs vs patched: 4251 secs); both ran on GPUs on the same server. Note that the snippet that I used (and listed on my article) comes from Keras' documentation on how to fine-tune a network. Admittedly the above measurements are single point estimates but especially the 5 point accuracy increase I report is consistent with what I've been observing for almost a year while applying workarounds (first time I reported this is on #7177). I don't know if the speed is currently your main concern for reopening this but I would say that this is unlikely to affect the majority of the users of Keras. This is because by default the Learning Phase is dynamic and the At any case I don't insist that it should me who changes this or that my current solution is the one we should use. I'm just raising a valid use case that is taken directly from Keras' documentation on how fine-tuning is performed. Currently there is no straightforward way to do what I describe (the current API doesn't cover it), nevertheless if you provide specific guidelines on what tickboxes the update should check it would be useful. Or perhaps some other longtime contributor of the BatchNormalization layer has an opinion or can offer a more elegant solution on this? @ozabluda @taehoonlee @farizrahman4u @Dref360 |
Sorry for late reply, still trying to understand the issues. For example, I am trying to understand if this is related at all to #9214 |
What sort of batch sizes were you using in your linked experiments? Some datasets are only viable with very small batch sizes of 1-4, like with image segmentation on a GPU with 8GB of memory. After briefly skimming this diff, I think the documentation would need to be updated to clearly delineate the different modes and when/why each should typically be chosen. In my case the current frozen behavior improved performance quite a lot over the previous behavior in which mean/var could shift when trainable=False, so I'm a bit hesitant about this though I'll reiterate I haven't reviewed what's happening in full detail. Here is a PR with some past discussion on BN #8616 |
@ozabluda First of all thank you for spending time on this. I wish I had provided on my PR the example that you posted on the issue #9214; perhaps this would have built a stronger case for this patch. What you showed on your post is exactly what I've been observing on real-world non-opensource datasets for the last year (close to 100% accuracy on training mode and 50% during inference on the same dataset and on similar validation sets). As @fchollet said the are lots of hacks that can help you avoid it but none of them should have been necessary. Based on the code you provided, I'm 100% certain you are being bitten by the behaviour of the BN layer that I'm trying to fix in this PR. In a nutshell, during training mode the frozen BN layers are scaled with different statistics than in inference mode. There is absolutely no theoretical foundation to support this behaviour. As a result, this can have devastating effects when you try to deploy the model or when you try to validate its accuracy. I am certain that the majority of people who face this believe they have overfitted the model while in reality this is just a side-effect of how Keras implements the Batch Normalization layer. So let's test your example on my branch of Keras where the BN layer is patched:
Below I run your code for ResNet50. As you can see the problem that you report is fixed once the BN behaviour is changed:
I would love to know if you can reproduce my results and whether you can observe any speed degradation that @fchollet suspects. |
@ahundt Thanks for your comment! In this very specific experiment I used a fixed batch size of 32. Nevertheless in this dummy example I try to reproduce a behaviour we've been facing for over a year now on real-world datasets and problems. In those cases a large number of different batch sizes were tested and the results were comparable. Please note that his PR DOES NOT undo the recent change where the mean/var no longer shifts when trainable=False. I 100% agree with you that this change is very beneficial. This PR actually takes it a step further and makes sure that the moving mean/var are used instead of the mini-batch statistics when trainable=False. This ensures that the non-frozen layers will be trained on data scaled the same way as in inference mode. BTW thanks for sending me the discussion on #8616. Give me sometime to read all the details to see how this is related. |
@ahundt I've read the discussion on #8616. I understand it focuses on the previous change on BN that correctly stopped the update of the moving mean/var when trainable=False. I totally agree with this change. As I said on my previous comment, this PR takes this a step further to ensure that the data after a frozen BN are scaled in the same way during training as during inference. What I find interesting is that during the original discussion on #8616, @fchollet raises similar concerns about the semantic meaning of trainable as in this PR. Nevertheless in that discussion, he proposed the introduction of another property to extend the API. I also see he tried to implement another property called "updatable" which was reverted due to the increased complexity (and at the end we settled with extending the semantics of trainable). I wonder if in this case it makes sense to extend the API to cover this valid case OR update the semantics of trainable (preferred solution) OR update the documentation/examples. I would love to have an opinion from @lukedeo on this since he reviewed the code on the other PR. |
@datumbox Ok think I see what you are saying, I might try this out on my dataset. Do I need to change any settings like trainable in my training code or can I just pull this in? In my example I use frozen vgg16 imagenet pretrained weights as a feature extractor with additional trainable layers afterwards. One thing that might help with getting this through is a few improvements to the PR, variable names, and description. If you better separate the concepts and clarify the conditions under which different data is fixed vs changing the reasons this improves performance may be more obvious. |
Ok so based on the test_batchnorm_trainable() changes this should be active by default in all cases except when both
Correct? |
@ahundt Thanks for looking into this. My PR affects only networks that use Batch Normalization layers, so VGG will not be affected. No additional configuration is required other than setting trainable=False on the BN layers. Pulling this in should work fine
Sure thing, send me your comments and I'll make the changes. :-) |
Oh, yeah sorry I did a first run definitely wasn't configured correctly since vgg makes no sense for this case, and the BN layers I had were trained from scratch. I did have other models including resnet and densenet that didn't perform as well as vgg that use pretrained weights, and the fix in this PR might be why. I will try them out but can you confirm the following steps will make use of your changes?
Should I expect the above sequence to change the performance when running with this PR applied? edit: fixed typo mentioned in the next post |
After some try out, I think that a fesible yet simple solution is this (pseudo code, taking inceptionV3 as example):
|
@captainst Hello, this way can work. But after I save the finalModel, and want to continue a training through load the saved_model, I cannot set part of the finalModel be at learning_phase=0. (╥╯^╰╥) |
@captainst your approach doesn't work, when you want to fine-tune top-k layers part of your base-model that may have BN layers. Yet another work-around :), using @faustomorales suggestion as the base soup to solve the top-k layer fine-tuning.
Will this work? |
For those using TensorFlow 2.0 and trying to fine-tune resnet, inceptionv3, etc., the problem seems to persist due to the injection of tensorflow.python.keras.layers. This references the TF 1.0 behaviour batch normalisation in keras_applications when calling Similar to what @faustomorales suggested, I found that simply injecting
|
@datumbox, For all others, what works in my case is to do transfer-learning on a 2-step process: Extract embeddings first (into tfrecords shards or not, it is up to you) for further classification. @rpeloff , your workaround worked for me on TF 1.15.0 but not at TF 2.0.0, but thanks anyway. |
I can't believe this is not fixed yet |
@Tauranis would you mind elaborating on your 2-step process please? Thanks. |
Thank you so much!!! Looks like it solved it for me. That is certainly a strange behaviour though. One would think that they're using TF 2.x components when using the official TF 2.x release. Anyways, thanks for your reply and the explanation. |
To add to the suggestion made by @faustomorales, I found it useful to actually update the
This means that when any Alternately, you could recurse through only your frozen model and individually set the call functions of the layer if it is an instance of This is a problem that needs to be addressed IMO since training with parts of the model frozen that include |
Can anyone simply answer me about If I am using pre-trained model Densenet121 on Imagenet dataset, and this model has BN layers and I am using it to train facial emotions dataset but I am not freezing any layer and I am using |
I suggest one solution by solving the fundamental problem. The problem: keras BN layer uses average and variance of a mini-batch while training even when it's frozen(trainable=False), but we want it to use trained moving_average and moving_variance while training when frozen. Summary of the solution
Validation code
|
Contrary to most, I agree with @fchollet; the existing API can fulfill this PR's intent. The PR does ease the process, at expense of increased computing time per adding an iteration-level conditional to the graph - but it's a valid patch that could've been merged with a printed warning. The solution is rather simple; use a model-building function with an argument, e.g. |
@datumbox Solved part of my puzzle, But there is another way: |
For some reason, @datumbox 's patch did not solve my problem with BN layer. I am using tf 2.0.0 and Keras 2.3.1 (in an anaconda environment). I made sure to correctly alter the normalization.py and tensorflow_backend.py. I have run the testing code from @datumbox blog, but the different behavior between training and testing still remains. I also tried @rpeloff and @faustomorales solution without success on @datumbox testing script. |
Regarding
|
@datumbox Thank you so much! As you mentioned, this has a severe impact when implementing transfer learning and fine tuning part of the model. I was getting high accuracy during training while inference accuracy was dipping by around 30%. I spent a significant amount of time ensuring that math and code were correct. I implemented your fix and inference started giving F1 and accuracy on par with training. This saved me a significant amount of time and head ache. Thank you again! |
This worked perfectly for me. Thanks everyone for their support |
If I'm still on TF 1.15 (the last 1.x) release and using
Or is it... #9965 (comment) |
everybody is mentioning that @faustomorales's solution works but i can not see that solution. |
NOTE: it is particularly important that this command also sets the batch-normalization layers to non-trainable, which now seems to be the standard with Tensorflow 2 + Keras, but is not yet handled well by, e.g., the models from `segmentation_models`. Cf. `freeze_model` from `segmentation_models/models/_utils.py` and, e.g., https://keras.io/getting_started/faq/#whats-the-difference-between-the-training-argument-in-call-and-the-trainable-attribute and keras-team/keras#9965.
Hi, @datumbox, thanks for your contribution. Can you tell me what changes you made on the code side? At least some top-overview(of course apart from what you have mentioned on the blog). |
But I got this. |
During fine-tuning, if a Batch Normalization layer is frozen it uses the mini-batch statistics. I believe this is incorrect and it can lead to reduced accuracy especially when we use Transfer learning. A better approach in this case would be to use the values of the moving mean and variance.
Changes on this PR:
In this PR I update the Batch Normalization layer to use the learned statistics if frozen during training. This is achieved by making the trainable flag part of the computational graph and by depending the behavior of the BN not only on the learning_phase but also on the value of the trainable property.
Brief explanation:
Assume we use one of the pre-trained CNNs of Keras and we want to fine-tune it. Unfortunately, we get no guarantees that the mean and variance of our new dataset inside the BN layers will be similar to the ones of the original dataset. As a result, if we fine-tune the top layers, their weights will be adjusted to the mean/variance of the new dataset. Nevertheless, during inference the top layers will receive data which are scaled using the mean/variance of the original dataset. This discrepancy can lead to reduced accuracy.
I understand that this is a significant change that requires thorough review. To faciliate the situation I've documented why making such a change is important and provided detailed comparisons before and after applying the patch on my blog.
EDIT: Since the fix was not merged on master, I maintain unofficial patches available for Keras 2.1.6, Keras 2.2.2 and Keras 2.2.4.