Implementation of RegNet* models without BN #1342

bonlime · 2022-07-11T11:24:22Z

bonlime
Jul 11, 2022

Hi,
First thanks for a great pre-trained models without BN (regnetz_c16_evos, regnetz_d8_evos ), I've used them for downstream tasks and ability to train using any BS is impressive.
I've looked closely into their implementation and have several questions about design choices used.

Why do you use EvoNorm2dS0a instead of GN + HardSwish/SiLU? I do understand that they have slightly different formula, but GN + HardSwish/SiLU is much faster in my experiments (up to 15% speed-up) and requires less memory. Also in your ResNet50 experiments (resnetv2_50d_gn) GN works almost the same.
Have you tried using your custom EvoNorms + weight Standardisation? I thought it was a standart practice to use GN + WS

rwightman · 2022-07-11T16:36:07Z

rwightman
Jul 11, 2022
Maintainer

@bonlime good to know they are working for you, you should try them with the latest PyTorch 1.12 nvfuser codegen btw, can improve throughput quite a bit (they are also good on PyTorch XLA w/ TPU).

You've probably picked it up, but in case others are looking at this and not familiar, the difference between the 'a' variants I threw in there and the paper ones are that 'a' always normalize the input by stats, even if activation is not enabled. I tried this variant on the regnetz because they are a lot of unactivated instances at the end of each block, the paper version wasn't working as well. The resnetv2 are also pre-act, but for some reason they worked fine with the S0 layer.

I have not tried with WS, I didn't recall seeing any standard practice to that effect. There are very few uses of EvoNorm out there. GN + WS sure, but that's still just one research group who likes to do that combo....

1 reply

bonlime Jul 12, 2022
Author

Thanks for a quick response. Now I get why you use your version instead of original.

Regarding using a WS. I referred to a standard practice of using (GN + WS), thinking EvoNorm family is closely related to GN. The motivation from original GN paper about some channels overwhelming another in the group due to different stats distribution seems valid even in the case of EvoNorm. In your case usage of SiLU instead of ReLU makes it less damaging for the activations, but I'm still curious if WS would make any difference.

bonlime · 2022-07-12T11:24:00Z

bonlime
Jul 12, 2022
Author

By the way, just a random thought to share with you. I've noticed you experimented with normalisations quite a lot, but I haven't seen using an idea from Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models (from author of original BN. TLDR - additionally use running stats in forward to make normalisation more stable) in my experiments is makes training significantly better with medium BS (8-12). After combining with calculating only RMS for dims=(0, 2, 3) instead of mean + std it becomes even more stable.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of RegNet* models without BN #1342

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Implementation of RegNet* models without BN #1342

bonlime Jul 11, 2022

Replies: 2 comments · 1 reply

rwightman Jul 11, 2022 Maintainer

bonlime Jul 12, 2022 Author

bonlime Jul 12, 2022 Author

bonlime
Jul 11, 2022

Replies: 2 comments 1 reply

rwightman
Jul 11, 2022
Maintainer

bonlime Jul 12, 2022
Author

bonlime
Jul 12, 2022
Author