BatchNorm can not converge with scale=False #18475

nttstar · 2020-06-03T02:30:54Z

Description

BatchNorm operator with scale=False can not converge.

Error Message

No error message, but loss value and training accuracy is abnormal comparing with scale=True BatchNorm.

To Reproduce

We can try https://github.com/nttstar/arcface.np to train arcface. Add one BatchNorm op with scale=False after final embedding layer

What have you tried to solve it?

Set Scale=True, it can work but with slightly worse test accuracy.

Environment

----------Python Info----------
Version : 3.6.9
Compiler : GCC 7.3.0
Build : ('default', 'Jul 30 2019 19:07:31')
Arch : ('64bit', '')
------------Pip Info-----------
Version : 19.3.1
Directory : /root/anaconda2/envs/py36/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version : 2.0.0
Directory : /root/anaconda2/envs/py36/lib/python3.6/site-packages/mxnet
Num GPUs : 8
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform : Linux-3.10.0-327.el7.x86_64-x86_64-with-centos-7.5.1804-Core
system : Linux
node : gpu06
release : 3.10.0-327.el7.x86_64
version : #1 SMP Thu Nov 19 22:10:57 UTC 2015

The text was updated successfully, but these errors were encountered:

wkcn · 2020-06-04T23:55:32Z

Hi @nttstar , there was a bug in BatchNorm in the previous version of MXNet. The bug was reported in #18373 , and it was fixed in PR #18377 . Could you please try the latest version of MXNet?

sxjscience · 2020-06-05T05:02:57Z

@wkcn I communicated with @nttstar offline and #18373 should not be the casue. Would you help try with scale=False and see if there is anything wrong?

wkcn · 2020-06-05T09:14:03Z

@sxjscience I'm sorry that I don't have any machine with GPU to check it recently.

I read the code of batch norm and its unittest.
There is a gradient check when fix_gamma=True:

https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_operator.py#L1777,

but no output check when fix_gamma=True:

https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_operator.py#L1882

wkcn · 2020-06-05T09:51:41Z

I try to test the batch norm with fix_gamma=True.
The result on CPU is right, but that on GPU is wrong when fix_gamma=True, axis = 1 and cudnn_off=False. The bn_beta.grad is the only wrong value.

Here are the failure cases when fix_gamma=True.

operator	shape	axis	cudnn_off	output_mean_var
BatchNorm	(24, 2)	1	False	False
BatchNorm	(24, 2)	1	False	True
BatchNorm	(24, 3, 4)	1	False	False
BatchNorm	(24, 3, 4)	1	False	True
BatchNorm	(24, 4, 4, 4)	1	False	False
BatchNorm	(24, 4, 4, 4)	1	False	True
BatchNorm	(24, 8, 4, 4)	1	False	False
BatchNorm	(24, 8, 4, 4)	1	False	True
BatchNorm	(24, 5, 6, 4, 4)	1	False	False
BatchNorm	(24, 5, 6, 4, 4)	1	False	True

wkcn · 2020-06-05T11:06:24Z

I'm fixing the bug and I will submit a PR later.

wkcn · 2020-06-10T02:15:55Z

Hi @nttstar , the bug of BatchNorm has been fixed in #18500 .

Thank you for the report!

nttstar · 2020-06-10T02:39:14Z

@wkcn Thanks! I will check it in the next pip package release.

nttstar added the Bug label Jun 3, 2020

wkcn mentioned this issue Jun 5, 2020

[Bug Fixed] fix batch norm when fix_gamma is True #18492

Closed

5 tasks

sxjscience closed this as completed Jul 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BatchNorm can not converge with scale=False #18475

BatchNorm can not converge with scale=False #18475

nttstar commented Jun 3, 2020

wkcn commented Jun 4, 2020

sxjscience commented Jun 5, 2020

wkcn commented Jun 5, 2020

wkcn commented Jun 5, 2020 •

edited

Loading

wkcn commented Jun 5, 2020

wkcn commented Jun 10, 2020 •

edited

Loading

nttstar commented Jun 10, 2020

BatchNorm can not converge with scale=False #18475

BatchNorm can not converge with scale=False #18475

Comments

nttstar commented Jun 3, 2020

Description

Error Message

To Reproduce

What have you tried to solve it?

Environment

wkcn commented Jun 4, 2020

sxjscience commented Jun 5, 2020

wkcn commented Jun 5, 2020

wkcn commented Jun 5, 2020 • edited Loading

wkcn commented Jun 5, 2020

wkcn commented Jun 10, 2020 • edited Loading

nttstar commented Jun 10, 2020

wkcn commented Jun 5, 2020 •

edited

Loading

wkcn commented Jun 10, 2020 •

edited

Loading