Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[v1.7.x] backport Invoke mkldnn and cudnn BatchNorm when axis != 1 to v1.7.x #18676

Merged
merged 4 commits into from
Aug 10, 2020

Conversation

stu1130
Copy link
Contributor

@stu1130 stu1130 commented Jul 9, 2020

Backport [Improvement] Invoke mkldnn and cudnn BatchNorm when axis != 1 v1.7x
The PR could be part of 1.7.1 release.

…e#18504)

* fix batch norm when fix_gamma is True

* support gradient accumulation for batch norm

* mkldnn batchnorm support grad add

* unittest for bn

* fix bn arg

* fix lint

* fix mkldnn

* fix mkldnn bn

* fix grad when fixing gamma

* fix naive gpu bn

* fix lint

* invoke mkldnn and cudnn batchnorm when axis != 1

* backport 18500

* change condition

* fix

* fix

* add mkldnn_off for bn

* remove mkldnn_off

* recover save_000800.json

* cast
@mxnet-bot
Copy link

Hey @stu1130 , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [centos-gpu, clang, website, unix-cpu, windows-cpu, centos-cpu, edge, miscellaneous, sanity, windows-gpu, unix-gpu]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@wkcn
Copy link
Member

wkcn commented Jul 9, 2020

@mxnet-bot run ci [unix-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-cpu]

@stu1130
Copy link
Contributor Author

stu1130 commented Jul 10, 2020

@wkcn could you take a look at the nan test failure?

@wkcn
Copy link
Member

wkcn commented Jul 10, 2020

HI @stu1130 , the flaky test is test_symbol.test_load_000800.

test_symbol.test_load_000800 initializes the weights of the model randomly, and the output's values are abnormally large, which leads to a abnormal output when invoking MKLDNN.

In the latest version of MXNet, the test test_symbol.test_load_000800 has been removed in other PR.

@stu1130
Copy link
Contributor Author

stu1130 commented Jul 10, 2020

@wkcn thanks for the explanation. Let me retrigger the ci.
@mxnet-bot run ci [unix-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-cpu]

1 similar comment
@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-cpu]

@wkcn
Copy link
Member

wkcn commented Jul 11, 2020

Hi @stu1130 , the unittest test_symbol.test_load_000800 should be removed.
Meanwhile the test case of test_numpy_op.test_npx_batch_norm should be updated.
#18688

I do not have the permission to update this PR.

@stu1130
Copy link
Contributor Author

stu1130 commented Jul 13, 2020

Hi @stu1130 , the unittest test_symbol.test_load_000800 should be removed.
Meanwhile the test case of test_numpy_op.test_npx_batch_norm should be updated.
#18688

I do not have the permission to update this PR.

Could you check again? I enable Allow edits and access to secrets by maintainers
If that doesn't work out, you can create another cherry-pick PR. I can drop this.

@wkcn

This comment has been minimized.

Copy link
Member

@wkcn wkcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

@stu1130
Copy link
Contributor Author

stu1130 commented Jul 15, 2020

Thanks @wkcn I will ping my colleague to merge it

@lanking520 lanking520 added the pr-awaiting-review PR is waiting for code review label Jul 26, 2020
@pengzhao-intel
Copy link
Contributor

@bgawrych could you take a review for this change, especially for MKLDNN BN

@bgawrych
Copy link
Contributor

bgawrych commented Aug 10, 2020

@pengzhao-intel LGTM, all test cases are executed by MKLDNN primitive

@samskalicky samskalicky merged commit 651e24b into apache:v1.7.x Aug 10, 2020
@stu1130 stu1130 deleted the v1.7.x branch August 10, 2020 18:40
@stu1130
Copy link
Contributor Author

stu1130 commented Aug 10, 2020

@ciyongch if we need to rework on 1.7 source code for addressing license issue, could we include this commit? If it would block 1.7 release or affect the process, I am ok to make it release on 1.7.1 or 1.8

stu1130 added a commit to stu1130/incubator-mxnet that referenced this pull request Aug 10, 2020
…o v1.7.x (apache#18676)

* [Improvement] Invoke mkldnn and cudnn BatchNorm when axis != 1 (apache#18504)

* fix batch norm when fix_gamma is True

* support gradient accumulation for batch norm

* mkldnn batchnorm support grad add

* unittest for bn

* fix bn arg

* fix lint

* fix mkldnn

* fix mkldnn bn

* fix grad when fixing gamma

* fix naive gpu bn

* fix lint

* invoke mkldnn and cudnn batchnorm when axis != 1

* backport 18500

* change condition

* fix

* fix

* add mkldnn_off for bn

* remove mkldnn_off

* recover save_000800.json

* cast

* remove  and fix flaky test

Co-authored-by: JackieWu <wkcn@live.cn>
@ciyongch
Copy link
Contributor

Hi @stu1130 , currently, 1.7.0rc1 vote on dev@ was completed, and I didn't see any hard requirement to rework/updated the 1.7.0rc1 code base, the release process was hold on (vote on general@) to address some legal issues like (IP clearance of mshadow and some branding/trademark issue in external/3rd party usage). So I would suggest to include this in the next release, what do you think?

@stu1130
Copy link
Contributor Author

stu1130 commented Aug 11, 2020

@ciyongch alright, I think we can do it on next release.

stu1130 added a commit to stu1130/incubator-mxnet that referenced this pull request Aug 12, 2020
…o v1.7.x (apache#18676)

* [Improvement] Invoke mkldnn and cudnn BatchNorm when axis != 1 (apache#18504)

* fix batch norm when fix_gamma is True

* support gradient accumulation for batch norm

* mkldnn batchnorm support grad add

* unittest for bn

* fix bn arg

* fix lint

* fix mkldnn

* fix mkldnn bn

* fix grad when fixing gamma

* fix naive gpu bn

* fix lint

* invoke mkldnn and cudnn batchnorm when axis != 1

* backport 18500

* change condition

* fix

* fix

* add mkldnn_off for bn

* remove mkldnn_off

* recover save_000800.json

* cast

* remove  and fix flaky test

Co-authored-by: JackieWu <wkcn@live.cn>
stu1130 added a commit to stu1130/incubator-mxnet that referenced this pull request Aug 12, 2020
…o v1.7.x (apache#18676)

* [Improvement] Invoke mkldnn and cudnn BatchNorm when axis != 1 (apache#18504)

* fix batch norm when fix_gamma is True

* support gradient accumulation for batch norm

* mkldnn batchnorm support grad add

* unittest for bn

* fix bn arg

* fix lint

* fix mkldnn

* fix mkldnn bn

* fix grad when fixing gamma

* fix naive gpu bn

* fix lint

* invoke mkldnn and cudnn batchnorm when axis != 1

* backport 18500

* change condition

* fix

* fix

* add mkldnn_off for bn

* remove mkldnn_off

* recover save_000800.json

* cast

* remove  and fix flaky test

Co-authored-by: JackieWu <wkcn@live.cn>
szha pushed a commit that referenced this pull request Aug 14, 2020
…o v1.7.x (#18676) (#18890)

* [Improvement] Invoke mkldnn and cudnn BatchNorm when axis != 1 (#18504)

* fix batch norm when fix_gamma is True

* support gradient accumulation for batch norm

* mkldnn batchnorm support grad add

* unittest for bn

* fix bn arg

* fix lint

* fix mkldnn

* fix mkldnn bn

* fix grad when fixing gamma

* fix naive gpu bn

* fix lint

* invoke mkldnn and cudnn batchnorm when axis != 1

* backport 18500

* change condition

* fix

* fix

* add mkldnn_off for bn

* remove mkldnn_off

* recover save_000800.json

* cast

* remove  and fix flaky test

Co-authored-by: JackieWu <wkcn@live.cn>

Co-authored-by: JackieWu <wkcn@live.cn>
@szha szha added this to the v1.8.0 milestone Aug 23, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants