Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[BUGFIX][1.8.x] Temporary fix for RNN with oneDNN seg faults/core dumps #19308

Merged
merged 3 commits into from
Oct 27, 2020

Conversation

bgawrych
Copy link
Contributor

@bgawrych bgawrych commented Oct 7, 2020

Description

This fix is workaround for problem with oneDNN memory descriptors comparison (size calculations) which causes segmentation faults in operator destructor - reported issue #19022.
Fix for this issue will be delivered with oneDNN 1.7 - however we would like to fix this issue in 1.8 version as well

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Memory descriptor comparison in RNN operator

@mxnet-bot
Copy link

Hey @bgawrych , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [centos-gpu, centos-cpu, windows-cpu, sanity, website, windows-gpu, clang, unix-gpu, unix-cpu, miscellaneous, edge]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@bgawrych
Copy link
Contributor Author

bgawrych commented Oct 7, 2020

@mxnet-bot run ci [windows-cpu, unix-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [windows-cpu, unix-cpu]

@bgawrych
Copy link
Contributor Author

bgawrych commented Oct 7, 2020

@mxnet-bot run ci [windows-cpu, unix-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [windows-cpu, unix-gpu]

@bgawrych
Copy link
Contributor Author

bgawrych commented Oct 7, 2020

@mxnet-bot run ci [unix-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-gpu]

@bgawrych
Copy link
Contributor Author

bgawrych commented Oct 8, 2020

Hey @szha @samskalicky Do you know what's wrong with unix-gpu CI on 1.8 branch ?
This test is not related to my change and it's failing

[2020-10-07T22:09:54.644Z] 
[2020-10-07T22:09:54.644Z] ======================================================================
[2020-10-07T22:09:54.644Z] FAIL: test_operator_gpu.test_np_mixed_precision_binary_funcs
[2020-10-07T22:09:54.644Z] ----------------------------------------------------------------------
[2020-10-07T22:09:54.644Z] Traceback (most recent call last):
[2020-10-07T22:09:54.644Z]   File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
[2020-10-07T22:09:54.644Z]     self.test(*self.arg)
[2020-10-07T22:09:54.644Z]   File "/usr/local/lib/python3.5/dist-packages/nose/util.py", line 620, in newfunc
[2020-10-07T22:09:54.644Z]     return func(*arg, **kw)
[2020-10-07T22:09:54.644Z]   File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 218, in test_new
[2020-10-07T22:09:54.644Z]     orig_test(*args, **kwargs)
[2020-10-07T22:09:54.644Z]   File "/work/mxnet/python/mxnet/util.py", line 297, in _with_np_shape
[2020-10-07T22:09:54.644Z]     return func(*args, **kwargs)
[2020-10-07T22:09:54.644Z]   File "/work/mxnet/python/mxnet/util.py", line 481, in _with_np_array
[2020-10-07T22:09:54.644Z]     return func(*args, **kwargs)
[2020-10-07T22:09:54.644Z]   File "/work/mxnet/tests/python/gpu/../unittest/test_numpy_op.py", line 2607, in test_np_mixed_precision_binary_funcs
[2020-10-07T22:09:54.644Z]     check_mixed_precision_binary_func(func, low, high, lshape, rshape, lgrad, rgrad, type1, type2)
[2020-10-07T22:09:54.644Z]   File "/work/mxnet/tests/python/gpu/../unittest/test_numpy_op.py", line 2550, in check_mixed_precision_binary_func
[2020-10-07T22:09:54.644Z]     use_broadcast=False, equal_nan=True)
[2020-10-07T22:09:54.644Z]   File "/work/mxnet/python/mxnet/test_utils.py", line 749, in assert_almost_equal
[2020-10-07T22:09:54.644Z]     raise AssertionError(msg)
[2020-10-07T22:09:54.644Z] AssertionError: 
[2020-10-07T22:09:54.644Z] Items are not equal:
[2020-10-07T22:09:54.644Z] Error 1.732774 exceeds tolerance rtol=1.000000e-02, atol=1.000000e-03 (mismatch 16.666667%).
[2020-10-07T22:09:54.644Z] Location of maximum error: (1, 1), a=0.00978292, b=0.01171875
[2020-10-07T22:09:54.644Z]  ACTUAL: array([[0.77931417, 0.14845479, 0.22072042],
[2020-10-07T22:09:54.644Z]        [1.27150167, 0.00978292, 1.49220479]])
[2020-10-07T22:09:54.644Z]  DESIRED: array([[0.78125   , 0.15039062, 0.22265625],
[2020-10-07T22:09:54.644Z]        [1.2734375 , 0.01171875, 1.49414062]])
[2020-10-07T22:09:54.645Z] -------------------- >> begin captured stdout << ---------------------
[2020-10-07T22:09:54.645Z] 
[2020-10-07T22:09:54.645Z] *** Maximum errors for vector of size 6:  rtol=0.01, atol=0.001
[2020-10-07T22:09:54.645Z] 
[2020-10-07T22:09:54.645Z]   1: Error 1.732774  Location of error: (1, 1), a=0.00978292, b=0.01171875
[2020-10-07T22:09:54.645Z] 
[2020-10-07T22:09:54.645Z] --------------------- >> end captured stdout << ----------------------
[2020-10-07T22:09:54.645Z] -------------------- >> begin captured logging << --------------------
[2020-10-07T22:09:54.645Z] common: WARNING: Error seen with seeded test, use MXNET_TEST_SEED=694743356 to reproduce.
[2020-10-07T22:09:54.645Z] --------------------- >> end captured logging << ---------------------
[2020-10-07T22:09:54.645Z] 
[2020-10-07T22:09:54.645Z] ----------------------------------------------------------------------

@bgawrych
Copy link
Contributor Author

bgawrych commented Oct 8, 2020

@mxnet-bot run ci [unix-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-gpu]

@bgawrych
Copy link
Contributor Author

bgawrych commented Oct 8, 2020

@anko-intel
Copy link
Contributor

LGTM

@pengzhao-intel
Copy link
Contributor

@ciyongch @zixuanweeei please help take a review.

@pengzhao-intel
Copy link
Contributor

@anko-intel what's the timeline of 1.8? Is it possible to release a new patch version for 1.7?

@anko-intel
Copy link
Contributor

@anko-intel what's the timeline of 1.8? Is it possible to release a new patch version for 1.7?

According to https://cwiki.apache.org/confluence/display/MXNET/1.8.0+Release+Plan+and+Status November the 1st will be the date of the official release. I think we are unable to prepare 1.7 patch version earlier.

@@ -47,6 +47,14 @@ inline int GetRnnGatesNum(int mode) {
}
}

// Bug in oneDNN >= 1.6 in memory descriptor comparision operators.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo? oneDNN <= 1.6 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, done

// for specific dims and strides in descriptors == operator can return `true`
// but get_size() function will return different size
// TODO(bgawrych): Remove with oneDNN 1.7 upgrade
bool CheckMemDescEquality(const mkldnn::memory::desc &left, const mkldnn::memory::desc &right) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to make this function as static inline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done :)

@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Oct 22, 2020
@szha szha requested a review from samskalicky October 22, 2020 15:51
@bgawrych
Copy link
Contributor Author

@ciyongch @pengzhao-intel @samskalicky Can we merge this change?

Copy link
Contributor

@pengzhao-intel pengzhao-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and merging now.

@pengzhao-intel pengzhao-intel merged commit 7c86f48 into apache:v1.8.x Oct 27, 2020
@pengzhao-intel
Copy link
Contributor

Please file an issue to track the oneDNN upgrade and remove the temp solution. @bgawrych

@samskalicky
Copy link
Contributor

Thanks @bgawrych, can you also backport this PR to the v1.x branch so it stays in sync?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
MKLDNN pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants