[v1.x] [MKLDNN] 2 Conv Tests Failed with MKLDNN + ACL #20265

Zha0q1 · 2021-05-12T23:40:30Z

I was trying to build MXNet 1.x with MKLDNN with ACL (Arm Compute Library) integration on an Arm instance. I used this cmake config file to integrate MKLDNN with ACL. The build was very performant and would surely benefit MXNet users hugely. I got a 3-4X boost with (16/64, 3, 512, 512) on Resnet compared to MKLDNN with no integration. However two operator unit test failed on this build:

test_operator.test_convolution_grouping 
test_operator.test_convolution_independent_gradients

I tried multiple mkldnn versions (v1.x now points to mkldnn release 2.0 beta 10, I also tried release 2.1 and 2.2) and ACl versions (20.08 and 21.2, 20.08 is Aug 2020 and 21.2 is the latest) and the failures persisted.

This got me suspect that there is some integration issue with MXNet-MKLDNN (more likely) or MKLDNN-ACL. Would someone from the Intel team help share some insights on this?

Would you help triage?@szha @leezu

CC @josephevans @mseth10 @waytrue17 @sandeep-krishnamurthy

The text was updated successfully, but these errors were encountered:

Zha0q1 · 2021-05-18T21:19:07Z

Here is the source code of the first test case test_operator.test_convolution_grouping
https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_operator.py#L1674-L1703

@with_seed()
def test_convolution_grouping():
    for dim in [1, 2, 3]:
        num_filter = 4
        for num_group in [1, 2]:
            kernel = (3,) * dim
            shape = (1, 4) + (9,) * dim

The test will only fail when dim=2 and num_group=2, i.e. when kernel=(3, 3) and shape=(1, 4, 9, 9).

So the test case will check if the Conv op will produce the same output and gradients (for x, w, b) when num_group=2 as those of when we do the "grouping" manually. So it looks like for all of output, x gradient, w gradient, and b gradient the first halves of then tensors are the same, but the second halves differ.

Mismatched elements: 98 / 196 (50%)
Max absolute difference: 16.347755
Max relative difference: 23.329699
 x: array([[[[ 1.597207e-01,  1.540166e+00,  6.755731e+00, -5.728347e+00,
           1.296894e+00, -1.326614e+00,  6.375989e+00],
         [ 1.170041e+00,  3.974693e-01,  5.603638e+00, -1.179677e+00,...
 y: array([[[[ 1.597207e-01,  1.540166e+00,  6.755731e+00, -5.728347e+00,
           1.296894e+00, -1.326614e+00,  6.375989e+00],
         [ 1.170040e+00,  3.974691e-01,  5.603638e+00, -1.179677e+00,...

command to run test:

nosetests-3.4 --with-coverage --cover-inclusive --cover-xml --cover-branches --cover-package=mxnet --with-timer --timer-ok 1 --timer-warning 15 --timer-filter warning,error --with-xunit --xunit-file nosetests_unittest.xml --verbose tests/python/unittest/test_operator.py:test_convolution_grouping

Zha0q1 · 2021-05-18T21:49:29Z

Here is the source code of the second test case https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_operator.py#L1750-L1829

@with_seed()
def test_convolution_independent_gradients():
    ctx = mx.cpu()
    atol = 1.0e-3
    rtol = 1.0e-3
    reqs = ["null", "write", "add"]
    var_names = ["x", "w", "b"]
    dims = [2]
    num_bases = [1, 8]
    kernel_xs = [3, 5]
    stride_xs = [1, 2]
    pad_xs = [0, 1]
    in_sizes = [7, 32]
    no_biases = [True, False]

This test will only fail when dim=2. However, if at the same time we set reqs = ["write", "add"] rather than reqs = ["null", "write", "add"], the test will also run fine.

failure:

======================================================================
FAIL: test_operator.test_convolution_independent_gradients
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.6/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/ubuntu/incubator-mxnet/tests/python/unittest/common.py", line 218, in test_new
    orig_test(*args, **kwargs)
  File "/home/ubuntu/incubator-mxnet/tests/python/unittest/test_operator.py", line 2242, in test_convolution_independent_gradients
    exe2.outputs[0].asnumpy(), rtol=rtol, atol=atol)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/numpy/testing/_private/utils.py", line 1528, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/numpy/testing/_private/utils.py", line 840, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=0.001, atol=0.001

Mismatched elements: 50 / 50 (100%)
Max absolute difference: 7.249808
Max relative difference: 37.313366
 x: array([[[[-3.723791,  4.639688, -0.084135,  1.286607,  5.300301],
         [ 2.492521, -4.99145 , -2.470633,  0.893196, -1.087579],
         [-0.640068, -1.888586, -1.97117 , -0.046645, -2.467756],...
 y: array([[[[ 2.861494, -1.560436,  2.371599,  6.006711,  1.072686],
         [-3.419539, -2.064956,  2.353943,  0.867886,  0.583497],
         [-3.662952,  0.055854, -2.707001, -3.451663,  2.091815],...
-------------------- >> begin captured logging << --------------------

command to run test:

nosetests-3.4 --with-coverage --cover-inclusive --cover-xml --cover-branches --cover-package=mxnet --with-timer --timer-ok 1 --timer-warning 15 --timer-filter warning,error --with-xunit --xunit-file nosetests_unittest.xml --verbose tests/python/unittest/test_operator.py:test_convolution_independent_gradients

Zha0q1 · 2021-05-18T21:53:18Z

@szha @anko-intel @josephevans I have collected more info about the two test failures. I've also prepared a clean ec2 instance with this specific mxnet build so that others can quickly get access. Please let me know about anything to try or if I can add your secret to the instance. Help is much appreciated!

cfRod · 2021-06-29T08:07:51Z

Hi @Zha0q1, could you provide this container and how to build it.

cfRod · 2021-06-30T16:34:52Z

Hi @Zha0q1 , can you confirm that gemm:acl is called for the tests on grouped convolutions?
In my logs I am seeing it fallback to reference kernels (gemm:ref) when it performs the following check:https://github.com/oneapi-src/oneDNN/blob/83ebc40d86bc54f0f23e947235e53570eeacf254/src/cpu/aarch64/acl_gemm_convolution_utils.cpp#L72

    // Compute Library unsupported shape scenarios
    if (one_of(true, is_3d, is_1d, with_groups)) {
        return status::unimplemented;
    }

Zha0q1 · 2021-06-30T17:50:58Z

Hi @cfRod Thanks for taking a look at this! Yeah I will prepare a docker image for you today. I am not sure about which kernel it was using -- our observation was that (MXNet + OneDNN on aarch64) had no issues whatsoever, and (MXNet + OneDNN + ACL) would fail the two tests, so it was related to the ACL integration

cfRod · 2021-06-30T18:39:59Z

Hi @Zha0q1 ,
I can confirm that I've built MXNet with oneDNN and ACL and the behaviour I observed is that while running this test it fallbacks to reference kernels and the test passes. The fallback to onednn reference is because of the check in the ACL integration layer (as I linked above).

ldd libmxnet.so
  1         linux-vdso.so.1 (0x0000ffffa7a60000)
  2         libdl.so.2 => /lib64/libdl.so.2 (0x0000ffffa1f30000)
  3         libdnnl.so.2 => /home/ubuntu/mxnet/mxnet/build/3rdparty/mkldnn/src/libdnnl.so.2 (0x0000ffffa1600000)
  4         libpthread.so.0 => /lib64/libpthread.so.0 (0x0000ffffa15c0000)
  5         libarm_compute.so => /home/ubuntu/mxnet/ComputeLibrary/build/libarm_compute.so (0x0000ffffa1230000)
  6         libarm_compute_graph.so => /home/ubuntu/mxnet/ComputeLibrary/build/libarm_compute_graph.so (0x0000ffffa1130000)
  7         libarm_compute_core.so => /home/ubuntu/mxnet/ComputeLibrary/build/libarm_compute_core.so (0x0000ffffa08b0000)
  8         libopenblas.so.0 => /home/ubuntu/mxnet/openblas/0.3.9/lib/libopenblas.so.0 (0x0000ffff9fe70000)

Yeah I will prepare a docker image for you today.

Cheers that would be great.

Zha0q1 · 2021-06-30T19:15:49Z

Hi @cfRod thanks for you reply! Hmm that's weird, anyways I am creating a docker now, will share it shortly

Zha0q1 · 2021-06-30T21:56:14Z

@cfRod I was able to reproduce the test failure within a docker container. I uploaded the image to here public.ecr.aws/y2l8i4q9/acl_debug:latest. Please let me know if you are able to pull & reproduce, thanks!

test_operator.test_convolution_grouping 
test_operator.test_convolution_independent_gradients

p.s. The second test would cause a crash for some reason, but this was not an issue in my non-docker ubuntu enviroment. I think for now we can start with the first test.

/usr/local/gcc-8.5.0/include/c++/8.5.0/bits/stl_vector.h:950: std::vector<_Tp, _Alloc>::const_reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) const [with _Tp = mxnet::OpReqType; _Alloc = std::allocator<mxnet::OpReqType>; std::vector<_Tp, _Alloc>::const_reference = const mxnet::OpReqType&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion '__builtin_expect(__n < this->size(), true)' failed.
Aborted (core dumped)

cfRod · 2021-07-01T11:55:21Z

@Zha0q1, Thanks I was able to pull and reproduce your issue.
Update on my build: I was also able to reproduce it on my version of the build. I will look into this.

cfRod · 2021-07-06T13:16:59Z

Hi @Zha0q1 ,

This is my WIP analysis so far (see verbose logs to support this)
The convolution with grouping is calling into gemm reference kernels, which is expected as grouped convolutions is unsupported by Compute library at the moment.
The case of manually slicing the input and running the convolution operation on the two slices of input and weights, here it is calling gemm:acl and the first half of the output is matches the output from the reference kernel but the second half doesn't. My initial analysis is that the primitive is created once and reused twice with different inputs and weights (see logs below). However Compute library assumes weights are immutable, which is because it does some preprocessing internally of the weights once for every new primitive creation and but in this case the second time around the new sets of weights are not used by ACL.

/home/ubuntu/mxnet/mxnet/src/executor/graph_executor.cc:1992: Subgraph backend MKLDNN is activated.
[14:52:42] /home/ubuntu/mxnet/mxnet/src/executor/graph_executor.cc:1992: Subgraph backend MKLDNN is activated.
dnnl_verbose,info,oneDNN v2.0.0 (commit 83ebc40d86bc54f0f23e947235e53570eeacf254)
dnnl_verbose,info,cpu,runtime:OpenMP
dnnl_verbose,info,cpu,isa:Generic
dnnl_verbose,info,gpu,runtime:none

# THIS IS GEMM REFFERENCE

dnnl_verbose,create:cache_miss,cpu,convolution,gemm:ref,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcde:f0 bia_f32::blocked:a:f0 dst_f32::blocked:abcd:f0,,alg:convolution_direct,mb1_g2ic4oc4_ih9oh7kh3sh1dh0ph0_iw9ow7kw3sw1dw0pw0,0.130859
dnnl_verbose,exec,cpu,convolution,gemm:ref,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcde:f0 bia_f32::blocked:a:f0 dst_f32::blocked:abcd:f0,,alg:convolution_direct,mb1_g2ic4oc4_ih9oh7kh3sh1dh0ph0_iw9ow7kw3sw1dw0pw0,19.759

# THIS IS GEMM ACL with sliced inputs
dnnl_verbose,create:cache_miss,cpu,convolution,gemm:acl,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcd:f0 bia_f32::blocked:a:f0 dst_f32::blocked:abcd:f0,,alg:convolution_direct,mb1_ic2oc2_ih9oh7kh3sh1dh0ph0_iw9ow7kw3sw1dw0pw0,0.0969238
dnnl_verbose,exec,cpu,convolution,gemm:acl,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcd:f0 bia_f32::blocked:a:f0 dst_f32::blocked:abcd:f0,,alg:convolution_direct,mb1_ic2oc2_ih9oh7kh3sh1dh0ph0_iw9ow7kw3sw1dw0pw0,0.415039
dnnl_verbose,exec,cpu,convolution,gemm:acl,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcd:f0 bia_f32::blocked:a:f0 dst_f32::blocked:abcd:f0,,alg:convolution_direct,mb1_ic2oc2_ih9oh7kh3sh1dh0ph0_iw9ow7kw3sw1dw0pw0,0.321045

# THIS IS THE CONCAT STAGE
dnnl_verbose,create:cache_miss,cpu,concat,simple:any,undef,src_f32::blocked:abcd:f0 src_f32::blocked:abcd:f0 dst_f32::blocked:abcd:f0,,axis:1,1x2x7x7:1x2x7x7 1x4x7x7,0.0700684
dnnl_verbose,exec,cpu,concat,simple:any,undef,src_f32::blocked:abcd:f0 src_f32::blocked:abcd:f0 dst_f32::blocked:abcd:f0,,axis:1,1x2x7x7:1x2x7x7 1x4x7x7,0.468994

We've seen a similar issue with primitive caching in TensorFlow and I am looking to confirm the hypothesis, possibly by modifying the test.

Zha0q1 · 2021-07-06T17:18:25Z

@cfRod, thanks for the detailed analysis, this makes a lot of sense! Just to understand it better, is this a OneDNN & ACL integration issue (rather than MXNet & OneDNN or TensorFlow & OneDNN integration issue)? Another quick questions is that do you work with the OneDNN or Arm? I am trying to think if we need to loop in more people to fix this : )

cfRod · 2021-07-07T16:40:28Z

Hi @Zha0q1, in this case, it is an issue with oneDNN and ACL rather than a framework specific issue.

do you work with the OneDNN or Arm

Yes. I am part of the team that is working on this.

Zha0q1 · 2021-07-07T17:52:17Z

Hi @Zha0q1, in this case, it is an issue with oneDNN and ACL rather than a framework specific issue.

do you work with the OneDNN or Arm

Yes. I am part of the team that is working on this.

Thanks 👍 👍

Zha0q1 · 2021-07-09T20:36:29Z

Hi @cfRod I have been tracking your discuss on the other thread, thanks for working on this! Meanwhile were you able to reproduce the other test failure test_operator.test_convolution_independent_gradients? This test "fails fine" on my native env but crashes in the docker. I am not sure why, but I just wanted to know if you could reproduce, thanks!

Zha0q1 · 2021-07-12T19:56:28Z

Would appreciate your input @cfRod :)

cfRod · 2021-07-13T10:00:41Z

Yes, this seems to be an issue which shows up with -D_GLIBCXX_ASSERTIONS turned on.
In mxnet this is turned on with -DCMAKE_BUILD_TYPE=Debug.

Is your non-docker mxnet built with -DCMAKE_BUILD_TYPE=Release?

Zha0q1 · 2021-07-13T19:10:56Z

Yes, this seems to be an issue which shows up with -D_GLIBCXX_ASSERTIONS turned on.
In mxnet this is turned on with -DCMAKE_BUILD_TYPE=Debug.

Is your non-docker mxnet built with -DCMAKE_BUILD_TYPE=Release?

Right I think the non-docker build is RELEASE. So you were able to reproduce test_operator.test_convolution_independent_gradients?

cfRod · 2021-07-13T19:52:28Z

Yes, I was able to reproduce the issue outside docker with a -DCMAKE_BUILD_TYPE=Debug build.

cfRod · 2021-07-19T16:55:37Z

Hi @Zha0q1,
Could you point me to all the tests that I need to run for submitting a PR on AArch64? Is it nosetests --verbose tests/python/unittest?

Zha0q1 · 2021-07-19T17:57:47Z

Hi @Zha0q1,
Could you point me to all the tests that I need to run for submitting a PR on AArch64? Is it nosetests --verbose tests/python/unittest?

Yeah I think that's exactly right :)

cfRod · 2021-07-20T08:07:20Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1.x] [MKLDNN] 2 Conv Tests Failed with MKLDNN + ACL #20265

[v1.x] [MKLDNN] 2 Conv Tests Failed with MKLDNN + ACL #20265

Zha0q1 commented May 12, 2021 •

edited

Loading

Zha0q1 commented May 18, 2021 •

edited

Loading

Zha0q1 commented May 18, 2021

Zha0q1 commented May 18, 2021

cfRod commented Jun 29, 2021

cfRod commented Jun 30, 2021 •

edited

Loading

Zha0q1 commented Jun 30, 2021 •

edited

Loading

cfRod commented Jun 30, 2021 •

edited

Loading

Zha0q1 commented Jun 30, 2021

Zha0q1 commented Jun 30, 2021

cfRod commented Jul 1, 2021

cfRod commented Jul 6, 2021

Zha0q1 commented Jul 6, 2021

cfRod commented Jul 7, 2021

Zha0q1 commented Jul 7, 2021

Zha0q1 commented Jul 9, 2021

Zha0q1 commented Jul 12, 2021

cfRod commented Jul 13, 2021

Zha0q1 commented Jul 13, 2021

cfRod commented Jul 13, 2021

cfRod commented Jul 19, 2021

Zha0q1 commented Jul 19, 2021

cfRod commented Jul 20, 2021 •

edited

Loading

mseth10 commented Sep 8, 2021

cfRod commented Sep 15, 2021

[v1.x] [MKLDNN] 2 Conv Tests Failed with MKLDNN + ACL #20265

[v1.x] [MKLDNN] 2 Conv Tests Failed with MKLDNN + ACL #20265

Comments

Zha0q1 commented May 12, 2021 • edited Loading

Zha0q1 commented May 18, 2021 • edited Loading

Zha0q1 commented May 18, 2021

Zha0q1 commented May 18, 2021

cfRod commented Jun 29, 2021

cfRod commented Jun 30, 2021 • edited Loading

Zha0q1 commented Jun 30, 2021 • edited Loading

cfRod commented Jun 30, 2021 • edited Loading

Zha0q1 commented Jun 30, 2021

Zha0q1 commented Jun 30, 2021

cfRod commented Jul 1, 2021

cfRod commented Jul 6, 2021

Zha0q1 commented Jul 6, 2021

cfRod commented Jul 7, 2021

Zha0q1 commented Jul 7, 2021

Zha0q1 commented Jul 9, 2021

Zha0q1 commented Jul 12, 2021

cfRod commented Jul 13, 2021

Zha0q1 commented Jul 13, 2021

cfRod commented Jul 13, 2021

cfRod commented Jul 19, 2021

Zha0q1 commented Jul 19, 2021

cfRod commented Jul 20, 2021 • edited Loading

mseth10 commented Sep 8, 2021

cfRod commented Sep 15, 2021

Zha0q1 commented May 12, 2021 •

edited

Loading

Zha0q1 commented May 18, 2021 •

edited

Loading

cfRod commented Jun 30, 2021 •

edited

Loading

Zha0q1 commented Jun 30, 2021 •

edited

Loading

cfRod commented Jun 30, 2021 •

edited

Loading

cfRod commented Jul 20, 2021 •

edited

Loading