Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes #18632

Merged
merged 3 commits into from
Jul 2, 2020

Conversation

ciyongch
Copy link
Contributor

Description

When doing calibration with variable input shapes, a new executor will be created here in the case of the current input has different shape compared to the previous one. While the callback function is only bound to the very first executor instead of passed down to the succeeding executors which shares the same symbol.
This PR enables passing down the callback function, to address the calibration skipping issue.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@pengzhao-intel @TaoLv @ChaiBapchya @szha

@ciyongch ciyongch requested a review from szha as a code owner June 28, 2020 03:24
@mxnet-bot
Copy link

Hey @ciyongch , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [windows-gpu, website, centos-cpu, windows-cpu, unix-gpu, sanity, edge, miscellaneous, clang, centos-gpu, unix-cpu]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@pengzhao-intel
Copy link
Contributor

Great job to root cause this bug :)

@ciyongch
Copy link
Contributor Author

@ChaiBapchya @leezu it looks like there's CI issues in current v1.6.x, which was existed in previous commit #18586. Do you know if there's anyone working on this? Thanks!

edge

[2020-06-28T03:57:59.801Z] + python setup.py bdist_wheel --universal
[2020-06-28T03:57:59.801Z] Traceback (most recent call last):
[2020-06-28T03:57:59.801Z]   File "setup.py", line 23, in <module>
[2020-06-28T03:57:59.801Z]     from setuptools import find_packages # This must precede distutils
[2020-06-28T03:57:59.801Z] ImportError: No module named setuptools

unix-gpu

[2020-06-28T04:11:22.856Z] /work/runtime_functions.sh: line 2083: build_ubuntu_gpu_cuda101_cudnn7_mkldnn_cpp_test: command not found

@pengzhao-intel
Copy link
Contributor

@sandeep-krishnamurthy @ChaiBapchya for helps :)

@ChaiBapchya
Copy link
Contributor

ChaiBapchya commented Jun 29, 2020

@ciyongch @PatricZhao
Thelatest commit in 1.6 branch was erroneously merged before it passed CI. That commit tried to fix edge/centos-cpu/gpu pipelines.

Going forward, I created another PR on 1.6.x branch: #18597
That

  • revert the erroneously merged commit
  • fix centos link issue
  • fix edge pipeline

However, it fails on setuptools as you pointed out. I'll try to get that fixed so that we can get the CI fixed for 1.6.x
Once merged we can rebase this PR.

@ChaiBapchya
Copy link
Contributor

It passed all 11 why did we have to retrigger? Is codecov blocking merge?
Also we should try to use mxnet-bot for re-triggering specific pipelines if any.

@ciyongch
Copy link
Contributor Author

ciyongch commented Jul 1, 2020

Hi @ChaiBapchya , I saw the codecov test cases failed and the mxnet-bot doesn't support re-trigger. Not sure if they're a merge blocker or not, I just re-trigger the cases.

@ChaiBapchya
Copy link
Contributor

I don't think that's the case.
@szha @leezu can confirm if code-cov is a blocker.
If its not a blocker, lets get this PR merged.
Also I'm guessing since this fix is made to executor,

  1. this is part of other branches as well?
  2. also this change doesn't have any test, is that already tested somewhere? how can we confirm?
    @ciyongch

@ciyongch
Copy link
Contributor Author

ciyongch commented Jul 1, 2020

@mxnet-bot run ci [unix-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-cpu]

@ciyongch
Copy link
Contributor Author

ciyongch commented Jul 1, 2020

  • this is part of other branches as well?

Yes, this is a common issue for all the current branches, I will do the backport to other branches as well.

  • also this change doesn't have any test, is that already tested somewhere? how can we confirm?

Currently, we've only verified this via a customized case which is kind of complicated, I will try to add some tests later to cover it.

@ciyongch
Copy link
Contributor Author

ciyongch commented Jul 1, 2020

Codecov failures are still there...which shouldn't be the blocker I think.

@sandeep-krishnamurthy
Copy link
Contributor

Codecov is not a blocker.

@sandeep-krishnamurthy
Copy link
Contributor

@pengzhao-intel @TaoLv this will be good to go after your review and approval

@ChaiBapchya
Copy link
Contributor

I will try to add some tests later to cover it.

Can we add a basic test to verify this? I guess reviewers would feel confident to approve this once they know there is a proper test to verify it and that it passes. @sandeep-krishnamurthy wdyt?

@leezu
Copy link
Contributor

leezu commented Jul 1, 2020

To "fix" the codecov showing up on the 1.x branches, you can include the 3 lines from https://github.com/apache/incubator-mxnet/pull/18497/files
cc @sandeep-krishnamurthy @ciyongch @ChaiBapchya

@ciyongch
Copy link
Contributor Author

ciyongch commented Jul 2, 2020

@ChaiBapchya We've verified the fix via an offline customized cases, anyway, it's quite reasonable to add a UT to cover this case. I will try to add this today.
@leezu thanks to point it out.

Copy link
Contributor

@pengzhao-intel pengzhao-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ciyongch
Copy link
Contributor Author

ciyongch commented Jul 2, 2020

Hi @ChaiBapchya @leezu @pengzhao-intel @TaoLv , now all the CI passed and the UT is added as well, please help to merge, thanks.

Copy link
Contributor

@ChaiBapchya ChaiBapchya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the UT. LGTM!

@TaoLv TaoLv merged commit e503704 into apache:v1.6.x Jul 2, 2020
ciyongch added a commit to ciyongch/incubator-mxnet that referenced this pull request Jul 14, 2020
… variable input shapes (apache#18632)

* Fix the monitor_callback invalid issue during calibration with variable input shapes

* retrigger CI

* Add UT for monitor check and disable codecov
TaoLv added a commit that referenced this pull request Jul 15, 2020
… variable input shapes (#18632) (#18703)

* Fix the monitor_callback invalid issue during calibration with variable input shapes

* retrigger CI

* Add UT for monitor check and disable codecov

Co-authored-by: Tao Lv <tao.a.lv@intel.com>
leezu pushed a commit to leezu/mxnet that referenced this pull request Oct 1, 2020
… variable input shapes (apache#18632) (apache#18703)

* Fix the monitor_callback invalid issue during calibration with variable input shapes

* retrigger CI

* Add UT for monitor check and disable codecov

Co-authored-by: Tao Lv <tao.a.lv@intel.com>
samskalicky pushed a commit that referenced this pull request Oct 2, 2020
* * Fix einsum gradient (#18482)

* [v1.7.x] Backport PRs of numpy features (#18653)

* add zero grad for npi_unique (#18080)

* fix np.clip scalar input case (#17788)

* fix true_divide (#18393)

Co-authored-by: Hao Jin <hjjn.amzn@gmail.com>
Co-authored-by: Xi Wang <xidulu@gmail.com>

* [v1.7.x] backport mixed type binary ops to v1.7.x (#18649)

* Fix Windows GPU CI (#17962)

Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness.

Co-authored-by: vexilligera <vexilligera@gmail.com>

* backport mixed type

Co-authored-by: Leonard Lausen <lausen@amazon.com>
Co-authored-by: vexilligera <vexilligera@gmail.com>

* revise activations (#18700)

* [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (#18632) (#18703)

* Fix the monitor_callback invalid issue during calibration with variable input shapes

* retrigger CI

* Add UT for monitor check and disable codecov

Co-authored-by: Tao Lv <tao.a.lv@intel.com>

* Fail build_windows.py if all retries failed (#18177)

* Update to thrust 1.9.8 on Windows (#18218)

* Update to thrust 1.9.8 on Windows

* Remove debug logic

* Re-enable build retries on MSVC (#18230)

Updating thrust alone did not help. Similar issues (though less often) still
occur with updated thrust, and also with nvidia cub. Tracked upstream at
NVIDIA/thrust#1090

Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com>
Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk>
Co-authored-by: Hao Jin <hjjn.amzn@gmail.com>
Co-authored-by: Xi Wang <xidulu@gmail.com>
Co-authored-by: Yijun Chen <chenyijun0902@gmail.com>
Co-authored-by: vexilligera <vexilligera@gmail.com>
Co-authored-by: ciyong <ciyong.chen@intel.com>
Co-authored-by: Tao Lv <tao.a.lv@intel.com>
samskalicky pushed a commit to samskalicky/incubator-mxnet that referenced this pull request Oct 2, 2020
* * Fix einsum gradient (apache#18482)

* [v1.7.x] Backport PRs of numpy features (apache#18653)

* add zero grad for npi_unique (apache#18080)

* fix np.clip scalar input case (apache#17788)

* fix true_divide (apache#18393)

Co-authored-by: Hao Jin <hjjn.amzn@gmail.com>
Co-authored-by: Xi Wang <xidulu@gmail.com>

* [v1.7.x] backport mixed type binary ops to v1.7.x (apache#18649)

* Fix Windows GPU CI (apache#17962)

Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness.

Co-authored-by: vexilligera <vexilligera@gmail.com>

* backport mixed type

Co-authored-by: Leonard Lausen <lausen@amazon.com>
Co-authored-by: vexilligera <vexilligera@gmail.com>

* revise activations (apache#18700)

* [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (apache#18632) (apache#18703)

* Fix the monitor_callback invalid issue during calibration with variable input shapes

* retrigger CI

* Add UT for monitor check and disable codecov

Co-authored-by: Tao Lv <tao.a.lv@intel.com>

* Fail build_windows.py if all retries failed (apache#18177)

* Update to thrust 1.9.8 on Windows (apache#18218)

* Update to thrust 1.9.8 on Windows

* Remove debug logic

* Re-enable build retries on MSVC (apache#18230)

Updating thrust alone did not help. Similar issues (though less often) still
occur with updated thrust, and also with nvidia cub. Tracked upstream at
NVIDIA/thrust#1090

Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com>
Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk>
Co-authored-by: Hao Jin <hjjn.amzn@gmail.com>
Co-authored-by: Xi Wang <xidulu@gmail.com>
Co-authored-by: Yijun Chen <chenyijun0902@gmail.com>
Co-authored-by: vexilligera <vexilligera@gmail.com>
Co-authored-by: ciyong <ciyong.chen@intel.com>
Co-authored-by: Tao Lv <tao.a.lv@intel.com>
samskalicky added a commit that referenced this pull request Oct 3, 2020
* * Fix einsum gradient (#18482)

* [v1.7.x] Backport PRs of numpy features (#18653)

* add zero grad for npi_unique (#18080)

* fix np.clip scalar input case (#17788)

* fix true_divide (#18393)

Co-authored-by: Hao Jin <hjjn.amzn@gmail.com>
Co-authored-by: Xi Wang <xidulu@gmail.com>

* [v1.7.x] backport mixed type binary ops to v1.7.x (#18649)

* Fix Windows GPU CI (#17962)

Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness.

Co-authored-by: vexilligera <vexilligera@gmail.com>

* backport mixed type

Co-authored-by: Leonard Lausen <lausen@amazon.com>
Co-authored-by: vexilligera <vexilligera@gmail.com>

* revise activations (#18700)

* [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (#18632) (#18703)

* Fix the monitor_callback invalid issue during calibration with variable input shapes

* retrigger CI

* Add UT for monitor check and disable codecov

Co-authored-by: Tao Lv <tao.a.lv@intel.com>

* Fail build_windows.py if all retries failed (#18177)

* Update to thrust 1.9.8 on Windows (#18218)

* Update to thrust 1.9.8 on Windows

* Remove debug logic

* Re-enable build retries on MSVC (#18230)

Updating thrust alone did not help. Similar issues (though less often) still
occur with updated thrust, and also with nvidia cub. Tracked upstream at
NVIDIA/thrust#1090

Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com>
Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk>
Co-authored-by: Hao Jin <hjjn.amzn@gmail.com>
Co-authored-by: Xi Wang <xidulu@gmail.com>
Co-authored-by: Yijun Chen <chenyijun0902@gmail.com>
Co-authored-by: vexilligera <vexilligera@gmail.com>
Co-authored-by: ciyong <ciyong.chen@intel.com>
Co-authored-by: Tao Lv <tao.a.lv@intel.com>

Co-authored-by: Leonard Lausen <lausen@amazon.com>
Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com>
Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk>
Co-authored-by: Hao Jin <hjjn.amzn@gmail.com>
Co-authored-by: Xi Wang <xidulu@gmail.com>
Co-authored-by: Yijun Chen <chenyijun0902@gmail.com>
Co-authored-by: vexilligera <vexilligera@gmail.com>
Co-authored-by: ciyong <ciyong.chen@intel.com>
Co-authored-by: Tao Lv <tao.a.lv@intel.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants