-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[CI][1.x] Cherrypick: Upgrade unix gpu toolchain (#18186) #18785
Conversation
Hey @ChaiBapchya , Thanks for submitting the PR
CI supported jobs: [centos-cpu, clang, edge, centos-gpu, unix-gpu, website, windows-cpu, miscellaneous, sanity, unix-cpu, windows-gpu] Note: |
@mxnet-bot run ci [unix-gpu] |
I run this locally to try & reproduce the CI error but it passes & doesn't throw the nvidia-docker error.
@leezu @josephevans any idea? I can confirm it translates into equivalent command of
|
That's related to the AMI. You could also update the build.py script to run |
Nevermind. @josephevans helped me identify that before calling the run_container it was building that docker container first and while building it was using |
I think when enabling the branch protection, we accidentally turned on "Require branches to be up to date before merging". I'm requesting to disable it in https://issues.apache.org/jira/browse/INFRA-20616. Don't worry about updating the branch in this PR for now. |
@mxnet-bot run ci [unix-gpu] |
Jenkins CI successfully triggered : [unix-gpu] |
@mxnet-bot run ci [unix-gpu] |
Jenkins CI successfully triggered : [unix-gpu] |
* update nvidiadocker command & remove cuda compat * replace cu101 with cuda since compat is no longer to be used * skip flaky tests * get rid of ubuntu_build_cuda and point ubuntu_cu101 to base gpu instead of cuda compat * Revert "skip flaky tests" This reverts commit 1c720fa. * revert removal of ubuntu_build_cuda * add linux gpu g4 node to all steps using g3 in unix-gpu pipeline
* Remove mention of nightly in pypi (apache#18635) * update bert dev.tsv link Co-authored-by: Sheng Zha <szha@users.noreply.github.com>
ubuntu_gpu_cu101 on 1.x branch relies on libcuda compat. However, for upgrading from G3 to G4 instance, we no longer rely on libcuda compat. It gives cuda driver/display driver error if using libcuda compat. Upon removing the LD_LIBRARY_PATH kludge for libcuda compat, 4 builds in unix-gpu pipeline failed due to TVM=ON relies on libcuda compat. Note: I haven't cherry-picked that PR because master branch CI has differences from v1.x [for e.g. most builds in unix-gpu for master branch have cmake instead of make] |
@mxnet-bot run ci [unix-gpu] re-triggering for flaky issue. |
Jenkins CI successfully triggered : [unix-gpu] |
@jinboci I saw one of your PRs for fixing TVM Op errors.. Any idea why this test fails when using TVM=ON? Common Stack Trace
In CI Jenkins_steps.groovy for Python3 GPU
where
While unpacking
where mx_lib_cython is a subset of mx_lib_cpp_examples
Based on the stacktrace: It's throwing TVM runtime check failed for allocated size |
@ChaiBapchya on master, |
@ChaiBapchya It seems the unix-gpu test have passed. Most of my work about TVMOp was written in the issue #18716. However, I don't think we were encountering the same problem. |
Ya I've dropped TVMOp support from unix-gpu pipeline and that caused the pipeline to pass. |
@mxnet-label-bot add [pr-awaiting-review] |
@mxnet-bot run ci [windows-gpu] retriggering as windows gpu timed out |
Jenkins CI successfully triggered : [windows-gpu] |
Leverage G4 instances for unix-gpu instead of G3
update nvidiadocker command & remove cuda compat
replace cu101 with cuda since compat is no longer to be used
skip flaky tests
get rid of ubuntu_build_cuda and point ubuntu_cu101 to base gpu instead of cuda compat
Revert "skip flaky tests"
This reverts commit 1c720fa.
revert removal of ubuntu_build_cuda
add linux gpu g4 node to all steps using g3 in unix-gpu pipeline
Refer: #18186