Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setup CUDA CI job #3424

Merged
merged 2 commits into from
Oct 26, 2020
Merged

setup CUDA CI job #3424

merged 2 commits into from
Oct 26, 2020

Conversation

StrikerRUS
Copy link
Collaborator

@StrikerRUS StrikerRUS commented Sep 30, 2020

Closed #3402.

@StrikerRUS StrikerRUS force-pushed the gha_cuda branch 9 times, most recently from 039ae7d to c32f4f4 Compare September 30, 2020 23:15
Comment on lines +4 to +5
pull_request_review_comment:
types: [created]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably simple issue_comment will be easier, but it requires workflow config in the master branch, so I cannot test it right now in this PR.
https://docs.github.com/en/free-pro-team@latest/actions/reference/events-that-trigger-workflows#issue_comment

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh interesting. I think it's ok to leave it like this if it works!

Comment on lines +38 to +39
- name: Remove old folder with repository
run: sudo rm -rf $GITHUB_WORKSPACE
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step is needed because actions/checkout@v1 fails to remove old files (CMake temporary build files particularly) from previous runs, because they were created in docker by another user.

Warning: Unable to run "git clean -ffdx" and "git reset --hard HEAD" successfully, delete source folder instead.
Error: One or more errors occurred. (One or more errors occurred. (Access to the path '/home/guoke/actions-runner/_work/LightGBM/LightGBM/build/CMakeFiles/3.18.1/CMakeCCompiler.cmake' is denied.)) (One or more errors occurred. (Access to the path '/home/guoke/actions-runner/_work/LightGBM/LightGBM/build/CMakeFiles/cmake.check_cache' is denied.)) (One or more errors occurred. (Access to the path '/home/guoke/actions-runner/_work/LightGBM/LightGBM/build/CMakeFiles/3.18.1/CMakeDetermineCompilerABI_C.bin' is denied.)) (One or more errors occurred. (Access to the path '/home/guoke/actions-runner/_work/LightGBM/LightGBM/build/CMakeFiles/CMakeError.log' is denied.)) (One or more errors occurred. (Access to the path '/home/guoke/actions-runner/_work/LightGBM/LightGBM/build/CMakeFiles/3.18.1/CMakeDetermineCompilerABI_CXX.bin' is denied.)) (One or more errors occurred. (Access to the path '/home/guoke/actions-runner/_work/LightGBM/LightGBM/build/CMakeCache.txt' is denied.)) (Access to the path '/home/guoke/actions-runner/_work/LightGBM/LightGBM/build/CMakeFiles/3.18.1/CMakeCCompiler.cmake' is denied.) (Access to the path '/home/guoke/actions-runner/_work/LightGBM/LightGBM/build/CMakeFiles/cmake.check_cache' is denied.) (Access to the path '/home/guoke/actions-runner/_work/LightGBM/LightGBM/build/CMakeFiles/3.18.1/CMakeDetermineCompilerABI_C.bin' is denied.) (Access to the path '/home/guoke/actions-runner/_work/LightGBM/LightGBM/build/CMakeFiles/CMakeError.log' is denied.) (Access to the path '/home/guoke/actions-runner/_work/LightGBM/LightGBM/build/CMakeFiles/3.18.1/CMakeDetermineCompilerABI_CXX.bin' is denied.) (Access to the path '/home/guoke/actions-runner/_work/LightGBM/LightGBM/build/CMakeCache.txt' is denied.)
Error: Exit code 1 returned from process: file name '/home/guoke/actions-runner/bin/Runner.PluginHost', arguments 'action "GitHub.Runner.Plugins.Repository.v1_0.CheckoutTask, Runner.Plugins"'.

$ROOT_DOCKER_FOLDER/.ci/setup.sh || exit -1
$ROOT_DOCKER_FOLDER/.ci/test.sh
EOF
sudo docker run --env-file docker.env -v "$GITHUB_WORKSPACE":"$ROOT_DOCKER_FOLDER" --rm --gpus all nvidia/cuda:11.0-devel-ubuntu20.04 /bin/bash $ROOT_DOCKER_FOLDER/docker-script.sh
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sudo to workaround the following error:

docker: Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/create: dial unix /var/run/docker.sock: connect: permission denied.
See 'docker run --help'.

test:
name: CUDA
runs-on: [self-hosted, linux]
if: github.event.comment.body == '/gha run cuda-builds' && contains('OWNER,MEMBER,COLLABORATOR', github.event.comment.author_association)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/gha run cuda-builds

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/gha run cuda-builds

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is so cool!!

@StrikerRUS StrikerRUS changed the title [WIP] setup CUDA CI job setup CUDA CI job Sep 30, 2020
@StrikerRUS StrikerRUS marked this pull request as ready for review September 30, 2020 23:36
@StrikerRUS
Copy link
Collaborator Author

Further possible improvements:

  • shut down machine after build is finished (CI CUDA job #3402 (comment)) and turn it on after appropriate trigger;
  • allow only skipped and succeeded status of CUDA builds to merge PRs (right now I even cannot see it in PR checks)
    image

Unfortunately, I'm not sure I'll be able to work on the items listed above.

@guolinke
Copy link
Collaborator

guolinke commented Oct 1, 2020

@StrikerRUS It seems currently both azure pipeline and github actions cannot support the power-off/turn-on machines.
Even with the VM scale set in azure pipeline, it needs at least one machine stand by.
I know there are some external solutions, but they seems to require my Azure (Microsoft) account permission, which seems to violate our policy.

@StrikerRUS
Copy link
Collaborator Author

@guolinke

I know there are some external solutions ...

Yeah, also they are a ton of pain! For instance:
dmlc/xgboost#4958
https://github.com/hcho3/xgboost-devops

... but they seems to require my Azure (Microsoft) account permission, which seems to violate our policy.

Yep, that's true! 😞

https://github.com/hcho3/xgboost-devops/blob/df6c582ac65d237632ec81c0a739ecdb7e9d77e0/.github/workflows/deploy-lambda.yml#L27-L28

@StrikerRUS
Copy link
Collaborator Author

Unfortunately, switching to P100 didn't help to get rid of segfault. When I run LightGBM\examples\python-guide\simple_example.py I get the following more informative logs:

Starting training...
Warning: ] [Warning] CUDA currently requires double precision calculations.
Warning: ] [Warning] Using sparse features with CUDA is currently not supported.
Warning: ] [Warning] CUDA currently requires double precision calculations.
Traceback (most recent call last):
  File "/LightGBM/examples/python-guide/simple_example.py", line 35, in <module>
    gbm = lgb.train(params,
  File "/root/.local/lib/python3.8/site-packages/lightgbm/engine.py", line 231, in train
    booster = Booster(params=params, train_set=train_set)
  File "/root/.local/lib/python3.8/site-packages/lightgbm/basic.py", line 1988, in __init__
    _safe_call(_LIB.LGBM_BoosterCreate(
  File "/root/.local/lib/python3.8/site-packages/lightgbm/basic.py", line 55, in _safe_call
    raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
lightgbm.basic.LightGBMError: [CUDA] invalid argument /LightGBM/src/treelearner/cuda_tree_learner.cpp 414

However, I don't think that it should block this PR from merging.

@guolinke
Copy link
Collaborator

guolinke commented Oct 2, 2020

@StrikerRUS so it is hardware problem or ?

@StrikerRUS
Copy link
Collaborator Author

@guolinke TBH, I have no idea...
Maybe we can ask @ChipKerchner or fix the following CMake warning (maybe it is important) at first

CMake Warning (dev) in CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "histo_16_64_256-allfeats_sp_const".
This warning is for project developers.  Use -Wno-dev to suppress it.

@ChipKerchner
Copy link
Contributor

CUDA_ARCHITECTURES

Sorry, this is a new feature in CMake 3.18 and I'm not familiar with it.

@StrikerRUS
Copy link
Collaborator Author

@ChipKerchner Thanks for your super fast response!

I'll try to re-run with older CMake.

Will it be possible to adapt CMakeLists.txt code according to the recent CMake changes in the future?

@StrikerRUS
Copy link
Collaborator Author

Just tested CMake 3.16.7: https://github.com/microsoft/LightGBM/runs/1199141320?check_suite_focus=true. Unfortunately, the same output.

@StrikerRUS
Copy link
Collaborator Author

Will it be possible to adapt CMakeLists.txt code according to the recent CMake changes in the future?

Looks like we could do the same:
https://github.com/dmlc/xgboost/blob/e0e4f15d0e314afcf44d690e0295fa6320fc7f64/cmake/Utils.cmake#L104-L111

@sh1ng
Copy link
Contributor

sh1ng commented Oct 5, 2020

I'm not sure that it's related to CUDA_ARCHITECTURES. It's already compiled for multiple architectures https://github.com/microsoft/LightGBM/blob/master/CMakeLists.txt#L159.

https://github.com/microsoft/LightGBM/blob/master/src/treelearner/cuda_tree_learner.cpp#L414
host memory has to be page-locked host memory. Is it allocated properly?

@sh1ng
Copy link
Contributor

sh1ng commented Oct 6, 2020

========= CUDA-MEMCHECK
[LightGBM] [Warning] CUDA currently requires double precision calculations.
[LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
[LightGBM] [Warning] CUDA currently requires double precision calculations.
[LightGBM] [Info] LightGBM using CUDA trainer with DP float!!
[LightGBM] [Info] Total Bins 22008
[LightGBM] [Info] Number of data points in the train set: 1348045, number of used features: 150
[LightGBM] [Debug] device_bin_size_ = 256
[LightGBM] [Debug] Resized feature masks
[LightGBM] [Debug] Memset pinned_feature_masks_
[LightGBM] [Debug] Allocated device_features_ addr=0x7f3230000000 sz=202206750
[LightGBM] [Debug] Memset device_data_indices_
[LightGBM] [Debug] created device_subhistograms_: 0x7f323e600000
[LightGBM] [Debug] Started copying dense features from CPU to GPU
[LightGBM] [Debug] Started copying dense features from CPU to GPU - 1
[LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
========= Program hit cudaErrorInvalidValue (error 11) due to "invalid argument" on CUDA API call to cudaMemcpyAsync.

Seems not correct pointers or size

@guolinke
Copy link
Collaborator

@StrikerRUS any updates of this PR?

@StrikerRUS
Copy link
Collaborator Author

@guolinke

any updates of this PR?

Which updates do you mean? I think this PR is ready.

@guolinke
Copy link
Collaborator

Great! I thought the cuda job cannot run.

@StrikerRUS
Copy link
Collaborator Author

@guolinke

I thought the cuda job cannot run.

It can but unfortunately fails with the following runtime error: #3424 (comment).
That error can be reproduced by independent CUDA early adopters:

@guolinke
Copy link
Collaborator

any insights regard to these errors? @ChipKerchner

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI CUDA job
5 participants