setup CUDA CI job #3424

StrikerRUS · 2020-09-30T22:26:22Z

Closed #3402.

StrikerRUS · 2020-09-30T23:18:10Z

.github/workflows/cuda.yml

+  pull_request_review_comment:
+    types: [created]


Probably simple issue_comment will be easier, but it requires workflow config in the master branch, so I cannot test it right now in this PR.
https://docs.github.com/en/free-pro-team@latest/actions/reference/events-that-trigger-workflows#issue_comment

oh interesting. I think it's ok to leave it like this if it works!

StrikerRUS · 2020-09-30T23:21:51Z

.github/workflows/cuda.yml

+      - name: Remove old folder with repository
+        run: sudo rm -rf $GITHUB_WORKSPACE


This step is needed because actions/checkout@v1 fails to remove old files (CMake temporary build files particularly) from previous runs, because they were created in docker by another user.

Warning: Unable to run "git clean -ffdx" and "git reset --hard HEAD" successfully, delete source folder instead. Error: One or more errors occurred. (One or more errors occurred. (Access to the path '/home/guoke/actions-runner/_work/LightGBM/LightGBM/build/CMakeFiles/3.18.1/CMakeCCompiler.cmake' is denied.)) (One or more errors occurred. (Access to the path '/home/guoke/actions-runner/_work/LightGBM/LightGBM/build/CMakeFiles/cmake.check_cache' is denied.)) (One or more errors occurred. (Access to the path '/home/guoke/actions-runner/_work/LightGBM/LightGBM/build/CMakeFiles/3.18.1/CMakeDetermineCompilerABI_C.bin' is denied.)) (One or more errors occurred. (Access to the path '/home/guoke/actions-runner/_work/LightGBM/LightGBM/build/CMakeFiles/CMakeError.log' is denied.)) (One or more errors occurred. (Access to the path '/home/guoke/actions-runner/_work/LightGBM/LightGBM/build/CMakeFiles/3.18.1/CMakeDetermineCompilerABI_CXX.bin' is denied.)) (One or more errors occurred. (Access to the path '/home/guoke/actions-runner/_work/LightGBM/LightGBM/build/CMakeCache.txt' is denied.)) (Access to the path '/home/guoke/actions-runner/_work/LightGBM/LightGBM/build/CMakeFiles/3.18.1/CMakeCCompiler.cmake' is denied.) (Access to the path '/home/guoke/actions-runner/_work/LightGBM/LightGBM/build/CMakeFiles/cmake.check_cache' is denied.) (Access to the path '/home/guoke/actions-runner/_work/LightGBM/LightGBM/build/CMakeFiles/3.18.1/CMakeDetermineCompilerABI_C.bin' is denied.) (Access to the path '/home/guoke/actions-runner/_work/LightGBM/LightGBM/build/CMakeFiles/CMakeError.log' is denied.) (Access to the path '/home/guoke/actions-runner/_work/LightGBM/LightGBM/build/CMakeFiles/3.18.1/CMakeDetermineCompilerABI_CXX.bin' is denied.) (Access to the path '/home/guoke/actions-runner/_work/LightGBM/LightGBM/build/CMakeCache.txt' is denied.) Error: Exit code 1 returned from process: file name '/home/guoke/actions-runner/bin/Runner.PluginHost', arguments 'action "GitHub.Runner.Plugins.Repository.v1_0.CheckoutTask, Runner.Plugins"'.

StrikerRUS · 2020-09-30T23:26:43Z

.github/workflows/cuda.yml

+            $ROOT_DOCKER_FOLDER/.ci/setup.sh || exit -1
+            $ROOT_DOCKER_FOLDER/.ci/test.sh
+            EOF
+            sudo docker run --env-file docker.env -v "$GITHUB_WORKSPACE":"$ROOT_DOCKER_FOLDER" --rm --gpus all nvidia/cuda:11.0-devel-ubuntu20.04 /bin/bash $ROOT_DOCKER_FOLDER/docker-script.sh


sudo to workaround the following error:

docker: Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/create: dial unix /var/run/docker.sock: connect: permission denied. See 'docker run --help'.

StrikerRUS · 2020-09-30T23:29:21Z

.github/workflows/cuda.yml

+  test:
+    name: CUDA
+    runs-on: [self-hosted, linux]
+    if: github.event.comment.body == '/gha run cuda-builds' && contains('OWNER,MEMBER,COLLABORATOR', github.event.comment.author_association)


/gha run cuda-builds

Comment above triggered the following build: https://github.com/microsoft/LightGBM/runs/1190615093?check_suite_focus=true.

/gha run cuda-builds

this is so cool!!

StrikerRUS · 2020-09-30T23:50:32Z

Further possible improvements:

shut down machine after build is finished (CI CUDA job #3402 (comment)) and turn it on after appropriate trigger;
allow only skipped and succeeded status of CUDA builds to merge PRs (right now I even cannot see it in PR checks)

Unfortunately, I'm not sure I'll be able to work on the items listed above.

guolinke · 2020-10-01T00:54:11Z

@StrikerRUS It seems currently both azure pipeline and github actions cannot support the power-off/turn-on machines.
Even with the VM scale set in azure pipeline, it needs at least one machine stand by.
I know there are some external solutions, but they seems to require my Azure (Microsoft) account permission, which seems to violate our policy.

StrikerRUS · 2020-10-01T06:17:32Z

@guolinke

I know there are some external solutions ...

Yeah, also they are a ton of pain! For instance:
dmlc/xgboost#4958
https://github.com/hcho3/xgboost-devops

... but they seems to require my Azure (Microsoft) account permission, which seems to violate our policy.

Yep, that's true! 😞

https://github.com/hcho3/xgboost-devops/blob/df6c582ac65d237632ec81c0a739ecdb7e9d77e0/.github/workflows/deploy-lambda.yml#L27-L28

StrikerRUS · 2020-10-01T14:50:55Z

Unfortunately, switching to P100 didn't help to get rid of segfault. When I run LightGBM\examples\python-guide\simple_example.py I get the following more informative logs:

Starting training...
Warning: ] [Warning] CUDA currently requires double precision calculations.
Warning: ] [Warning] Using sparse features with CUDA is currently not supported.
Warning: ] [Warning] CUDA currently requires double precision calculations.
Traceback (most recent call last):
  File "/LightGBM/examples/python-guide/simple_example.py", line 35, in <module>
    gbm = lgb.train(params,
  File "/root/.local/lib/python3.8/site-packages/lightgbm/engine.py", line 231, in train
    booster = Booster(params=params, train_set=train_set)
  File "/root/.local/lib/python3.8/site-packages/lightgbm/basic.py", line 1988, in __init__
    _safe_call(_LIB.LGBM_BoosterCreate(
  File "/root/.local/lib/python3.8/site-packages/lightgbm/basic.py", line 55, in _safe_call
    raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
lightgbm.basic.LightGBMError: [CUDA] invalid argument /LightGBM/src/treelearner/cuda_tree_learner.cpp 414

However, I don't think that it should block this PR from merging.

guolinke · 2020-10-02T13:00:25Z

@StrikerRUS so it is hardware problem or ?

StrikerRUS · 2020-10-02T13:21:08Z

@guolinke TBH, I have no idea...
Maybe we can ask @ChipKerchner or fix the following CMake warning (maybe it is important) at first

CMake Warning (dev) in CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "histo_16_64_256-allfeats_sp_const".
This warning is for project developers.  Use -Wno-dev to suppress it.

ChipKerchner · 2020-10-02T13:33:26Z

CUDA_ARCHITECTURES

Sorry, this is a new feature in CMake 3.18 and I'm not familiar with it.

StrikerRUS · 2020-10-02T13:42:17Z

@ChipKerchner Thanks for your super fast response!

I'll try to re-run with older CMake.

Will it be possible to adapt CMakeLists.txt code according to the recent CMake changes in the future?

StrikerRUS · 2020-10-02T14:05:58Z

Just tested CMake 3.16.7: https://github.com/microsoft/LightGBM/runs/1199141320?check_suite_focus=true. Unfortunately, the same output.

StrikerRUS · 2020-10-02T18:50:15Z

Will it be possible to adapt CMakeLists.txt code according to the recent CMake changes in the future?

Looks like we could do the same:
https://github.com/dmlc/xgboost/blob/e0e4f15d0e314afcf44d690e0295fa6320fc7f64/cmake/Utils.cmake#L104-L111

sh1ng · 2020-10-05T21:52:55Z

I'm not sure that it's related to CUDA_ARCHITECTURES. It's already compiled for multiple architectures https://github.com/microsoft/LightGBM/blob/master/CMakeLists.txt#L159.

https://github.com/microsoft/LightGBM/blob/master/src/treelearner/cuda_tree_learner.cpp#L414
host memory has to be page-locked host memory. Is it allocated properly?

sh1ng · 2020-10-06T15:25:19Z

========= CUDA-MEMCHECK
[LightGBM] [Warning] CUDA currently requires double precision calculations.
[LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
[LightGBM] [Warning] CUDA currently requires double precision calculations.
[LightGBM] [Info] LightGBM using CUDA trainer with DP float!!
[LightGBM] [Info] Total Bins 22008
[LightGBM] [Info] Number of data points in the train set: 1348045, number of used features: 150
[LightGBM] [Debug] device_bin_size_ = 256
[LightGBM] [Debug] Resized feature masks
[LightGBM] [Debug] Memset pinned_feature_masks_
[LightGBM] [Debug] Allocated device_features_ addr=0x7f3230000000 sz=202206750
[LightGBM] [Debug] Memset device_data_indices_
[LightGBM] [Debug] created device_subhistograms_: 0x7f323e600000
[LightGBM] [Debug] Started copying dense features from CPU to GPU
[LightGBM] [Debug] Started copying dense features from CPU to GPU - 1
[LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
========= Program hit cudaErrorInvalidValue (error 11) due to "invalid argument" on CUDA API call to cudaMemcpyAsync.

Seems not correct pointers or size

guolinke · 2020-10-21T07:01:59Z

@StrikerRUS any updates of this PR?

StrikerRUS · 2020-10-21T15:00:29Z

@guolinke

any updates of this PR?

Which updates do you mean? I think this PR is ready.

guolinke · 2020-10-22T02:23:37Z

Great! I thought the cuda job cannot run.

StrikerRUS · 2020-10-22T03:33:24Z

@guolinke

I thought the cuda job cannot run.

It can but unfortunately fails with the following runtime error: #3424 (comment).
That error can be reproduced by independent CUDA early adopters:

guolinke · 2020-10-22T03:55:56Z

any insights regard to these errors? @ChipKerchner

github-actions · 2023-08-24T04:39:57Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

StrikerRUS force-pushed the gha_cuda branch 9 times, most recently from 039ae7d to c32f4f4 Compare September 30, 2020 23:15

StrikerRUS commented Sep 30, 2020

View reviewed changes

setup CUDA CI job

369a081

StrikerRUS force-pushed the gha_cuda branch from c32f4f4 to 369a081 Compare September 30, 2020 23:35

StrikerRUS changed the title ~~[WIP] setup CUDA CI job~~ setup CUDA CI job Sep 30, 2020

StrikerRUS added the maintenance label Sep 30, 2020

StrikerRUS marked this pull request as ready for review September 30, 2020 23:36

StrikerRUS requested review from jameslamb and Laurae2 as code owners September 30, 2020 23:36

StrikerRUS requested review from guolinke and huanzhang12 September 30, 2020 23:36

minor updates

6512520

StrikerRUS force-pushed the gha_cuda branch from fd0d0b8 to 6512520 Compare October 1, 2020 14:53

jameslamb mentioned this pull request Oct 6, 2020

[R-package] [ci] Add test on R package with sanitizers #3439

Merged

StrikerRUS mentioned this pull request Oct 10, 2020

CUDA: multi GPUs issue #3450

Closed

guolinke approved these changes Oct 22, 2020

View reviewed changes

guolinke merged commit 12e220b into master Oct 26, 2020

StrikerRUS deleted the gha_cuda branch October 26, 2020 12:06

This was referenced Nov 3, 2020

[docs] document CUDA version support #3428

Merged

Add support for CUDA-based GPU build #3160

Merged

StrikerRUS mentioned this pull request Jan 23, 2021

[ci] improve and run CUDA jobs for every commit and PR #3825

Merged

github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

setup CUDA CI job #3424

setup CUDA CI job #3424

StrikerRUS commented Sep 30, 2020 •

edited

Loading

StrikerRUS Sep 30, 2020

jameslamb Oct 6, 2020

StrikerRUS Sep 30, 2020

StrikerRUS Sep 30, 2020

StrikerRUS Sep 30, 2020

StrikerRUS Sep 30, 2020

StrikerRUS Oct 1, 2020

jameslamb Oct 3, 2020

StrikerRUS commented Sep 30, 2020

guolinke commented Oct 1, 2020 •

edited

Loading

StrikerRUS commented Oct 1, 2020

StrikerRUS commented Oct 1, 2020

guolinke commented Oct 2, 2020

StrikerRUS commented Oct 2, 2020

ChipKerchner commented Oct 2, 2020

StrikerRUS commented Oct 2, 2020

StrikerRUS commented Oct 2, 2020

StrikerRUS commented Oct 2, 2020

sh1ng commented Oct 5, 2020

sh1ng commented Oct 6, 2020

guolinke commented Oct 21, 2020

StrikerRUS commented Oct 21, 2020

guolinke commented Oct 22, 2020

StrikerRUS commented Oct 22, 2020

guolinke commented Oct 22, 2020

github-actions bot commented Aug 24, 2023

		- name: Remove old folder with repository
		run: sudo rm -rf $GITHUB_WORKSPACE

setup CUDA CI job #3424

setup CUDA CI job #3424

Conversation

StrikerRUS commented Sep 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS commented Sep 30, 2020

guolinke commented Oct 1, 2020 • edited Loading

StrikerRUS commented Oct 1, 2020

StrikerRUS commented Oct 1, 2020

guolinke commented Oct 2, 2020

StrikerRUS commented Oct 2, 2020

ChipKerchner commented Oct 2, 2020

StrikerRUS commented Oct 2, 2020

StrikerRUS commented Oct 2, 2020

StrikerRUS commented Oct 2, 2020

sh1ng commented Oct 5, 2020

sh1ng commented Oct 6, 2020

guolinke commented Oct 21, 2020

StrikerRUS commented Oct 21, 2020

guolinke commented Oct 22, 2020

StrikerRUS commented Oct 22, 2020

guolinke commented Oct 22, 2020

github-actions bot commented Aug 24, 2023

StrikerRUS commented Sep 30, 2020 •

edited

Loading

guolinke commented Oct 1, 2020 •

edited

Loading