-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation for GPU installation on Windows #409
Conversation
@Laurae2 I was able to follow the Windows GPU installation steps on Windows Server 2016 (from Azure) and successfully compiled LightGBM, and also installed the Python interface. All tests can pass as well, despite some small glitches. The GPU does not achieve the same speedup as on Ubuntu (with the same GPU on Azure), not sure if it is the GPU driver/compiler or GPU configuration problem, but the good news is it works! Some small suggestions for the building procedure:
I have fixed the Boost.Compute bug on Windows in boostorg/compute#704. The problem is not the OpenMP parallel region, but an undefined macro. You can try to apply the patch and see if it fixes the Windows builds for you. If yes, we don't need to hack |
Do you know what are the small glitches? I also noticed it's slower in Windows, but I am not sure why it is slower exactly.
Added.
With your proposed change, CLI / Python:
It does not apply to R-package, as it works without any hack (due to being R and its way of installing like Unix variants). |
The small glitches are fixed in #411 and #412. Make sure you are testing the latest master. I think the crash is related to GCC/MinGW version. There can be multiple bugs there. I am using the latest MinGW with GCC 6.3.0 on Windows Server 2016 (should be somewhat equivalent to Windows 10). After fixing boostorg/compute#704 it will not crash. Can you compile with debug information and get a backtrace on where the bug is? |
@huanzhang12 I am using commit 3a8b5e5 (6h ago) + MinGW 5.3 for CLI / Python, and MinGW 4.9 for R., with your boostorg/compute#704 fix. It is possible it is related to MinGW version, currently I have this error with 5.3: Full log if you need, I don't know if it is useful for you:
Line 583 of gpu_tree_learner.cpp:
|
Based on the backtrace it seems you are using boost.compute 1.63 located at
c:\boost\boost-build, not the one in the compute submodule. So the fix for
boost.compute was probably not applied. This issue should be fixed in
6be0874: we want the compute submodule be the first one in compiler search
path.
Can you clean everything (especially cmake generated files) and try again?
…On Thu, Apr 13, 2017 at 10:19 AM Laurae ***@***.***> wrote:
@huanzhang12 <https://github.com/huanzhang12> I am using commit 3a8b5e5
<3a8b5e5>
(6h ago) + MinGW 5.3 for CLI / Python, and MinGW 4.9 for R., with your
boostorg/compute#704 <boostorg/compute#704> fix.
It is possible it is related to MinGW version, currently I have this error
with 5.3:
[image: image]
<https://cloud.githubusercontent.com/assets/9083669/25015529/7326860a-207c-11e7-8fc3-320b2be619a6.png>
Full log if you need, I don't know if it is useful for you:
C:\xgboost\LightGBM\examples\binary_classification>gdb --args "../../lightgbm.exe" config=train.conf data=binary.train valid=binary.test objective=binary devic
=gpu
GNU gdb (GDB) 7.10.1
Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-w64-mingw32".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ../../lightgbm.exe...done.
(gdb) run
Starting program: C:\xgboost\LightGBM\lightgbm.exe "config=train.conf" "data=binary.train" "valid=binary.test" "objective=binary" "device=gpu"
[New Thread 105220.0x199b8]
[New Thread 105220.0x783c]
[Thread 105220.0x783c exited with code 0]
[LightGBM] [Info] Finished loading parameters
[New Thread 105220.0x19490]
[New Thread 105220.0x1a71c]
[New Thread 105220.0x19a24]
[New Thread 105220.0x4fb0]
[Thread 105220.0x4fb0 exited with code 0]
[LightGBM] [Info] Loading weights...
[New Thread 105220.0x19988]
[Thread 105220.0x19988 exited with code 0]
[New Thread 105220.0x1a8fc]
[Thread 105220.0x1a8fc exited with code 0]
[LightGBM] [Info] Loading weights...
[New Thread 105220.0x1a90c]
[Thread 105220.0x1a90c exited with code 0]
[LightGBM] [Info] Finished loading data in 1.011408 seconds
[LightGBM] [Info] Number of positive: 3716, number of negative: 3284
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 6143
[LightGBM] [Info] Number of data: 7000, number of used features: 28
[New Thread 105220.0x1a62c]
[LightGBM] [Info] Using GPU Device: Oland, Vendor: Advanced Micro Devices, Inc.
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
Program received signal SIGSEGV, Segmentation fault.
0x00007ffbb37c11f1 in strlen () from C:\Windows\system32\msvcrt.dll
(gdb) backtrace
#0 0x00007ffbb37c11f1 in strlen () from C:\Windows\system32\msvcrt.dll
#1 0x000000000048bbe5 in std::char_traits<char>::length (__s=0x0)
at C:/PROGRA~1/MINGW-~1/X86_64~1.0-P/mingw64/x86_64-w64-mingw32/include/c++/bits/char_traits.h:267
#2 std::operator+<char, std::char_traits<char>, std::allocator<char> > (__rhs="\\", __lhs=0x0)
at C:/PROGRA~1/MINGW-~1/X86_64~1.0-P/mingw64/x86_64-w64-mingw32/include/c++/bits/basic_string.tcc:1157
#3 boost::compute::detail::appdata_path[abi:cxx11]() () at C:/boost/boost-build/include/boost/compute/detail/path.hpp:38
#4 0x000000000048eec3 in boost::compute::detail::program_binary_path (hash="d27987d5bd61e2d28cd32b8d7a7916126354dc81", ***@***.***=false)
at C:/boost/boost-build/include/boost/compute/detail/path.hpp:46
#5 0x00000000004913de in boost::compute::program::load_program_binary (hash="d27987d5bd61e2d28cd32b8d7a7916126354dc81", ctx=...)
at C:/boost/boost-build/include/boost/compute/program.hpp:605
#6 0x0000000000490ece in boost::compute::program::build_with_source (
source="\n#ifndef _HISTOGRAM_256_KERNEL_\n#define _HISTOGRAM_256_KERNEL_\n\n#pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable\n#pragma OPEN
L EXTENSION cl_khr_global_int32_base_atomics : enable\n\n//"..., context=...,
options=" -D POWER_FEATURE_WORKGROUPS=5 -D USE_CONSTANT_BUF=0 -D USE_DP_FLOAT=0 -D CONST_HESSIAN=0 -cl-strict-aliasing -cl-mad-enable -cl-no-signed-zeros -
l-fast-relaxed-math") at C:/boost/boost-build/include/boost/compute/program.hpp:549
#7 0x0000000000454339 in LightGBM::GPUTreeLearner::BuildGPUKernels () at C:\xgboost\LightGBM\src\treelearner\gpu_tree_learner.cpp:583
#8 0x00000000636044f2 in libgomp-1!GOMP_parallel () from C:\Program Files\mingw-w64\x86_64-5.3.0-posix-seh-rt_v4-rev0\mingw64\bin\libgomp-1.dll
#9 0x0000000000455e7e in LightGBM::GPUTreeLearner::BuildGPUKernels ***@***.***=0x3b9cac0)
at C:\xgboost\LightGBM\src\treelearner\gpu_tree_learner.cpp:569
#10 0x0000000000457b49 in LightGBM::GPUTreeLearner::InitGPU (this=0x3b9cac0, platform_id=<optimized out>, device_id=<optimized out>)
at C:\xgboost\LightGBM\src\treelearner\gpu_tree_learner.cpp:720
#11 0x0000000000410395 in LightGBM::GBDT::ResetTrainingData (this=0x1f26c90, config=<optimized out>, train_data=0x1f28180, objective_function=0x1f280e0,
training_metrics=std::vector of length 2, capacity 2 = {...}) at C:\xgboost\LightGBM\src\boosting\gbdt.cpp:98
#12 0x0000000000402e93 in LightGBM::Application::InitTrain ***@***.***=0x23f9d0) at C:\xgboost\LightGBM\src\application\application.cpp:213
---Type <return> to continue, or q <return> to quit---
#13 0x00000000004f0b55 in LightGBM::Application::Run (this=0x23f9d0) at C:/xgboost/LightGBM/include/LightGBM/application.h:84
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#409 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHpnVAyZ3HQMO8f-p4pa1P_s5qThb2wjks5rvlj6gaJpZM4M78kh>
.
|
@huanzhang12 I tested it on another Windows computer and it does not crash anymore. On the computer which had LightGBM crashing, I wiped and reinstalled my main LightGBM directory and it also works now (no more crash). It always crashed before even though I was deleting the build directory and lib_lightgbm.dll / lib_lightgbm.dll.a / lightgbm.exe (cmake is caching stuff elsewhere?). So currently we don't need anymore the small hack as long as we use the new compute module. Can you update the submodule on LightGBM to boostorg/compute@6de7f64 ? I'll modify my tutorial afterwards to reflect the changes. I will also add a tutorial for debugging using CLI/gdb in case we are getting users with crashes, this will help developers to trace problems faster. |
@Laurae2 Sure, I will update the submodule. I didn't update because I wanted to know your testing results. Now it is safe to update. |
@Laurae2 Yes, it will be very useful to add a debugging tutorial to collect crash information from users. Thanks! |
@huanzhang12 done. If everything is OK, we can ask @guolinke if it's OK to merge and if he is OK about the Users who have setup R "perfectly" can install the R-package with GPU support at any time using As with any R package, copy & pasting the library folder from R must work (keeping the GPU identical) even if you get rid of the installation as long as you are not using architecture-dependent compilation variables ( |
@Laurae2 I checked the steps you wrote and they look good! I don't know about R, but I can confirm that CLI and Python work by following the steps. For GDB debugging, do we need to set CMAKE_BUILD_TYPE=debug? I am not sure on Windows, but on Linux this flag is required to get a full backtrace with exact line number (otherwise only function names are shown). |
correct, forgot this step, I added it!
For R, it should be OK as I got friends reproducing the installation steps without problems/supervision. However, I don't have the "self-contained package" setup for it. For Windows it is not useful, as it would be reliant on your computer (and you only need to copy & paste the library folder from R to export/install it). |
@Laurae2 @huanzhang12 thx for working on WIN tutorial, I will try it with 1080 GTX next week. I guess you used latest cmake (3.8 _x64), right? When I was compiling xgboost couple months ago with MinGW there was critical to use right version of MinGW (with openMP, posix,..). Can you please share which version of MinGW you used? and from which source? (until now I used Visual Studio for WIN). |
@gugatr0n1c tested under the following conditions:
During MinGW installation, you only need to change to the correct architecture x86_64, all the defaults are fine. Source: on the GPU tutorial file in the first post: https://github.com/Laurae2/LightGBM/blob/patch-2/docs/GPU-Windows.md Simple MinGW installation: http://iweb.dl.sourceforge.net/project/mingw-w64/Toolchains%20targetting%20Win32/Personal%20Builds/mingw-builds/installer/mingw-w64-install.exe |
@Laurae2 great thx, so you are using Mingw-w64 (maybe this can in docs as well? there are severel MinGW versions/forks: MinGW, MinGW-x64, cygwin, neune, tdm fork,..). |
MinGW latest version is 6.3, you must check here for the new versions: https://sourceforge.net/projects/mingw-w64/files/?source=navbar The "homepage(s)" you are describing are pages which were not updated manually for many years, they all mostly already moved to sourceforge if you want the new versions (5.x already works fine for 99% of cases anyway, even 4.9 is already enough for most uses).
Avoid at any cost using MinGW forks such as TDM because they patch differently Same rules applies to xgboost compilation. If you have the default MinGW, it goes flawlessly out of the box no matter what your Windows installation is unless your CPU/Windows flavor is not supported. Using forked MinGWs is asking for more troubles, unless - as said earlier -, you have a very specific reason to use them.
It is already in the doc.
It is already in the doc. |
Working installation steps, applies to Windows installation for:
Indirectly applies to any Unix-based installation for R (similar steps applies for installation, path,
Makeconf
, andMakevars
modification).Step by step installation with pictures.
In addition, allow R installation but requires small changes to R's
Makeconf
. We are using the following variables, as they will not be found easily without:These variables are identical to what cmake uses.
User is free to change the paths, as it will be dependent on each installation. The example is using AMD OpenCL SDK.
Check out the tutorial as a direct link on this GitHub blob: https://github.com/Laurae2/LightGBM/blob/patch-2/docs/GPU-Windows.md
Edit: see now https://github.com/Microsoft/LightGBM/blob/master/docs/GPU-Windows.md
Tested working on:
Untested using OpenCL CPU, but is already confirmed working on Unix-based OS: #389 (comment)
Currently, GPU version known working on:
* requires a small change
** based on @wxchan work (#389 (comment)) and CLI -> Python installs
*** untested/unknown status
ping @gugatr0n1c for method
ping @huanzhang12 for correctness.
Please test and review steps before merging. We might not have the same Windows environments, even though I tested this on 10 different machines today to use GPU version.
In addition, about the variables used, review if we can get a better way of handling this (
BOOST_INCLUDE_DIR
,BOOST_LIBRARY
,OpenCL_INCLUDE_DIR
, andOpenCL_LIBRARY
).