Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Some More GPU documentation #401

Merged
merged 100 commits into from
Apr 12, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
4810c79
add dummy gpu solver code
huanzhang12 Feb 10, 2017
e41ba15
initial GPU code
huanzhang12 Feb 12, 2017
6dde565
fix crash bug
huanzhang12 Feb 12, 2017
2dce7d1
first working version
huanzhang12 Feb 12, 2017
146b2dd
use asynchronous copy
huanzhang12 Feb 12, 2017
1f39a03
use a better kernel for root
huanzhang12 Feb 13, 2017
435674d
parallel read histogram
huanzhang12 Feb 13, 2017
22f478a
sparse features now works, but no acceleration, compute on CPU
huanzhang12 Feb 13, 2017
cfd77ae
compute sparse feature on CPU simultaneously
huanzhang12 Feb 13, 2017
40c3212
fix big bug; add gpu selection; add kernel selection
huanzhang12 Feb 14, 2017
c3398c9
better debugging
huanzhang12 Feb 14, 2017
76a13c7
clean up
huanzhang12 Feb 15, 2017
2dc4555
add feature scatter
huanzhang12 Feb 15, 2017
d4c1c01
Add sparse_threshold control
huanzhang12 Feb 15, 2017
97da274
fix a bug in feature scatter
huanzhang12 Feb 15, 2017
a96ca80
clean up debug
huanzhang12 Feb 15, 2017
9be6438
temporarily add OpenCL kernels for k=64,256
huanzhang12 Feb 27, 2017
cbef453
fix up CMakeList and definition USE_GPU
huanzhang12 Feb 27, 2017
4d08152
add OpenCL kernels as string literals
huanzhang12 Feb 28, 2017
624d405
Add boost.compute as a submodule
huanzhang12 Feb 28, 2017
11b241f
add boost dependency into CMakeList
huanzhang12 Feb 28, 2017
5142f19
fix opencl pragma
huanzhang12 Feb 28, 2017
508b48c
use pinned memory for histogram
huanzhang12 Feb 28, 2017
1a63b99
use pinned buffer for gradients and hessians
huanzhang12 Mar 1, 2017
e2166b1
better debugging message
huanzhang12 Mar 1, 2017
3b24e33
add double precision support on GPU
huanzhang12 Mar 9, 2017
e7336ee
fix boost version in CMakeList
huanzhang12 Mar 9, 2017
b29fec7
Add a README
huanzhang12 Mar 9, 2017
97fed3e
reconstruct GPU initialization code for ResetTrainingData
huanzhang12 Mar 12, 2017
164dbd1
move data to GPU in parallel
huanzhang12 Mar 12, 2017
c1c605e
fix a bug during feature copy
huanzhang12 Mar 13, 2017
c5ab1ae
update gpu kernels
huanzhang12 Mar 13, 2017
947629a
update gpu code
huanzhang12 Mar 15, 2017
105b0dd
initial port to LightGBM v2
huanzhang12 Mar 19, 2017
ba2c0a3
speedup GPU data loading process
huanzhang12 Mar 21, 2017
a6cb794
Add 4-bit bin support to GPU
huanzhang12 Mar 22, 2017
ed929cb
re-add sparse_threshold parameter
huanzhang12 Mar 23, 2017
2cd3d85
remove kMaxNumWorkgroups and allows an unlimited number of features
huanzhang12 Mar 23, 2017
4d2758f
add feature mask support for skipping unused features
huanzhang12 Mar 24, 2017
62bc04e
enable kernel cache
huanzhang12 Mar 24, 2017
e4dd344
use GPU kernels withoug feature masks when all features are used
huanzhang12 Mar 24, 2017
61b09a3
REAdme.
Mar 25, 2017
da20fc0
REAdme.
Mar 25, 2017
2d43e36
update README
huanzhang12 Mar 25, 2017
9602cd7
update to v2
huanzhang12 Mar 25, 2017
cd52bb0
fix typos (#349)
wxchan Mar 17, 2017
be91a98
change compile to gcc on Apple as default
chivee Mar 18, 2017
8f1d05e
clean vscode related file
chivee Mar 19, 2017
411383f
refine api of constructing from sampling data.
guolinke Mar 21, 2017
487660e
fix bug in the last commit.
guolinke Mar 21, 2017
882f420
more efficient algorithm to sample k from n.
guolinke Mar 22, 2017
7d0f338
fix bug in filter bin
guolinke Mar 22, 2017
0b44817
change to boost from average output.
guolinke Mar 22, 2017
85a3ba4
fix tests.
guolinke Mar 22, 2017
f615ba0
only stop training when all classes are finshed in multi-class.
guolinke Mar 23, 2017
fbed3ca
limit the max tree output. change hessian in multi-class objective.
guolinke Mar 24, 2017
8eb961b
robust tree model loading.
guolinke Mar 24, 2017
10cd85f
fix test.
guolinke Mar 24, 2017
e57ec49
convert the probabilities to raw score in boost_from_average of class…
guolinke Mar 24, 2017
39965a0
fix the average label for binary classification.
guolinke Mar 24, 2017
8ac77dc
Add boost_from_average to docs (#354)
Laurae2 Mar 24, 2017
25f6268
don't use "ConvertToRawScore" for self-defined objective function.
guolinke Mar 24, 2017
bf3dfb6
boost_from_average seems doesn't work well in binary classification. …
guolinke Mar 24, 2017
22df883
For a better jump link (#355)
JayveeHe Mar 25, 2017
9f4d2f0
add FitByExistingTree.
guolinke Mar 25, 2017
f54ac4d
adapt GPU tree learner for FitByExistingTree
huanzhang12 Mar 26, 2017
59c473b
avoid NaN output.
guolinke Mar 26, 2017
a0549d1
update boost.compute
huanzhang12 Mar 26, 2017
5e945d2
fix typos (#361)
zhangyafeikimi Mar 26, 2017
3891cdb
fix broken links (#359)
wxchan Mar 26, 2017
48b4d9d
update README
huanzhang12 Mar 27, 2017
7248e58
disable GPU acceleration by default
huanzhang12 Mar 27, 2017
56fe2cc
fix image url
huanzhang12 Mar 27, 2017
1c51775
cleanup debug macro
huanzhang12 Mar 27, 2017
78ae386
Initial GPU acceleration
huanzhang12 Mar 27, 2017
2690181
Merge remote-tracking branch 'gpudev/master'
huanzhang12 Mar 27, 2017
f3573d5
remove old README
huanzhang12 Mar 27, 2017
12e5b82
do not save sparse_threshold_ in FeatureGroup
huanzhang12 Mar 27, 2017
1159854
add details for new GPU settings
huanzhang12 Mar 27, 2017
c719ead
ignore submodule when doing pep8 check
huanzhang12 Mar 27, 2017
15c97b4
allocate workspace for at least one thread during builing Feature4
huanzhang12 Mar 27, 2017
cb35a02
move sparse_threshold to class Dataset
huanzhang12 Mar 28, 2017
a039a3a
remove duplicated code in GPUTreeLearner::Split
huanzhang12 Mar 29, 2017
35ab97f
Remove duplicated code in FindBestThresholds and BeforeFindBestSplit
huanzhang12 Mar 29, 2017
28c1715
do not rebuild ordered gradients and hessians for sparse features
huanzhang12 Mar 29, 2017
2af1860
support feature groups in GPUTreeLearner
huanzhang12 Apr 4, 2017
475cf8c
Merge remote-tracking branch 'upstream/master'
huanzhang12 Apr 5, 2017
4d5d957
Initial parallel learners with GPU support
huanzhang12 Apr 5, 2017
4b44173
add option device, cleanup code
huanzhang12 Apr 5, 2017
b948c1f
clean up FindBestThresholds; add some omp parallel
huanzhang12 Apr 6, 2017
50f7da1
Merge remote-tracking branch 'upstream/master'
huanzhang12 Apr 7, 2017
3a16753
Merge remote-tracking branch 'upstream/master'
huanzhang12 Apr 7, 2017
2b0514e
constant hessian optimization for GPU
huanzhang12 Apr 8, 2017
e72d8cd
Fix GPUTreeLearner crash when there is zero feature
huanzhang12 Apr 9, 2017
a68ae52
use np.testing.assert_almost_equal() to compare lists of floats in tests
huanzhang12 Apr 9, 2017
2ac5103
travis for GPU
huanzhang12 Apr 9, 2017
edb30a6
Merge remote-tracking branch 'upstream/master'
huanzhang12 Apr 9, 2017
0c5eb15
Merge remote-tracking branch 'upstream/master'
huanzhang12 Apr 9, 2017
b121443
Merge remote-tracking branch 'upstream/master'
huanzhang12 Apr 11, 2017
74bc952
add tutorial and more GPU docs
huanzhang12 Apr 12, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ LightGBM is a gradient boosting framework that uses tree based learning algorith
- Faster training speed and higher efficiency
- Lower memory usage
- Better accuracy
- Parallel learning supported
- Parallel and GPU learning supported
- Capable of handling large-scale data

For more details, please refer to [Features](https://github.com/Microsoft/LightGBM/wiki/Features).
Expand All @@ -17,7 +17,7 @@ For more details, please refer to [Features](https://github.com/Microsoft/LightG
News
----

04/10/2017 : Support use GPU to accelerate the tree learning.
04/10/2017 : LightGBM now supports GPU-accelerated tree learning. Please read our [GPU Tutorial](./docs/GPU-Tutorial.md) and [Performance Comparison](./docs/GPU-Performance.md).

02/20/2017 : Update to LightGBM v2.

Expand Down Expand Up @@ -45,6 +45,7 @@ To get started, please follow the [Installation Guide](https://github.com/Micros
* [**Examples**](https://github.com/Microsoft/LightGBM/tree/master/examples)
* [**Features**](https://github.com/Microsoft/LightGBM/wiki/Features)
* [**Parallel Learning Guide**](https://github.com/Microsoft/LightGBM/wiki/Parallel-Learning-Guide)
* [**GPU Learning Tutorial**](./docs/GPU-Tutorial.md)
* [**Configuration**](https://github.com/Microsoft/LightGBM/wiki/Configuration)
* [**Document Indexer**](https://github.com/Microsoft/LightGBM/blob/master/docs/Readme.md)

Expand Down
177 changes: 177 additions & 0 deletions docs/GPU-Performance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
GPU Tuning Guide and Performance Comparison
============================================

How it works?
--------------------------

In LightGBM, the main computation cost during training is building the feature
histograms. We use an efficient algorithm on GPU to accelerate this process.
The implementation is highly modular, and works for all learning tasks
(classification, ranking, regression, etc). GPU acceleration also works in
distributed learning settings. GPU algorithm implementation is based on OpenCL
and can work with a wide range of GPUs.

Supported Hardware
--------------------------

We target AMD Graphics Core Next (GCN) architecture and NVIDIA
Maxwell and Pascal architectures. Most AMD GPUs released after 2012 and NVIDIA
GPUs released after 2014 should be supported. We have tested the GPU
implementation on the following GPUs:

- AMD RX 480 with AMDGPU-pro driver 16.60 on Ubuntu 16.10
- AMD R9 280X (aka Radeon HD 7970) with fglrx driver 15.302.2301 on Ubuntu 16.10
- NVIDIA GTX 1080 with driver 375.39 and CUDA 8.0 on Ubuntu 16.10
- NVIDIA Titan X (Pascal) with driver 367.48 and CUDA 8.0 on Ubuntu 16.04
- NVIDIA Tesla M40 with driver 375.39 and CUDA 7.5 on Ubuntu 16.04

Using the following hardware is discouraged:

- NVIDIA Kepler (K80, K40, K20, most GeForce GTX 700 series GPUs) or earlier
NVIDIA GPUs. They don't support hardware atomic operations in local memory space
and thus histogram construction will be slow.

- AMD VLIW4-based GPUs, including Radeon HD 6xxx series and earlier GPUs. These
GPUs have been discontinued for years and are rarely seen nowadays.


How to Achieve Good Speedup on GPU
----------------------------------

1. You want to run a few datasets that we have verified with good speedup
(including Higgs, epsilon, Bosch, etc) to ensure your
setup is correct. If you have multiple GPUs, make sure to set
`gpu_platform_id` and `gpu_device_id` to use the desired GPU.
Also make sure your system is idle (especially when using a
shared computer) to get accuracy performance measurements.

2. GPU works best on large scale and dense datasets. If dataset is too small,
computing it on GPU is inefficient as the data transfer overhead can be
significant. For dataset with a mixture of sparse and dense features, you
can control the `sparse_threshold` parameter to make sure there are enough
dense features to process on the GPU. If you have categorical features, use
the `categorical_column` option and input them into LightGBM directly; do
not convert them into one-hot variables. Make sure to check the run log and
look at the reported number of sparse and dense features.


3. To get good speedup with GPU, it is suggested to use a smaller number of
bins. Setting `max_bin=63` is recommended, as it usually does not
noticeably affect training accuracy on large datasets, but GPU training can
be significantly faster than using the default bin size of 255. For some
dataset, even using 15 bins is enough (`max_bin=15`); using 15 bins will
maximize GPU performance. Make sure to check the run log and verify that the
desired number of bins is used.

4. Try to use single precision training (`gpu_use_dp=false`) when possible,
because most GPUs (especially NVIDIA consumer GPUs) have poor
double-precision performance.

Performance Comparison
--------------------------

We evaluate the training performance of GPU acceleration on the following datasets:

| Data | Task | Link | #Examples | #Feature| Comments|
|----------|---------------|-------|-------|---------|---------|
| Higgs | Binary classification | [link](https://archive.ics.uci.edu/ml/datasets/HIGGS) |10,500,000|28| use last 500,000 samples as test set |
| Epsilon | Binary classification | [link](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) | 400,000 | 2,000 | use the provided test set |
| Bosch | Binary classification | [link](https://www.kaggle.com/c/bosch-production-line-performance/data) | 1,000,000 | 968 | use the provided test set |
| Yahoo LTR| Learning to rank | [link](https://webscope.sandbox.yahoo.com/catalog.php?datatype=c) |473,134|700| set1.train as train, set1.test as test |
| MS LTR | Learning to rank | [link](http://research.microsoft.com/en-us/projects/mslr/) |2,270,296|137| {S1,S2,S3} as train set, {S5} as test set |
| Expo | Binary classification (Categorical) | [link](http://stat-computing.org/dataexpo/2009/) |11,000,000|700| use last 1,000,000 as test set |

We used the following hardware to evaluate the performance of LightGBM GPU training.
Our CPU reference is **a high-end dual socket Haswell-EP Xeon server with 28 cores**;
GPUs include a budget GPU (RX 480) and a mainstream (GTX 1080) GPU installed on
the same server. It is worth mentioning that **the GPUs used are not the best GPUs in
the market**; if you are using a better GPU (like AMD RX 580, NVIDIA GTX 1080 Ti,
Titan X Pascal, Titan Xp, Tesla P100, etc), you are likely to get a better speedup.

| Hardware | Peak FLOPS | Peak Memory BW | Cost (MSRP) |
|------------------------------|--------------|----------------|-------------|
| AMD Radeon RX 480 | 5,161 GFLOPS | 256 GB/s | $199 |
| NVIDIA GTX 1080 | 8,228 GFLOPS | 320 GB/s | $499 |
| 2x Xeon E5-2683v3 (28 cores) | 1,792 GFLOPS | 133 GB/s | $3,692 |

During benchmarking on CPU we used only 28 physical cores of the CPU, and did
not use hyper-threading cores, because we found that using too many threads
actually makes performance worse. The following shows the training configuration we used:

```
max_bin = 63
num_leaves = 255
num_iterations = 500
learning_rate = 0.1
tree_learner = serial
task = train
is_train_metric = false
min_data_in_leaf = 1
min_sum_hessian_in_leaf = 100
ndcg_eval_at = 1,3,5,10
sparse_threshold=1.0
device = gpu
gpu_platform_id = 0
gpu_device_id = 0
num_thread = 28
```

We use the configuration shown above, except for the
Bosch dataset, we use a smaller `learning_rate=0.015` and set
`min_sum_hessian_in_leaf=5`. For all GPU training we set
`sparse_threshold=1`, and vary the max number of bins (255, 63 and 15). The
GPU implementation is from commit
[0bb4a82](https://github.com/Microsoft/LightGBM/commit/0bb4a82)
of LightGBM, when the GPU support was just merged in.

The following table lists the accuracy on test set that CPU and GPU learner
can achieve after 500 iterations. GPU with the same number of bins can achieve
a similar level of accuracy as on the CPU, despite using single precision
arithmetic. For most datasets, using 63 bins is sufficient.

| | CPU 255 bins | CPU 63 bins | CPU 15 bins | GPU 255 bins | GPU 63 bins | GPU 15 bins |
|-------------------|--------------|-------------|-------------|--------------|-------------|-------------|
| Higgs AUC | 0.845612 | 0.845239 | 0.841066 | 0.845612 | 0.845209 | 0.840748 |
| Epsilon AUC | 0.950243 | 0.949952 | 0.948365 | 0.950057 | 0.949876 | 0.948365 |
| Yahoo-LTR NDCG@1 | 0.730824 | 0.730165 | 0.729647 | 0.730936 | 0.732257 | 0.73114 |
| Yahoo-LTR NDCG@3 | 0.738687 | 0.737243 | 0.736445 | 0.73698 | 0.739474 | 0.735868 |
| Yahoo-LTR NDCG@5 | 0.756609 | 0.755729 | 0.754607 | 0.756206 | 0.757007 | 0.754203 |
| Yahoo-LTR NDCG@10 | 0.79655 | 0.795827 | 0.795273 | 0.795894 | 0.797302 | 0.795584 |
| Expo AUC | 0.776217 | 0.771566 | 0.743329 | 0.776285 | 0.77098 | 0.744078 |
| MS-LTR NDCG@1 | 0.521265 | 0.521392 | 0.518653 | 0.521789 | 0.522163 | 0.516388 |
| MS-LTR NDCG@3 | 0.503153 | 0.505753 | 0.501697 | 0.503886 | 0.504089 | 0.501691 |
| MS-LTR NDCG@5 | 0.509236 | 0.510391 | 0.507193 | 0.509861 | 0.510095 | 0.50663 |
| MS-LTR NDCG@10 | 0.527835 | 0.527304 | 0.524603 | 0.528009 | 0.527059 | 0.524722 |
| Bosch AUC | 0.718115 | 0.721791 | 0.716677 | 0.717184 | 0.724761 | 0.717005 |


We record the wall clock time after 500 iterations, as shown in the figure below:

![Performance Comparison](http://www.huan-zhang.com/images/upload/lightgbm-gpu/compare_0bb4a825.png)

When using a GPU, it is advisable to use a bin size of 63 rather than 255,
because it can speed up training significantly without noticeably affecting
accuracy. On CPU, using a smaller bin size only marginally improves
performance, sometimes even slows down training, like in Higgs (we can
reproduce the same slowdown on two different machines, with different GCC
versions). We found that GPU can achieve impressive acceleration on large and
dense datasets like Higgs and Epsilon. Even on smaller and sparse datasets,
a *budget* GPU can still compete and be faster than a 28-core Haswell server.

Memory Usage
---------------

The next table shows GPU memory usage reported by `nvidia-smi` during training
with 63 bins. We can see that even the largest dataset just uses about 1 GB of
GPU memory, indicating that our GPU implementation can scale to huge
datasets over 10x larger than Bosch or Epsilon. Also, we can observe that
generally a larger dataset (using more GPU memory, like Epsilon or Bosch)
has better speedup, because the overhead of invoking GPU functions becomes
significant when the dataset is small.

| Datasets | Higgs | Epsilon | Bosch | MS-LTR | Expo |Yahoo-LTR |
|-----------------------|-------|---------|--------|---------|-------|----------|
| GPU Memory Usage (MB) | 611 | 901 | 1067 | 413 | 405 | 291 |



185 changes: 185 additions & 0 deletions docs/GPU-Tutorial.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
LightGBM GPU Tutorial
==================================

The purpose of this document is to give you a quick step-by-step tutorial on GPU training.
We will use the GPU instance on
[Microsoft Azure cloud computing platform](https://azure.microsoft.com/)
for demonstration, but you can use any machine with modern AMD or NVIDIA GPUs.


GPU Setup
-------------------------

You need to launch a `NV` type instance on Azure (available in East US, North
Central US, South Central US, West Europe and Southeast Asia zones)
and select Ubuntu 16.04 LTS as the operating system.
For testing, the smallest `NV6` type virtual machine is sufficient, which includes
1/2 M60 GPU, with 8 GB memory, 180 GB/s memory bandwidth and 4,825 GFLOPS peak
computation power. Don't use the `NC` type instance as the GPUs (K80) are
based on an older architecture (Kepler).

First we need to install minimal NVIDIA drivers and OpenCL development environment:

```
sudo apt-get update
sudo apt-get install --no-install-recommends nvidia-375
sudo apt-get install --no-install-recommends nvidia-opencl-icd-375 nvidia-opencl-dev opencl-headers
```

After installing the drivers you need to restart the server.

```
sudo init 6
```

After about 30 seconds, the server should be up again.

If you are using a AMD GPU, you should download and install the
[AMDGPU-Pro](http://support.amd.com/en-us/download/linux) driver and
also install package `ocl-icd-libopencl1` and `ocl-icd-opencl-dev`.

Build LightGBM
----------------------------

Now install necessary building tools and dependencies:
```
sudo apt-get install --no-install-recommends git cmake build-essential libboost-dev libboost-system-dev libboost-filesystem-dev
```

The NV6 GPU instance has a 320 GB ultra-fast SSD mounted at /mnt. Let's use it
as our workspace (skip this if you are using your own machine):

```
sudo mkdir -p /mnt/workspace
sudo chown $(whoami):$(whoami) /mnt/workspace
cd /mnt/workspace
```

Now we are ready to checkout LightGBM and compile it with GPU support:

```
git clone --recursive https://github.com/Microsoft/LightGBM
cd LightGBM
mkdir build ; cd build
cmake -DUSE_GPU=1 ..
make -j$(nproc)
cd ..
```

You will see two binaries are generated, `lightgbm` and `lib_lightgbm.so`.

If you are building on OSX, you probably need to remove macro
`BOOST_COMPUTE_USE_OFFLINE_CACHE` in `src/treelearner/gpu_tree_learner.h` to
avoid a known crash bug in Boost.Compute.

Install Python Interface (optional)
-----------------------------------

If you want to use the Python interface of LightGBM, you can install it now
(along with some necessary Python package dependencies):

```
sudo apt-get -y install python-pip
sudo -H pip install setuptools numpy scipy scikit-learn -U
cd python-package/
sudo python setup.py install
cd ..
```

You need to set an additional parameter `"device" : "gpu"` (along with your other options
like `learning_rate`, `num_leaves`, etc) to use GPU in Python.
You can read our [Python Guide](https://github.com/Microsoft/LightGBM/tree/master/examples/python-guide)
for more information on how to use the Python interface.

Dataset Preparation
----------------------------

Using the following commands to prepare the Higgs dataset:

```
git clone https://github.com/guolinke/boosting_tree_benchmarks.git
cd boosting_tree_benchmarks/data
wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz"
gunzip HIGGS.csv.gz
python higgs2libsvm.py
cd ../..
ln -s boosting_tree_benchmarks/data/higgs.train
ln -s boosting_tree_benchmarks/data/higgs.test
```

Now we create a configuration file for LightGBM by running the following commands
(please copy the entire block and run it as a whole):

```
cat > lightgbm_gpu.conf <<EOF
max_bin = 63
num_leaves = 255
num_iterations = 50
learning_rate = 0.1
tree_learner = serial
task = train
is_train_metric = false
min_data_in_leaf = 1
min_sum_hessian_in_leaf = 100
ndcg_eval_at = 1,3,5,10
sparse_threshold = 1.0
device = gpu
gpu_platform_id = 0
gpu_device_id = 0
EOF
echo "num_threads=$(nproc)" >> lightgbm_gpu.conf
```

GPU is enabled in the configuration file we just created by setting `device=gpu`. It will use
the first GPU installed on the system by default (`gpu_platform_id=0` and
`gpu_device_id=0`).

Run Your First Learning Task on GPU
-----------------------------------

Now we are ready to start GPU training! First we want to verify the GPU works
correctly. Run the following command to train on GPU, and take a note of the
AUC after 50 iterations:

```
./lightgbm config=lightgbm_gpu.conf data=higgs.train valid=higgs.test objective=binary metric=auc
```

Now train the same dataset on CPU using the following command. You should observe a similar AUC:

```
./lightgbm config=lightgbm_gpu.conf data=higgs.train valid=higgs.test objective=binary metric=auc device=cpu
```

Now we can make a speed test on GPU without calculating AUC after each iteration.

```
./lightgbm config=lightgbm_gpu.conf data=higgs.train objective=binary metric=auc
```

Speed test on CPU:

```
./lightgbm config=lightgbm_gpu.conf data=higgs.train objective=binary metric=auc device=cpu
```

You should observe over three times speedup on this GPU.

The GPU acceleration can be used on other tasks/metrics (regression, multi-class classification, ranking, etc)
as well. For example, we can train the Higgs dataset on GPU as a regression task:

```
./lightgbm config=lightgbm_gpu.conf data=higgs.train objective=regression_l2 metric=l2
```

Also, you can compare the training speed with CPU:

```
./lightgbm config=lightgbm_gpu.conf data=higgs.train objective=regression_l2 metric=l2 device=cpu
```

Further Reading
---------------

[GPU Tuning Guide and Performance Comparison](./GPU-Performance.md)