GPU performance for Quadro P1000 (and 4 GPUs) #12

Laurae2 · 2019-04-30T20:14:25Z

CPU: Dual Xeon Gold 6154 (36 cores / 72 threads, 3.7 GHz)
OS: Pop!_OS 18.10
GPU versions: dmlc/xgboost@4fac987 and microsoft/LightGBM@5ece53b
Compilers / Drivers: CUDA 10.0.154 + NCCL 2.3.7 + OpenCL 1.2 + gcc 8.1 + Intel MKL 2019

CPU only with 18 physical threads (numactl for 1st socket + OpenMP environment variables lock in):

?gb	Size	Speed (s)	AUC
xgb	0.1m	4.181	0.7324224
xgb	1m	15.978	0.7494959
xgb	10m	104.598	0.7551197
lgb	0.1m	1.763	0.7298355
lgb	1m	4.253	0.7636987
lgb	10m	38.197	0.7742033

1x Quadro P1000:

?gb	Size	Speed (s)	AUC
xgb	0.1m	17.529	0.7328954
xgb	1m	38.528	0.7499591
xgb	10m	103.154	0.7564821
lgb	0.1m	18.345	0.7298129
lgb	1m	22.179	0.7640155
lgb	10m	62.929	0.774168

4x Quadro P1000:

?gb	Size	Speed (s)	AUC
xgb	0.1m	18.838	0.7324756
xgb	1m	36.877	0.749169
xgb	10m	64.994	0.7564492

The text was updated successfully, but these errors were encountered:

RAMitchell · 2019-04-30T20:24:15Z

Try updating the xgboost commit to latest. I would expect to see considerable improvement.

Laurae2 · 2019-04-30T21:41:39Z

@RAMitchell It seems to be actually slower. I am using only dmlc/xgboost@84d992b (16 days ago from this post) because dmlc/xgboost#4323 broke all my installation scripts/packages.

1x Quadro P1000:

?gb	Size	Speed (s)	AUC
xgb	0.1m	21.676	0.7325956
xgb	1m	44.178	0.7494882
xgb	10m	110.799	0.7564208

4x Quadro P1000:

?gb	Size	Speed (s)	AUC
xgb	0.1m	22.842	0.7324483
xgb	1m	43.264	0.749597
xgb	10m	71.226	0.7564267

xgboost script copy&paste:

library(data.table)
library(xgboost)
library(Matrix)
library(ROCR)

set.seed(123)

d_train <- fread("train-10m.csv", showProgress=FALSE)
d_test <- fread("test.csv", showProgress=FALSE)

X_train_test <- sparse.model.matrix(dep_delayed_15min ~ .-1, data = rbind(d_train, d_test))
n1 <- nrow(d_train)
n2 <- nrow(d_test)
X_train <- X_train_test[1:n1,]
X_test <- X_train_test[(n1+1):(n1+n2),]

dxgb_train <- xgb.DMatrix(data = X_train, label = ifelse(d_train$dep_delayed_15min=='Y',1,0))


cat(system.time({
  md <- xgb.train(data = dxgb_train, 
                  objective = "binary:logistic", 
                  nround = 100, max_depth = 10, eta = 0.1, 
                  tree_method = "gpu_hist", n_gpus = 4, nthread = 4)
})[[3]]," ",sep="")

phat <- predict(md, newdata = X_test)
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")

RAMitchell · 2019-04-30T21:49:20Z

Hmmm strange. My benchmarks show xgboost GPU to be outperforming LightGBM by quite a bit on similar sized datasets.
Some factors that could be influencing this:

nrounds = 100 vs 500. Xgboosts start-up times could be significantly more
Quadro1000. Maybe there is something unique about this chip? I think this is unlikely.
Some unknown overhead from using R. I used Python for all my benchmarks.

Laurae2 · 2019-05-01T08:55:54Z

@RAMitchell It is more likely because it is depth=10 (depth>6 for GPU) which causes the slowdown.

R and Python have near identical runtimes.

Note that the data features is very unbalanced itself. computing the gradients for splitting on a OHE column should take less than 1 millisecond per feature on a GPU, while there are other features where it would take a bit more time if the data type is different.

Laurae2 · 2019-05-11T23:38:24Z

@RAMitchell GPU hist xgboost seems to have difficulties dealing with very sparse data. The 0.1M data (15.1 MB as sparse, 100K observations x 695 features) wants to gobble 659 MB (maybe a bit more at initialization, I could notice 957 MB).

See picture below:

RAMitchell · 2019-05-11T23:40:49Z

@Laurae2 good to know. I do have plans to improve this in future. Does a similar problem occur with the CPU algorithm?

Laurae2 · 2019-05-12T09:28:32Z

@RAMitchell It occurs with hist on CPU also.

Laurae2 · 2019-05-12T16:40:58Z

Newer LightGBM GPU results using not all threads, restricting to 1 NUMA node and physical cores, excluding histogram building time (negligible, 0.04x seconds for 0.1m,, 0.1xx seconds for 1m, 1.180s for 10m):

1x Quadro P1000:

?gb	Size	Speed (s)	AUC
xgb	0.1m	17.529	0.7328954
xgb	1m	38.528	0.7499591
xgb	10m	103.154	0.7564821
lgb	0.1m	5.776	0.7298912
lgb	1m	8.661	0.7661723
lgb	10m	39.535	0.7742480

Script:

suppressMessages({
    library(data.table)
    library(ROCR)
    library(lightgbm)
    library(Matrix)
})

set.seed(123)

d_train <- fread("train-10m.csv", showProgress=FALSE)
d_test <- fread("test.csv", showProgress=FALSE)

X_train_test <- sparse.model.matrix(dep_delayed_15min ~ . -1, data = rbindlist(list(d_train, d_test)))
n1 <- nrow(d_train)
n2 <- nrow(d_test)
X_train <- X_train_test[1:n1,]
X_test <- X_train_test[(n1 + 1):(n1 + n2),]
labels <- as.numeric(d_train$dep_delayed_15min == "Y")

dlgb_train <- lgb.Dataset(data = X_train, label = labels, nthread = 18, device = "gpu")
cat(system.time({lgb.Dataset.construct(dlgb_train)})[[3]], " ", sep = "")

cat(system.time({
    md <- lgb.train(data = dlgb_train, 
                    objective = "binary", 
                    nrounds = 100, num_leaves = 512, learning_rate = 0.1, 
                    device = "gpu", 
                    nthread = 18,
                    verbose = 0)
})[[3]], " ", sep = "")

phat <- predict(md, data = X_test)
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]], "\n")
invisible(gc(verbose = FALSE))

rm(md, dlgb_train, phat, rocr_pred)
gc(verbose = FALSE)

No idea why it does not crash when I add device = "gpu" on lgb.Dataset (otherwise, it crashes at 10m).

trivialfis · 2019-05-18T05:17:25Z

@RAMitchell @Laurae2 It might be the AUC implementation. I will see if it's possible to revise it a little in next release.

Laurae2 · 2019-05-18T08:16:44Z

@trivialfis We are not passing any watchlist, only the objective for gradient / hessian is computed.

Laurae2 · 2020-09-25T20:40:07Z

CUDA LightGBM: some results from myself here: https://gist.github.com/Laurae2/7195cebe65887907a06e9118a3ec7f96 (VERY experimental)

Using commit microsoft/LightGBM@df37bce (25 Sept 2020).

GPU usage increases as lower number of GPU is used (ex: 80% for 1 GPU, 50% for 4 GPU).

Note: CUDA uses double precision. OpenCL can use single precision or double precision (gpu_use_dp = TRUE).

Airline OHE (see previous link for code):

Compute	Timing	AUC	GPU RAM
18T Dual Xeon 6154 CPU	15.872s	0.7745457 AUC
1x Quadro P1000 GPU CUDA	43.767s	0.7736450 AUC	285 MB
2x Quadro P1000 GPU CUDA	32.291s	0.7736450 AUC	215 MB
3x Quadro P1000 GPU CUDA	29.732s	0.7736450 AUC	197-207 MB
4x Quadro P1000 GPU CUDA	29.515s	0.7736450 AUC	187-197 MB
OpenCL Quadro P1000 GPU sp	25.810s	0.7760418 AUC	329 MB
OpenCL Quadro P1000 GPU dp	40.080s	0.7747921 AUC	337 MB

Airline Categoricals (see previous link for code):

Compute	Timing	AUC	GPU RAM
18T Dual Xeon 6154 CPU	18.281s	0.7922730 AUC
1x Quadro P1000 GPU CUDA	53.890s	0.7922730 AUC	245 MB
2x Quadro P1000 GPU CUDA	39.789s	0.7922730 AUC	207 MB
3x Quadro P1000 GPU CUDA	38.705s	0.7922730 AUC	197 MB
4x Quadro P1000 GPU CUDA	36.903s	0.7924575 AUC	187 MB
OpenCL Quadro P1000 GPU sp	23.896s	0.7924575 AUC	329 MB
OpenCL Quadro P1000 GPU dp	35.693s	0.7920217 AUC	337 MB

Using categoricals as example benchmark for GPU usage:

nvidia-smi of 1 GPU on LightGBM CUDA:

nvtop of 1 GPU on LightGBM CUDA:

nvidia-smi of 4 GPU on LightGBM CUDA:

nvtop of 4 GPU on LightGBM CUDA:

This was referenced May 1, 2019

lightgbm categorical values from R #2

Closed

100 million data results #14

Open

szilard added the analysis label May 2, 2019

This was referenced May 11, 2019

xgboost Exact performance #23

Closed

CPU Single threaded performance #22

Open

szilard mentioned this issue May 12, 2019

GPU utilization patterns #11

Open

szilard changed the title ~~GPU performance for Quadro P1000~~ GPU performance for Quadro P1000 (and 4 GPUs) May 20, 2019

StrikerRUS mentioned this issue Sep 30, 2020

Traning model on multi GPUs? microsoft/LightGBM#3416

Closed

StrikerRUS mentioned this issue Oct 17, 2020

[R-package] Add GPU support for CRAN package microsoft/LightGBM#3206

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU performance for Quadro P1000 (and 4 GPUs) #12

GPU performance for Quadro P1000 (and 4 GPUs) #12

Laurae2 commented Apr 30, 2019 •

edited

Loading

RAMitchell commented Apr 30, 2019

Laurae2 commented Apr 30, 2019

RAMitchell commented Apr 30, 2019

Laurae2 commented May 1, 2019

Laurae2 commented May 11, 2019 •

edited

Loading

RAMitchell commented May 11, 2019

Laurae2 commented May 12, 2019

Laurae2 commented May 12, 2019 •

edited

Loading

trivialfis commented May 18, 2019

Laurae2 commented May 18, 2019

Laurae2 commented Sep 25, 2020

GPU performance for Quadro P1000 (and 4 GPUs) #12

GPU performance for Quadro P1000 (and 4 GPUs) #12

Comments

Laurae2 commented Apr 30, 2019 • edited Loading

RAMitchell commented Apr 30, 2019

Laurae2 commented Apr 30, 2019

RAMitchell commented Apr 30, 2019

Laurae2 commented May 1, 2019

Laurae2 commented May 11, 2019 • edited Loading

RAMitchell commented May 11, 2019

Laurae2 commented May 12, 2019

Laurae2 commented May 12, 2019 • edited Loading

trivialfis commented May 18, 2019

Laurae2 commented May 18, 2019

Laurae2 commented Sep 25, 2020

Laurae2 commented Apr 30, 2019 •

edited

Loading

Laurae2 commented May 11, 2019 •

edited

Loading

Laurae2 commented May 12, 2019 •

edited

Loading