Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU performance for Quadro P1000 (and 4 GPUs) #12

Open
Laurae2 opened this issue Apr 30, 2019 · 11 comments
Open

GPU performance for Quadro P1000 (and 4 GPUs) #12

Laurae2 opened this issue Apr 30, 2019 · 11 comments
Labels

Comments

@Laurae2
Copy link

Laurae2 commented Apr 30, 2019

CPU: Dual Xeon Gold 6154 (36 cores / 72 threads, 3.7 GHz)
OS: Pop!_OS 18.10
GPU versions: dmlc/xgboost@4fac987 and microsoft/LightGBM@5ece53b
Compilers / Drivers: CUDA 10.0.154 + NCCL 2.3.7 + OpenCL 1.2 + gcc 8.1 + Intel MKL 2019

CPU only with 18 physical threads (numactl for 1st socket + OpenMP environment variables lock in):

?gb Size Speed (s) AUC
xgb 0.1m 4.181 0.7324224
xgb 1m 15.978 0.7494959
xgb 10m 104.598 0.7551197
lgb 0.1m 1.763 0.7298355
lgb 1m 4.253 0.7636987
lgb 10m 38.197 0.7742033

1x Quadro P1000:

?gb Size Speed (s) AUC
xgb 0.1m 17.529 0.7328954
xgb 1m 38.528 0.7499591
xgb 10m 103.154 0.7564821
lgb 0.1m 18.345 0.7298129
lgb 1m 22.179 0.7640155
lgb 10m 62.929 0.774168

4x Quadro P1000:

?gb Size Speed (s) AUC
xgb 0.1m 18.838 0.7324756
xgb 1m 36.877 0.749169
xgb 10m 64.994 0.7564492
@RAMitchell
Copy link

Try updating the xgboost commit to latest. I would expect to see considerable improvement.

@Laurae2
Copy link
Author

Laurae2 commented Apr 30, 2019

@RAMitchell It seems to be actually slower. I am using only dmlc/xgboost@84d992b (16 days ago from this post) because dmlc/xgboost#4323 broke all my installation scripts/packages.

1x Quadro P1000:

?gb Size Speed (s) AUC
xgb 0.1m 21.676 0.7325956
xgb 1m 44.178 0.7494882
xgb 10m 110.799 0.7564208

4x Quadro P1000:

?gb Size Speed (s) AUC
xgb 0.1m 22.842 0.7324483
xgb 1m 43.264 0.749597
xgb 10m 71.226 0.7564267

xgboost script copy&paste:

library(data.table)
library(xgboost)
library(Matrix)
library(ROCR)

set.seed(123)

d_train <- fread("train-10m.csv", showProgress=FALSE)
d_test <- fread("test.csv", showProgress=FALSE)

X_train_test <- sparse.model.matrix(dep_delayed_15min ~ .-1, data = rbind(d_train, d_test))
n1 <- nrow(d_train)
n2 <- nrow(d_test)
X_train <- X_train_test[1:n1,]
X_test <- X_train_test[(n1+1):(n1+n2),]

dxgb_train <- xgb.DMatrix(data = X_train, label = ifelse(d_train$dep_delayed_15min=='Y',1,0))


cat(system.time({
  md <- xgb.train(data = dxgb_train, 
                  objective = "binary:logistic", 
                  nround = 100, max_depth = 10, eta = 0.1, 
                  tree_method = "gpu_hist", n_gpus = 4, nthread = 4)
})[[3]]," ",sep="")

phat <- predict(md, newdata = X_test)
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")

@RAMitchell
Copy link

Hmmm strange. My benchmarks show xgboost GPU to be outperforming LightGBM by quite a bit on similar sized datasets.
Some factors that could be influencing this:

  • nrounds = 100 vs 500. Xgboosts start-up times could be significantly more
  • Quadro1000. Maybe there is something unique about this chip? I think this is unlikely.
  • Some unknown overhead from using R. I used Python for all my benchmarks.

@Laurae2
Copy link
Author

Laurae2 commented May 1, 2019

@RAMitchell It is more likely because it is depth=10 (depth>6 for GPU) which causes the slowdown.

R and Python have near identical runtimes.

Note that the data features is very unbalanced itself. computing the gradients for splitting on a OHE column should take less than 1 millisecond per feature on a GPU, while there are other features where it would take a bit more time if the data type is different.

@Laurae2
Copy link
Author

Laurae2 commented May 11, 2019

@RAMitchell GPU hist xgboost seems to have difficulties dealing with very sparse data. The 0.1M data (15.1 MB as sparse, 100K observations x 695 features) wants to gobble 659 MB (maybe a bit more at initialization, I could notice 957 MB).

See picture below:

image

@RAMitchell
Copy link

@Laurae2 good to know. I do have plans to improve this in future. Does a similar problem occur with the CPU algorithm?

@Laurae2
Copy link
Author

Laurae2 commented May 12, 2019

@RAMitchell It occurs with hist on CPU also.

@Laurae2
Copy link
Author

Laurae2 commented May 12, 2019

Newer LightGBM GPU results using not all threads, restricting to 1 NUMA node and physical cores, excluding histogram building time (negligible, 0.04x seconds for 0.1m,, 0.1xx seconds for 1m, 1.180s for 10m):

1x Quadro P1000:

?gb Size Speed (s) AUC
xgb 0.1m 17.529 0.7328954
xgb 1m 38.528 0.7499591
xgb 10m 103.154 0.7564821
lgb 0.1m 5.776 0.7298912
lgb 1m 8.661 0.7661723
lgb 10m 39.535 0.7742480

Script:

suppressMessages({
    library(data.table)
    library(ROCR)
    library(lightgbm)
    library(Matrix)
})

set.seed(123)

d_train <- fread("train-10m.csv", showProgress=FALSE)
d_test <- fread("test.csv", showProgress=FALSE)

X_train_test <- sparse.model.matrix(dep_delayed_15min ~ . -1, data = rbindlist(list(d_train, d_test)))
n1 <- nrow(d_train)
n2 <- nrow(d_test)
X_train <- X_train_test[1:n1,]
X_test <- X_train_test[(n1 + 1):(n1 + n2),]
labels <- as.numeric(d_train$dep_delayed_15min == "Y")

dlgb_train <- lgb.Dataset(data = X_train, label = labels, nthread = 18, device = "gpu")
cat(system.time({lgb.Dataset.construct(dlgb_train)})[[3]], " ", sep = "")

cat(system.time({
    md <- lgb.train(data = dlgb_train, 
                    objective = "binary", 
                    nrounds = 100, num_leaves = 512, learning_rate = 0.1, 
                    device = "gpu", 
                    nthread = 18,
                    verbose = 0)
})[[3]], " ", sep = "")

phat <- predict(md, data = X_test)
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]], "\n")
invisible(gc(verbose = FALSE))

rm(md, dlgb_train, phat, rocr_pred)
gc(verbose = FALSE)

No idea why it does not crash when I add device = "gpu" on lgb.Dataset (otherwise, it crashes at 10m).

@trivialfis
Copy link

@RAMitchell @Laurae2 It might be the AUC implementation. I will see if it's possible to revise it a little in next release.

@Laurae2
Copy link
Author

Laurae2 commented May 18, 2019

@trivialfis We are not passing any watchlist, only the objective for gradient / hessian is computed.

@szilard szilard changed the title GPU performance for Quadro P1000 GPU performance for Quadro P1000 (and 4 GPUs) May 20, 2019
@Laurae2
Copy link
Author

Laurae2 commented Sep 25, 2020

CUDA LightGBM: some results from myself here: https://gist.github.com/Laurae2/7195cebe65887907a06e9118a3ec7f96 (VERY experimental)

Using commit microsoft/LightGBM@df37bce (25 Sept 2020).

GPU usage increases as lower number of GPU is used (ex: 80% for 1 GPU, 50% for 4 GPU).

Note: CUDA uses double precision. OpenCL can use single precision or double precision (gpu_use_dp = TRUE).

Airline OHE (see previous link for code):

Compute Timing AUC GPU RAM
18T Dual Xeon 6154 CPU 15.872s 0.7745457 AUC
1x Quadro P1000 GPU CUDA 43.767s 0.7736450 AUC 285 MB
2x Quadro P1000 GPU CUDA 32.291s 0.7736450 AUC 215 MB
3x Quadro P1000 GPU CUDA 29.732s 0.7736450 AUC 197-207 MB
4x Quadro P1000 GPU CUDA 29.515s 0.7736450 AUC 187-197 MB
OpenCL Quadro P1000 GPU sp 25.810s 0.7760418 AUC 329 MB
OpenCL Quadro P1000 GPU dp 40.080s 0.7747921 AUC 337 MB

Airline Categoricals (see previous link for code):

Compute Timing AUC GPU RAM
18T Dual Xeon 6154 CPU 18.281s 0.7922730 AUC
1x Quadro P1000 GPU CUDA 53.890s 0.7922730 AUC 245 MB
2x Quadro P1000 GPU CUDA 39.789s 0.7922730 AUC 207 MB
3x Quadro P1000 GPU CUDA 38.705s 0.7922730 AUC 197 MB
4x Quadro P1000 GPU CUDA 36.903s 0.7924575 AUC 187 MB
OpenCL Quadro P1000 GPU sp 23.896s 0.7924575 AUC 329 MB
OpenCL Quadro P1000 GPU dp 35.693s 0.7920217 AUC 337 MB

Using categoricals as example benchmark for GPU usage:

nvidia-smi of 1 GPU on LightGBM CUDA:

image

nvtop of 1 GPU on LightGBM CUDA:

image

nvidia-smi of 4 GPU on LightGBM CUDA:

image

nvtop of 4 GPU on LightGBM CUDA:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants