-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU performance for Quadro P1000 (and 4 GPUs) #12
Comments
Try updating the xgboost commit to latest. I would expect to see considerable improvement. |
@RAMitchell It seems to be actually slower. I am using only dmlc/xgboost@84d992b (16 days ago from this post) because dmlc/xgboost#4323 broke all my installation scripts/packages. 1x Quadro P1000:
4x Quadro P1000:
xgboost script copy&paste: library(data.table)
library(xgboost)
library(Matrix)
library(ROCR)
set.seed(123)
d_train <- fread("train-10m.csv", showProgress=FALSE)
d_test <- fread("test.csv", showProgress=FALSE)
X_train_test <- sparse.model.matrix(dep_delayed_15min ~ .-1, data = rbind(d_train, d_test))
n1 <- nrow(d_train)
n2 <- nrow(d_test)
X_train <- X_train_test[1:n1,]
X_test <- X_train_test[(n1+1):(n1+n2),]
dxgb_train <- xgb.DMatrix(data = X_train, label = ifelse(d_train$dep_delayed_15min=='Y',1,0))
cat(system.time({
md <- xgb.train(data = dxgb_train,
objective = "binary:logistic",
nround = 100, max_depth = 10, eta = 0.1,
tree_method = "gpu_hist", n_gpus = 4, nthread = 4)
})[[3]]," ",sep="")
phat <- predict(md, newdata = X_test)
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n") |
Hmmm strange. My benchmarks show xgboost GPU to be outperforming LightGBM by quite a bit on similar sized datasets.
|
@RAMitchell It is more likely because it is depth=10 (depth>6 for GPU) which causes the slowdown. R and Python have near identical runtimes. Note that the data features is very unbalanced itself. computing the gradients for splitting on a OHE column should take less than 1 millisecond per feature on a GPU, while there are other features where it would take a bit more time if the data type is different. |
@RAMitchell GPU hist xgboost seems to have difficulties dealing with very sparse data. The 0.1M data (15.1 MB as sparse, 100K observations x 695 features) wants to gobble 659 MB (maybe a bit more at initialization, I could notice 957 MB). See picture below: |
@Laurae2 good to know. I do have plans to improve this in future. Does a similar problem occur with the CPU algorithm? |
@RAMitchell It occurs with |
Newer LightGBM GPU results using not all threads, restricting to 1 NUMA node and physical cores, excluding histogram building time (negligible, 0.04x seconds for 0.1m,, 0.1xx seconds for 1m, 1.180s for 10m): 1x Quadro P1000:
Script: suppressMessages({
library(data.table)
library(ROCR)
library(lightgbm)
library(Matrix)
})
set.seed(123)
d_train <- fread("train-10m.csv", showProgress=FALSE)
d_test <- fread("test.csv", showProgress=FALSE)
X_train_test <- sparse.model.matrix(dep_delayed_15min ~ . -1, data = rbindlist(list(d_train, d_test)))
n1 <- nrow(d_train)
n2 <- nrow(d_test)
X_train <- X_train_test[1:n1,]
X_test <- X_train_test[(n1 + 1):(n1 + n2),]
labels <- as.numeric(d_train$dep_delayed_15min == "Y")
dlgb_train <- lgb.Dataset(data = X_train, label = labels, nthread = 18, device = "gpu")
cat(system.time({lgb.Dataset.construct(dlgb_train)})[[3]], " ", sep = "")
cat(system.time({
md <- lgb.train(data = dlgb_train,
objective = "binary",
nrounds = 100, num_leaves = 512, learning_rate = 0.1,
device = "gpu",
nthread = 18,
verbose = 0)
})[[3]], " ", sep = "")
phat <- predict(md, data = X_test)
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]], "\n")
invisible(gc(verbose = FALSE))
rm(md, dlgb_train, phat, rocr_pred)
gc(verbose = FALSE) No idea why it does not crash when I add |
@RAMitchell @Laurae2 It might be the AUC implementation. I will see if it's possible to revise it a little in next release. |
@trivialfis We are not passing any watchlist, only the objective for gradient / hessian is computed. |
CUDA LightGBM: some results from myself here: https://gist.github.com/Laurae2/7195cebe65887907a06e9118a3ec7f96 (VERY experimental) Using commit microsoft/LightGBM@df37bce (25 Sept 2020). GPU usage increases as lower number of GPU is used (ex: 80% for 1 GPU, 50% for 4 GPU). Note: CUDA uses double precision. OpenCL can use single precision or double precision ( Airline OHE (see previous link for code):
Airline Categoricals (see previous link for code):
Using categoricals as example benchmark for GPU usage: nvidia-smi of 1 GPU on LightGBM CUDA: nvtop of 1 GPU on LightGBM CUDA: nvidia-smi of 4 GPU on LightGBM CUDA: nvtop of 4 GPU on LightGBM CUDA: |
CPU: Dual Xeon Gold 6154 (36 cores / 72 threads, 3.7 GHz)
OS: Pop!_OS 18.10
GPU versions: dmlc/xgboost@4fac987 and microsoft/LightGBM@5ece53b
Compilers / Drivers: CUDA 10.0.154 + NCCL 2.3.7 + OpenCL 1.2 + gcc 8.1 + Intel MKL 2019
CPU only with 18 physical threads (
numactl
for 1st socket + OpenMP environment variables lock in):1x Quadro P1000:
4x Quadro P1000:
The text was updated successfully, but these errors were encountered: