-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LightGBM pre-performance #5
Comments
Also LightGBM GPU RAM usage is very predictable once you used it once: no need to try guessing the peak RAM used, because the peak RAM used seems to be one that stays during the training. xgboost GPU usage is less predictable, as the peak is (usually) very short and can't be always be caught using Here is an example of extrapolating 67.25 MB RAM estimated peak usage for maxing out my 4GB GPU with processes, without crashing on 68 processes (with still small headroom for 1 more): |
Obviously, we cannot run 4x GPU with 68 R threads each (maxes out each GPU to nearly 4GB RAM) as R doesn't have enough available connections (limited to 125 at the moment), see the following: HenrikBengtsson/Wishlist-for-R#28 - this could maximize the performance of LightGBM for the data. A solution to this would be to nest parallelization, however this might hit a limit once 1000 connections are reached. The 125 connection limit also limits how this idea could be used, as I already also hit the limit on large servers (8x Intel Platinum 8180/8280, it has 224 cores / 448 threads!). Perhaps R will do something in the future about it (I do use nested parallelization workaround to scale R with 10,000+ threads on many machines using SSH key, however the initialization performance for such amount of servers is poor). Running overnight this with LightGBM, and re-running xgboost overnight also: CPU:
GPU:
Script: DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=1 --model_threads=1 --parallel_gpus=1 --gpus_threads=1 --number_of_models=50 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=2 --model_threads=1 --parallel_gpus=2 --gpus_threads=1 --number_of_models=100 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=3 --model_threads=1 --parallel_gpus=3 --gpus_threads=1 --number_of_models=250 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=4 --model_threads=1 --parallel_gpus=4 --gpus_threads=1 --number_of_models=500 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=4 --model_threads=1 --parallel_gpus=1 --gpus_threads=4 --number_of_models=100 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=8 --model_threads=1 --parallel_gpus=2 --gpus_threads=4 --number_of_models=250 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=12 --model_threads=1 --parallel_gpus=3 --gpus_threads=4 --number_of_models=500 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=16 --model_threads=1 --parallel_gpus=4 --gpus_threads=4 --number_of_models=1000 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=9 --model_threads=1 --parallel_gpus=1 --gpus_threads=9 --number_of_models=250 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=18 --model_threads=1 --parallel_gpus=2 --gpus_threads=9 --number_of_models=500 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=27 --model_threads=1 --parallel_gpus=3 --gpus_threads=9 --number_of_models=1000 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=36 --model_threads=1 --parallel_gpus=4 --gpus_threads=9 --number_of_models=2500 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=18 --model_threads=1 --parallel_gpus=1 --gpus_threads=18 --number_of_models=500 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=36 --model_threads=1 --parallel_gpus=2 --gpus_threads=18 --number_of_models=1000 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=54 --model_threads=1 --parallel_gpus=3 --gpus_threads=18 --number_of_models=2500 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=72 --model_threads=1 --parallel_gpus=4 --gpus_threads=18 --number_of_models=5000 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=35 --model_threads=1 --parallel_gpus=1 --gpus_threads=35 --number_of_models=1000 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=70 --model_threads=1 --parallel_gpus=2 --gpus_threads=35 --number_of_models=2500 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=58 --model_threads=1 --parallel_gpus=1 --gpus_threads=58 --number_of_models=2500 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=1 --model_threads=1 --parallel_gpus=0 --gpus_threads=0 --number_of_models=100 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=9 --model_threads=1 --parallel_gpus=0 --gpus_threads=0 --number_of_models=250 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=18 --model_threads=1 --parallel_gpus=0 --gpus_threads=0 --number_of_models=500 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=35 --model_threads=1 --parallel_gpus=0 --gpus_threads=0 --number_of_models=1000 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=70 --model_threads=1 --parallel_gpus=0 --gpus_threads=0 --number_of_models=2500 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=1 --model_threads=1 --parallel_gpus=0 --gpus_threads=0 --number_of_models=250 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=1 --model_threads=9 --parallel_gpus=0 --gpus_threads=0 --number_of_models=250 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=1 --model_threads=18 --parallel_gpus=0 --gpus_threads=0 --number_of_models=250 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=1 --model_threads=35 --parallel_gpus=0 --gpus_threads=0 --number_of_models=250 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg
env -u GOMP_CPU_AFFINITY Rscript bench_lgb_test.R --parallel_threads=1 --model_threads=70 --parallel_gpus=0 --gpus_threads=0 --number_of_models=250 --wkdir=${DIR} --train_file=train-0.1m.csv --test_file=test.csv --output_dir=./output --output_csv=TRUE --output_chart=jpeg |
I am currently testing a LightGBM script which can run 40 LightGBM on one 4GB GPU (only 2690 MB used).
Throughput: 1.330s / model, insane speed! Imagine on multiple GPUs...
You need 67.25 MB per thread for LightGBM on airline 0.1m data. It's about 10x lower than xgboost. Maybe @RAMitchell can get some ideas to optimize xgboost for very sparse data on GPU, as it could be a common case (xgboost does not support categoricals). LightGBM handles 16 bins, 64 bins, and 256 bins as shown here: https://github.com/microsoft/LightGBM/tree/master/src/treelearner/ocl
For the script, you may have to toy with
gpu_platform_id
andgpu_device_id
. Unfortunately, it's due to OpenCL way of working with devices... (it will crash if it is wrong platform/device combo).Test script:
The text was updated successfully, but these errors were encountered: