Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training with 96 CPU cores is slower than with 48 CPU cores #4631

Closed
JivanRoquet opened this issue Sep 28, 2021 · 8 comments
Closed

Training with 96 CPU cores is slower than with 48 CPU cores #4631

JivanRoquet opened this issue Sep 28, 2021 · 8 comments

Comments

@JivanRoquet
Copy link

JivanRoquet commented Sep 28, 2021

  • LightGBM version: 3.2.1
  • Using LightGBM's Python Scikit-Learn API

Hello, I noticed that the training time is consistently about 50% slower when using a c5.24xlarge instance on AWS (96 cpu cores, 96GB RAM) than when using a c5.12xlarge (48 cpu cores, 192GB RAM).

Model is created with the following settings:

lgb_c = LGBMClassifier(
    max_depth=7,
    n_estimators=1000,
    reg_alpha=0.15,
    reg_lambda=0.15,
    num_leaves=100,
    learning_rate=0.006,
    colsample_bytree=0.8,
    min_child_samples=20,
    objective='multiclass',
    class_weight='balanced',
    importance_type='gain',
    n_jobs=os.cpu_count(),
)

Training is done with these parameters:

lgb_c.fit(
    X_train,
    y_train,
    eval_set=(X_eval, y_eval),
    eval_names=['eval'],
    eval_metric='multiclass',
    early_stopping_rounds=100,
    verbose=50
)

Eval set is about 15k rows.

Training dataset has about 270k rows, with about 18 categorical features (high cardinality, between 200 and 2000 unique elements each) and 2 numeric (integer) features. Target is a categorical feature with about 300 unique categories.

  • Typical training time with 48 cores: 20 minutes
  • Typical training time with 96 cores: 30 minutes

In each case, all cores are 100% busy during the whole time of the training.

I would have expected training time to go down with the number of cores. Could this be a bug or is this behaviour normal? Is there any way to fix this using different hyper parameters or settings for the model?

@Laurae2
Copy link
Contributor

Laurae2 commented Sep 28, 2021

Hello,

As you increase the number of CPU threads, the following requirements increase :

  • overhead of multithreading
  • RAM bandwidth used
  • cores fighting for L1/L2/L3 caches
  • maintaining turbo boost clocks
  • maintaining NUMA node memory coherency (for multi CPU setups)
  • CPU interconnects bandwidth (for multi CPU setups)
  • and possibly more depending on the hardware/software setup (ex: kernel, heat, power limits, etc.)

In most cases, you are very likely to hit a large multithreading overhead especially with "only" 270k rows and few features (and potentially lower clock rates from using more threads). In this scenario, it will significantly cost more CPU time to dispatch work to threads, and the cost become large enough to supersede the gain of parallelism. It is a normal and expected behavior.

Note that 100% CPU usage reported in task managers / top (etc.) does not mean 100% proper usage of the CPU.

You may want to check the following: szilard/GBM-perf#29 (comment) but also some of my old detailed benchmarks : https://sites.google.com/view/lauraepp/benchmarks/xgb-vs-lgb-oct-2018 (or for instance Laurae2/ml-perf#6 (comment) for simple example on xgboost) for some examples on multithreading scaling.

@JivanRoquet
Copy link
Author

JivanRoquet commented Sep 28, 2021

Update with more data on training speed vs CPU count

CPU Cores Training time (minutes)
36 22
48 20
72 27
96 27

interestingly I’ve just noticed that

  • 72 cores machine with 48 cores used : 25 minutes
  • 48 cores machine with 48 cores used : 20 minutes

@JivanRoquet
Copy link
Author

Thanks @Laurae2 for the clear and useful explanation.

@JivanRoquet
Copy link
Author

JivanRoquet commented Sep 28, 2021

So does it mean that in this case (same dataset, same hyper parameters), 20 minutes would likely be the training time lower bound? There's really no way to get significantly below this without changing the hyper parameters?

I've also tried using GPU and it makes everything much, much slower on this "small" dataset, especially because of the high-cardinality categorical variables.

@shiyu1994
Copy link
Collaborator

@JivanRoquet Thank you for using LightGBM. Could you please try force_row_wise and force_col_wise options to see if the same conclusion holds on both choices?

@JivanRoquet
Copy link
Author

Hi @shiyu1994 I'm going to try this, thanks for the suggestion.

@no-response
Copy link

no-response bot commented Dec 10, 2021

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

@no-response no-response bot closed this as completed Dec 10, 2021
@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants