Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] Parallel tree learner with Dask cannot overfit a small dataset #4471

Closed
xingyuansun opened this issue Jul 13, 2021 · 4 comments
Closed

Comments

@xingyuansun
Copy link

Hi, thanks for providing such a fantastic gradient boosting library! I was doing a sanity check of the parallel tree learner with dask, asking the model to overfit a small dataset. However, it seems the model fails to do so as long as there are at least two workers. The following code is modified from here. As we can see, with n_workers=1, the model successfully overfit the training data with a very small MSE (0.05 on my machine), but with n_workers=2, the model failed to do so, resulting in a MSE of a few thousand (2890 on my machine). May someone let me know the actual training procedure happening in the library? I am using the library with a version of 3.2.1. Thanks!

import dask.array as da
from distributed import Client, LocalCluster
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
import lightgbm as lgb

if __name__ == "__main__":
    for n_workers in [1, 2]:
        X, y = make_regression(n_samples=1000, n_features=50, random_state=0)
        cluster = LocalCluster(n_workers=n_workers)
        client = Client(cluster)
        dX = da.from_array(X, chunks=(100, 50))
        dy = da.from_array(y, chunks=(100,))
        dask_model = lgb.DaskLGBMRegressor(n_estimators=1000, random_state=0)
        dask_model.fit(dX, dy)
        assert dask_model.fitted_
        preds = dask_model.predict(dX)
        preds_local = preds.compute()
        actuals_local = dy.compute()
        mse = mean_squared_error(actuals_local, preds_local)
        print(f"MSE: {mse}")
@jmoralez
Copy link
Collaborator

Hi. I'm not able to reproduce this with the latest version in master. I believe it could be related to #4026 where if one split produced an empty child in one of the workers the predictions would become very large (which could be the cause of the high MSE you observe). That issue was fixed but hasn't been released yet.

Can you try running this with the version in master? I ran it and got:

MSE: 0.05205989998796295
MSE: 0.049111817689211024

@xingyuansun
Copy link
Author

Hi José, thank you for your reply! It seems to be an installation issue -- running by re-installing LightGBM 3.2.1 using pip/conda generates results like you provided. Thanks for the help!

@jameslamb
Copy link
Collaborator

thanks for the help @jmoralez

@jameslamb jameslamb changed the title Parallel tree learner with Dask cannot overfit a small dataset [dask] Parallel tree learner with Dask cannot overfit a small dataset Jul 15, 2021
@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants