-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] [python] client.rebalance on dask ranker test #3892
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! This is one of those fun changes that looks super simple, but I know a ton of good research went into it in #3817 (comment).
Thanks very much 🚀
@@ -409,7 +418,7 @@ def test_ranker(output, client, listen_port, group): | |||
# have high rank correlation with scores from serial ranker. | |||
dcor = spearmanr(rnkvec_dask, y).correlation | |||
assert dcor > 0.6 | |||
assert spearmanr(rnkvec_dask, rnkvec_local).correlation > 0.75 | |||
assert spearmanr(rnkvec_dask, rnkvec_local).correlation > 0.8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! I'm comfortable with this, given the distributed you saw in #3817 (comment)
from distributed.utils_test import client, cluster_fixture, gen_cluster, loop | ||
from scipy.sparse import csr_matrix | ||
from sklearn.datasets import make_blobs, make_regression | ||
from sklearn.utils import check_random_state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh thanks for removing this. Didn't realize it was unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all credit goes to pycharm
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Further addresses #3817 by making
test_dask.py
a little bit more predictable. Previously thetest_ranker
test while usingoutput='array'
provided very unevengroup
allocations among its two test workers, unlike in the case of dask.dataframe input. Because#data
is so small (e.g. 100 rows), when one worker gets a small amount of data relative to the other(s), this can cause rather significant discrepancies between the Dask ranker and the standardLGBMRanker
. See comment for more background.Applying
client.rebalance
in this case (small data and uneven worker data distributions) helps make the tests a bit more predictable, tightening the distribution of observed spearman correlations withLGBMRanker
. Thanks for the patience to the maintainers!