-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pyspark] check use_qdm across all the workers #8496
Conversation
@WeichenXu123 @trivialfis please help to review it |
python-package/xgboost/spark/core.py
Outdated
|
||
messages = context.allGather(message=json.dumps(worker_message)) | ||
|
||
use_qdm = all(json.loads(x)["use_qdm"] for x in messages) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All workers should be consistent. Let's use a check instead of a workaround.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't assume the user's env, especially for standalone mode. users need to install the xgboost dep in every worker manually. And this check is really cheap since the message is attached to the original allGather API getting rabit message. but it can really fix the "inconsistent" issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this PR also removes the first barrier() which is not needed, since allGather achieves the same goal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would like the users to have a consistent environment.
users need to install the xgboost dep in every worker manually
I'm sure a user can automate the process.
My suggestion is not to remove the allgather, but instead to assert that all workers have the same environment. If it doesn't, we will throw an exception and terminate the job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean "not to remove the allgather"?
@WeichenXu123 what's your opinion about "but instead to assert that all workers have the same environment. If it doesn't, we will throw an exception and terminate the job."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can assume all worker has the same python environment.
We don't need to complicate the code for some very corner case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WeichenXu123, how about throwing an exception when detecting in-consistent env?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is simple and won't introduce any side-effects including extra overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wbo4958 Sounds good!
@hcho3 @trivialfis please help to trigger the CI. Thx |
@WeichenXu123 @trivialfis please help to review it again. Thx |
This PR is a followup of #8471 which can't resolve the issue that some workers have cudf while some others don't.
This PR gathers use_qdm infos from all workers and does the cudf-available check.