Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pyspark] check use_qdm across all the workers #8496

Merged
merged 6 commits into from
Dec 8, 2022

Conversation

wbo4958
Copy link
Contributor

@wbo4958 wbo4958 commented Nov 29, 2022

This PR is a followup of #8471 which can't resolve the issue that some workers have cudf while some others don't.

This PR gathers use_qdm infos from all workers and does the cudf-available check.

@wbo4958
Copy link
Contributor Author

wbo4958 commented Nov 29, 2022

@WeichenXu123 @trivialfis please help to review it

@wbo4958 wbo4958 mentioned this pull request Nov 30, 2022
9 tasks

messages = context.allGather(message=json.dumps(worker_message))

use_qdm = all(json.loads(x)["use_qdm"] for x in messages)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All workers should be consistent. Let's use a check instead of a workaround.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't assume the user's env, especially for standalone mode. users need to install the xgboost dep in every worker manually. And this check is really cheap since the message is attached to the original allGather API getting rabit message. but it can really fix the "inconsistent" issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this PR also removes the first barrier() which is not needed, since allGather achieves the same goal.

Copy link
Member

@trivialfis trivialfis Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would like the users to have a consistent environment.

users need to install the xgboost dep in every worker manually

I'm sure a user can automate the process.

My suggestion is not to remove the allgather, but instead to assert that all workers have the same environment. If it doesn't, we will throw an exception and terminate the job.

Copy link
Contributor Author

@wbo4958 wbo4958 Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean "not to remove the allgather"?

@WeichenXu123 what's your opinion about "but instead to assert that all workers have the same environment. If it doesn't, we will throw an exception and terminate the job."

Copy link
Contributor

@WeichenXu123 WeichenXu123 Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can assume all worker has the same python environment.

We don't need to complicate the code for some very corner case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WeichenXu123, how about throwing an exception when detecting in-consistent env?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is simple and won't introduce any side-effects including extra overhead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wbo4958 Sounds good!

@wbo4958
Copy link
Contributor Author

wbo4958 commented Dec 7, 2022

@hcho3 @trivialfis please help to trigger the CI. Thx

@wbo4958
Copy link
Contributor Author

wbo4958 commented Dec 7, 2022

@WeichenXu123 @trivialfis please help to review it again. Thx

@wbo4958 wbo4958 requested review from trivialfis and WeichenXu123 and removed request for trivialfis and WeichenXu123 December 8, 2022 02:16
@trivialfis trivialfis merged commit 40a1a2f into dmlc:master Dec 8, 2022
@wbo4958 wbo4958 deleted the use-qdm branch April 23, 2024 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants