Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-threading debugging and test #31

Merged
merged 6 commits into from
Aug 24, 2024
Merged

Conversation

ZidongLiu
Copy link
Owner

@ZidongLiu ZidongLiu commented Aug 18, 2024

Summary

  • use rng directly in torch.randn instead of relying on the global rng state
  • refactor server train one step to make it more readable
  • add option in server to switch between 1-by-1 and multi-thread
  • test 2 options under 1/5/10 perturb, 1/3/5 local updates, 1/3/5 clients. They should give same result

Result

Before PR:
image

After PR:
image

Test

image

@ZidongLiu ZidongLiu added the bug Something isn't working label Aug 18, 2024
@ZidongLiu ZidongLiu linked an issue Aug 18, 2024 that may be closed by this pull request
Copy link
Collaborator

@BichengYing BichengYing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions about option to switch between serial and parallel and test, you should split that server-client local update out of the main step

cezo_fl/client.py Show resolved Hide resolved
cezo_fl/client.py Show resolved Hide resolved
@ZidongLiu
Copy link
Owner Author

ZidongLiu commented Aug 18, 2024

Some suggestions about option to switch between serial and parallel and test, you should split that server-client local update out of the main step

@BichengYing Which step? Any code reference?

@BichengYing
Copy link
Collaborator

I mean this def train_one_step(self, iteration: int) -> tuple[float, float]: function. It becomes longer and longer (consider we may add some attack/defend aggregation in the future). It is a good time to split them more. For this PR specifically, this with ThreadPoolExecutor() as executor: part can be extracted out as another sub-function. It will also result in a more reasonable architecture for the switching the parallel and serial I think

@ZidongLiu ZidongLiu requested a review from BichengYing August 21, 2024 23:21
@ZidongLiu
Copy link
Owner Author

result running test
image

@ZidongLiu ZidongLiu marked this pull request as ready for review August 22, 2024 23:34
ZidongLiu added a commit that referenced this pull request Aug 23, 2024
broken test will be fixed in #31
cezo_fl/client.py Outdated Show resolved Hide resolved
cezo_fl/run_client_jobs.py Outdated Show resolved Hide resolved
@ZidongLiu
Copy link
Owner Author

ZidongLiu commented Aug 24, 2024

main train speed:
image

this branch parallel=True
image

this branch parallel=False
image

@ZidongLiu ZidongLiu merged commit d346c83 into main Aug 24, 2024
1 check passed
@ZidongLiu ZidongLiu deleted the fix-multi-thread-problem branch August 24, 2024 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Potential Corruption of the random seeds between the parallel jobs.
2 participants