Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix training related failures #616

Merged
merged 11 commits into from
Feb 8, 2024
Merged

Conversation

JulienVig
Copy link
Collaborator

@JulienVig JulienVig commented Feb 1, 2024

Fixes multiple relatively small bugs happening when training with DISCO.

  • Fixes Fix local test failure #612
    One local test was failing with a timeout on my laptop. It happens that the test timeout was too short for my humble machine to finish the test. I simply increased the timeout value.

  • Fixes Training on Titanic has a NaN loss #615
    The training Loss was NaN when training the Titanic task. There were two issues:

    1. the tabular preprocessing was empty after a preprocessing (cf. issue for more details). I added a placeholder preprocessing to handle missing values and ensure there is no NaN in the inputs
    2. the default learning rate was too high for the small amount of data and led the weights to diverge. Decreased it from 0.01 to 0.001.
      Before: training accuracy constant at 0.61 and NaN Loss, now the training accuracy increases throughout the epochs (as well as the validation) while the loss decreases monotonically, both from a Node.js script and from the web-client.
  • Fixes Train Collaboratively raises error during training #610
    See issue for detailed description of bug and fix. While the client used to send the weights and then request the new weights separately, the server now automatically sends back the new weights after receiving the local contribution. This change saves 1 RTT and importantly, ensures that events happen in the right order, which was previously causing timeouts.
    Now training collaboratively doesn't timeout anymore. The training accuracy now increases monotonically to 70% while it was previously staying constant at 61% with NaN loss.

@JulienVig JulienVig added bug Something isn't working discojs Related to Disco.js server Related to the server labels Feb 1, 2024
@JulienVig JulienVig self-assigned this Feb 1, 2024
@JulienVig JulienVig linked an issue Feb 1, 2024 that may be closed by this pull request
@JulienVig JulienVig marked this pull request as ready for review February 6, 2024 13:57
Copy link
Collaborator

@tharvik tharvik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small non-blocking comments, merge at will 🚀

JulienVig and others added 3 commits February 8, 2024 12:30
Co-authored-by: Valérian Rousset <tharvik@users.noreply.github.com>
Co-authored-by: Valérian Rousset <tharvik@users.noreply.github.com>
@JulienVig JulienVig merged commit 27fb55e into develop Feb 8, 2024
17 checks passed
@JulienVig JulienVig deleted the 610-training-timeout-julien branch February 8, 2024 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working discojs Related to Disco.js server Related to the server
Projects
None yet
2 participants