-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unet3d rcp fix v3.1 #327
Unet3d rcp fix v3.1 #327
Conversation
In mlcommons/training#523 the distributed sampler was enabled, which required to use `drop_last_batch` setting, vs `pad_last_batch` as was being used previously. This changed the `samples_per_epoch` for each global batch size and required a different set of RPCs for GBS that do not divide the dataset evenly. Furthermore, in mlcommons/training#625, a bug in the implementation of the distributed samples was found, which might have affected convergence. I regenerated all the RCPs (except for BS=2 which does not use the distributed sampler and divides the dataset evenly) to accommodate the changes. The change for GBS=32 and GBS=80 seems substantial, but please keep in mind that the number of samples is more or less the same, so the total time to train should remain constant: GBS=32 old samples per epoch: 192 old RCP mean: 1974 new samples per epoch: 160 new RCP mean: 2409 ratio: 1.22, expected ratio (192/160): 1.2 GBS=56 old samples per epoch: 168 old RCP mean: 2213 new samples per epoch: 168 new RCP mean: 2300 ratio: 1.04, expected ratio: 1.00 GBS=64 old samples per epoch: 192 old RCP mean: --- new samples per epoch: 128 new RCP mean: 3270.77 ratio: ---, expected ratio: 1.5, expected old RCP mean: 2180.51 GBS=80 old samples per epoch: 240 old RCP mean: 1618.75 new samples per epoch: 160 new RCP mean: 2462.51 ratio 1.52, expected ratio: 1.5 GBS=84 old RCP mean: 2233.25 new RCP mean: 2308.695652173913 ratio: 1.03, expected ratio: 1.0
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Training WG 08/17: Generally accepted as ok in meeting. |
@mmarcinkiewicz Is this RCP update meant for training v3.1? In that case could you update your branch and move the changes into the training-3.1.0 folder? |
…to unet3d-rcp-fix-v3.1 # Conflicts: # mlperf_logging/rcp_checker/training_3.0.0/rcps_unet3d.json
@pgmpablo157321 changes have been moved to 3.1.0 folder. Please take a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
UNET3D RCPs don't exhibit a typical convex curve on a plot epochs to converge vs GBS. Instead, it's convergence is a bit chaotic - this is a result of the dataset being very small and the fact that training happens on GBS which are a large fraction of the dataset (1/3 or even 1/2). Because of that, the RCP pruning doesn't work as expected, potentially passing a questionable results. Here's the current plot of epochs to converge vs GBS:
My proposition is to replace epochs to converge with samples to converge (like BERT and LLM currently do). To do that, I simply multiply the epochs to converge by samples per epoch. The new plot would look like:
The behaviour is much more stable, there are no visible dips for RCP pruning to malfunction.
If the PR is approved, I'll prepare appropriate changes to rules and the reference.