-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various ZeRO Stage3 Optimizations + Improvements (including bfloat16 support) #1453
Commits on Oct 12, 2021
-
Configuration menu - View commit details
-
Copy full SHA for fe26423 - Browse repository at this point
Copy the full SHA fe26423View commit details -
ZeRO stage3 optimizations, with some bug fixes
optimizations for stage3: - prefetching improvements - batching allgather calls to amortize fixed overhead and improve bandwidth utilization - batching reduce_scatter calls to amortize fixed overhead and improve bandwidth utilization - using *_base variants of allgather and reduce scatter to reduce memory allocations and data movement - more fine grained synchronization for communication that allows blocking on less work - precomputation of fetching code - using a fetch queue rather than deciding what to (pre)fetch at each iteration - limiting queued coalesced communication ops to reduce memory pressure on pytorch cuda caching allocator (not elegant solution) optimizations for stage3-offload: - made some host-device tensor copies async to improve performance bug fixes and qol improvements: - fix init context method when parent modules modify child weights - speed up model initialization by moving model to GPU before weight initialization - fixed unit test imports so that unit tests can be run from any directory - change performance logging to include memory consumption - add logging w/ model size when done partitioning model new features - bfloat16 support for ZeRO 3
Justin Chiu committedOct 12, 2021 Configuration menu - View commit details
-
Copy full SHA for 8864f91 - Browse repository at this point
Copy the full SHA 8864f91View commit details -
Justin Chiu committed
Oct 12, 2021 Configuration menu - View commit details
-
Copy full SHA for e66aedc - Browse repository at this point
Copy the full SHA e66aedcView commit details -
Justin Chiu committed
Oct 12, 2021 Configuration menu - View commit details
-
Copy full SHA for 350a7a0 - Browse repository at this point
Copy the full SHA 350a7a0View commit details
Commits on Oct 13, 2021
-
Configuration menu - View commit details
-
Copy full SHA for b37a4f0 - Browse repository at this point
Copy the full SHA b37a4f0View commit details
Commits on Oct 14, 2021
-
improvements to cache flush warn log
Justin Chiu committedOct 14, 2021 Configuration menu - View commit details
-
Copy full SHA for f383947 - Browse repository at this point
Copy the full SHA f383947View commit details -
backwards compatibility with older versions of pytorch
Justin Chiu committedOct 14, 2021 Configuration menu - View commit details
-
Copy full SHA for b2a1c95 - Browse repository at this point
Copy the full SHA b2a1c95View commit details -
handle edge case where reduced tensor smaller than world size
Justin Chiu committedOct 14, 2021 Configuration menu - View commit details
-
Copy full SHA for d8678fa - Browse repository at this point
Copy the full SHA d8678faView commit details -
moved event synchronization to allgather handle wait() call
Justin Chiu committedOct 14, 2021 Configuration menu - View commit details
-
Copy full SHA for a0faca0 - Browse repository at this point
Copy the full SHA a0faca0View commit details -
removed unnecessary barrier call
Justin Chiu committedOct 14, 2021 Configuration menu - View commit details
-
Copy full SHA for bf20c90 - Browse repository at this point
Copy the full SHA bf20c90View commit details -
Configuration menu - View commit details
-
Copy full SHA for a353017 - Browse repository at this point
Copy the full SHA a353017View commit details -
formatting fix after resolving merge conflict
Justin Chiu committedOct 14, 2021 Configuration menu - View commit details
-
Copy full SHA for c51ba46 - Browse repository at this point
Copy the full SHA c51ba46View commit details -
skip nvme prefetch when trace not complete
Justin Chiu committedOct 14, 2021 Configuration menu - View commit details
-
Copy full SHA for ff01f5c - Browse repository at this point
Copy the full SHA ff01f5cView commit details
Commits on Oct 15, 2021
-
opportunistically avoid memory allocation in allgather coalesced wher…
…e possible
Justin Chiu committedOct 15, 2021 Configuration menu - View commit details
-
Copy full SHA for 13093eb - Browse repository at this point
Copy the full SHA 13093ebView commit details
Commits on Oct 20, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 3cdcbdf - Browse repository at this point
Copy the full SHA 3cdcbdfView commit details
Commits on Oct 21, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 64d74d1 - Browse repository at this point
Copy the full SHA 64d74d1View commit details
Commits on Oct 22, 2021
-
Configuration menu - View commit details
-
Copy full SHA for e30e6cc - Browse repository at this point
Copy the full SHA e30e6ccView commit details -
Justin Chiu committed
Oct 22, 2021 Configuration menu - View commit details
-
Copy full SHA for f19593d - Browse repository at this point
Copy the full SHA f19593dView commit details -
fixes to account for parameter offload
Justin Chiu committedOct 22, 2021 Configuration menu - View commit details
-
Copy full SHA for f72bc78 - Browse repository at this point
Copy the full SHA f72bc78View commit details -
accounting for torch.cuda.memory_stats not being available
Justin Chiu committedOct 22, 2021 Configuration menu - View commit details
-
Copy full SHA for 660df05 - Browse repository at this point
Copy the full SHA 660df05View commit details -
moved partition_all_params to optimizer step
Justin Chiu committedOct 22, 2021 Configuration menu - View commit details
-
Copy full SHA for 4f9477f - Browse repository at this point
Copy the full SHA 4f9477fView commit details
Commits on Oct 26, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 818651c - Browse repository at this point
Copy the full SHA 818651cView commit details -
Configuration menu - View commit details
-
Copy full SHA for f681201 - Browse repository at this point
Copy the full SHA f681201View commit details -
allgathering on params before item gets called
Justin Chiu committedOct 26, 2021 Configuration menu - View commit details
-
Copy full SHA for bb34f90 - Browse repository at this point
Copy the full SHA bb34f90View commit details -
needed after moving partition_all_parameters call to optimizer step
Justin Chiu committedOct 26, 2021 Configuration menu - View commit details
-
Copy full SHA for 9f3b504 - Browse repository at this point
Copy the full SHA 9f3b504View commit details -
fix grad accumulation with optimizer offload
Justin Chiu committedOct 26, 2021 Configuration menu - View commit details
-
Copy full SHA for 1772d41 - Browse repository at this point
Copy the full SHA 1772d41View commit details -
grad norm computation fix for optimizer offload
Justin Chiu committedOct 26, 2021 Configuration menu - View commit details
-
Copy full SHA for 5f213d8 - Browse repository at this point
Copy the full SHA 5f213d8View commit details -
change post divide in reduce-scatter to pre divide
Justin Chiu committedOct 26, 2021 Configuration menu - View commit details
-
Copy full SHA for 3198805 - Browse repository at this point
Copy the full SHA 3198805View commit details -
fix gradient race condition w/ optimizer offload
Justin Chiu committedOct 26, 2021 Configuration menu - View commit details
-
Copy full SHA for 2225659 - Browse repository at this point
Copy the full SHA 2225659View commit details -
improve inf/nan gradient tracking
Justin Chiu committedOct 26, 2021 Configuration menu - View commit details
-
Copy full SHA for 5aa9bd5 - Browse repository at this point
Copy the full SHA 5aa9bd5View commit details -
don't prefetch when not in training mode
Justin Chiu committedOct 26, 2021 Configuration menu - View commit details
-
Copy full SHA for a1a60ed - Browse repository at this point
Copy the full SHA a1a60edView commit details -
Justin Chiu committed
Oct 26, 2021 Configuration menu - View commit details
-
Copy full SHA for df41659 - Browse repository at this point
Copy the full SHA df41659View commit details
Commits on Oct 27, 2021
-
fix prefetching issue when using NVME offload
Justin Chiu committedOct 27, 2021 Configuration menu - View commit details
-
Copy full SHA for ab3a82a - Browse repository at this point
Copy the full SHA ab3a82aView commit details
Commits on Oct 29, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 025a41e - Browse repository at this point
Copy the full SHA 025a41eView commit details
Commits on Nov 1, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 6f9415b - Browse repository at this point
Copy the full SHA 6f9415bView commit details
Commits on Nov 2, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 8d12281 - Browse repository at this point
Copy the full SHA 8d12281View commit details -
improved defragmentation for fp16 parameters
Justin Chiu committedNov 2, 2021 Configuration menu - View commit details
-
Copy full SHA for a26d1fb - Browse repository at this point
Copy the full SHA a26d1fbView commit details -
relative imports for bf16 tests
Justin Chiu committedNov 2, 2021 Configuration menu - View commit details
-
Copy full SHA for 937f04e - Browse repository at this point
Copy the full SHA 937f04eView commit details -
changes for bwd compatibility with pytorch 1.2
Justin Chiu committedNov 2, 2021 Configuration menu - View commit details
-
Copy full SHA for e74f509 - Browse repository at this point
Copy the full SHA e74f509View commit details -
remove buffered_reduce_fallback
Justin Chiu committedNov 2, 2021 Configuration menu - View commit details
-
Copy full SHA for 6ee558d - Browse repository at this point
Copy the full SHA 6ee558dView commit details
Commits on Nov 3, 2021
-
removed unused parameter offset bookkeeping
Justin Chiu committedNov 3, 2021 Configuration menu - View commit details
-
Copy full SHA for 14e22a2 - Browse repository at this point
Copy the full SHA 14e22a2View commit details -
fixed tracking for multiple param groups
Justin Chiu committedNov 3, 2021 Configuration menu - View commit details
-
Copy full SHA for 16281df - Browse repository at this point
Copy the full SHA 16281dfView commit details -
Configuration menu - View commit details
-
Copy full SHA for 38af6b1 - Browse repository at this point
Copy the full SHA 38af6b1View commit details -
unbroke bfloat16 config after merge conflict
Justin Chiu committedNov 3, 2021 Configuration menu - View commit details
-
Copy full SHA for cc7011e - Browse repository at this point
Copy the full SHA cc7011eView commit details -
using base allgather params when only 1 param
Justin Chiu committedNov 3, 2021 Configuration menu - View commit details
-
Copy full SHA for 806b072 - Browse repository at this point
Copy the full SHA 806b072View commit details -
cleanup/fixes for fp16 partition defragmentation
Justin Chiu committedNov 3, 2021 Configuration menu - View commit details
-
Copy full SHA for bf0dd66 - Browse repository at this point
Copy the full SHA bf0dd66View commit details
Commits on Nov 5, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 73207ae - Browse repository at this point
Copy the full SHA 73207aeView commit details -
Configuration menu - View commit details
-
Copy full SHA for d3ecb1f - Browse repository at this point
Copy the full SHA d3ecb1fView commit details
Commits on Nov 11, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 812fe67 - Browse repository at this point
Copy the full SHA 812fe67View commit details
Commits on Nov 18, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 6dc21a6 - Browse repository at this point
Copy the full SHA 6dc21a6View commit details -
Configuration menu - View commit details
-
Copy full SHA for 2a38302 - Browse repository at this point
Copy the full SHA 2a38302View commit details -
Configuration menu - View commit details
-
Copy full SHA for 16f1d21 - Browse repository at this point
Copy the full SHA 16f1d21View commit details
Commits on Nov 23, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 11d590a - Browse repository at this point
Copy the full SHA 11d590aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 2b5f6ea - Browse repository at this point
Copy the full SHA 2b5f6eaView commit details
Commits on Nov 24, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 80b53d3 - Browse repository at this point
Copy the full SHA 80b53d3View commit details -
Configuration menu - View commit details
-
Copy full SHA for 6dfe693 - Browse repository at this point
Copy the full SHA 6dfe693View commit details
Commits on Nov 29, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 912e6f0 - Browse repository at this point
Copy the full SHA 912e6f0View commit details
Commits on Nov 30, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 4b0133b - Browse repository at this point
Copy the full SHA 4b0133bView commit details -
Configuration menu - View commit details
-
Copy full SHA for b998206 - Browse repository at this point
Copy the full SHA b998206View commit details -
remove extra bfloat16_enabled definition
Justin Chiu committedNov 30, 2021 Configuration menu - View commit details
-
Copy full SHA for d6deecb - Browse repository at this point
Copy the full SHA d6deecbView commit details -
asserting params inflight for AllGatherHandle
Justin Chiu committedNov 30, 2021 Configuration menu - View commit details
-
Copy full SHA for 2a4ef29 - Browse repository at this point
Copy the full SHA 2a4ef29View commit details -
remove get_cuda_mem_allocated_str
Justin Chiu committedNov 30, 2021 Configuration menu - View commit details
-
Copy full SHA for 90182b6 - Browse repository at this point
Copy the full SHA 90182b6View commit details
Commits on Dec 8, 2021
-
Configuration menu - View commit details
-
Copy full SHA for ad847ed - Browse repository at this point
Copy the full SHA ad847edView commit details -
Configuration menu - View commit details
-
Copy full SHA for f590ba4 - Browse repository at this point
Copy the full SHA f590ba4View commit details -
fix bfloat16 zero stage check (broken after merge commit)
Justin Chiu committedDec 8, 2021 Configuration menu - View commit details
-
Copy full SHA for 9db815f - Browse repository at this point
Copy the full SHA 9db815fView commit details -
Configuration menu - View commit details
-
Copy full SHA for 259ec15 - Browse repository at this point
Copy the full SHA 259ec15View commit details
Commits on Dec 9, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 96d2247 - Browse repository at this point
Copy the full SHA 96d2247View commit details -
Configuration menu - View commit details
-
Copy full SHA for 2630b75 - Browse repository at this point
Copy the full SHA 2630b75View commit details
Commits on Dec 11, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 79fd42c - Browse repository at this point
Copy the full SHA 79fd42cView commit details
Commits on Dec 14, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 8565e04 - Browse repository at this point
Copy the full SHA 8565e04View commit details
Commits on Dec 30, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 06eab1a - Browse repository at this point
Copy the full SHA 06eab1aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 0f8affe - Browse repository at this point
Copy the full SHA 0f8affeView commit details -
Configuration menu - View commit details
-
Copy full SHA for 3436422 - Browse repository at this point
Copy the full SHA 3436422View commit details -
Configuration menu - View commit details
-
Copy full SHA for 601d1f1 - Browse repository at this point
Copy the full SHA 601d1f1View commit details -
Configuration menu - View commit details
-
Copy full SHA for 5dcee36 - Browse repository at this point
Copy the full SHA 5dcee36View commit details
Commits on Jan 3, 2022
-
Configuration menu - View commit details
-
Copy full SHA for 580d25e - Browse repository at this point
Copy the full SHA 580d25eView commit details
Commits on Jan 7, 2022
-
Configuration menu - View commit details
-
Copy full SHA for 872f451 - Browse repository at this point
Copy the full SHA 872f451View commit details
Commits on Jan 10, 2022
-
Configuration menu - View commit details
-
Copy full SHA for e236293 - Browse repository at this point
Copy the full SHA e236293View commit details
Commits on Jan 11, 2022
-
Configuration menu - View commit details
-
Copy full SHA for 43b3b83 - Browse repository at this point
Copy the full SHA 43b3b83View commit details
Commits on Jan 12, 2022
-
Configuration menu - View commit details
-
Copy full SHA for 83905ac - Browse repository at this point
Copy the full SHA 83905acView commit details -
iterate over params_to_fetch rather than make another iterator
Justin Chiu committedJan 12, 2022 Configuration menu - View commit details
-
Copy full SHA for 31aecfc - Browse repository at this point
Copy the full SHA 31aecfcView commit details
Commits on Jan 14, 2022
-
Justin Chiu committed
Jan 14, 2022 Configuration menu - View commit details
-
Copy full SHA for 8736700 - Browse repository at this point
Copy the full SHA 8736700View commit details -
Configuration menu - View commit details
-
Copy full SHA for 516379d - Browse repository at this point
Copy the full SHA 516379dView commit details
Commits on Jan 19, 2022
-
remove unnecessary division by micro_step_id
Justin Chiu committedJan 19, 2022 Configuration menu - View commit details
-
Copy full SHA for 0bf7bcd - Browse repository at this point
Copy the full SHA 0bf7bcdView commit details -
rename config keys "bfloat16" -> "bf16"
Justin Chiu committedJan 19, 2022 Configuration menu - View commit details
-
Copy full SHA for 43c00ff - Browse repository at this point
Copy the full SHA 43c00ffView commit details -
rename stage3_gather_fp16_weights_on_model_save -> stage3_gather_16bi…
…t_weights_on_model_save
Justin Chiu committedJan 19, 2022 Configuration menu - View commit details
-
Copy full SHA for 4574bc7 - Browse repository at this point
Copy the full SHA 4574bc7View commit details -
add unit test to check backwards compatibility for gather_16bit_weights
Justin Chiu committedJan 19, 2022 Configuration menu - View commit details
-
Copy full SHA for e04dc6a - Browse repository at this point
Copy the full SHA e04dc6aView commit details -
added test to confirm bf16 key bwd compatibility
Justin Chiu committedJan 19, 2022 Configuration menu - View commit details
-
Copy full SHA for 391cecf - Browse repository at this point
Copy the full SHA 391cecfView commit details -
Configuration menu - View commit details
-
Copy full SHA for 3d26469 - Browse repository at this point
Copy the full SHA 3d26469View commit details -
Configuration menu - View commit details
-
Copy full SHA for 536d171 - Browse repository at this point
Copy the full SHA 536d171View commit details
Commits on Jan 20, 2022
-
Configuration menu - View commit details
-
Copy full SHA for 19f3538 - Browse repository at this point
Copy the full SHA 19f3538View commit details