Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various ZeRO Stage3 Optimizations + Improvements (including bfloat16 support) #1453

Merged
merged 91 commits into from
Jan 21, 2022

Commits on Oct 12, 2021

  1. Changes for bfloat16 Zero2

    raamjad authored and Justin Chiu committed Oct 12, 2021
    Configuration menu
    Copy the full SHA
    fe26423 View commit details
    Browse the repository at this point in the history
  2. ZeRO stage3 optimizations, with some bug fixes

    optimizations for stage3:
    - prefetching improvements
    - batching allgather calls to amortize fixed overhead and improve
      bandwidth utilization
    - batching reduce_scatter calls to amortize fixed overhead and
      improve bandwidth utilization
    - using *_base variants of allgather and reduce scatter to reduce memory
      allocations and data movement
    - more fine grained synchronization for communication that allows
      blocking on less work
    - precomputation of fetching code - using a fetch queue rather than
      deciding what to (pre)fetch at each iteration
    - limiting queued coalesced communication ops to reduce memory pressure
      on pytorch cuda caching allocator (not elegant solution)
    
    optimizations for stage3-offload:
    - made some host-device tensor copies async to improve performance
    
    bug fixes and qol improvements:
    - fix init context method when parent modules modify child weights
    - speed up model initialization by moving model to GPU before weight
      initialization
    - fixed unit test imports so that unit tests can be run from any
      directory
    - change performance logging to include memory consumption
    - add logging w/ model size when done partitioning model
    
    new features
    - bfloat16 support for ZeRO 3
    Justin Chiu committed Oct 12, 2021
    Configuration menu
    Copy the full SHA
    8864f91 View commit details
    Browse the repository at this point in the history
  3. fix import in ut

    Justin Chiu committed Oct 12, 2021
    Configuration menu
    Copy the full SHA
    e66aedc View commit details
    Browse the repository at this point in the history
  4. ran yapf

    Justin Chiu committed Oct 12, 2021
    Configuration menu
    Copy the full SHA
    350a7a0 View commit details
    Browse the repository at this point in the history

Commits on Oct 13, 2021

  1. Configuration menu
    Copy the full SHA
    b37a4f0 View commit details
    Browse the repository at this point in the history

Commits on Oct 14, 2021

  1. improvements to cache flush warn log

    Justin Chiu committed Oct 14, 2021
    Configuration menu
    Copy the full SHA
    f383947 View commit details
    Browse the repository at this point in the history
  2. backwards compatibility with older versions of pytorch

    Justin Chiu committed Oct 14, 2021
    Configuration menu
    Copy the full SHA
    b2a1c95 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    d8678fa View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    a0faca0 View commit details
    Browse the repository at this point in the history
  5. removed unnecessary barrier call

    Justin Chiu committed Oct 14, 2021
    Configuration menu
    Copy the full SHA
    bf20c90 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    a353017 View commit details
    Browse the repository at this point in the history
  7. formatting fix after resolving merge conflict

    Justin Chiu committed Oct 14, 2021
    Configuration menu
    Copy the full SHA
    c51ba46 View commit details
    Browse the repository at this point in the history
  8. skip nvme prefetch when trace not complete

    Justin Chiu committed Oct 14, 2021
    Configuration menu
    Copy the full SHA
    ff01f5c View commit details
    Browse the repository at this point in the history

Commits on Oct 15, 2021

  1. opportunistically avoid memory allocation in allgather coalesced wher…

    …e possible
    Justin Chiu committed Oct 15, 2021
    Configuration menu
    Copy the full SHA
    13093eb View commit details
    Browse the repository at this point in the history

Commits on Oct 20, 2021

  1. Configuration menu
    Copy the full SHA
    3cdcbdf View commit details
    Browse the repository at this point in the history

Commits on Oct 21, 2021

  1. Configuration menu
    Copy the full SHA
    64d74d1 View commit details
    Browse the repository at this point in the history

Commits on Oct 22, 2021

  1. Configuration menu
    Copy the full SHA
    e30e6cc View commit details
    Browse the repository at this point in the history
  2. fix indentation after merge

    Justin Chiu committed Oct 22, 2021
    Configuration menu
    Copy the full SHA
    f19593d View commit details
    Browse the repository at this point in the history
  3. fixes to account for parameter offload

    Justin Chiu committed Oct 22, 2021
    Configuration menu
    Copy the full SHA
    f72bc78 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    660df05 View commit details
    Browse the repository at this point in the history
  5. moved partition_all_params to optimizer step

    Justin Chiu committed Oct 22, 2021
    Configuration menu
    Copy the full SHA
    4f9477f View commit details
    Browse the repository at this point in the history

Commits on Oct 26, 2021

  1. Configuration menu
    Copy the full SHA
    818651c View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    f681201 View commit details
    Browse the repository at this point in the history
  3. allgathering on params before item gets called

    Justin Chiu committed Oct 26, 2021
    Configuration menu
    Copy the full SHA
    bb34f90 View commit details
    Browse the repository at this point in the history
  4. fix param status checks

    needed after moving partition_all_parameters call to optimizer step
    Justin Chiu committed Oct 26, 2021
    Configuration menu
    Copy the full SHA
    9f3b504 View commit details
    Browse the repository at this point in the history
  5. fix grad accumulation with optimizer offload

    Justin Chiu committed Oct 26, 2021
    Configuration menu
    Copy the full SHA
    1772d41 View commit details
    Browse the repository at this point in the history
  6. grad norm computation fix for optimizer offload

    Justin Chiu committed Oct 26, 2021
    Configuration menu
    Copy the full SHA
    5f213d8 View commit details
    Browse the repository at this point in the history
  7. change post divide in reduce-scatter to pre divide

    Justin Chiu committed Oct 26, 2021
    Configuration menu
    Copy the full SHA
    3198805 View commit details
    Browse the repository at this point in the history
  8. fix gradient race condition w/ optimizer offload

    Justin Chiu committed Oct 26, 2021
    Configuration menu
    Copy the full SHA
    2225659 View commit details
    Browse the repository at this point in the history
  9. improve inf/nan gradient tracking

    Justin Chiu committed Oct 26, 2021
    Configuration menu
    Copy the full SHA
    5aa9bd5 View commit details
    Browse the repository at this point in the history
  10. don't prefetch when not in training mode

    Justin Chiu committed Oct 26, 2021
    Configuration menu
    Copy the full SHA
    a1a60ed View commit details
    Browse the repository at this point in the history
  11. format fix after merging

    Justin Chiu committed Oct 26, 2021
    Configuration menu
    Copy the full SHA
    df41659 View commit details
    Browse the repository at this point in the history

Commits on Oct 27, 2021

  1. fix prefetching issue when using NVME offload

    Justin Chiu committed Oct 27, 2021
    Configuration menu
    Copy the full SHA
    ab3a82a View commit details
    Browse the repository at this point in the history

Commits on Oct 29, 2021

  1. Configuration menu
    Copy the full SHA
    025a41e View commit details
    Browse the repository at this point in the history

Commits on Nov 1, 2021

  1. Configuration menu
    Copy the full SHA
    6f9415b View commit details
    Browse the repository at this point in the history

Commits on Nov 2, 2021

  1. Configuration menu
    Copy the full SHA
    8d12281 View commit details
    Browse the repository at this point in the history
  2. improved defragmentation for fp16 parameters

    Justin Chiu committed Nov 2, 2021
    Configuration menu
    Copy the full SHA
    a26d1fb View commit details
    Browse the repository at this point in the history
  3. relative imports for bf16 tests

    Justin Chiu committed Nov 2, 2021
    Configuration menu
    Copy the full SHA
    937f04e View commit details
    Browse the repository at this point in the history
  4. changes for bwd compatibility with pytorch 1.2

    Justin Chiu committed Nov 2, 2021
    Configuration menu
    Copy the full SHA
    e74f509 View commit details
    Browse the repository at this point in the history
  5. remove buffered_reduce_fallback

    Justin Chiu committed Nov 2, 2021
    Configuration menu
    Copy the full SHA
    6ee558d View commit details
    Browse the repository at this point in the history

Commits on Nov 3, 2021

  1. removed unused parameter offset bookkeeping

    Justin Chiu committed Nov 3, 2021
    Configuration menu
    Copy the full SHA
    14e22a2 View commit details
    Browse the repository at this point in the history
  2. fixed tracking for multiple param groups

    Justin Chiu committed Nov 3, 2021
    Configuration menu
    Copy the full SHA
    16281df View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    38af6b1 View commit details
    Browse the repository at this point in the history
  4. unbroke bfloat16 config after merge conflict

    Justin Chiu committed Nov 3, 2021
    Configuration menu
    Copy the full SHA
    cc7011e View commit details
    Browse the repository at this point in the history
  5. using base allgather params when only 1 param

    Justin Chiu committed Nov 3, 2021
    Configuration menu
    Copy the full SHA
    806b072 View commit details
    Browse the repository at this point in the history
  6. cleanup/fixes for fp16 partition defragmentation

    Justin Chiu committed Nov 3, 2021
    Configuration menu
    Copy the full SHA
    bf0dd66 View commit details
    Browse the repository at this point in the history

Commits on Nov 5, 2021

  1. Configuration menu
    Copy the full SHA
    73207ae View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    d3ecb1f View commit details
    Browse the repository at this point in the history

Commits on Nov 11, 2021

  1. Configuration menu
    Copy the full SHA
    812fe67 View commit details
    Browse the repository at this point in the history

Commits on Nov 18, 2021

  1. switch to CRLF

    jeffra committed Nov 18, 2021
    Configuration menu
    Copy the full SHA
    6dc21a6 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    2a38302 View commit details
    Browse the repository at this point in the history
  3. align new line with master

    jeffra committed Nov 18, 2021
    Configuration menu
    Copy the full SHA
    16f1d21 View commit details
    Browse the repository at this point in the history

Commits on Nov 23, 2021

  1. Configuration menu
    Copy the full SHA
    11d590a View commit details
    Browse the repository at this point in the history
  2. Fix merge issues

    tjruwase committed Nov 23, 2021
    Configuration menu
    Copy the full SHA
    2b5f6ea View commit details
    Browse the repository at this point in the history

Commits on Nov 24, 2021

  1. Configuration menu
    Copy the full SHA
    80b53d3 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    6dfe693 View commit details
    Browse the repository at this point in the history

Commits on Nov 29, 2021

  1. switch to CRLF

    jeffra committed Nov 29, 2021
    Configuration menu
    Copy the full SHA
    912e6f0 View commit details
    Browse the repository at this point in the history

Commits on Nov 30, 2021

  1. fix to LF line endings

    jeffra committed Nov 30, 2021
    Configuration menu
    Copy the full SHA
    4b0133b View commit details
    Browse the repository at this point in the history
  2. minor merge fixes

    jeffra committed Nov 30, 2021
    Configuration menu
    Copy the full SHA
    b998206 View commit details
    Browse the repository at this point in the history
  3. remove extra bfloat16_enabled definition

    Justin Chiu committed Nov 30, 2021
    Configuration menu
    Copy the full SHA
    d6deecb View commit details
    Browse the repository at this point in the history
  4. asserting params inflight for AllGatherHandle

    Justin Chiu committed Nov 30, 2021
    Configuration menu
    Copy the full SHA
    2a4ef29 View commit details
    Browse the repository at this point in the history
  5. remove get_cuda_mem_allocated_str

    Justin Chiu committed Nov 30, 2021
    Configuration menu
    Copy the full SHA
    90182b6 View commit details
    Browse the repository at this point in the history

Commits on Dec 8, 2021

  1. Configuration menu
    Copy the full SHA
    ad847ed View commit details
    Browse the repository at this point in the history
  2. Format fixes

    tjruwase committed Dec 8, 2021
    Configuration menu
    Copy the full SHA
    f590ba4 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    9db815f View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    259ec15 View commit details
    Browse the repository at this point in the history

Commits on Dec 9, 2021

  1. Add self.reduce_scatter

    tjruwase committed Dec 9, 2021
    Configuration menu
    Copy the full SHA
    96d2247 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    2630b75 View commit details
    Browse the repository at this point in the history

Commits on Dec 11, 2021

  1. Configuration menu
    Copy the full SHA
    79fd42c View commit details
    Browse the repository at this point in the history

Commits on Dec 14, 2021

  1. Configuration menu
    Copy the full SHA
    8565e04 View commit details
    Browse the repository at this point in the history

Commits on Dec 30, 2021

  1. Configuration menu
    Copy the full SHA
    06eab1a View commit details
    Browse the repository at this point in the history
  2. Format fix

    tjruwase committed Dec 30, 2021
    Configuration menu
    Copy the full SHA
    0f8affe View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    3436422 View commit details
    Browse the repository at this point in the history
  4. Fix merge issues

    tjruwase committed Dec 30, 2021
    Configuration menu
    Copy the full SHA
    601d1f1 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    5dcee36 View commit details
    Browse the repository at this point in the history

Commits on Jan 3, 2022

  1. Configuration menu
    Copy the full SHA
    580d25e View commit details
    Browse the repository at this point in the history

Commits on Jan 7, 2022

  1. Configuration menu
    Copy the full SHA
    872f451 View commit details
    Browse the repository at this point in the history

Commits on Jan 10, 2022

  1. Configuration menu
    Copy the full SHA
    e236293 View commit details
    Browse the repository at this point in the history

Commits on Jan 11, 2022

  1. Configuration menu
    Copy the full SHA
    43b3b83 View commit details
    Browse the repository at this point in the history

Commits on Jan 12, 2022

  1. Configuration menu
    Copy the full SHA
    83905ac View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    31aecfc View commit details
    Browse the repository at this point in the history

Commits on Jan 14, 2022

  1. add some TODOs

    Justin Chiu committed Jan 14, 2022
    Configuration menu
    Copy the full SHA
    8736700 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    516379d View commit details
    Browse the repository at this point in the history

Commits on Jan 19, 2022

  1. remove unnecessary division by micro_step_id

    Justin Chiu committed Jan 19, 2022
    Configuration menu
    Copy the full SHA
    0bf7bcd View commit details
    Browse the repository at this point in the history
  2. rename config keys "bfloat16" -> "bf16"

    Justin Chiu committed Jan 19, 2022
    Configuration menu
    Copy the full SHA
    43c00ff View commit details
    Browse the repository at this point in the history
  3. rename stage3_gather_fp16_weights_on_model_save -> stage3_gather_16bi…

    …t_weights_on_model_save
    Justin Chiu committed Jan 19, 2022
    Configuration menu
    Copy the full SHA
    4574bc7 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    e04dc6a View commit details
    Browse the repository at this point in the history
  5. added test to confirm bf16 key bwd compatibility

    Justin Chiu committed Jan 19, 2022
    Configuration menu
    Copy the full SHA
    391cecf View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    3d26469 View commit details
    Browse the repository at this point in the history
  7. Format fixes

    tjruwase committed Jan 19, 2022
    Configuration menu
    Copy the full SHA
    536d171 View commit details
    Browse the repository at this point in the history

Commits on Jan 20, 2022

  1. Configuration menu
    Copy the full SHA
    19f3538 View commit details
    Browse the repository at this point in the history