Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature(wgt): add barrier middleware #570

Merged
merged 4 commits into from
Apr 16, 2023

Conversation

SolenoidWGT
Copy link
Collaborator

@SolenoidWGT SolenoidWGT commented Jan 11, 2023

Description

Barrier is a middleware for debug or profiling. It can synchronize the task step of each process within the scope of all visible processes. When using Barrier, you need to pay attention to the following points:

  1. All processes must call the same number of Barrier, otherwise a deadlock occurs.

  2. 'attch_from_nums' is a very important variable, This value indicates the number of times the current process will be attached to by other processes (the number of connections established). For example:

        Node0: address: 127.0.0.1:12345, attach_to = []
        Node1: address: 127.0.0.1:12346, attach_to = ["tcp://127.0.0.1:12345"]

        For Node0, the 'attch_from_nums' value is 1. (It will be acttched by Node1)
        For Node1, the 'attch_from_nums' value is 0. (No one will attach to Node1)

Please note that this value must be given correctly, otherwise, for a node whose 'attach_to' list is empty, it cannot perceive how many processes will establish connections with it, resulting in any form of synchronization cannot be performed.

  1. Barrier is thread-safe, but it is not recommended to use barrier in multithreading. You need to carefully calculate the number of times each thread calls Barrier to avoid deadlock.

  2. In normal training tasks, please do not use Barrier, which will force the step synchronization between each process, so it will greatly damage the training efficiency. In addition, if your training task has dynamic processes, do not use Barrier to prevent deadlock.

Check List

  • merge the latest version source branch/repo, and resolve all the conflicts
  • pass style check
  • pass all the tests

@codecov
Copy link

codecov bot commented Jan 12, 2023

Codecov Report

Merging #570 (04eafbb) into main (3a9f213) will decrease coverage by 0.89%.
The diff coverage is 92.24%.

@@            Coverage Diff             @@
##             main     #570      +/-   ##
==========================================
- Coverage   83.57%   82.69%   -0.89%     
==========================================
  Files         562      581      +19     
  Lines       46009    47464    +1455     
==========================================
+ Hits        38454    39249     +795     
- Misses       7555     8215     +660     
Flag Coverage Δ
unittests 82.69% <92.24%> (-0.89%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
ding/framework/middleware/collector.py 54.45% <0.00%> (-2.55%) ⬇️
ding/framework/middleware/functional/evaluator.py 42.35% <0.00%> (-0.31%) ⬇️
ding/policy/ppof.py 15.02% <0.00%> (-1.94%) ⬇️
ding/framework/middleware/functional/trainer.py 85.71% <50.00%> (ø)
ding/framework/task.py 93.64% <75.00%> (+0.08%) ⬆️
ding/framework/parallel.py 89.04% <90.90%> (+3.47%) ⬆️
ding/framework/middleware/tests/test_barrier.py 94.54% <94.54%> (ø)
ding/framework/middleware/barrier.py 96.74% <96.74%> (ø)
ding/framework/middleware/__init__.py 100.00% <100.00%> (ø)

... and 176 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@SolenoidWGT SolenoidWGT force-pushed the feature_add_nng_mq_barrier branch 3 times, most recently from c6a2d56 to 906e0c0 Compare January 13, 2023 07:28
@PaParaZz1 PaParaZz1 added the enhancement New feature or request label Jan 16, 2023
@PaParaZz1 PaParaZz1 changed the title feature(wgt): Add barrier middleware feature(wgt): add barrier middleware Feb 2, 2023
ding/framework/middleware/barrier.py Outdated Show resolved Hide resolved
ding/framework/parallel.py Outdated Show resolved Hide resolved
ding/framework/middleware/functional/trainer.py Outdated Show resolved Hide resolved
ding/framework/task.py Outdated Show resolved Hide resolved
ding/framework/task.py Show resolved Hide resolved
ding/framework/parallel.py Outdated Show resolved Hide resolved
@PaParaZz1 PaParaZz1 merged commit dd00ebf into opendilab:main Apr 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants