Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add multi machine dist_train #114

Merged
merged 4 commits into from
Mar 17, 2022

Conversation

pppppM
Copy link
Collaborator

@pppppM pppppM commented Mar 16, 2022

Motivation

Add training startup documentation
Support training with multi nodes
ref: open-mmlab/mmselfsup#232

Modification

Add training startup documentation
Update tools/xxx/dist_train.sh and tools/xxx/dist_test.sh

@codecov
Copy link

codecov bot commented Mar 16, 2022

Codecov Report

Merging #114 (88d244c) into dev_v0.3.0 (20d1e0b) will decrease coverage by 1.58%.
The diff coverage is n/a.

@@              Coverage Diff               @@
##           dev_v0.3.0     #114      +/-   ##
==============================================
- Coverage       64.81%   63.23%   -1.59%     
==============================================
  Files              91       91              
  Lines            3223     3272      +49     
  Branches          597      600       +3     
==============================================
- Hits             2089     2069      -20     
- Misses           1035     1095      +60     
- Partials           99      108       +9     
Flag Coverage Δ
unittests 63.23% <ø> (-1.59%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
mmrazor/models/pruners/utils/switchable_bn.py 30.76% <0.00%> (-69.24%) ⬇️
mmrazor/apis/utils.py 56.00% <0.00%> (-44.00%) ⬇️
mmrazor/models/pruners/ratio_pruning.py 64.51% <0.00%> (-35.49%) ⬇️
mmrazor/utils/setup_env.py 72.72% <0.00%> (-22.73%) ⬇️
mmrazor/models/algorithms/autoslim.py 56.58% <0.00%> (-13.96%) ⬇️
mmrazor/utils/misc.py 95.23% <0.00%> (-4.77%) ⬇️
mmrazor/models/pruners/structure_pruning.py 85.75% <0.00%> (-0.39%) ⬇️
mmrazor/apis/__init__.py 100.00% <0.00%> (ø)
mmrazor/models/mutators/differentiable_mutator.py 97.50% <0.00%> (+0.06%) ⬆️
mmrazor/models/mutators/one_shot_mutator.py 96.42% <0.00%> (+0.13%) ⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 20d1e0b...88d244c. Read the comment docs.

@pppppM pppppM changed the title Multinode [Feature] Add multi machine dist_train Mar 17, 2022
@pppppM pppppM merged commit f5ee768 into open-mmlab:dev_v0.3.0 Mar 17, 2022
pppppM added a commit to humu789/mmrazor that referenced this pull request Mar 27, 2022
* support multi nodes

* update training doc

* fix lints

* remove fixed seed
pppppM added a commit that referenced this pull request Apr 2, 2022
* [Feature] Add function to meet mmdeploy support (#102)

* add init_model function for mmdeploy

* fix lint

* add unittest for init_xxx_model

* fix lint

* mv test_inference.py to test_apis directory

* [Feature] Add function to meet mmdeploy support (#102)

* add init_model function for mmdeploy

* fix lint

* add unittest for init_xxx_model

* fix lint

* mv test_inference.py to test_apis directory

* [Refactor] Delete redundant `set_random_seed` function (#104)

* refactor set_random_seed

* add unittests

* fix unittests error

* fix lint

* avoid bc breaking

* [Feature] Add diff seeds to diff ranks and set torch seed in worker_init_fn (#113)

* add init_random_seed

* Set diff seed to diff workers

* [Feature] Add multi machine dist_train (#114)

* support multi nodes

* update training doc

* fix lints

* remove fixed seed

* fix ddp wrapper registry (#128)

* [Docs] Add brief installation steps in README(_zh-CN).md (#121)

* Add brief installation

* add brief installtion ref to mmediting pr#816

Co-authored-by: caoweihan <caoweihan@sensetime.com>

* [BUG]Fix bugs in pruner (#126)

* fix bugs in pruner when pruning models with shared modules

* pruner can trace models with dilation conv2d

* fix deploy_subnet

* fix add_pruning_attrs

* fix bugs in modify_forward

* fix lint

* fix StructurePruner

* test tracing models with shared modules

Co-authored-by: caoweihan <caoweihan@sensetime.com>

* [Docs]Add some more details to docs (#133)

* add docs for dataset

* add cfg-options for distillation

* fix docs

Co-authored-by: caoweihan <caoweihan@sensetime.com>

* reset norm running status after prepare_from_supernet (#81)

* [Improvement]Sync train api (#115)

Co-authored-by: caoweihan <caoweihan@sensetime.com>

* [Feature]Support Relational Knowledge Distillation (#127)

* add rkd

* add rkd pytest

* add rkd configs

* fix readme

* fix rkd

* split rkd loss to distance-wise and angle-wise losses

* rename rkd losses

* add rkd metaflie

* add rkd related links

* rename rkd metafile and add to model index

* delete cifar100

Co-authored-by: caoweihan <caoweihan@sensetime.com>
Co-authored-by: pppppM <gjf_mail@126.com>

Co-authored-by: qiufeng <44188071+wutongshenqiu@users.noreply.github.com>
Co-authored-by: wutongshenqiu <690364065@qq.com>
Co-authored-by: whcao <41630003+HIT-cwh@users.noreply.github.com>
Co-authored-by: caoweihan <caoweihan@sensetime.com>
pppppM added a commit to pppppM/mmrazor that referenced this pull request Jul 15, 2022
* [Feature] Add function to meet mmdeploy support (open-mmlab#102)

* add init_model function for mmdeploy

* fix lint

* add unittest for init_xxx_model

* fix lint

* mv test_inference.py to test_apis directory

* [Feature] Add function to meet mmdeploy support (open-mmlab#102)

* add init_model function for mmdeploy

* fix lint

* add unittest for init_xxx_model

* fix lint

* mv test_inference.py to test_apis directory

* [Refactor] Delete redundant `set_random_seed` function (open-mmlab#104)

* refactor set_random_seed

* add unittests

* fix unittests error

* fix lint

* avoid bc breaking

* [Feature] Add diff seeds to diff ranks and set torch seed in worker_init_fn (open-mmlab#113)

* add init_random_seed

* Set diff seed to diff workers

* [Feature] Add multi machine dist_train (open-mmlab#114)

* support multi nodes

* update training doc

* fix lints

* remove fixed seed

* fix ddp wrapper registry (open-mmlab#128)

* [Docs] Add brief installation steps in README(_zh-CN).md (open-mmlab#121)

* Add brief installation

* add brief installtion ref to mmediting pr#816

Co-authored-by: caoweihan <caoweihan@sensetime.com>

* [BUG]Fix bugs in pruner (open-mmlab#126)

* fix bugs in pruner when pruning models with shared modules

* pruner can trace models with dilation conv2d

* fix deploy_subnet

* fix add_pruning_attrs

* fix bugs in modify_forward

* fix lint

* fix StructurePruner

* test tracing models with shared modules

Co-authored-by: caoweihan <caoweihan@sensetime.com>

* [Docs]Add some more details to docs (open-mmlab#133)

* add docs for dataset

* add cfg-options for distillation

* fix docs

Co-authored-by: caoweihan <caoweihan@sensetime.com>

* reset norm running status after prepare_from_supernet (open-mmlab#81)

* [Improvement]Sync train api (open-mmlab#115)

Co-authored-by: caoweihan <caoweihan@sensetime.com>

* [Feature]Support Relational Knowledge Distillation (open-mmlab#127)

* add rkd

* add rkd pytest

* add rkd configs

* fix readme

* fix rkd

* split rkd loss to distance-wise and angle-wise losses

* rename rkd losses

* add rkd metaflie

* add rkd related links

* rename rkd metafile and add to model index

* delete cifar100

Co-authored-by: caoweihan <caoweihan@sensetime.com>
Co-authored-by: pppppM <gjf_mail@126.com>

Co-authored-by: qiufeng <44188071+wutongshenqiu@users.noreply.github.com>
Co-authored-by: wutongshenqiu <690364065@qq.com>
Co-authored-by: whcao <41630003+HIT-cwh@users.noreply.github.com>
Co-authored-by: caoweihan <caoweihan@sensetime.com>
pppppM added a commit to pppppM/mmrazor that referenced this pull request Jul 15, 2022
* [Feature] Add function to meet mmdeploy support (open-mmlab#102)

* add init_model function for mmdeploy

* fix lint

* add unittest for init_xxx_model

* fix lint

* mv test_inference.py to test_apis directory

* [Feature] Add function to meet mmdeploy support (open-mmlab#102)

* add init_model function for mmdeploy

* fix lint

* add unittest for init_xxx_model

* fix lint

* mv test_inference.py to test_apis directory

* [Refactor] Delete redundant `set_random_seed` function (open-mmlab#104)

* refactor set_random_seed

* add unittests

* fix unittests error

* fix lint

* avoid bc breaking

* [Feature] Add diff seeds to diff ranks and set torch seed in worker_init_fn (open-mmlab#113)

* add init_random_seed

* Set diff seed to diff workers

* [Feature] Add multi machine dist_train (open-mmlab#114)

* support multi nodes

* update training doc

* fix lints

* remove fixed seed

* fix ddp wrapper registry (open-mmlab#128)

* [Docs] Add brief installation steps in README(_zh-CN).md (open-mmlab#121)

* Add brief installation

* add brief installtion ref to mmediting pr#816

Co-authored-by: caoweihan <caoweihan@sensetime.com>

* [BUG]Fix bugs in pruner (open-mmlab#126)

* fix bugs in pruner when pruning models with shared modules

* pruner can trace models with dilation conv2d

* fix deploy_subnet

* fix add_pruning_attrs

* fix bugs in modify_forward

* fix lint

* fix StructurePruner

* test tracing models with shared modules

Co-authored-by: caoweihan <caoweihan@sensetime.com>

* [Docs]Add some more details to docs (open-mmlab#133)

* add docs for dataset

* add cfg-options for distillation

* fix docs

Co-authored-by: caoweihan <caoweihan@sensetime.com>

* reset norm running status after prepare_from_supernet (open-mmlab#81)

* [Improvement]Sync train api (open-mmlab#115)

Co-authored-by: caoweihan <caoweihan@sensetime.com>

* [Feature]Support Relational Knowledge Distillation (open-mmlab#127)

* add rkd

* add rkd pytest

* add rkd configs

* fix readme

* fix rkd

* split rkd loss to distance-wise and angle-wise losses

* rename rkd losses

* add rkd metaflie

* add rkd related links

* rename rkd metafile and add to model index

* delete cifar100

Co-authored-by: caoweihan <caoweihan@sensetime.com>
Co-authored-by: pppppM <gjf_mail@126.com>

Co-authored-by: qiufeng <44188071+wutongshenqiu@users.noreply.github.com>
Co-authored-by: wutongshenqiu <690364065@qq.com>
Co-authored-by: whcao <41630003+HIT-cwh@users.noreply.github.com>
Co-authored-by: caoweihan <caoweihan@sensetime.com>
humu789 pushed a commit to humu789/mmrazor that referenced this pull request Feb 13, 2023
* add shape constantofshape unittest for ncnn

* fix lint

* standarize import

* fix lint

* reply for code review

* reply for code review

* fix lint

* remove some hardcode

* fix lint

* reply for code review

* test gather and fix gather cpp code

* fix yapf

* fix clang-format

* reply for code review

* reply for code review

* fix lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant