Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Default process group has not been initialized with autoslim search #74

Closed
twmht opened this issue Feb 10, 2022 · 8 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@twmht
Copy link
Contributor

twmht commented Feb 10, 2022

I tried to search subnets from supernet with autoslim.

python ./tools/mmcls/search_mmcls.py \
  configs/pruning/autoslim/autoslim_mbv2_search_8xb1024_in1k.py \
  your_pre-training_checkpoint_path \
  --work-dir your_work_dir

image

Do I need to use distributed mode when searching?

@twmht twmht added the bug Something isn't working label Feb 10, 2022
@twmht twmht changed the title [Bug] Default process group has not been initialized [Bug] Default process group has not been initialized with autoslim search Feb 10, 2022
@HIT-cwh
Copy link
Collaborator

HIT-cwh commented Feb 10, 2022

Thank you for your issue.
At present, distributed mode is needed when searching even if only one gpu is used. It is hacky and we are refactoring the search part. The new version will no longer have this problem.

@twmht twmht closed this as completed Feb 10, 2022
@tanghy2016
Copy link

这个问题现在的版本解决了吗?我也遇到一样的问题

@HIT-cwh
Copy link
Collaborator

HIT-cwh commented Apr 15, 2022

You can avoid this by trying distributed mode.

Plus, using English is more appreciated for better community discussion around the world.

@tanghy2016
Copy link

where to do the setup you said

@HIT-cwh
Copy link
Collaborator

HIT-cwh commented Apr 15, 2022

where to do the setup you said

You can set the job launcher to one of pytorch, slurm or mpi (ref to here ) to use distributed mode.

@tanghy2016
Copy link

$ python ./tools/mmcls/search_mmcls.py \
>   configs/pruning/autoslim/autoslim_mbv2_search_8xb1024_ci10.py \
>   output/epoch_50.pth \
>   --work-dir output \
>   --launcher pytorch
/home/tanghuayang/venv_torch/lib/python3.6/site-packages/mmrazor/utils/setup_env.py:33: UserWarning: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  f'Setting OMP_NUM_THREADS environment variable for each process '
/home/tanghuayang/venv_torch/lib/python3.6/site-packages/mmrazor/utils/setup_env.py:43: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  f'Setting MKL_NUM_THREADS environment variable for each process '
Traceback (most recent call last):
  File "./tools/mmcls/search_mmcls.py", line 181, in <module>
    main()
  File "./tools/mmcls/search_mmcls.py", line 99, in main
    init_dist(args.launcher, **cfg.dist_params)
  File "/home/tanghuayang/venv_torch/lib64/python3.6/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
    _init_dist_pytorch(backend, **kwargs)
  File "/home/tanghuayang/venv_torch/lib64/python3.6/site-packages/mmcv/runner/dist_utils.py", line 29, in _init_dist_pytorch
    rank = int(os.environ['RANK'])
  File "/usr/lib64/python3.6/os.py", line 669, in __getitem__
    raise KeyError(key) from None
KeyError: 'RANK'

Is it necessary to configure cfg.dist_params? And, how to configure it?

@tanghy2016
Copy link

$ python ./tools/mmcls/search_mmcls.py \
>   configs/pruning/autoslim/autoslim_mbv2_search_8xb1024_ci10.py \
>   output/epoch_50.pth \
>   --work-dir output \
>   --launcher pytorch
/home/tanghuayang/venv_torch/lib/python3.6/site-packages/mmrazor/utils/setup_env.py:33: UserWarning: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  f'Setting OMP_NUM_THREADS environment variable for each process '
/home/tanghuayang/venv_torch/lib/python3.6/site-packages/mmrazor/utils/setup_env.py:43: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  f'Setting MKL_NUM_THREADS environment variable for each process '
Traceback (most recent call last):
  File "./tools/mmcls/search_mmcls.py", line 181, in <module>
    main()
  File "./tools/mmcls/search_mmcls.py", line 99, in main
    init_dist(args.launcher, **cfg.dist_params)
  File "/home/tanghuayang/venv_torch/lib64/python3.6/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
    _init_dist_pytorch(backend, **kwargs)
  File "/home/tanghuayang/venv_torch/lib64/python3.6/site-packages/mmcv/runner/dist_utils.py", line 29, in _init_dist_pytorch
    rank = int(os.environ['RANK'])
  File "/usr/lib64/python3.6/os.py", line 669, in __getitem__
    raise KeyError(key) from None
KeyError: 'RANK'

Is it necessary to configure cfg.dist_params? And, how to configure it?

it's runing, use the following command:

$ RANK=0 WORLD_SIZE=1 MASTER_ADDR=127.0.0.1 MASTER_PORT=1692 python ./tools/mmcls/search_mmcls.py \
  configs/pruning/autoslim/autoslim_mbv2_search_8xb1024_ci10.py \
  output/epoch_50.pth \
  --work-dir output \
  --launcher pytorch

@tanghy2016
Copy link

$ python ./tools/mmcls/search_mmcls.py \
>   configs/pruning/autoslim/autoslim_mbv2_search_8xb1024_ci10.py \
>   output/epoch_50.pth \
>   --work-dir output \
>   --launcher pytorch
/home/tanghuayang/venv_torch/lib/python3.6/site-packages/mmrazor/utils/setup_env.py:33: UserWarning: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  f'Setting OMP_NUM_THREADS environment variable for each process '
/home/tanghuayang/venv_torch/lib/python3.6/site-packages/mmrazor/utils/setup_env.py:43: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  f'Setting MKL_NUM_THREADS environment variable for each process '
Traceback (most recent call last):
  File "./tools/mmcls/search_mmcls.py", line 181, in <module>
    main()
  File "./tools/mmcls/search_mmcls.py", line 99, in main
    init_dist(args.launcher, **cfg.dist_params)
  File "/home/tanghuayang/venv_torch/lib64/python3.6/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
    _init_dist_pytorch(backend, **kwargs)
  File "/home/tanghuayang/venv_torch/lib64/python3.6/site-packages/mmcv/runner/dist_utils.py", line 29, in _init_dist_pytorch
    rank = int(os.environ['RANK'])
  File "/usr/lib64/python3.6/os.py", line 669, in __getitem__
    raise KeyError(key) from None
KeyError: 'RANK'

Is it necessary to configure cfg.dist_params? And, how to configure it?

it's runing, use the following command:

$ RANK=0 WORLD_SIZE=1 MASTER_ADDR=127.0.0.1 MASTER_PORT=1692 python ./tools/mmcls/search_mmcls.py \
  configs/pruning/autoslim/autoslim_mbv2_search_8xb1024_ci10.py \
  output/epoch_50.pth \
  --work-dir output \
  --launcher pytorch

but, how to write these configuration parameters into cfg.dist_params?

humu789 pushed a commit to humu789/mmrazor that referenced this issue Feb 13, 2023
* [Refactor] Refactor configs according to new standard (open-mmlab#67)

* modify cfg and cfg_util

* modify tensorrt config

* fix bug

* lint

* Fix

1. Delete print
2. Modify the return value from "False, None" to "None" and related code
3. Rename 2 get functions

* modify apply_marks

* [Feature] Refactor ocr config (open-mmlab#71)

* add text detection config refactor

* add text recognition refactor

* add static exporting for mmocr

* fix lint

* set max space in child config

* use Sequence[int] instead

* add assert input_shape

* fix static bug and add ppl ort and trt static (open-mmlab#77)

* [Feature] Refine setup.py (open-mmlab#61)

* add setup.py and related files

* lint

* Edit requirements

* modify onnx version

* modify according to comments

* [Refactor] Refactor mmseg configs  (open-mmlab#73)

* refactor mmseg config

* change create_input

* fix lint

* fix lint

* fix lint

* fix yapf

* fix yapf

* update export

* remove Segmentation

* remove tast assert

* add onnx_config

* remove hardcode

* Inherit with static

* Remove blank line

* Add segmentation task enum

* add assert task

* mmocr version 0.3.0 (open-mmlab#79)

* add dump_info

* [Feature]: Refactor config in mmdet (open-mmlab#75)

* support onnxruntime

* add two stage

* test two-stage ort and ppl

* update fcos post_params

* fix calib

* test ok with maskrcnn dynamic

* add empty line

* add static into config filename

* add input_shape to create_input in mmdet

* add static to some configs

* remove todo codes

* remove partition config in base

* refactor create_input

* rename task name in mmdet

* return None if input_shape is None

* add size info into mmdet configs filenames

* reorganize mmdet configs

* add object detection task for mmdet

* rename get_mmdet_params

* keep naming style consistent

* update post_params for fcos

* fix typo in ncnn config

* [Refactor] Refactor mmedit static config (open-mmlab#78)

* add static cfg

* update create_input

* [Refactor]: Refactor mmcls configs (open-mmlab#74)

* refactor mmcls2.0

* fix classify_tensorrt_dynamic.py

* fix classify_tensorrt_dynmic.py

* classify_tensorrt_dynamic_int8.py

* fix file name

* fix ncnn ppl

* updata prepare_input.py

* update utils.py

* updata constant.py

* add

* fix prepare_input.py

* fix prepare_input.py

* add static config file

* add blank lines

* fix prepare_input.py(wait test)

* fix input_shape(wait test)

* Update prepare_input.py

* fix classification_tensorrt_dynamic(wait test)

* fix classification_tensorrt_dynamic_int8(wait test)

* fix classification_tensorrt_static_int8(wait test)

* Rename classification_tensorrt_dynamic.py to classification_tensorrt_dynamic-224x224-224x224.py

* Rename classification_tensorrt_dynamic_int8.py to classification_tensorrt_dynamic_int8-224x224-224x224.py

* Rename classification_tensorrt_dynamic_int8-224x224-224x224.py to classification_tensorrt_int8_dynamic_224x224-224x224.py

* Rename classification_tensorrt_dynamic-224x224-224x224.py to classification_tensorrt_dynamic_224x224-224x224.py

* Rename classification_tensorrt_static.py to classification_tensorrt_static_224x224.py

* Rename classification_tensorrt_static_int8.py to classification_tensorrt_int8_static_224x224.py

* Update prepare_input.py

* Rename classification_tensorrt_dynamic_224x224-224x224.py to classification_tensorrt_dynamic-224x224-224x224.py

* Rename classification_tensorrt_int8_dynamic_224x224-224x224.py to classification_tensorrt_int8-dynamic_224x224-224x224.py

* Rename classification_tensorrt_int8-dynamic_224x224-224x224.py to classification_tensorrt_int8_dynamic-224x224-224x224.py

* Rename classification_tensorrt_int8_static_224x224.py to classification_tensorrt_int8_static-224x224.py

* Rename classification_tensorrt_static_224x224.py to classification_tensorrt_static-224x224.py

* Update prepare_input.py

* Update prepare_input.py

* Update prepare_input.py

* Update prepare_input.py

* Update prepare_input.py

* Update prepare_input.py

* Update prepare_input.py

* change logging msg

Co-authored-by: maningsheng <mnsheng@yeah.net>

* fix

* fix else branch

* fix bug for trt in mmseg

* enable dump trt info

* fix trt static for mmdet

* remove two-stage_partition_tensorrt_static-800x1344 config

* fix wrong backend in ppl config

* fix partition calibration

Co-authored-by: Yifan Zhou <singlezombie@163.com>
Co-authored-by: AllentDan <41138331+AllentDan@users.noreply.github.com>
Co-authored-by: hanrui1sensetime <83800577+hanrui1sensetime@users.noreply.github.com>
Co-authored-by: RunningLeon <maningsheng@sensetime.com>
Co-authored-by: VVsssssk <88368822+VVsssssk@users.noreply.github.com>
Co-authored-by: maningsheng <mnsheng@yeah.net>
Co-authored-by: AllentDan <AllentDan@yeah.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants