Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPPF will generate nodes with duplicate names #234

Closed
SarBH opened this issue Nov 23, 2021 · 6 comments · Fixed by #240
Closed

SPPF will generate nodes with duplicate names #234

SarBH opened this issue Nov 23, 2021 · 6 comments · Fixed by #240
Labels
bug / fix Something isn't working

Comments

@SarBH
Copy link

SarBH commented Nov 23, 2021

🐛 Describe the bug

Somewhere between these two commits there was a model backbone change: 06022fd...e3e18f2. The three MaxPool2d at backbone.body.8.m.0, backbone.body.8.m.1, backbone.body.8.m.2 go from being parallel to being serialized in the later hash into just backbone.body.8.m with three outputs. (I'm using nni==2.4 to prune the model, and a node with three outputs is a problem for that)

For example, the SMALL model.
On the left: commit hash 06022fd, default upstream_version = r4.0
On the right: commit hash e3e18f2, default value for upstream_version changed with the addition of r6.0, so I set the upstream_version=r4.0 explicitly.
image
I see the same behavior for 'yolov5s', 'yolov5m', 'yolov5l'

All else in the models remained the same, therefore I wondered if this was accidental.

Versions

Collecting environment information...
PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.14.4
Libc version: glibc-2.31

Python version: 3.8.10 (default, Jun  2 2021, 10:49:15)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-4.14.252-195.483.amzn2.x86_64-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 11.2.142
GPU models and configuration: GPU 0: Tesla K80
Nvidia driver version: 450.142.00
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.1
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] efficientnet-pytorch==0.6.3
[pip3] mypy==0.910
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.19.5
[pip3] numpydoc==1.1.0
[pip3] pytorch-lightning==1.5.2
[pip3] pytorchcv==0.0.58
[pip3] segmentation-models-pytorch==0.2.1
[pip3] torch==1.7.1
[pip3] torchinfo==1.5.3
[pip3] torchmetrics==0.6.0
[pip3] torchvision==0.8.2
[conda] Could not collect
@zhiqwang
Copy link
Owner

zhiqwang commented Nov 24, 2021

Hi @SarBH , Thanks for reporting and providing this information!

All else in the models remained the same, therefore I wondered if this was accidental.

The origin of this change is in upstream YOLOv5 ultralytics/yolov5#4420, the SPPF we adopted here is a faster version of SPP. I think it will also benefit the earlier version, as such I replace this part in 451f3e4 with a verification of numerical equality.

EDITED: A detailed test about the SPPF vs SPP - ultralytics/yolov5#4420 (comment) .

https://github.com/zhiqwang/yolov5-rt-stack/blob/4cba0437389e2cccbcc299ab196c922945c93d45/yolort/v5/models/common.py#L191-L208

I'm using nni==2.4 to prune the model, and a node with three outputs is a problem for that

Actually, I don't quite understand the phenomenon that occurs here, could you provide me with more information, or a reproducible example? (I also have some interest in nni and am following its progress).

If the above change affects the use of downstream application, we can revert this change back or we can work together to find a better way to handle this scenario.

@SarBH
Copy link
Author

SarBH commented Nov 30, 2021

Thanks for the follow up @zhiqwang, and for this awesome project!

I investigated where exactly pruning is failing:

the compressor.py file asserts that assert len(node.outputs) == 1, 'The number of the output should be one after the Tuple unpacked manually' .
I see that all other (non failing) nodes in this model indeed have a single output, but this one MaxPool2d node has 3 outputs after the update (Below is a printout of the node, see last line):

name: backbone.body.8.m, type: module, op_type: MaxPool2d, sub_nodes: 
['__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_aten::max_pool2d', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_aten::max_pool2d', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_aten::max_pool2d', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct'], 
inputs: ['input.85'], outputs: ['input.86', 'input.87', '4815'], aux: None 

The old hash’s model that doesnt fail prunning actually splits that backbone.body.8.m layer into backbone.body.8.m.0, backbone.body.8.m.1, backbone.body.8.m.2 , but it is otherwise the same exact model.
image

** Note: We are using an older version of nni package nni==2.4 so I'm not sure if this is resolved already. From the fact that compressor.py still has the assert im inclined to beleive new version will also fail.
I'm not an expert on pruning, but I hope this helps :)

@zhiqwang
Copy link
Owner

zhiqwang commented Dec 1, 2021

I see that all other (non failing) nodes in this model indeed have a single output, but this one MaxPool2d node has 3 outputs after the update.

Got it, Thanks for the details informations, it is very useful. I agree with you, we will revert this substitution it this two days.

@zhiqwang zhiqwang added the bug / fix Something isn't working label Dec 1, 2021
@zhiqwang
Copy link
Owner

zhiqwang commented Dec 1, 2021

Hi @SarBH ,

I revert the SPPF to SPP in #240 both in "r4.0" and "r6.0", and as such I'm closing this issue.

Please reinstall the yolort from source (we'll distribute 0.6.0 at the end of the month.)

pip install -U 'git+https://github.com/zhiqwang/yolov5-rt-stack.git'

Thanks for the detailed information again, feel free to reopen this or create another ticket if you have more questions.

@syswyl
Copy link

syswyl commented Jan 25, 2022

Thanks for the follow up @zhiqwang, and for this awesome project!

I investigated where exactly pruning is failing:

the compressor.py file asserts that assert len(node.outputs) == 1, 'The number of the output should be one after the Tuple unpacked manually' . I see that all other (non failing) nodes in this model indeed have a single output, but this one MaxPool2d node has 3 outputs after the update (Below is a printout of the node, see last line):

name: backbone.body.8.m, type: module, op_type: MaxPool2d, sub_nodes: 
['__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_aten::max_pool2d', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_aten::max_pool2d', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_aten::max_pool2d', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct'], 
inputs: ['input.85'], outputs: ['input.86', 'input.87', '4815'], aux: None 

The old hash’s model that doesnt fail prunning actually splits that backbone.body.8.m layer into backbone.body.8.m.0, backbone.body.8.m.1, backbone.body.8.m.2 , but it is otherwise the same exact model. image

** Note: We are using an older version of nni package nni==2.4 so I'm not sure if this is resolved already. From the fact that compressor.py still has the assert im inclined to beleive new version will also fail. I'm not an expert on pruning, but I hope this helps :)

Hello @SarBH , I encountered the same problem as you when using nni2.6. Considering the reason of sppf, I decided to use the yolov5 version 5, because the old version still uses the basic spp module, but the pruning process appeared new error, have you succeeded in pruning yolov5?

[2022-01-25 11:21:04] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update mask for model.12
model.12
node inputs:['4932', 'input.79']
node outputs:['input.107']
file:nni/compression/pytorch/speedup/compressor.py
Traceback (most recent call last):
File "v12_old_yolo.py", line 85, in
ModelSpeedup(model, dummy_input=dummy_input.to(device), masks_file=masks).speedup_model()
File "/nni-master25/nni/compression/pytorch/speedup/compressor.py", line 545, in speedup_model
self.infer_modules_masks()
File "/nni-master25/nni/compression/pytorch/speedup/compressor.py", line 390, in infer_modules_masks
self.update_direct_sparsity(curnode)
File "/nni-master25/nni/compression/pytorch/speedup/compressor.py", line 234, in update_direct_sparsity
state_dict=copy.deepcopy(module.state_dict()), batch_dim=self.batch_dim)
File "/nni-master25/nni/compression/pytorch/speedup/infer_mask.py", line 80, in init
self.output = self.module(*dummy_input)
File "/Users/anaconda3/envs/py37torch17/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
TypeError: forward() takes 2 positional arguments but 3 were given

@zhiqwang zhiqwang changed the title After code refactor, model backbone for r4.0 has changed SPPF will generate nodes with duplicate names Mar 2, 2022
@Hap-Zhang
Copy link

Thanks for the follow up @zhiqwang, and for this awesome project!
I investigated where exactly pruning is failing:
the compressor.py file asserts that assert len(node.outputs) == 1, 'The number of the output should be one after the Tuple unpacked manually' . I see that all other (non failing) nodes in this model indeed have a single output, but this one MaxPool2d node has 3 outputs after the update (Below is a printout of the node, see last line):

name: backbone.body.8.m, type: module, op_type: MaxPool2d, sub_nodes: 
['__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_aten::max_pool2d', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_aten::max_pool2d', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_aten::max_pool2d', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct', 
'__module.backbone/__module.backbone.body/__module.backbone.body.8/__module.backbone.body.8.m_prim::ListConstruct'], 
inputs: ['input.85'], outputs: ['input.86', 'input.87', '4815'], aux: None 

The old hash’s model that doesnt fail prunning actually splits that backbone.body.8.m layer into backbone.body.8.m.0, backbone.body.8.m.1, backbone.body.8.m.2 , but it is otherwise the same exact model. image
** Note: We are using an older version of nni package nni==2.4 so I'm not sure if this is resolved already. From the fact that compressor.py still has the assert im inclined to beleive new version will also fail. I'm not an expert on pruning, but I hope this helps :)

Hello @SarBH , I encountered the same problem as you when using nni2.6. Considering the reason of sppf, I decided to use the yolov5 version 5, because the old version still uses the basic spp module, but the pruning process appeared new error, have you succeeded in pruning yolov5?

[2022-01-25 11:21:04] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update mask for model.12
model.12
node inputs:['4932', 'input.79']
node outputs:['input.107']
file:nni/compression/pytorch/speedup/compressor.py
Traceback (most recent call last):
File "v12_old_yolo.py", line 85, in
ModelSpeedup(model, dummy_input=dummy_input.to(device), masks_file=masks).speedup_model()
File "/nni-master25/nni/compression/pytorch/speedup/compressor.py", line 545, in speedup_model
self.infer_modules_masks()
File "/nni-master25/nni/compression/pytorch/speedup/compressor.py", line 390, in infer_modules_masks
self.update_direct_sparsity(curnode)
File "/nni-master25/nni/compression/pytorch/speedup/compressor.py", line 234, in update_direct_sparsity
state_dict=copy.deepcopy(module.state_dict()), batch_dim=self.batch_dim)
File "/nni-master25/nni/compression/pytorch/speedup/infer_mask.py", line 80, in init
self.output = self.module(*dummy_input)
File "/Users/anaconda3/envs/py37torch17/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
TypeError: forward() takes 2 positional arguments but 3 were given

Hi, @zhiqwang @syswyl
i met the same error now, have you solved this problem? And can you give me some guidance? Thanks very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug / fix Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants