Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

cannot quantization example #17231

Open
zhhoper opened this issue Jan 7, 2020 · 17 comments
Open

cannot quantization example #17231

zhhoper opened this issue Jan 7, 2020 · 17 comments
Labels
Bug Quantization Issues/Feature Requests related to Quantization

Comments

@zhhoper
Copy link

zhhoper commented Jan 7, 2020

Description

(A clear and concise description of what the bug is.)
I try to run quantization example:
python imagenet_gen_qsym_mkldnn.py and met the segmentation fault. The details of output is as follows:

Error Message

(Paste the complete error message. Please also include stack trace by setting environment variable DMLC_LOG_STACK_TRACE_DEPTH=10 before running your script.)
INFO:logger:Namespace(batch_size=32, calib_dataset='data/val_256_q90.rec', calib_mode='entropy', data_nthreads=60, enable_calib_quantize=True, epoch=0, exclude_first_conv=False, image_shape='3,224,224', label_name='softmax_label', model='resnet50_v1', no_pretrained=False, num_calib_batches=10, quantized_dtype='auto', quiet=False, shuffle_chunk_seed=3982304, shuffle_dataset=True, shuffle_seed=48564309)
INFO:logger:shuffle_dataset=True
INFO:logger:calibration mode set to entropy
INFO:logger:Get pre-trained model from MXNet or Gluoncv modelzoo.
INFO:logger:If you want to use custom model, please set --no-pretrained.
INFO:logger:model resnet50_v1 is converted from GluonCV
INFO:logger:Converting model from Gluon-CV ModelZoo resnet50_v1... into path /home/ubuntu/software/incubator-mxnet/example/quantization/model
Model file is not found. Downloading.
Downloading /home/ubuntu/.mxnet/models/resnet50_v1-cc729d95.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet50_v1-cc729d95.zip...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 57421/57421 [00:00<00:00, 57938.39KB/s]
/home/ubuntu/anaconda3/envs/mxnet_0.15/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/module/base_module.py:67: UserWarning: Data provided by label_shapes don't match names specified by label_names ([] vs. ['softmax_label'])
warnings.warn(msg)
[00:03:02] ../src/executor/graph_executor.cc:1982: Subgraph backend MKLDNN is activated.
INFO:logger:batch size = 32 for calibration
INFO:logger:number of batches = 10 for calibration
INFO:logger:These layers have been excluded []
INFO:logger:label_name = softmax_label
INFO:logger:Input data shape = (3, 224, 224)
INFO:logger:rgb_mean = 123.68,116.779,103.939
INFO:logger:rgb_std = 58.393, 57.12, 57.375
INFO:logger:Creating ImageRecordIter for reading calibration dataset
[00:03:02] ../src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/val_256_q90.rec, use 16 threads for decoding..

Segmentation fault: 11

@zhhoper zhhoper added the Bug label Jan 7, 2020
@eric-haibin-lin
Copy link
Member

@ZhennanQin
Copy link
Contributor

@eric-haibin-lin Thanks for reporting this. May I know if calibration=naive will crash or not?

@zhhoper
Copy link
Author

zhhoper commented Jan 7, 2020

@ZhennanQin I tried to set calib-mode to 'naive', met the same error. Error message as follows

INFO:logger:Namespace(batch_size=32, calib_dataset='data/val_256_q90.rec', calib_mode='naive', data_nthreads=60, enable_calib_quantize=True, epoch=0, exclude_first_conv=False, image_shape='3,224,224', label_name='softmax_label', model='resnet50_v1', no_pretrained=False, num_calib_batches=10, quantized_dtype='auto', quiet=False, shuffle_chunk_seed=3982304, shuffle_dataset=True, shuffle_seed=48564309)
INFO:logger:shuffle_dataset=True
INFO:logger:calibration mode set to naive
INFO:logger:Get pre-trained model from MXNet or Gluoncv modelzoo.
INFO:logger:If you want to use custom model, please set --no-pretrained.
INFO:logger:model resnet50_v1 is converted from GluonCV
INFO:logger:Converting model from Gluon-CV ModelZoo resnet50_v1... into path /home/ubuntu/software/incubator-mxnet/example/quantization/model
/home/ubuntu/anaconda3/envs/mxnet_0.15/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/module/base_module.py:67: UserWarning: Data provided by label_shapes don't match names specified by label_names ([] vs. ['softmax_label'])
warnings.warn(msg)
[00:17:25] ../src/executor/graph_executor.cc:1982: Subgraph backend MKLDNN is activated.
INFO:logger:batch size = 32 for calibration
INFO:logger:number of batches = 10 for calibration
INFO:logger:These layers have been excluded []
INFO:logger:label_name = softmax_label
INFO:logger:Input data shape = (3, 224, 224)
INFO:logger:rgb_mean = 123.68,116.779,103.939
INFO:logger:rgb_std = 58.393, 57.12, 57.375
INFO:logger:Creating ImageRecordIter for reading calibration dataset
[00:17:26] ../src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/val_256_q90.rec, use 16 threads for decoding..

Segmentation fault: 11

@ZhennanQin
Copy link
Contributor

@zhhoper Thanks for the information. Will investigate this soon.

@wuxun-zhang
Copy link
Contributor

@zhhoper From my side, I cannot reproduce this issue with latest master in my local machine. May I know which mxnet version (commit id) do you use? Recently we have provided a PR to fix _copyto issue when using calib_mode=entropy. Could you try this commit e65fc4b or later on master again? Please let us know if you have any question. Thanks.

@pengzhao-intel pengzhao-intel added the Quantization Issues/Feature Requests related to Quantization label Jan 7, 2020
@wuxun-zhang
Copy link
Contributor

@zhhoper Any update for this issue?

@zhhoper
Copy link
Author

zhhoper commented Jan 14, 2020

@wuxun-zhang Sorry that I haven't been able to touch that after reporting the bug. Will take a look at that and let you know if the bug is still there.

@zhhoper
Copy link
Author

zhhoper commented Jan 15, 2020

@wuxun-zhang @ZhennanQin I run the example using mxnet 1.6.0, it seems to work ok. However, the run time of quantized model is much slower (more than 10 times) than the original one. Is there anything I need to set up in order to speed up the quantized model?
I test resnet152
For float32:
command:
python imagenet_inference.py --symbol-file=./model/imagenet1k-resnet-152-symbol.json --param-file=./model/imagenet1k-resnet-152-0000.params --num-skipped-batches=50 --batch-size=64 --num-inference-batches=500 --dataset=./data/val_256_q90.rec --ctx=cpu
Output:
INFO:logger:batch size = 64 for inference
INFO:logger:rgb_mean = 0,0,0
INFO:logger:rgb_std = 1,1,1
INFO:logger:label_name = softmax_label
INFO:logger:Input data shape = (3, 224, 224)
INFO:logger:Dataset for inference: ./data/val_256_q90.rec
[07:03:16] ../src/io/iter_image_recordio_2.cc:831: Create ImageRecordIter2 optimized for CPU backend.Use omp threads instead of preprocess_threads.
[07:03:16] ../src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: ./data/val_256_q90.rec, use 16 threads for decoding..
[07:03:16] ../src/base.cc:84: Upgrade advisory: this mxnet has been built against cuDNN lib version 7401, which is older than the oldest version tested by CI (7600). Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.
INFO:logger:Loading symbol from file /home/ubuntu/software/incubator-mxnet/example/quantization/./model/imagenet1k-resnet-152-symbol.json
[07:03:18] ../src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v0.8.0. Attempting to upgrade...
[07:03:18] ../src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
INFO:logger:Loading params from file /home/ubuntu/software/incubator-mxnet/example/quantization/./model/imagenet1k-resnet-152-0000.params
INFO:logger:Skipping the first 50 batches
INFO:logger:Running model ./model/imagenet1k-resnet-152-symbol.json for inference
[07:03:19] ../src/executor/graph_executor.cc:1982: Subgraph backend MKLDNN is activated.
INFO:logger:Finished inference with 32000 images
INFO:logger:Finished with 22.124158 images per second
WARNING:logger:Note: GPU performance is expected to be slower than CPU. Please refer quantization/README.md for details
INFO:logger:('accuracy', 0.7676875)
INFO:logger:('top_k_accuracy_5', 0.93034375)

For quantized model
command:
python imagenet_inference.py --symbol-file=./model/imagenet1k-resnet-152-quantized-5batches-naive-symbol.json --param-file=./model/imagenet1k-resnet-152-quantized-0000.params --num-skipped-batches=50 --batch-size=64 --num-inference-batches=500 --dataset=./data/val_256_q90.rec --ctx=cpu
output:
INFO:logger:batch size = 64 for inference
INFO:logger:rgb_mean = 0,0,0
INFO:logger:rgb_std = 1,1,1
INFO:logger:label_name = softmax_label
INFO:logger:Input data shape = (3, 224, 224)
INFO:logger:Dataset for inference: ./data/val_256_q90.rec
[00:37:40] ../src/io/iter_image_recordio_2.cc:831: Create ImageRecordIter2 optimized for CPU backend.Use omp threads instead of preprocess_threads.
[00:37:40] ../src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: ./data/val_256_q90.rec, use 16 threads for decoding..
[00:37:40] ../src/base.cc:84: Upgrade advisory: this mxnet has been built against cuDNN lib version 7401, which is older than the oldest version tested by CI (7600). Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.
INFO:logger:Loading symbol from file /home/ubuntu/software/incubator-mxnet/example/quantization/./model/imagenet1k-resnet-152-quantized-5batches-naive-symbol.json
INFO:logger:Loading params from file /home/ubuntu/software/incubator-mxnet/example/quantization/./model/imagenet1k-resnet-152-quantized-0000.params
INFO:logger:Skipping the first 50 batches
INFO:logger:Running model ./model/imagenet1k-resnet-152-quantized-5batches-naive-symbol.json for inference
[00:37:43] ../src/executor/graph_executor.cc:1982: Subgraph backend MKLDNN is activated.
INFO:logger:Finished inference with 32000 images
INFO:logger:Finished with 1.495486 images per second
WARNING:logger:Note: GPU performance is expected to be slower than CPU. Please refer quantization/README.md for details
INFO:logger:('accuracy', 0.76328125)
INFO:logger:('top_k_accuracy_5', 0.92859375)

@wuxun-zhang
Copy link
Contributor

@zhhoper May I know your exact command to build MXNet from source? And your complete benchamrk commands? Thanks.

@zhhoper
Copy link
Author

zhhoper commented Jan 17, 2020

Hi, the mxnet build from source does not seem to work. I install the mxnet with pip, it can compress the network but the run time is super slow. The mxnet version is 1.6.0.

@venkat-kittu
Copy link

I am also facing the same Issue, but I am getting the segmentation fault error while quantizing the network with calib-mode="entropy", but for calib-mode="naive" it worked fine.

My mxnet version is 1.6.0, which I downloaded using pip as follows

pip3 install mxnet-cu101mkl

Below is the command I have executed and error I got

python imagenet_gen_qsym_mkldnn.py --model=vgg19 --num-calib-batches=782 --calib-mode=entropy

INFO:logger:Collecting layer sg_mkldnn_conv_act_12_output histogram of shape (32, 512, 14, 14)
Segmentation fault: 11
Stack trace: [bt] (0) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8280) [0x7fec58aab280] [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7fecb3e05f20] [bt] (2) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3addaef) [0x7fec583e0aef] [bt] (3) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::NDArray::Reorder2Default() const+0x4fe) [0x7fec583e402e] [bt] (4) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushFComputeEx(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::Resource, std::allocatormxnet::Resource > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x482) [0x7fec582795b2] [bt] (5) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushFComputeEx(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::Resource, std::allocatormxnet::Resource > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&)+0x463) [0x7fec58279c13] [bt] (6) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::Imperative::InvokeOp(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, mxnet::DispatchMode, mxnet::OpStatePtr)+0x481) [0x7fec5827b711] [bt] (7) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::Imperative::Invoke(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&)+0x25b) [0x7fec5827be4b] [bt] (8) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3839f1f) [0x7fec5813cf1f] Segmentation fault: 11 Stack trace: [bt] (0) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8280) [0x7fec58aab280] [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7fecb3e05f20] [bt] (2) /opt/conda/lib/python3.6/site-packages/mxnet/libmkldnn.so.1(+0x232392) [0x7fecaf3b5392] [bt] (3) /opt/conda/lib/python3.6/site-packages/mxnet/libmkldnn.so.1(mkldnn_memory_create+0xc0) [0x7fecaf3b70e0] [bt] (4) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x70b13b) [0x7fec5500e13b] [bt] (5) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3ac5770) [0x7fec583c8770] [bt] (6) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::NDArray::Chunk::SetMKLMem(mxnet::TShape const&, int)+0x2b4) [0x7fec583ccb84] [bt] (7) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::NDArray::GetMKLDNNData() const+0x70) [0x7fec583d1ae0] [bt] (8) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::op::SgMKLDNNConvOperator::Forward(mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray > const&)+0x1cb7) [0x7fec5523d147]

@venkat-kittu
Copy link

I have tried mxnet docker image and now I am getting a new error while running same command as below
INFO:logger:Collected statistics from 200 batches with batch_size=32
INFO:logger:Collected layer outputs from FP32 model using 6400 examples
INFO:logger:Calculating optimal thresholds for quantization
INFO:logger:Calculating optimal thresholds for quantization using KL divergence with num_quantized_bins=255
terminate called after throwing an instance of 'dmlc::Error'
what(): [05:01:48] src/operator/quantization/calibrate.cc:81: Check failed: p[i] > 0 && q[i] > 0:

Aborted (core dumped)

What is happening?

@wuxun-zhang
Copy link
Contributor

wuxun-zhang commented Mar 11, 2020

@venkat-kittu Can you try again with the latest nightly build via pip install --pre mxnet -f https://dist.mxnet.io/python/cpu ? Previously, we had a fix merged into master branch.

@venkat-kittu
Copy link

@wuxun-zhang Nope it's not working, it's only working when I keep --num-calib-batches=33. Except for numbers less than 33, it's not working for any higher number.

@wuxun-zhang
Copy link
Contributor

I have tried mxnet docker image and now I am getting a new error while running same command as below
INFO:logger:Collected statistics from 200 batches with batch_size=32
INFO:logger:Collected layer outputs from FP32 model using 6400 examples
INFO:logger:Calculating optimal thresholds for quantization
INFO:logger:Calculating optimal thresholds for quantization using KL divergence with num_quantized_bins=255
terminate called after throwing an instance of 'dmlc::Error'
what(): [05:01:48] src/operator/quantization/calibrate.cc:81: Check failed: p[i] > 0 && q[i] > 0:

Aborted (core dumped)

What is happening?

I can exactly reproduce this issue (when num-calib-batches is higher than 33) with the latest master. I will look into this.

@wuxun-zhang
Copy link
Contributor

wuxun-zhang commented Mar 13, 2020

@venkat-kittu I have just provided a patch here wuxun-zhang@c06a715, could you please try it out and verify if it can resolve your issue? Thanks.

@venkat-kittu
Copy link

sorry, for the late reply......Now I have kept it aside for some time, but when I start I will let you know.
Thanks for the help.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Bug Quantization Issues/Feature Requests related to Quantization
Projects
None yet
Development

No branches or pull requests

6 participants