cannot quantization example #17231

zhhoper · 2020-01-07T00:07:35Z

Description

(A clear and concise description of what the bug is.)
I try to run quantization example:
python imagenet_gen_qsym_mkldnn.py and met the segmentation fault. The details of output is as follows:

Error Message

(Paste the complete error message. Please also include stack trace by setting environment variable DMLC_LOG_STACK_TRACE_DEPTH=10 before running your script.)
INFO:logger:Namespace(batch_size=32, calib_dataset='data/val_256_q90.rec', calib_mode='entropy', data_nthreads=60, enable_calib_quantize=True, epoch=0, exclude_first_conv=False, image_shape='3,224,224', label_name='softmax_label', model='resnet50_v1', no_pretrained=False, num_calib_batches=10, quantized_dtype='auto', quiet=False, shuffle_chunk_seed=3982304, shuffle_dataset=True, shuffle_seed=48564309)
INFO:logger:shuffle_dataset=True
INFO:logger:calibration mode set to entropy
INFO:logger:Get pre-trained model from MXNet or Gluoncv modelzoo.
INFO:logger:If you want to use custom model, please set --no-pretrained.
INFO:logger:model resnet50_v1 is converted from GluonCV
INFO:logger:Converting model from Gluon-CV ModelZoo resnet50_v1... into path /home/ubuntu/software/incubator-mxnet/example/quantization/model
Model file is not found. Downloading.
Downloading /home/ubuntu/.mxnet/models/resnet50_v1-cc729d95.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet50_v1-cc729d95.zip...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 57421/57421 [00:00<00:00, 57938.39KB/s]
/home/ubuntu/anaconda3/envs/mxnet_0.15/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/module/base_module.py:67: UserWarning: Data provided by label_shapes don't match names specified by label_names ([] vs. ['softmax_label'])
warnings.warn(msg)
[00:03:02] ../src/executor/graph_executor.cc:1982: Subgraph backend MKLDNN is activated.
INFO:logger:batch size = 32 for calibration
INFO:logger:number of batches = 10 for calibration
INFO:logger:These layers have been excluded []
INFO:logger:label_name = softmax_label
INFO:logger:Input data shape = (3, 224, 224)
INFO:logger:rgb_mean = 123.68,116.779,103.939
INFO:logger:rgb_std = 58.393, 57.12, 57.375
INFO:logger:Creating ImageRecordIter for reading calibration dataset
[00:03:02] ../src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/val_256_q90.rec, use 16 threads for decoding..

Segmentation fault: 11

The text was updated successfully, but these errors were encountered:

eric-haibin-lin · 2020-01-07T00:12:37Z

@PatricZhao @TaoLv @ZhennanQin

ZhennanQin · 2020-01-07T00:15:55Z

@eric-haibin-lin Thanks for reporting this. May I know if calibration=naive will crash or not?

zhhoper · 2020-01-07T00:19:00Z

@ZhennanQin I tried to set calib-mode to 'naive', met the same error. Error message as follows

INFO:logger:Namespace(batch_size=32, calib_dataset='data/val_256_q90.rec', calib_mode='naive', data_nthreads=60, enable_calib_quantize=True, epoch=0, exclude_first_conv=False, image_shape='3,224,224', label_name='softmax_label', model='resnet50_v1', no_pretrained=False, num_calib_batches=10, quantized_dtype='auto', quiet=False, shuffle_chunk_seed=3982304, shuffle_dataset=True, shuffle_seed=48564309)
INFO:logger:shuffle_dataset=True
INFO:logger:calibration mode set to naive
INFO:logger:Get pre-trained model from MXNet or Gluoncv modelzoo.
INFO:logger:If you want to use custom model, please set --no-pretrained.
INFO:logger:model resnet50_v1 is converted from GluonCV
INFO:logger:Converting model from Gluon-CV ModelZoo resnet50_v1... into path /home/ubuntu/software/incubator-mxnet/example/quantization/model
/home/ubuntu/anaconda3/envs/mxnet_0.15/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/module/base_module.py:67: UserWarning: Data provided by label_shapes don't match names specified by label_names ([] vs. ['softmax_label'])
warnings.warn(msg)
[00:17:25] ../src/executor/graph_executor.cc:1982: Subgraph backend MKLDNN is activated.
INFO:logger:batch size = 32 for calibration
INFO:logger:number of batches = 10 for calibration
INFO:logger:These layers have been excluded []
INFO:logger:label_name = softmax_label
INFO:logger:Input data shape = (3, 224, 224)
INFO:logger:rgb_mean = 123.68,116.779,103.939
INFO:logger:rgb_std = 58.393, 57.12, 57.375
INFO:logger:Creating ImageRecordIter for reading calibration dataset
[00:17:26] ../src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/val_256_q90.rec, use 16 threads for decoding..

Segmentation fault: 11

ZhennanQin · 2020-01-07T00:21:07Z

@zhhoper Thanks for the information. Will investigate this soon.

wuxun-zhang · 2020-01-07T01:49:41Z

@zhhoper From my side, I cannot reproduce this issue with latest master in my local machine. May I know which mxnet version (commit id) do you use? Recently we have provided a PR to fix _copyto issue when using calib_mode=entropy. Could you try this commit e65fc4b or later on master again? Please let us know if you have any question. Thanks.

wuxun-zhang · 2020-01-14T02:47:04Z

@zhhoper Any update for this issue?

zhhoper · 2020-01-14T22:13:50Z

@wuxun-zhang Sorry that I haven't been able to touch that after reporting the bug. Will take a look at that and let you know if the bug is still there.

zhhoper · 2020-01-15T08:07:49Z

@wuxun-zhang @ZhennanQin I run the example using mxnet 1.6.0, it seems to work ok. However, the run time of quantized model is much slower (more than 10 times) than the original one. Is there anything I need to set up in order to speed up the quantized model?
I test resnet152
For float32:
command:
python imagenet_inference.py --symbol-file=./model/imagenet1k-resnet-152-symbol.json --param-file=./model/imagenet1k-resnet-152-0000.params --num-skipped-batches=50 --batch-size=64 --num-inference-batches=500 --dataset=./data/val_256_q90.rec --ctx=cpu
Output:
INFO:logger:batch size = 64 for inference
INFO:logger:rgb_mean = 0,0,0
INFO:logger:rgb_std = 1,1,1
INFO:logger:label_name = softmax_label
INFO:logger:Input data shape = (3, 224, 224)
INFO:logger:Dataset for inference: ./data/val_256_q90.rec
[07:03:16] ../src/io/iter_image_recordio_2.cc:831: Create ImageRecordIter2 optimized for CPU backend.Use omp threads instead of preprocess_threads.
[07:03:16] ../src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: ./data/val_256_q90.rec, use 16 threads for decoding..
[07:03:16] ../src/base.cc:84: Upgrade advisory: this mxnet has been built against cuDNN lib version 7401, which is older than the oldest version tested by CI (7600). Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.
INFO:logger:Loading symbol from file /home/ubuntu/software/incubator-mxnet/example/quantization/./model/imagenet1k-resnet-152-symbol.json
[07:03:18] ../src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v0.8.0. Attempting to upgrade...
[07:03:18] ../src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
INFO:logger:Loading params from file /home/ubuntu/software/incubator-mxnet/example/quantization/./model/imagenet1k-resnet-152-0000.params
INFO:logger:Skipping the first 50 batches
INFO:logger:Running model ./model/imagenet1k-resnet-152-symbol.json for inference
[07:03:19] ../src/executor/graph_executor.cc:1982: Subgraph backend MKLDNN is activated.
INFO:logger:Finished inference with 32000 images
INFO:logger:Finished with 22.124158 images per second
WARNING:logger:Note: GPU performance is expected to be slower than CPU. Please refer quantization/README.md for details
INFO:logger:('accuracy', 0.7676875)
INFO:logger:('top_k_accuracy_5', 0.93034375)

For quantized model
command:
python imagenet_inference.py --symbol-file=./model/imagenet1k-resnet-152-quantized-5batches-naive-symbol.json --param-file=./model/imagenet1k-resnet-152-quantized-0000.params --num-skipped-batches=50 --batch-size=64 --num-inference-batches=500 --dataset=./data/val_256_q90.rec --ctx=cpu
output:
INFO:logger:batch size = 64 for inference
INFO:logger:rgb_mean = 0,0,0
INFO:logger:rgb_std = 1,1,1
INFO:logger:label_name = softmax_label
INFO:logger:Input data shape = (3, 224, 224)
INFO:logger:Dataset for inference: ./data/val_256_q90.rec
[00:37:40] ../src/io/iter_image_recordio_2.cc:831: Create ImageRecordIter2 optimized for CPU backend.Use omp threads instead of preprocess_threads.
[00:37:40] ../src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: ./data/val_256_q90.rec, use 16 threads for decoding..
[00:37:40] ../src/base.cc:84: Upgrade advisory: this mxnet has been built against cuDNN lib version 7401, which is older than the oldest version tested by CI (7600). Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.
INFO:logger:Loading symbol from file /home/ubuntu/software/incubator-mxnet/example/quantization/./model/imagenet1k-resnet-152-quantized-5batches-naive-symbol.json
INFO:logger:Loading params from file /home/ubuntu/software/incubator-mxnet/example/quantization/./model/imagenet1k-resnet-152-quantized-0000.params
INFO:logger:Skipping the first 50 batches
INFO:logger:Running model ./model/imagenet1k-resnet-152-quantized-5batches-naive-symbol.json for inference
[00:37:43] ../src/executor/graph_executor.cc:1982: Subgraph backend MKLDNN is activated.
INFO:logger:Finished inference with 32000 images
INFO:logger:Finished with 1.495486 images per second
WARNING:logger:Note: GPU performance is expected to be slower than CPU. Please refer quantization/README.md for details
INFO:logger:('accuracy', 0.76328125)
INFO:logger:('top_k_accuracy_5', 0.92859375)

wuxun-zhang · 2020-01-16T02:15:48Z

@zhhoper May I know your exact command to build MXNet from source? And your complete benchamrk commands? Thanks.

zhhoper · 2020-01-17T22:44:09Z

Hi, the mxnet build from source does not seem to work. I install the mxnet with pip, it can compress the network but the run time is super slow. The mxnet version is 1.6.0.

venkat-kittu · 2020-03-09T06:20:11Z

I am also facing the same Issue, but I am getting the segmentation fault error while quantizing the network with calib-mode="entropy", but for calib-mode="naive" it worked fine.

My mxnet version is 1.6.0, which I downloaded using pip as follows

pip3 install mxnet-cu101mkl

Below is the command I have executed and error I got

python imagenet_gen_qsym_mkldnn.py --model=vgg19 --num-calib-batches=782 --calib-mode=entropy

INFO:logger:Collecting layer sg_mkldnn_conv_act_12_output histogram of shape (32, 512, 14, 14)
Segmentation fault: 11
Stack trace: [bt] (0) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8280) [0x7fec58aab280] [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7fecb3e05f20] [bt] (2) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3addaef) [0x7fec583e0aef] [bt] (3) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::NDArray::Reorder2Default() const+0x4fe) [0x7fec583e402e] [bt] (4) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushFComputeEx(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::Resource, std::allocatormxnet::Resource > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x482) [0x7fec582795b2] [bt] (5) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushFComputeEx(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::Resource, std::allocatormxnet::Resource > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&)+0x463) [0x7fec58279c13] [bt] (6) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::Imperative::InvokeOp(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, mxnet::DispatchMode, mxnet::OpStatePtr)+0x481) [0x7fec5827b711] [bt] (7) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::Imperative::Invoke(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&)+0x25b) [0x7fec5827be4b] [bt] (8) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3839f1f) [0x7fec5813cf1f] Segmentation fault: 11 Stack trace: [bt] (0) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8280) [0x7fec58aab280] [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7fecb3e05f20] [bt] (2) /opt/conda/lib/python3.6/site-packages/mxnet/libmkldnn.so.1(+0x232392) [0x7fecaf3b5392] [bt] (3) /opt/conda/lib/python3.6/site-packages/mxnet/libmkldnn.so.1(mkldnn_memory_create+0xc0) [0x7fecaf3b70e0] [bt] (4) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x70b13b) [0x7fec5500e13b] [bt] (5) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3ac5770) [0x7fec583c8770] [bt] (6) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::NDArray::Chunk::SetMKLMem(mxnet::TShape const&, int)+0x2b4) [0x7fec583ccb84] [bt] (7) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::NDArray::GetMKLDNNData() const+0x70) [0x7fec583d1ae0] [bt] (8) /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::op::SgMKLDNNConvOperator::Forward(mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray > const&)+0x1cb7) [0x7fec5523d147]

venkat-kittu · 2020-03-11T05:18:10Z

I have tried mxnet docker image and now I am getting a new error while running same command as below
INFO:logger:Collected statistics from 200 batches with batch_size=32
INFO:logger:Collected layer outputs from FP32 model using 6400 examples
INFO:logger:Calculating optimal thresholds for quantization
INFO:logger:Calculating optimal thresholds for quantization using KL divergence with num_quantized_bins=255
terminate called after throwing an instance of 'dmlc::Error'
what(): [05:01:48] src/operator/quantization/calibrate.cc:81: Check failed: p[i] > 0 && q[i] > 0:

Aborted (core dumped)

What is happening?

wuxun-zhang · 2020-03-11T06:07:19Z

@venkat-kittu Can you try again with the latest nightly build via pip install --pre mxnet -f https://dist.mxnet.io/python/cpu ? Previously, we had a fix merged into master branch.

venkat-kittu · 2020-03-11T11:31:26Z

@wuxun-zhang Nope it's not working, it's only working when I keep --num-calib-batches=33. Except for numbers less than 33, it's not working for any higher number.

wuxun-zhang · 2020-03-12T01:16:24Z

I have tried mxnet docker image and now I am getting a new error while running same command as below
INFO:logger:Collected statistics from 200 batches with batch_size=32
INFO:logger:Collected layer outputs from FP32 model using 6400 examples
INFO:logger:Calculating optimal thresholds for quantization
INFO:logger:Calculating optimal thresholds for quantization using KL divergence with num_quantized_bins=255
terminate called after throwing an instance of 'dmlc::Error'
what(): [05:01:48] src/operator/quantization/calibrate.cc:81: Check failed: p[i] > 0 && q[i] > 0:

Aborted (core dumped)

What is happening?

I can exactly reproduce this issue (when num-calib-batches is higher than 33) with the latest master. I will look into this.

wuxun-zhang · 2020-03-13T02:57:15Z

@venkat-kittu I have just provided a patch here wuxun-zhang@c06a715, could you please try it out and verify if it can resolve your issue? Thanks.

venkat-kittu · 2020-03-17T04:55:04Z

sorry, for the late reply......Now I have kept it aside for some time, but when I start I will let you know.
Thanks for the help.

zhhoper added the Bug label Jan 7, 2020

pengzhao-intel added the Quantization Issues/Feature Requests related to Quantization label Jan 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cannot quantization example #17231

cannot quantization example #17231

zhhoper commented Jan 7, 2020

eric-haibin-lin commented Jan 7, 2020

ZhennanQin commented Jan 7, 2020

zhhoper commented Jan 7, 2020

ZhennanQin commented Jan 7, 2020

wuxun-zhang commented Jan 7, 2020

wuxun-zhang commented Jan 14, 2020

zhhoper commented Jan 14, 2020

zhhoper commented Jan 15, 2020

wuxun-zhang commented Jan 16, 2020

zhhoper commented Jan 17, 2020

venkat-kittu commented Mar 9, 2020

venkat-kittu commented Mar 11, 2020

wuxun-zhang commented Mar 11, 2020 •

edited

Loading

venkat-kittu commented Mar 11, 2020

wuxun-zhang commented Mar 12, 2020

wuxun-zhang commented Mar 13, 2020 •

edited

Loading

venkat-kittu commented Mar 17, 2020

cannot quantization example #17231

cannot quantization example #17231

Comments

zhhoper commented Jan 7, 2020

Description

Error Message

eric-haibin-lin commented Jan 7, 2020

ZhennanQin commented Jan 7, 2020

zhhoper commented Jan 7, 2020

ZhennanQin commented Jan 7, 2020

wuxun-zhang commented Jan 7, 2020

wuxun-zhang commented Jan 14, 2020

zhhoper commented Jan 14, 2020

zhhoper commented Jan 15, 2020

wuxun-zhang commented Jan 16, 2020

zhhoper commented Jan 17, 2020

venkat-kittu commented Mar 9, 2020

venkat-kittu commented Mar 11, 2020

wuxun-zhang commented Mar 11, 2020 • edited Loading

venkat-kittu commented Mar 11, 2020

wuxun-zhang commented Mar 12, 2020

wuxun-zhang commented Mar 13, 2020 • edited Loading

venkat-kittu commented Mar 17, 2020

wuxun-zhang commented Mar 11, 2020 •

edited

Loading

wuxun-zhang commented Mar 13, 2020 •

edited

Loading