IndexError in gluon-cv Mask-RCNN validation on master #17485

Kh4L · 2020-01-30T19:40:06Z

Description

An IndexError occurs during the first validation step when training gluon-cv Mask-RCNN with horovod.

Error Message

[1,2]<stderr>:IndexError: index 999 is out of bounds for axis 1 with size 500
[1,5]<stderr>:Traceback (most recent call last):
[1,5]<stderr>:  File "gluon-cv/scripts/instance/mask_rcnn/train_mask_rcnn.py", line 615, in <module>
[1,5]<stderr>:    train(net, train_data, val_data, eval_metric, batch_size, ctx, logger, args)
[1,5]<stderr>:  File "gluon-cv/scripts/instance/mask_rcnn/train_mask_rcnn.py", line 540, in train
[1,5]<stderr>:    args)
[1,5]<stderr>:  File "gluon-cv/scripts/instance/mask_rcnn/train_mask_rcnn.py", line 275, in validate
[1,5]<stderr>:    det_bbox = det_bbox[i].asnumpy()
[1,5]<stderr>:  File "/workspace/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 2554, in asnumpy
[1,5]<stderr>:    ctypes.c_size_t(data.size)))
[1,5]<stderr>:  File "/workspace/incubator-mxnet/python/mxnet/base.py", line 273, in check_call
[1,5]<stderr>:    raise get_last_ffi_error()
[1,5]<stderr>:IndexError: Traceback (most recent call last):
[1,5]<stderr>:  File "src/operator/tensor/indexing_op.cu", line 461
[1,5]<stderr>:IndexError: index 999 is out of bounds for axis 1 with size 500
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
  Process name: [[39172,1],0]
  Exit code:    1
--------------------------------------------------------------------------

Steps to reproduce

Compile and install mxnet master
Get gluon-cv master
horovodrun -np 8 -H localhost:8 python ./gluon-cv/scripts/instance/mask_rcnn/train_mask_rcnn.py --dataset coco -j 4 --log-interval 1000 --use-fpn --horovod --amp --batch-size 16 --lr 0.02 --lr-warmup 500 --epochs 1

Environment

We recommend using our script for collecting the diagnostic information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

# paste outputs here

The text was updated successfully, but these errors were encountered:

ChaiBapchya · 2021-01-21T00:29:06Z

@Kh4L was this issue fixed? If yes what was the workaround?

karan6181 · 2021-01-21T17:33:09Z

The issue occurs when batch size is greater than 1 per GPU. If user provide the Batch size of 1 per GPU, then Validation works perfectly fine. However, when batch size is greater than 1 per GPU then they see this error IndexError: index 999 is out of bounds for axis 1 with size 500. Looks like a Bug. FYI @zhreshold

karan6181 · 2021-01-21T18:03:28Z

I did some analysis on different batch_size hyperparameter configurations:

If batch_size=1 per GPU, then training plus validation (after every epoch) works without any issue
If batch_size=2 per GPU, then training works (If we don't run validation at all)
If batch_size=2 per GPU, then training works, but validation fails, irrespective of doing validation at every epoch or at the end of training.
If we save the model (model params) after training with batch_size=1 per GPU and then run validation separately by loading the same model params with batch_size=1 per GPU then it works, however, with batch_size=2 per GPU, it doesn't work with the same model params that was loaded.

Note: Validation doesn't support multi-batch. Meaning it always runs with 1 image per GPU irrespective of batch_size number which is provided by the user.

karan6181 · 2021-01-21T18:49:10Z

I debugged the code a bit and found this line (https://github.com/dmlc/gluon-cv/blob/master/scripts/instance/mask_rcnn/train_mask_rcnn.py#L694) might be the culprit or things where we should start looking at.
For batch_size=2 per GPU, I think during training, the data loader created such a way that it passes 2 images per GPU during forward pass. However, Validation doesn't support multi-batch and it is always 1 image per GPU irrespective of user passes batch_size=2 per GPU. Since the model expects 2 images per GPU and for validation we are passing 1 image per GPU, we are seeing the above error. If I run the training with batch_size=2 per GPU and then save the model params and then run the validation by loading the same model params but with batch_size=1 per GPU, then validation works. So something to do with per_device_batch_size managed in https://github.com/dmlc/gluon-cv/blob/master/gluoncv/model_zoo/rcnn/faster_rcnn/faster_rcnn.py or https://github.com/dmlc/gluon-cv/blob/master/gluoncv/model_zoo/rcnn/mask_rcnn/rcnn_target.py. I might lack some background knowledge on this, but this is what I found.

zhreshold · 2021-01-21T20:05:38Z

@karan6181
after investigating the mask rcnn, I think a simple fix should do the trick to allow validation work with models initialized with batch_size > 1:

here: https://github.com/dmlc/gluon-cv/blob/ecf491685018e951bec12f9e45bc482749de4f85/gluoncv/model_zoo/rcnn/mask_rcnn/mask_rcnn.py#L85

we can modify this line to

if autograd.is_training():
    x = x.reshape((-4, self._batch_images, -1, 0, 0, 0))
else:
    # always use batch_size = 1 for inference
    x = x.reshape((-4, 1, -1, 0, 0, 0))

ChaiBapchya · 2021-01-22T00:40:12Z

@karan6181 tested it on EC2 instances and verified that this works.
While I tested it in docker environment. Both the places, we were able to run for per_device_batch_size > 1 for training while it set per_device_batch_size to 1 for inference.
Thanks @zhreshold for pointing it out & @karan6181 for helping with investigation.

Kh4L added the Bug label Jan 30, 2020

Kh4L mentioned this issue Feb 13, 2020

Faster GPU frozen BatchNorm #17368

Merged

3 tasks

ChaiBapchya mentioned this issue Jan 22, 2021

maskrcnn: set batch_size to 1 if inference dmlc/gluon-cv#1594

Merged

zhreshold closed this as completed in dmlc/gluon-cv#1594 Jan 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexError in gluon-cv Mask-RCNN validation on master #17485

IndexError in gluon-cv Mask-RCNN validation on master #17485

Kh4L commented Jan 30, 2020

ChaiBapchya commented Jan 21, 2021

karan6181 commented Jan 21, 2021

karan6181 commented Jan 21, 2021

karan6181 commented Jan 21, 2021

zhreshold commented Jan 21, 2021

ChaiBapchya commented Jan 22, 2021 •

edited

Loading

IndexError in gluon-cv Mask-RCNN validation on master #17485

IndexError in gluon-cv Mask-RCNN validation on master #17485

Comments

Kh4L commented Jan 30, 2020

Description

Error Message

Steps to reproduce

Environment

ChaiBapchya commented Jan 21, 2021

karan6181 commented Jan 21, 2021

karan6181 commented Jan 21, 2021

karan6181 commented Jan 21, 2021

zhreshold commented Jan 21, 2021

ChaiBapchya commented Jan 22, 2021 • edited Loading

ChaiBapchya commented Jan 22, 2021 •

edited

Loading