Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

IndexError in gluon-cv Mask-RCNN validation on master #17485

Closed
Kh4L opened this issue Jan 30, 2020 · 6 comments · Fixed by dmlc/gluon-cv#1594
Closed

IndexError in gluon-cv Mask-RCNN validation on master #17485

Kh4L opened this issue Jan 30, 2020 · 6 comments · Fixed by dmlc/gluon-cv#1594
Labels

Comments

@Kh4L
Copy link
Contributor

Kh4L commented Jan 30, 2020

Description

An IndexError occurs during the first validation step when training gluon-cv Mask-RCNN with horovod.

Error Message

[1,2]<stderr>:IndexError: index 999 is out of bounds for axis 1 with size 500
[1,5]<stderr>:Traceback (most recent call last):
[1,5]<stderr>:  File "gluon-cv/scripts/instance/mask_rcnn/train_mask_rcnn.py", line 615, in <module>
[1,5]<stderr>:    train(net, train_data, val_data, eval_metric, batch_size, ctx, logger, args)
[1,5]<stderr>:  File "gluon-cv/scripts/instance/mask_rcnn/train_mask_rcnn.py", line 540, in train
[1,5]<stderr>:    args)
[1,5]<stderr>:  File "gluon-cv/scripts/instance/mask_rcnn/train_mask_rcnn.py", line 275, in validate
[1,5]<stderr>:    det_bbox = det_bbox[i].asnumpy()
[1,5]<stderr>:  File "/workspace/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 2554, in asnumpy
[1,5]<stderr>:    ctypes.c_size_t(data.size)))
[1,5]<stderr>:  File "/workspace/incubator-mxnet/python/mxnet/base.py", line 273, in check_call
[1,5]<stderr>:    raise get_last_ffi_error()
[1,5]<stderr>:IndexError: Traceback (most recent call last):
[1,5]<stderr>:  File "src/operator/tensor/indexing_op.cu", line 461
[1,5]<stderr>:IndexError: index 999 is out of bounds for axis 1 with size 500
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
  Process name: [[39172,1],0]
  Exit code:    1
--------------------------------------------------------------------------

Steps to reproduce

  1. Compile and install mxnet master
  2. Get gluon-cv master
  3. horovodrun -np 8 -H localhost:8 python ./gluon-cv/scripts/instance/mask_rcnn/train_mask_rcnn.py --dataset coco -j 4 --log-interval 1000 --use-fpn --horovod --amp --batch-size 16 --lr 0.02 --lr-warmup 500 --epochs 1

Environment

We recommend using our script for collecting the diagnostic information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

# paste outputs here
@Kh4L Kh4L added the Bug label Jan 30, 2020
@Kh4L Kh4L mentioned this issue Feb 13, 2020
3 tasks
@ChaiBapchya
Copy link
Contributor

@Kh4L was this issue fixed? If yes what was the workaround?

@karan6181
Copy link
Contributor

The issue occurs when batch size is greater than 1 per GPU. If user provide the Batch size of 1 per GPU, then Validation works perfectly fine. However, when batch size is greater than 1 per GPU then they see this error IndexError: index 999 is out of bounds for axis 1 with size 500. Looks like a Bug. FYI @zhreshold

@karan6181
Copy link
Contributor

I did some analysis on different batch_size hyperparameter configurations:

  1. If batch_size=1 per GPU, then training plus validation (after every epoch) works without any issue
  2. If batch_size=2 per GPU, then training works (If we don't run validation at all)
  3. If batch_size=2 per GPU, then training works, but validation fails, irrespective of doing validation at every epoch or at the end of training.
  4. If we save the model (model params) after training with batch_size=1 per GPU and then run validation separately by loading the same model params with batch_size=1 per GPU then it works, however, with batch_size=2 per GPU, it doesn't work with the same model params that was loaded.

Note: Validation doesn't support multi-batch. Meaning it always runs with 1 image per GPU irrespective of batch_size number which is provided by the user.

@karan6181
Copy link
Contributor

@zhreshold
Copy link
Member

@karan6181
after investigating the mask rcnn, I think a simple fix should do the trick to allow validation work with models initialized with batch_size > 1:

here: https://github.com/dmlc/gluon-cv/blob/ecf491685018e951bec12f9e45bc482749de4f85/gluoncv/model_zoo/rcnn/mask_rcnn/mask_rcnn.py#L85

we can modify this line to

if autograd.is_training():
    x = x.reshape((-4, self._batch_images, -1, 0, 0, 0))
else:
    # always use batch_size = 1 for inference
    x = x.reshape((-4, 1, -1, 0, 0, 0))

@ChaiBapchya
Copy link
Contributor

ChaiBapchya commented Jan 22, 2021

@karan6181 tested it on EC2 instances and verified that this works.
While I tested it in docker environment. Both the places, we were able to run for per_device_batch_size > 1 for training while it set per_device_batch_size to 1 for inference.
Thanks @zhreshold for pointing it out & @karan6181 for helping with investigation.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants