Failed to find any forward convolution algorithm. #11176

ThomasDelteil · 2018-06-06T17:46:46Z

See this test failing: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/915/pipeline/

with

 src/operator/nn/./cudnn/cudnn_convolution-inl.h:744: Failed to find any forward convolution algorithm.

I have encountered this in the wild very rarely too.

The text was updated successfully, but these errors were encountered:

marcoabreu · 2018-06-06T18:07:39Z

@DickJC123

lanking520 · 2018-06-07T17:25:02Z

Hi @nswamy can you addd this one with 'CI' label

marcoabreu · 2018-06-08T01:04:25Z

This is not CI related

eric-haibin-lin · 2018-06-08T23:11:35Z

Is it possible that memory is exhausted on CI?

aluo-x · 2018-06-19T15:36:59Z

Also encountered this error on Windows, CUDA 9.2, cudnn 7.1.4.

Ran the following command python train_imagenet.py --benchmark 1 --gpus 0 --network inception-v3 --batch-size 64 --image-shape 3,299,299 --num-epochs 1 --kv-store device

Reducing the batch size to 16 resolved the issue.

ghost · 2018-07-07T00:04:19Z

Currently facing this issue. Reducing batch size does not seem to fix the issue.

Trying to train fast neural style transfer

Issues seems to arise when trying to do mod.save_params() throws the following error:

mxnet.base.MXNetError: [23:59:34] src/operator/nn/./cudnn/cudnn_convolution-inl.h:744: Failed to find any forward convolution algorithm.

ghost · 2018-07-09T18:17:58Z

Update: I've managed to find a rather bizarre workaround to this issue.

I was facing this issue when I was trying to do a model.save_checkpoint(). However, if I caught the exception and saved it in the except block, it seemed to work flawlessly

    try:
        mod.save_checkpoint(model_save_path, epoch)
    except Exception as excep:
        print("Exception caught: ", excep)
        mod.save_checkpoint(model_save_path, epoch)

ghost · 2018-07-10T18:03:39Z

Update: Sleeping for 0.5 seconds before saving the checkpoint also seems to help.

    time.sleep(0.5)
    mod.save_checkpoint(model_save_path, epoch)

haojin2 · 2018-07-20T22:03:32Z

@ThomasDelteil Is this still occurring on CI now? If it's not appearing again would you mind closing this?
@aluo-x @Codewithsk Usually this is due to lack of GPU memories, reducing the consumption of GPU memories such as reducing batch sizes and using a smaller model would help. If you experience more issues on this, please create a separate issue with a title like "GPU memory overflow on xxx model with yyy batch size and zzz dataset". Meanwhile I'll look for ways to improve this error message to indicate the actual root cause of this error.

marcoabreu added Sparse CUDA Bug Operator and removed Sparse labels Jun 8, 2018

ThomasDelteil closed this as completed Jul 24, 2018

haojin2 mentioned this issue Jul 25, 2018

Improve error message of cudnn operators #11886

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to find any forward convolution algorithm. #11176

Failed to find any forward convolution algorithm. #11176

ThomasDelteil commented Jun 6, 2018

marcoabreu commented Jun 6, 2018

lanking520 commented Jun 7, 2018

marcoabreu commented Jun 8, 2018

eric-haibin-lin commented Jun 8, 2018

aluo-x commented Jun 19, 2018

ghost commented Jul 7, 2018

ghost commented Jul 9, 2018

ghost commented Jul 10, 2018 •

edited by ghost

Loading

haojin2 commented Jul 20, 2018

Failed to find any forward convolution algorithm. #11176

Failed to find any forward convolution algorithm. #11176

Comments

ThomasDelteil commented Jun 6, 2018

marcoabreu commented Jun 6, 2018

lanking520 commented Jun 7, 2018

marcoabreu commented Jun 8, 2018

eric-haibin-lin commented Jun 8, 2018

aluo-x commented Jun 19, 2018

ghost commented Jul 7, 2018

ghost commented Jul 9, 2018

ghost commented Jul 10, 2018 • edited by ghost Loading

haojin2 commented Jul 20, 2018

ghost commented Jul 10, 2018 •

edited by ghost

Loading