Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Failed to find any forward convolution algorithm. #11176

Closed
ThomasDelteil opened this issue Jun 6, 2018 · 9 comments
Closed

Failed to find any forward convolution algorithm. #11176

ThomasDelteil opened this issue Jun 6, 2018 · 9 comments

Comments

@ThomasDelteil
Copy link
Contributor

See this test failing: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/915/pipeline/

with

 src/operator/nn/./cudnn/cudnn_convolution-inl.h:744: Failed to find any forward convolution algorithm.

I have encountered this in the wild very rarely too.

@marcoabreu
Copy link
Contributor

@DickJC123

@lanking520
Copy link
Member

Hi @nswamy can you addd this one with 'CI' label

@marcoabreu
Copy link
Contributor

This is not CI related

@eric-haibin-lin
Copy link
Member

Is it possible that memory is exhausted on CI?

@aluo-x
Copy link

aluo-x commented Jun 19, 2018

Also encountered this error on Windows, CUDA 9.2, cudnn 7.1.4.

Ran the following command python train_imagenet.py --benchmark 1 --gpus 0 --network inception-v3 --batch-size 64 --image-shape 3,299,299 --num-epochs 1 --kv-store device

Reducing the batch size to 16 resolved the issue.

@ghost
Copy link

ghost commented Jul 7, 2018

Currently facing this issue. Reducing batch size does not seem to fix the issue.

Trying to train fast neural style transfer

Issues seems to arise when trying to do mod.save_params() throws the following error:

mxnet.base.MXNetError: [23:59:34] src/operator/nn/./cudnn/cudnn_convolution-inl.h:744: Failed to find any forward convolution algorithm.

@ghost
Copy link

ghost commented Jul 9, 2018

Update: I've managed to find a rather bizarre workaround to this issue.

I was facing this issue when I was trying to do a model.save_checkpoint(). However, if I caught the exception and saved it in the except block, it seemed to work flawlessly

    try:
        mod.save_checkpoint(model_save_path, epoch)
    except Exception as excep:
        print("Exception caught: ", excep)
        mod.save_checkpoint(model_save_path, epoch)

@ghost
Copy link

ghost commented Jul 10, 2018

Update: Sleeping for 0.5 seconds before saving the checkpoint also seems to help.

    time.sleep(0.5)
    mod.save_checkpoint(model_save_path, epoch)

@haojin2
Copy link
Contributor

haojin2 commented Jul 20, 2018

@ThomasDelteil Is this still occurring on CI now? If it's not appearing again would you mind closing this?
@aluo-x @Codewithsk Usually this is due to lack of GPU memories, reducing the consumption of GPU memories such as reducing batch sizes and using a smaller model would help. If you experience more issues on this, please create a separate issue with a title like "GPU memory overflow on xxx model with yyy batch size and zzz dataset". Meanwhile I'll look for ways to improve this error message to indicate the actual root cause of this error.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants