Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Odd behavior when training #657

Closed
opfromthestart opened this issue Apr 3, 2023 · 8 comments · Fixed by #658
Closed

Odd behavior when training #657

opfromthestart opened this issue Apr 3, 2023 · 8 comments · Fixed by #658

Comments

@opfromthestart
Copy link
Contributor

opfromthestart commented Apr 3, 2023

I am making a CNN based autoencoder, and it will not train properly with large batches. If I feed it a large batch (16), the error will not decrease, or will increase over time. However, when given batches of 1, the error does decrease. I am using some of the new/experimental features (Conv2D, ConvTrans2d, Upscale, PReLU) so I suspect that there is an error in one of them that doesn't calculate gradients right for batched inputs.
The repo of my code is at https://github.com/opfromthestart/touhou-ai, but it probably only runs on my computer at the moment. The structure of the model is at https://github.com/opfromthestart/touhou-ai/blob/master/src/net/dfdx.rs#L155 (lines 13-153 define the encoder, decoder, and actor models).

@coreylowman
Copy link
Owner

Is there a way you can replace some of the newer features with ones that are more proven? Using ReLU instead of PReLU should be easy enough, and if you see the same behavior then that will cross that off.

I can take a look at upscale

@coreylowman
Copy link
Owner

coreylowman commented Apr 3, 2023

I think batched upscale has the issue - have a failing test. Edit: nvm

@coreylowman
Copy link
Owner

There may be some issues with Bilinear interpolation - torch is giving me different results.

@opfromthestart
Copy link
Contributor Author

opfromthestart commented Apr 3, 2023

You have to use torch's upsample with align_corners being true, as I thought it better represented what upscaling should do.

@coreylowman
Copy link
Owner

Ahhhhh okay I'll make sure to document that, that's an important piece of info!

@opfromthestart
Copy link
Contributor Author

Yeah its currently only documented in a test
https://github.com/coreylowman/dfdx/blob/main/src/tensor_ops/upscale2d/mod.rs#L264

@coreylowman
Copy link
Owner

Okay I think the issue is actually batched convtrans2d, I have a test that's passing for CPU and failing for CUDA. The corresponding test for conv2d passes on both

@coreylowman
Copy link
Owner

@opfromthestart let me know if you still encounter issues - upscale and prelu seemed fine to me, so now that convtrans2d is fixed you should be good 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants