Odd behavior when training #657

opfromthestart · 2023-04-03T01:26:55Z

I am making a CNN based autoencoder, and it will not train properly with large batches. If I feed it a large batch (16), the error will not decrease, or will increase over time. However, when given batches of 1, the error does decrease. I am using some of the new/experimental features (Conv2D, ConvTrans2d, Upscale, PReLU) so I suspect that there is an error in one of them that doesn't calculate gradients right for batched inputs.
The repo of my code is at https://github.com/opfromthestart/touhou-ai, but it probably only runs on my computer at the moment. The structure of the model is at https://github.com/opfromthestart/touhou-ai/blob/master/src/net/dfdx.rs#L155 (lines 13-153 define the encoder, decoder, and actor models).

coreylowman · 2023-04-03T13:59:27Z

Is there a way you can replace some of the newer features with ones that are more proven? Using ReLU instead of PReLU should be easy enough, and if you see the same behavior then that will cross that off.

I can take a look at upscale

coreylowman · 2023-04-03T14:46:54Z

~~I think batched upscale has the issue - have a failing test.~~ Edit: nvm

coreylowman · 2023-04-03T15:08:23Z

There may be some issues with Bilinear interpolation - torch is giving me different results.

opfromthestart · 2023-04-03T15:51:13Z

You have to use torch's upsample with align_corners being true, as I thought it better represented what upscaling should do.

coreylowman · 2023-04-03T16:34:14Z

Ahhhhh okay I'll make sure to document that, that's an important piece of info!

opfromthestart · 2023-04-03T17:50:05Z

Yeah its currently only documented in a test
https://github.com/coreylowman/dfdx/blob/main/src/tensor_ops/upscale2d/mod.rs#L264

coreylowman · 2023-04-03T18:00:56Z

Okay I think the issue is actually batched convtrans2d, I have a test that's passing for CPU and failing for CUDA. The corresponding test for conv2d passes on both

coreylowman · 2023-04-04T13:23:15Z

@opfromthestart let me know if you still encounter issues - upscale and prelu seemed fine to me, so now that convtrans2d is fixed you should be good 👍

coreylowman mentioned this issue Apr 3, 2023

Fixes conv transpose stride bug, adds more docs to upscale2d #658

Merged

coreylowman closed this as completed in #658 Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Odd behavior when training #657

Odd behavior when training #657

opfromthestart commented Apr 3, 2023 •

edited

Loading

coreylowman commented Apr 3, 2023

coreylowman commented Apr 3, 2023 •

edited

Loading

coreylowman commented Apr 3, 2023

opfromthestart commented Apr 3, 2023 •

edited

Loading

coreylowman commented Apr 3, 2023

opfromthestart commented Apr 3, 2023

coreylowman commented Apr 3, 2023

coreylowman commented Apr 4, 2023

Odd behavior when training #657

Odd behavior when training #657

Comments

opfromthestart commented Apr 3, 2023 • edited Loading

coreylowman commented Apr 3, 2023

coreylowman commented Apr 3, 2023 • edited Loading

coreylowman commented Apr 3, 2023

opfromthestart commented Apr 3, 2023 • edited Loading

coreylowman commented Apr 3, 2023

opfromthestart commented Apr 3, 2023

coreylowman commented Apr 3, 2023

coreylowman commented Apr 4, 2023

opfromthestart commented Apr 3, 2023 •

edited

Loading

coreylowman commented Apr 3, 2023 •

edited

Loading

opfromthestart commented Apr 3, 2023 •

edited

Loading