-
-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Buildkite CI failures with grad test of ConvTranspose
+ selu
#1804
Comments
Despite pulling the exact inputs and weights from failing buildkite runs, I was not able to replicate this locally. If someone has a znver2 machine they can test on, everything needed should be in https://gist.github.com/ToucheSir/32fd6688d3932c9f498c78a42a0ea017. @DhairyaLGandhi and possibly @maleadt, is there any way to replicate the CI environment and run tests against that? |
I could get the same summed result by exporting the pre-selu output and loading that locally, but still have not been able to generate a similar output and wasn't able to coax a container image into downloading the same dep versions. What's confusing is that the tests for ConvTranspose + identity activation do pass on CI, yet the differing output for the selu version is happening before the activation function is applied. Either way, #1822 is now up as a stopgap measure until some brave soul decides to revisit this in the future. |
Fixed by #1836, ref. #1836 (comment). |
One example: https://buildkite.com/julialang/flux-dot-jl/builds/1914#c62d9761-ab7f-415a-b995-51552eb2b1e5
I could not repro this locally on 2 separate GPU machines, master and 2 threads (same number Buildkite uses). Oddly, it's succeded twice (once on trying and once on staging), but fails with the same result every time, which makes me suspect some environmental discrepancy between workers.
The text was updated successfully, but these errors were encountered: