-
-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue #96 #118
Issue #96 #118
Conversation
The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
src/nn/residual.rs
Outdated
/// ``` | ||
#[derive(Debug, Clone, Default)] | ||
pub struct Residual<F>(F); | ||
pub struct Residual<F, R>(F, R); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of having both Residual & ResidualAdd. I imagine most people will probably use ResidualAdd, so let's rename them
Residual
doesf(x) + x
GeneralizedResidual
doesf(x) + r(x)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about SkipConnection
instead of GeneralizedResidual
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, after re-reading some wiki pages i think that i understand it better now - but shouldn't the gradients of F
and R
have to be summed AFTER the backprop of F
and R
(so the addition would have to be pushed to the tape first?) or am I not understanding binary_map
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the operations on the tape are executed in reverse, so for this function the add/binary_map backward op will be executed, then the f(x) backward op is executed using the gradient from add(), and then the r(x) backward ops using gradient from add(). Does that help?
Let's keep both Residual and SkipConnection, and then GeneralizedResidual. SkipConnection can just be a type alias (type SkipConnection<F, R> = Residual<F, R>;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have an idea why the numpy test fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add_grad + f_grad + r_grad
but if the add(y, &x)
is executed before 2.a
in the backprop, how can f_grad be fetched from gradients to calculate x
's gradients for 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The three are summed together independently. After add's backprop x's gradient is only 1/3 complete (add_grad). Once f's backward op is executed then x is 2/3 complete (add_grad + f_grad), then finally after r's backward op x is complete (add_grad + f_grad + r_grad).
It works because x is input for add, f, and r, not the result. Gradients only need to be fully "complete" for backward ops in which they are the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but for the last backward op the gradient of x
= r_grad
+ f_grad
. in which line is this done? i can only find the backprops for r_grad and f_grad, but not their addition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it depends on what f & r are. for binary map the call to addmul does rhs_grad += rhs_deriv * result_grad
, so if rhs_grad
is x's grad that's where part of the accumulation happens.
If F and R were both using map
for instance then both accumulations would happen in map (where x's gradient is g
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linear<10, 5>
i mean that linear layer, what gradients will it use? it will be lhs_grad
, won't it? but when is rhs_grad
added to lhs_grad
?
hmmm, file deleted and new file oof |
should be ready for merge now |
finally the tests all passed 😅 |
with the new |
@M1ngXU this looks good to me for merging - can you update or rebase your branch to current master (e.g. the dropout change was already merged into main) |
how exactly do i do that? 😅 |
This article should walk you through it https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork |
ok, should work now. i thought that i already tried that, but i appearently synced main and not BetterResidual. |
@coreylowman how about this? |
I like the idea about moving Residual to use GeneralizedResidual, but I think we'd need something like |
the name Identity makes more sense |
See Issue #96