Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #96 #118

Merged
merged 14 commits into from
Aug 3, 2022
Merged

Issue #96 #118

merged 14 commits into from
Aug 3, 2022

Conversation

M1ngXU
Copy link
Contributor

@M1ngXU M1ngXU commented Jul 23, 2022

See Issue #96

@M1ngXU
Copy link
Contributor Author

M1ngXU commented Jul 23, 2022

The Module::forward is quite inefficient with 3 allocations, so the old residual_add-connection is now residual_add. Also, I have no idea about the numpy stuff, so I am not sure how to fix the last numpy test (comments in code).

Copy link
Owner

@coreylowman coreylowman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

src/nn/residual.rs Outdated Show resolved Hide resolved
/// ```
#[derive(Debug, Clone, Default)]
pub struct Residual<F>(F);
pub struct Residual<F, R>(F, R);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of having both Residual & ResidualAdd. I imagine most people will probably use ResidualAdd, so let's rename them

  • Residual does f(x) + x
  • GeneralizedResidual does f(x) + r(x)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about SkipConnection instead of GeneralizedResidual?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, after re-reading some wiki pages i think that i understand it better now - but shouldn't the gradients of F and R have to be summed AFTER the backprop of F and R (so the addition would have to be pushed to the tape first?) or am I not understanding binary_map ?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the operations on the tape are executed in reverse, so for this function the add/binary_map backward op will be executed, then the f(x) backward op is executed using the gradient from add(), and then the r(x) backward ops using gradient from add(). Does that help?

Let's keep both Residual and SkipConnection, and then GeneralizedResidual. SkipConnection can just be a type alias (type SkipConnection<F, R> = Residual<F, R>;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have an idea why the numpy test fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_grad + f_grad + r_grad

but if the add(y, &x) is executed before 2.a in the backprop, how can f_grad be fetched from gradients to calculate x's gradients for 1?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The three are summed together independently. After add's backprop x's gradient is only 1/3 complete (add_grad). Once f's backward op is executed then x is 2/3 complete (add_grad + f_grad), then finally after r's backward op x is complete (add_grad + f_grad + r_grad).

It works because x is input for add, f, and r, not the result. Gradients only need to be fully "complete" for backward ops in which they are the result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but for the last backward op the gradient of x = r_grad + f_grad. in which line is this done? i can only find the backprops for r_grad and f_grad, but not their addition

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it depends on what f & r are. for binary map the call to addmul does rhs_grad += rhs_deriv * result_grad, so if rhs_grad is x's grad that's where part of the accumulation happens.

If F and R were both using map for instance then both accumulations would happen in map (where x's gradient is g)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linear<10, 5>

i mean that linear layer, what gradients will it use? it will be lhs_grad, won't it? but when is rhs_grad added to lhs_grad?

@M1ngXU
Copy link
Contributor Author

M1ngXU commented Jul 26, 2022

hmmm, file deleted and new file oof

@M1ngXU M1ngXU marked this pull request as ready for review July 30, 2022 05:40
@M1ngXU
Copy link
Contributor Author

M1ngXU commented Jul 30, 2022

should be ready for merge now

@M1ngXU
Copy link
Contributor Author

M1ngXU commented Jul 30, 2022

finally the tests all passed 😅

@M1ngXU
Copy link
Contributor Author

M1ngXU commented Jul 30, 2022

with the new GeneralizedResidual, the Residual struct can be removed, if we add a Linear-Activation, so f(x) = x, and then type Residual<F> = GeneralizedResidual<F, Linear>

@coreylowman
Copy link
Owner

coreylowman commented Aug 2, 2022

@M1ngXU this looks good to me for merging - can you update or rebase your branch to current master (e.g. the dropout change was already merged into main)

@M1ngXU
Copy link
Contributor Author

M1ngXU commented Aug 2, 2022

@M1ngXU this looks good to me for merging - can you update or rebase your branch to current master (e.g. the dropout change was already merged into main)

how exactly do i do that? 😅

@coreylowman
Copy link
Owner

@M1ngXU this looks good to me for merging - can you update or rebase your branch to current master (e.g. the dropout change was already merged into main)

how exactly do i do that? 😅

This article should walk you through it https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@M1ngXU
Copy link
Contributor Author

M1ngXU commented Aug 3, 2022

@M1ngXU this looks good to me for merging - can you update or rebase your branch to current master (e.g. the dropout change was already merged into main)

how exactly do i do that? 😅

This article should walk you through it https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ok, should work now. i thought that i already tried that, but i appearently synced main and not BetterResidual.

@M1ngXU
Copy link
Contributor Author

M1ngXU commented Aug 3, 2022

with the new GeneralizedResidual, the Residual struct can be removed, if we add a Linear-Activation, so f(x) = x, and then type Residual<F> = GeneralizedResidual<F, Linear>

@coreylowman how about this?

@coreylowman coreylowman merged commit 7f22b5d into coreylowman:main Aug 3, 2022
@coreylowman
Copy link
Owner

with the new GeneralizedResidual, the Residual struct can be removed, if we add a Linear-Activation, so f(x) = x, and then type Residual<F> = GeneralizedResidual<F, Linear>

I like the idea about moving Residual to use GeneralizedResidual, but I think we'd need something like Identity (which just returns x in the forward pass) instead of Linear for R.

@M1ngXU
Copy link
Contributor Author

M1ngXU commented Aug 3, 2022

with the new GeneralizedResidual, the Residual struct can be removed, if we add a Linear-Activation, so f(x) = x, and then type Residual<F> = GeneralizedResidual<F, Linear>

I like the idea about moving Residual to use GeneralizedResidual, but I think we'd need something like Identity (which just returns x in the forward pass) instead of Linear for R.

the name Identity makes more sense

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants