Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradient Accumulation #400

Closed
Tracked by #278
jafioti opened this issue Jan 25, 2023 · 5 comments · Fixed by #519
Closed
Tracked by #278

Gradient Accumulation #400

jafioti opened this issue Jan 25, 2023 · 5 comments · Fixed by #519

Comments

@jafioti
Copy link
Contributor

jafioti commented Jan 25, 2023

Often times it is desired to train on larger batch sizes than can fit on the GPU / memory at once. Accumulating gradients across mini-batches is a solution, which can effectively simulate larger batch sizes, albeit without the parallelism advantage.

A straightforward way to do this would be to impl Add<Gradients<D>> to Gradients<D> such that gradients on the same device can be added.

I'm not sure how this would be handled, since the grads seem to be stored in a Box, which I don't see how you can add to.

@coreylowman
Copy link
Owner

I think if we change the underlying storage to Box<dyn std::ops::AddAssign>, we could do this. Probably prefer add assign so we aren't cloning things.

Would need to add implementations of AddAssign for raw device storage

@coreylowman
Copy link
Owner

So I was working on this a bit cause I thought I had a pretty clever solution. Turns out Box<dyn Any> makes this pretty hard due to object safety rules.

Notably you can't have a trait that takes/returns any Self objects. So you can't add AddAssign or something like:

trait AddSelf {
    fn add_self(self, rhs: Self) -> Self;
}

If we can't do this with Gradients object, we may need to add some separate gradient accumulator object that does lazy addition:

struct GradientAccumulator {
    gradients: Vec<Gradients>
}

and then add some abstraction layer for the optimizers to use:

trait HasGradients {
    fn remove<T>(&mut self, t: &T) -> Option<T::Gradient>
    where
        T: HasUniqueId + AllocGrad;
}

impl this for both Gradients and GradientAccumulator, and then change optimizer to accept this trait:

fn update<G: HasGradients>(
        &mut self,
        module: &mut M,
        gradients: G,
    ) -> Result<(), OptimizerUpdateError<D>>;

@jafioti
Copy link
Contributor Author

jafioti commented Feb 7, 2023

@coreylowman Why can't there be a trait that accepts / returns Self? That first trait you had looks acceptable to me

@coreylowman
Copy link
Owner

You can't use that kind of object with dyn. I.e. you can't have Box<dyn AddSelf> because the AddSelf trait is not object safe. See https://doc.rust-lang.org/reference/items/traits.html#object-safety

@coreylowman
Copy link
Owner

coreylowman commented Feb 26, 2023

Another option for this: give an option to pass in an existing Gradients object to .trace().

Pros:

  • this is possible with minimal modifications
  • this could help reduce allocations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants