Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BatchNorm1D #476

Closed
kstavro opened this issue Feb 22, 2023 · 10 comments · Fixed by #513
Closed

Add BatchNorm1D #476

kstavro opened this issue Feb 22, 2023 · 10 comments · Fixed by #513

Comments

@kstavro
Copy link
Contributor

kstavro commented Feb 22, 2023

Implementation of BatchNorm2D exists, but of BatchNorm1D doesn't.

@kstavro
Copy link
Contributor Author

kstavro commented Feb 24, 2023

Started working on it here:
https://github.com/kstavro/dfdx/tree/batchnorm1d

@nkoppel
Copy link
Contributor

nkoppel commented Feb 24, 2023

FYI, after merging #482, you may need to change your implementation a little.

@kstavro
Copy link
Contributor Author

kstavro commented Feb 25, 2023

@nkoppel Thanks for letting me know, I will sync my fork.

I think one unclear point is where and how to put the batchnorm layer in the syntax of an nn in dfdx. The traditional way would be to do Linear -> BatchNorm -> Activation, but I have seen some research from Chollet that argues going Batchnorm -> Linear -> Activation might actually be better. The latter might be also much simpler to implement in dfdx? Eg with something like

type Mlp = (
    (BatchNorm1D<128,1 28, Dev>, NoActivation), // Don't even know if NoActivation even exists, but this could work.
    (Linear<128, 32, Dev>, ReLU),
)

The other way to do it would be to already add a Batchnorm option in the tuple and be able to directly write

type Mlp = (
    (Linear<128, 128, Dev>, BatchNorm1D, ReLU),
    (BatchNorm1D<128, 128, Dev>, Linear, ReLU), // here just pointing out that we could have the option of choosing where to put the batchnorm layer, the choice of the order doesn't have to be implemented like this
)

I find this much nicer, but it might need more tinkering with the code. Which is why I wanted to have a deeper look into the 2D version and follow what was done there (haven't done it yet). Let me know what you both think (@coreylowman too, curious what your view on this is)

@kstavro
Copy link
Contributor Author

kstavro commented Feb 26, 2023

Ok, I checked the code of BatchNorm2D and everything works out of the box for 1D by adding inference and training implementations to BatchNorm2D as follows:

impl<B: Dim, const C: usize, E: Dtype, D: Device<E>> Module<Tensor<(B, Const<C>), E, D, NoneTape>>
    for BatchNorm2D<C, E, D>
{
...
}

I also realized after I understood a bit more in depth the code, that what I was talking above was practically nonsense, since Modules can already be stacked together any way we like. I confirmed it with the MNIST example:

type Mlp = (
    (Linear<784, 512>, BatchNorm2D<512>, ReLU),
    (Linear<512, 128>, BatchNorm2D<128>, ReLU),
    (Linear<128, 32>, BatchNorm2D<32>, ReLU),
    Linear<32, 10>,
);

Each epoch now takes 1.5x longer on CPU (I guess to be expected with the batchnorm layers), but the nn also converges faster.

I would rename the current BatchNorm2D to BatchNorm (or _BatchNorm) and have BatchNorm1D and BatchNorm2D "inherit" from it if that is ok. If you think there is a better way, suggestions are very welcome.

@coreylowman
Copy link
Owner

coreylowman commented Feb 26, 2023

Ok, I checked the code of BatchNorm2D and everything works out of the box for 1D by adding inference and training implementations to BatchNorm2D as follows:

Nice! Do we need to also add a 3d version of it to match pytorch docs? I wasn't really sure when BatchNorm1d would be receiving 3d inputs though... transformers maybe?
image
It might complain about conflicting implementations with the existing 3d impl, but I could also see the different shape being enough. If it does, that would be enough justification to have a separate structs for 1d/2d.

Each epoch now takes 1.5x longer on CPU (I guess to be expected with the batchnorm layers), but the nn also converges faster.
Yeah I've been working on trying to optimize batchnorm on CPU quite a bit, since its a bit slow atm. So that's definitely a separate issue.

and have BatchNorm1D and BatchNorm2D "inherit" from it if that is ok.
What do you mean by inherit I guess? Just have their impls be calling the underlying _BatchNorm impl?

@kstavro
Copy link
Contributor Author

kstavro commented Feb 26, 2023

Nice! Do we need to also add a 3d version of it to match pytorch docs? I wasn't really sure when BatchNorm1d would be receiving 3d inputs though... transformers maybe? image It might complain about conflicting implementations with the existing 3d impl, but I could also see the different shape being enough. If it does, that would be enough justification to have a separate structs for 1d/2d.

Yeah, I also think separate BatchNorm1D and BathNorm2D is the way to go and I will add the 3D version for BatchNorm1D as well. Yes, transformers would be the use case for 3D ((N, C, L), but I think it isn't relevant in practice, because variable length sequences complicate the batch normalization process. You need to pad all the sequences to make the same length and the padded tokens seem to make the normalization unstable (because they are empty tokens I guess). Don't have actual experience with this though. In any case, I think LayerNorm is the default normalization for NLP Transformers. Vision Transformers (that have to do with fixed resolution images and so BN is more relevant there), can just use BatchNorm2D.

Yeah I've been working on trying to optimize batchnorm on CPU quite a bit, since its a bit slow atm. So that's definitely a separate issue.

If you have a hunch what might be the bottleneck, I could also have a look at that, since I am working on the topic (in a separate issue as you suggested).

and have BatchNorm1D and BatchNorm2D "inherit" from it if that is ok.
What do you mean by inherit I guess? Just have their impls be calling the underlying _BatchNorm impl?

Inherit was the wrong word to use in Rust. I just wanted to capture the BN logic once and pass it to the the two batchnorms to avoid duplication. I was just thinking making BatchNorm a trait that implements train_fwd and inf_fwd and then having the two Batchnorms implement the base trait.

Just have their impls be calling the underlying _BatchNorm impl?> >

Could you elaborate a bit what you mean(not talking to the best Rust programmer here :) )? Do you mean keeping BatchNorm as a struct instead of making it a trait? I know for sure that if I keep BatchNorm as a struct and do something like:

pub struct BatchNorm1D<const C: usize, E: Dtype, D: DeviceStorage> {
    pub batchnorm: BatchNorm<C, E, D>,
}

pub struct BatchNorm2D<const C: usize, E: Dtype, D: DeviceStorage> {
    pub batchnorm: BatchNorm<C, E, D>,
}

it will definitely work, but I find it ugly, because everything hides behind self.batchnorm. And you most probably meant something else?

@coreylowman
Copy link
Owner

If you have a hunch what might be the bottleneck, I could also have a look at that, since I am working on the topic (in a separate issue as you suggested).

I'm fairly certain the bottleneck is the binary operations backward when dealing with broadcasted arrays. I think sum() backwards is also a big slowdown as well. I think #491 is probably the most relevant issue for this.

Inherit was the wrong word to use in Rust. I just wanted to capture the BN logic once and pass it to the the two batchnorms to avoid duplication. I was just thinking making BatchNorm a trait that implements train_fwd and inf_fwd and then having the two Batchnorms implement the base trait.

Gotcha. I think we might actually be able to get away with just moving train_fwd and infer_fwd to being functions that accept all the tensors instead of structs, and then the different structs can just call them directly:

fn train_fwd<...>(
    x: Tensor<S, E, D,T>, 
    running_mean: &mut Tensor<Rank1<C>, E, D>,
    running_bias: &mut Tensor,
    scale: Tensor,
    bias: Tensor,
    epsilon: E
)

@kstavro
Copy link
Contributor Author

kstavro commented Feb 26, 2023

I'm fairly certain the bottleneck is the binary operations backward when dealing with broadcasted arrays. I think sum() backwards is also a big slowdown as well. I think #491 is probably the most relevant issue for this.

Ok, this sounds interesting, I 'll try to have a look at it once done with the current issue. Since there is already a more specific issue, there is no point to open a new one ftm I guess.

Gotcha. I think we might actually be able to get away with just moving train_fwd and infer_fwd to being functions that accept all the tensors instead of structs, and then the different structs can just call them directly:

fn train_fwd<...>(
    x: Tensor<S, E, D,T>, 
    running_mean: &mut Tensor<Rank1<C>, E, D>,
    running_bias: &mut Tensor,
    scale: Tensor,
    bias: Tensor,
    epsilon: E
)

Great, sounds good, let's do it like that, it keeps things simple. On it!

@kstavro
Copy link
Contributor Author

kstavro commented Mar 1, 2023

Just checking in, because it has been a couple of days. I have been struggling a bit to make it work with both ways of trying to deduplicate the code:

  1. With the functions as suggested above, I need to be also returning the running terms for the next step (so a tuple), but then I can't seem to be able to get the type of the expected output right. With the combinations of tuples and Results that look right as type of output I get compiler complaints about the "?" inside the functions, which is implemented for every try_ function.
  2. I have also tried to make the BatchNorm trait (which might be the correcter way to do it), where I define fn scale(&self) -> &mut Tensor<Rank1<C>, E, D>; etc to have them updated as in the original struct. But then I get some missing implementation of TryAdd at the step let std = (self.running_var().clone() + self.epsilon()).try_sqrt()?;. It reads as follows:
tensor_impls.rs(35, 1): an implementation of `tensor_ops::add::TryAdd<_>` might be missing for `tensor_impls::Tensor<(shape::Const<C>,), E, D>`
mod.rs(50, 1): the following trait must be implemented
batchnorm2d.rs(157, 59): consider further restricting this bound: ` + tensor_ops::utilities::ops::UnaryKernel<tensor_ops::add::ScalarAddKernelOp<E>, E>`

Restricting the bound as suggested has created other problems elsewhere. I have also tried to add the implementations myself, but not succesfully (it also feels I shouldn't have to).

Will be checking it again in the next days, and if I don't sort it out until Friday, I might just copy BatchNorm2D and convert it to BatchNorm1D and leave it to a better Rust programmer or someone more knowledgeable about all the hidden details of dfdx to deduplicate the code. Maybe it's a dumb thing I am doing and I 'll sort it out in the next days (still want to fight a bit through it).

@coreylowman
Copy link
Owner

Okay sounds good. This is useful to know though, as I really want to make it as easy as possible for people to contribute at all levels of rust experience. So if you have any other things that you got stuck on, sharing them would be really useful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants