Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Max norm regularisation #541

Closed
KristofferC opened this issue Jan 7, 2019 · 5 comments
Closed

Max norm regularisation #541

KristofferC opened this issue Jan 7, 2019 · 5 comments

Comments

@KristofferC
Copy link
Contributor

Perhaps each layer could take a constraint so something like max norm regularization ||w|| < c would be simple to add to a layer (or all layers in a model?).

How it is done in Keras:

https://github.com/keras-team/keras/blob/1d81a20292ca6926e595d06a6cd725dbb104a146/keras/constraints.py#L51-L55

@pshashk
Copy link
Contributor

pshashk commented Jan 8, 2019

Why not just add norm clipping function to a vector of optimizers?
Something like: train!(loss, ps, data, [opt, clip_norm]) where clip_norm() applies clipping w.data ./= max(1, norm(w.data) / c) for every parameter w.

@MikeInnes
Copy link
Member

This does seem like it should be part of the optimiser in some way, but it's not obvious how to express it.

One thing worth considering vs Keras is that this should be something that composes with existing layers (as opposed to a similar-looking but independent API added to each layer).

@pshashk
Copy link
Contributor

pshashk commented Jan 10, 2019

The parameter regularization (part of the loss) that depends only on model parameters but not data is already quite easy to write. I can't think of any concise regularizer API that wouldn't be sacrificing flexibility or readability.

# l2 regularization of dense layer 
loss = ... + λ₁ * colnorm(dense_layer.W) + λ₂ * norm(dense_layer.b)
# soft parameter sharing
loss = ... + λ * norm(layer1.W - layer2.W)

The Max norm and other parameter constraints are different because they mutate parameters after update!(opt, ps) without affecting loss or gradients). Maybe we can define constraints as callbacks?

Something like: train!(loss, ps, data, opt, cb = every_nth(clip_norm, 5)) where every_nth(clip_norm, 5) evaluates clip_norm every 5 training iterations to avoid unnecessary computations in reproducible manner.

@MikeInnes
Copy link
Member

MikeInnes commented Jan 11, 2019

Yes, that's a nice separation of the issues. I could imagine having some kind of sugar for very common forms of regularisation, but we don't need to worry about that here.

One way to express constraints would be something like Optimiser(Descent(0.1), Clip(c)). The implementation of Clip would just be a bit weird (it would have to work by emitting a gradient that subtracts the right amount). But we want to expose per-layer optimisers anyway so that dovetails nicely. However, that doesn't get you every_nth automatically, if that's something important.

Things are weird partly because callbacks are a strict generalisation of optimisers, yet we have both. I'd like to eventually expose that relationship explicitly but I'm not sure how to do it neatly yet.

@darsnack
Copy link
Member

Since #1133 this should no longer be an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants