Max norm regularisation #541

KristofferC · 2019-01-07T14:16:10Z

Perhaps each layer could take a constraint so something like max norm regularization ||w|| < c would be simple to add to a layer (or all layers in a model?).

How it is done in Keras:

https://github.com/keras-team/keras/blob/1d81a20292ca6926e595d06a6cd725dbb104a146/keras/constraints.py#L51-L55

The text was updated successfully, but these errors were encountered:

pshashk · 2019-01-08T15:56:01Z

Why not just add norm clipping function to a vector of optimizers?
Something like: train!(loss, ps, data, [opt, clip_norm]) where clip_norm() applies clipping w.data ./= max(1, norm(w.data) / c) for every parameter w.

MikeInnes · 2019-01-10T10:37:19Z

This does seem like it should be part of the optimiser in some way, but it's not obvious how to express it.

One thing worth considering vs Keras is that this should be something that composes with existing layers (as opposed to a similar-looking but independent API added to each layer).

pshashk · 2019-01-10T13:44:57Z

The parameter regularization (part of the loss) that depends only on model parameters but not data is already quite easy to write. I can't think of any concise regularizer API that wouldn't be sacrificing flexibility or readability.

# l2 regularization of dense layer 
loss = ... + λ₁ * colnorm(dense_layer.W) + λ₂ * norm(dense_layer.b)
# soft parameter sharing
loss = ... + λ * norm(layer1.W - layer2.W)

The Max norm and other parameter constraints are different because they mutate parameters after update!(opt, ps) without affecting loss or gradients). Maybe we can define constraints as callbacks?

Something like: train!(loss, ps, data, opt, cb = every_nth(clip_norm, 5)) where every_nth(clip_norm, 5) evaluates clip_norm every 5 training iterations to avoid unnecessary computations in reproducible manner.

MikeInnes · 2019-01-11T10:25:50Z

Yes, that's a nice separation of the issues. I could imagine having some kind of sugar for very common forms of regularisation, but we don't need to worry about that here.

One way to express constraints would be something like Optimiser(Descent(0.1), Clip(c)). The implementation of Clip would just be a bit weird (it would have to work by emitting a gradient that subtracts the right amount). But we want to expose per-layer optimisers anyway so that dovetails nicely. However, that doesn't get you every_nth automatically, if that's something important.

Things are weird partly because callbacks are a strict generalisation of optimisers, yet we have both. I'd like to eventually expose that relationship explicitly but I'm not sure how to do it neatly yet.

darsnack · 2021-02-12T23:31:08Z

Since #1133 this should no longer be an issue.

darsnack closed this as completed Feb 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Max norm regularisation #541

Max norm regularisation #541

KristofferC commented Jan 7, 2019

pshashk commented Jan 8, 2019

MikeInnes commented Jan 10, 2019

pshashk commented Jan 10, 2019

MikeInnes commented Jan 11, 2019 •

edited

Loading

darsnack commented Feb 12, 2021

Max norm regularisation #541

Max norm regularisation #541

Comments

KristofferC commented Jan 7, 2019

pshashk commented Jan 8, 2019

MikeInnes commented Jan 10, 2019

pshashk commented Jan 10, 2019

MikeInnes commented Jan 11, 2019 • edited Loading

darsnack commented Feb 12, 2021

MikeInnes commented Jan 11, 2019 •

edited

Loading