Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small upgrades to training docs #2331

Merged
merged 4 commits into from
Mar 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions docs/src/training/reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,6 @@ Because of this:
* Flux defines its own version of `setup` which checks this assumption.
(Using instead `Optimisers.setup` will also work, they return the same thing.)

The new implementation of rules such as Adam in the Optimisers is quite different from the old one in `Flux.Optimise`. In Flux 0.14, `Flux.Adam()` returns the old one, with supertype `Flux.Optimise.AbstractOptimiser`, but `setup` will silently translate it to its new counterpart.
The available rules are listed the [optimisation rules](@ref man-optimisers) page here;
see the [Optimisers documentation](https://fluxml.ai/Optimisers.jl/dev/) for details on how the new rules work.

```@docs
Flux.Train.setup
Flux.Train.train!(loss, model, data, state; cb)
Expand Down Expand Up @@ -47,10 +43,16 @@ Flux 0.13 and 0.14 are the transitional versions which support both; Flux 0.15 w
The blue-green boxes in the [training section](@ref man-training) describe
the changes needed to upgrade old code.

The available rules are listed the [optimisation rules](@ref man-optimisers) page here.

!!! compat "Old & new rules"
The new implementation of rules such as Adam in the Optimisers is quite different from the old one in `Flux.Optimise`. In Flux 0.14, `Flux.Adam()` still returns the old one, with supertype `Flux.Optimise.AbstractOptimiser`, but `setup` will silently translate it to its new counterpart.

For full details on the interface for implicit-style optimisers, see the [Flux 0.13.6 manual](https://fluxml.ai/Flux.jl/v0.13.6/training/training/).
See the [Optimisers documentation](https://fluxml.ai/Optimisers.jl/dev/) for details on how the new rules work.

!!! compat "Flux ≤ 0.12"
Earlier versions of Flux exported `params`, thus allowing unqualified `params(model)`
Much earlier versions of Flux exported `params`, thus allowing unqualified `params(model)`
after `using Flux`. This conflicted with too many other packages, and was removed in Flux 0.13.
If you get an error `UndefVarError: params not defined`, this probably means that you are
following code for Flux 0.12 or earlier on a more recent version.
Expand Down
8 changes: 7 additions & 1 deletion docs/src/training/training.md
Original file line number Diff line number Diff line change
Expand Up @@ -225,6 +225,9 @@ callback API. Here is an example, in which it may be helpful to note:
returns the value of the function, for logging or diagnostic use.
* Logging or printing is best done outside of the `gradient` call,
as there is no need to differentiate these commands.
* To use `result` for logging purposes, you could change the `do` block to end with
`return my_loss(result, label), result`, i.e. make the function passed to `withgradient`
return a tuple. The first element is always the loss.
* Julia's `break` and `continue` keywords let you exit from parts of the loop.

```julia
Expand Down Expand Up @@ -319,9 +322,12 @@ The first, [`WeightDecay`](@ref Flux.WeightDecay) adds `0.42` times original par
matching the gradient of the penalty above (with the same, unrealistically large, constant).
After that, in either case, [`Adam`](@ref Flux.Adam) computes the final update.

The same trick works for *L₁ regularisation* (also called Lasso), where the penalty is
`pen_l1(x::AbstractArray) = sum(abs, x)` instead. This is implemented by `SignDecay(0.42)`.

The same `OptimiserChain` mechanism can be used for other purposes, such as gradient clipping with [`ClipGrad`](@ref Flux.Optimise.ClipValue) or [`ClipNorm`](@ref Flux.Optimise.ClipNorm).

Besides L2 / weight decay, another common and quite different kind of regularisation is
Besides L1 / L2 / weight decay, another common and quite different kind of regularisation is
provided by the [`Dropout`](@ref Flux.Dropout) layer. This turns off some outputs of the
previous layer during training.
It should switch automatically, but see [`trainmode!`](@ref Flux.trainmode!) / [`testmode!`](@ref Flux.testmode!) to manually enable or disable this layer.
Expand Down
Loading