From 56180591bb0ed5f9d68a6926e8946b24df582bc8 Mon Sep 17 00:00:00 2001 From: Michael Abbott <32575566+mcabbott@users.noreply.github.com> Date: Tue, 5 Mar 2024 14:39:03 -0600 Subject: [PATCH] Small upgrades to training docs (#2331) --- docs/src/training/reference.md | 12 +++++++----- docs/src/training/training.md | 8 +++++++- 2 files changed, 14 insertions(+), 6 deletions(-) diff --git a/docs/src/training/reference.md b/docs/src/training/reference.md index 77dc0f81d0..1bf0cfd1bf 100644 --- a/docs/src/training/reference.md +++ b/docs/src/training/reference.md @@ -10,10 +10,6 @@ Because of this: * Flux defines its own version of `setup` which checks this assumption. (Using instead `Optimisers.setup` will also work, they return the same thing.) -The new implementation of rules such as Adam in the Optimisers is quite different from the old one in `Flux.Optimise`. In Flux 0.14, `Flux.Adam()` returns the old one, with supertype `Flux.Optimise.AbstractOptimiser`, but `setup` will silently translate it to its new counterpart. -The available rules are listed the [optimisation rules](@ref man-optimisers) page here; -see the [Optimisers documentation](https://fluxml.ai/Optimisers.jl/dev/) for details on how the new rules work. - ```@docs Flux.Train.setup Flux.Train.train!(loss, model, data, state; cb) @@ -47,10 +43,16 @@ Flux 0.13 and 0.14 are the transitional versions which support both; Flux 0.15 w The blue-green boxes in the [training section](@ref man-training) describe the changes needed to upgrade old code. +The available rules are listed the [optimisation rules](@ref man-optimisers) page here. + +!!! compat "Old & new rules" + The new implementation of rules such as Adam in the Optimisers is quite different from the old one in `Flux.Optimise`. In Flux 0.14, `Flux.Adam()` still returns the old one, with supertype `Flux.Optimise.AbstractOptimiser`, but `setup` will silently translate it to its new counterpart. + For full details on the interface for implicit-style optimisers, see the [Flux 0.13.6 manual](https://fluxml.ai/Flux.jl/v0.13.6/training/training/). +See the [Optimisers documentation](https://fluxml.ai/Optimisers.jl/dev/) for details on how the new rules work. !!! compat "Flux ≤ 0.12" - Earlier versions of Flux exported `params`, thus allowing unqualified `params(model)` + Much earlier versions of Flux exported `params`, thus allowing unqualified `params(model)` after `using Flux`. This conflicted with too many other packages, and was removed in Flux 0.13. If you get an error `UndefVarError: params not defined`, this probably means that you are following code for Flux 0.12 or earlier on a more recent version. diff --git a/docs/src/training/training.md b/docs/src/training/training.md index 623b4788fc..6dd80897b5 100644 --- a/docs/src/training/training.md +++ b/docs/src/training/training.md @@ -225,6 +225,9 @@ callback API. Here is an example, in which it may be helpful to note: returns the value of the function, for logging or diagnostic use. * Logging or printing is best done outside of the `gradient` call, as there is no need to differentiate these commands. +* To use `result` for logging purposes, you could change the `do` block to end with + `return my_loss(result, label), result`, i.e. make the function passed to `withgradient` + return a tuple. The first element is always the loss. * Julia's `break` and `continue` keywords let you exit from parts of the loop. ```julia @@ -319,9 +322,12 @@ The first, [`WeightDecay`](@ref Flux.WeightDecay) adds `0.42` times original par matching the gradient of the penalty above (with the same, unrealistically large, constant). After that, in either case, [`Adam`](@ref Flux.Adam) computes the final update. +The same trick works for *L₁ regularisation* (also called Lasso), where the penalty is +`pen_l1(x::AbstractArray) = sum(abs, x)` instead. This is implemented by `SignDecay(0.42)`. + The same `OptimiserChain` mechanism can be used for other purposes, such as gradient clipping with [`ClipGrad`](@ref Flux.Optimise.ClipValue) or [`ClipNorm`](@ref Flux.Optimise.ClipNorm). -Besides L2 / weight decay, another common and quite different kind of regularisation is +Besides L1 / L2 / weight decay, another common and quite different kind of regularisation is provided by the [`Dropout`](@ref Flux.Dropout) layer. This turns off some outputs of the previous layer during training. It should switch automatically, but see [`trainmode!`](@ref Flux.trainmode!) / [`testmode!`](@ref Flux.testmode!) to manually enable or disable this layer.