From 56180591bb0ed5f9d68a6926e8946b24df582bc8 Mon Sep 17 00:00:00 2001
From: Michael Abbott <32575566+mcabbott@users.noreply.github.com>
Date: Tue, 5 Mar 2024 14:39:03 -0600
Subject: [PATCH] Small upgrades to training docs (#2331)

---
 docs/src/training/reference.md | 12 +++++++-----
 docs/src/training/training.md  |  8 +++++++-
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/docs/src/training/reference.md b/docs/src/training/reference.md
index 77dc0f81d0..1bf0cfd1bf 100644
--- a/docs/src/training/reference.md
+++ b/docs/src/training/reference.md
@@ -10,10 +10,6 @@ Because of this:
 * Flux defines its own version of `setup` which checks this assumption.
   (Using instead `Optimisers.setup` will also work, they return the same thing.)
 
-The new implementation of rules such as Adam in the Optimisers is quite different from the old one in `Flux.Optimise`. In Flux 0.14, `Flux.Adam()` returns the old one, with supertype `Flux.Optimise.AbstractOptimiser`, but `setup` will silently translate it to its new counterpart.
-The available rules are listed the [optimisation rules](@ref man-optimisers) page here;
-see the [Optimisers documentation](https://fluxml.ai/Optimisers.jl/dev/) for details on how the new rules work.
-
 ```@docs
 Flux.Train.setup
 Flux.Train.train!(loss, model, data, state; cb)
@@ -47,10 +43,16 @@ Flux 0.13 and 0.14 are the transitional versions which support both; Flux 0.15 w
     The blue-green boxes in the [training section](@ref man-training) describe
     the changes needed to upgrade old code.
 
+The available rules are listed the [optimisation rules](@ref man-optimisers) page here.
+
+!!! compat "Old & new rules"
+    The new implementation of rules such as Adam in the Optimisers is quite different from the old one in `Flux.Optimise`. In Flux 0.14, `Flux.Adam()` still returns the old one, with supertype `Flux.Optimise.AbstractOptimiser`, but `setup` will silently translate it to its new counterpart.
+
 For full details on the interface for implicit-style optimisers, see the [Flux 0.13.6 manual](https://fluxml.ai/Flux.jl/v0.13.6/training/training/).
+See the [Optimisers documentation](https://fluxml.ai/Optimisers.jl/dev/) for details on how the new rules work.
 
 !!! compat "Flux ≤ 0.12"
-    Earlier versions of Flux exported `params`, thus allowing unqualified `params(model)`
+    Much earlier versions of Flux exported `params`, thus allowing unqualified `params(model)`
     after `using Flux`. This conflicted with too many other packages, and was removed in Flux 0.13.
     If you get an error `UndefVarError: params not defined`, this probably means that you are
     following code for Flux 0.12 or earlier on a more recent version.
diff --git a/docs/src/training/training.md b/docs/src/training/training.md
index 623b4788fc..6dd80897b5 100644
--- a/docs/src/training/training.md
+++ b/docs/src/training/training.md
@@ -225,6 +225,9 @@ callback API. Here is an example, in which it may be helpful to note:
   returns the value of the function, for logging or diagnostic use.
 * Logging or printing is best done outside of the `gradient` call,
   as there is no need to differentiate these commands.
+* To use `result` for logging purposes, you could change the `do` block to end with 
+  `return my_loss(result, label), result`, i.e. make the function passed to `withgradient`
+  return a tuple. The first element is always the loss.
 * Julia's `break` and `continue` keywords let you exit from parts of the loop.
 
 ```julia
@@ -319,9 +322,12 @@ The first, [`WeightDecay`](@ref Flux.WeightDecay) adds `0.42` times original par
 matching the gradient of the penalty above (with the same, unrealistically large, constant).
 After that, in either case, [`Adam`](@ref Flux.Adam) computes the final update.
 
+The same trick works for *L₁ regularisation* (also called Lasso), where the penalty is 
+`pen_l1(x::AbstractArray) = sum(abs, x)` instead. This is implemented by `SignDecay(0.42)`.
+
 The same `OptimiserChain` mechanism can be used for other purposes, such as gradient clipping with [`ClipGrad`](@ref Flux.Optimise.ClipValue) or [`ClipNorm`](@ref Flux.Optimise.ClipNorm).
 
-Besides L2 / weight decay, another common and quite different kind of regularisation is
+Besides L1 / L2 / weight decay, another common and quite different kind of regularisation is
 provided by the [`Dropout`](@ref Flux.Dropout) layer. This turns off some outputs of the
 previous layer during training.
 It should switch automatically, but see [`trainmode!`](@ref Flux.trainmode!) / [`testmode!`](@ref Flux.testmode!) to manually enable or disable this layer.