From b7c4ae9136ced32345d14fc32fc33607764c5fb7 Mon Sep 17 00:00:00 2001
From: Saransh <saransh0701@gmail.com>
Date: Tue, 5 Jul 2022 16:10:22 +0530
Subject: [PATCH 01/23] Create a getting started section and add a new linear
 regression example

---
 docs/Project.toml                             |   3 +
 docs/make.jl                                  |   4 +-
 .../src/{models => getting_started}/basics.md |   0
 docs/src/getting_started/linear_regression.md | 496 ++++++++++++++++++
 .../{models => getting_started}/overview.md   |   0
 docs/src/gpu.md                               |   2 +-
 docs/src/models/advanced.md                   |   2 +-
 docs/src/training/optimisers.md               |   2 +-
 docs/src/training/training.md                 |   8 +-
 xy.jld2                                       | Bin 0 -> 1337 bytes
 10 files changed, 508 insertions(+), 9 deletions(-)
 rename docs/src/{models => getting_started}/basics.md (100%)
 create mode 100644 docs/src/getting_started/linear_regression.md
 rename docs/src/{models => getting_started}/overview.md (100%)
 create mode 100644 xy.jld2

diff --git a/docs/Project.toml b/docs/Project.toml
index 0879636f3c..222e63405b 100644
--- a/docs/Project.toml
+++ b/docs/Project.toml
@@ -3,10 +3,13 @@ BSON = "fbb218c0-5317-5bc6-957e-2ee96dd4b1f0"
 ChainRulesCore = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4"
 Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
 Functors = "d9f16b24-f501-4c13-a1f2-28368ffc5196"
+MLDatasets = "eb30cadb-4394-5ae3-aed4-317e484a6458"
 MLUtils = "f1d291b0-491e-4a28-83b9-f70985020b54"
 NNlib = "872c559c-99b0-510c-b3b7-b6c96a88d5cd"
 OneHotArrays = "0b1bfda6-eb8a-41d2-88d8-f5af5cad476f"
 Optimisers = "3bd65402-5787-11e9-1adc-39752487f4e2"
+Plots = "91a5bcdd-55d7-5caf-9e0b-520d859cae80"
+Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
 Zygote = "e88e6eb3-aa80-5325-afca-941959d7151f"
 
 [compat]
diff --git a/docs/make.jl b/docs/make.jl
index 40d6033637..31950e30f2 100644
--- a/docs/make.jl
+++ b/docs/make.jl
@@ -1,10 +1,10 @@
-using Documenter, Flux, NNlib, Functors, MLUtils, BSON, Optimisers, OneHotArrays, Zygote, ChainRulesCore
+using Documenter, Flux, NNlib, Functors, MLUtils, BSON, Optimisers, OneHotArrays, Zygote, ChainRulesCore, Plots, MLDatasets, Statistics
 
 
 DocMeta.setdocmeta!(Flux, :DocTestSetup, :(using Flux); recursive = true)
 
 makedocs(
-    modules = [Flux, NNlib, Functors, MLUtils, BSON, Optimisers, OneHotArrays, Zygote, ChainRulesCore, Base],
+    modules = [Flux, NNlib, Functors, MLUtils, BSON, Optimisers, OneHotArrays, Zygote, ChainRulesCore, Base, Plots, MLDatasets, Statistics],
     doctest = false,
     sitename = "Flux",
     # strict = [:cross_references,],
diff --git a/docs/src/models/basics.md b/docs/src/getting_started/basics.md
similarity index 100%
rename from docs/src/models/basics.md
rename to docs/src/getting_started/basics.md
diff --git a/docs/src/getting_started/linear_regression.md b/docs/src/getting_started/linear_regression.md
new file mode 100644
index 0000000000..139738552f
--- /dev/null
+++ b/docs/src/getting_started/linear_regression.md
@@ -0,0 +1,496 @@
+# Linear Regression
+
+The following page contains a step-by-step walkthrough of the linear regression algorithm in `Julia` using `Flux`! We will start by creating a simple linear regression model for dummy data and then move on to a real dataset. The first part would involve writing some parts of the model on our own, which will later be replaced by `Flux`.
+
+## A simple linear regression model
+Let us start by building a simple linear regression model. This model would be trained on the data points of the form `(x₁, y₁), (x₂, y₂), ... , (xₙ, yₙ)`. In the real world, these `x`s denote a feature, and the `y`s denote a label; hence, our data would have `n` data points, each point mapping a single feature to a single label.
+
+Importing the required `Julia` packages -
+
+```jldoctest linear_regression_simple
+julia> using Flux
+
+julia> using Plots
+```
+### Generating a dataset
+The data usually comes from the real world, which we will be exploring in the last part of this tutorial, but we don't want to jump straight to the relatively harder part. Here we will generate the `x`s of our data points and map them to the respective `y`s using a simple function. Remember, each `x` is a feature, and each `y` is the corresponding label. Combining all the `x`s and `y`s would create the complete dataset.
+
+```jldoctest linear_regression_simple
+julia> x = hcat(collect(Float32, -3:0.1:3)...);
+
+julia> x |> size
+(1, 61)
+
+julia> typeof(x)
+Matrix{Float32} (alias for Array{Float32, 2})
+```
+
+The `hcat` call generates a `Matrix` with numbers ranging from `-3.0` to `3.0` with a gap of `0.1` between them. Each column of this matrix holds a single `x`, a total of 61 `x`s. The next step would be to generate the corresponding labels or the `y`s.
+
+```jldoctest linear_regression_simple
+julia> f(x) = @. 3x + 2;
+
+julia> y = f(x);
+
+julia> y |> size
+(1, 61)
+
+julia> typeof(y)
+Matrix{Float32} (alias for Array{Float32, 2})
+```
+
+The function `f` maps each `x` to a `y`, and as `x` is a `Matrix`, the expression broadcasts the scalar values using `@.` macro. Our data points are ready, but they are too perfect. In a real-world scenario, we will not have an `f` function to generate `y` values, but instead, the labels would be manually added.
+
+
+```jldoctest linear_regression_simple
+julia> x = x .* reshape(rand(Float32, 61), (1, 61));
+```
+
+Visualizing the final data -
+
+```jldoctest linear_regression_simple
+julia> plot(reshape(x, (61, 1)), reshape(y, (61, 1)), lw = 3, seriestype = :scatter, label = "", title = "Generated data", xlabel = "x", ylabel= "y");
+```
+
+
+![linear-regression-data](https://user-images.githubusercontent.com/74055102/177034397-d433a313-21a5-4394-97d9-5467f5cf6b72.png)
+
+
+The data looks random enough now! The `x` and `y` values are still somewhat correlated; hence, the linear regression algorithm should work fine on our dataset.
+
+We can now proceed ahead and build a model for our dataset!
+
+### Building a model
+
+A linear regression model is mathematically defined as -
+
+```math
+model(x) = Wx + b
+```
+
+where `W` is the weight matrix and `b` is the bias. For our case, the weight matrix (`W`) would constitute only a single element, as we have only a single feature. We can define our model in `Julia` using the exact same notation!
+
+```jldoctest linear_regression_simple
+julia> model(x) = @. W*x + b
+model (generic function with 1 method)
+```
+
+The `@.` macro allows you to perform the calculations by broadcasting the scalar quantities (for example - the bias).
+
+The next step would be to initialize the model parameters, which are the weight and the bias. There are a lot of initialization techniques available for different machine learning models, but for the sake of this example, let's pull out the weight from a uniform distribution and initialize the bias as `0`.
+
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+julia> W = rand(Float32, 1, 1)
+1×1 Matrix{Float32}:
+ 0.33832288
+
+julia> b = [0.0f0]
+1-element Vector{Float32}:
+ 0.0
+```
+
+Time to test if our model works!
+
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+julia> model(x) |> size
+(1, 61)
+
+julia> model(x)[1], y[1]
+(-0.5491928f0, -7.0f0)
+```
+
+It does! But the predictions are way off. We need to train the model to improve the predictions, but before training the model we need to define the loss function. The loss function would ideally output a quantity that we will try to minimize during the entire training process. Here we will use the mean sum squared error loss function.
+
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+julia> function loss(x, y)
+           ŷ = model(x)
+           sum((y .- ŷ).^2) / length(x)
+       end;
+
+julia> loss(x, y)
+28.880724f0
+```
+
+Calling the loss function on our `x`s and `y`s shows how far our predictions (`ŷ`) are from the real labels. More precisely, it calculates the sum of the squares of residuals and divides it by the total number of data points.
+
+We have successfully defined our model and the loss function, but surprisingly, we haven't used `Flux` anywhere till now. Let's see how we can write the same code using `Flux`. 
+
+```jldoctest linear_regression_simple
+julia> flux_model = Dense(1 => 1)
+Dense(1 => 1)       # 2 parameters
+```
+
+A [`Dense(1 => 1)`](@ref Dense) layer denotes a layer of one neuron with one output and one input. This layer is exactly same as the mathematical model defined by us above! Under the hood, `Flux` too calculates the output using the same expression! But, we don't have to initialize the parameters ourselves this time, instead `Flux` does it for us.
+
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+julia> flux_model.weight, flux_model.bias
+(Float32[1.0764818], Float32[0.0])
+```
+
+Now we can check if our model is acting right -
+
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+julia> flux_model(x) |> size
+(1, 61)
+
+julia> flux_model(x)[1], y[1]
+(-1.7474315f0, -7.0f0)
+```
+
+It is! The next step would be defining the loss function using `Flux`'s functions -
+
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+julia> function flux_loss(x, y)
+           ŷ = flux_model(x)
+           Flux.mse(ŷ, y)
+       end;
+
+julia> flux_loss(x, y)
+23.189152f0
+```
+
+Everything works as before! It almost feels like `Flux` provides us with smart wrappers for the functions we could have written on our own. Now, as the last step of this section, let's see how different the `flux_model` is from our custom `model`. A good way to go about this would be to fix the parameters of both models to be the same. Let's change the parameters of our custom `model` to match that of the `flux_model` -
+
+
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+julia> W = Float32[1.0764818]
+1-element Vector{Float32}:
+ 1.0764818
+```
+
+To check how both the models are performing on the data, let's find out the losses using the `loss` and `flux_loss` functions -
+
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+julia> loss(x, y), flux_loss(x, y)
+(23.189152f0, 23.189152f0)
+```
+
+The losses are identical! This means that our `model` and the `flux_model` are identical on some level, and the loss functions are completely identical! The difference in models would be that `Flux`'s [`Dense`](@ref) layer supports many other arguments that can be used to customize the layer further. But, for this tutorial, let us stick to our simple custom `model`.
+
+### Training the model
+
+Before we begin the training procedure with `Flux`, let's initialize an optimiser, finalize our data, and pass our parameters through [`Flux.params`](@ref) to specify that we want all derivatives of `W` and `b`. We will be using the classic [`Gradient Descent`](@ref Descent) algorithm. `Flux` comes loaded with a lot of different optimisers; refer to [Optimisers](@ref) for more information on the same.
+
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+julia> opt = Descent(0.01);
+
+julia> data = [(x, y)];
+
+julia> params = Flux.params(W, b)
+Params([Float32[0.71305436], Float32[0.0]])
+```
+
+Now, we can move to the actual training! The training consists of obtaining the gradient and updating the current parameters with the obtained derivatives using backpropagation. This is achieved using `Flux.gradient` (see see [Taking Gradients](@ref)) and [`Flux.Optimise.update!`](@ref) functions respectively.
+
+```jldoctest linear_regression_simple
+julia> gs = Flux.gradient(params) do
+                  loss(x, y)
+            end;
+
+julia> Flux.Optimise.update!(opt, params, gs)
+```
+
+We can now check the values of our parameters and the value of the loss function -
+
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+julia> params, loss(x, y)
+(Params([Float32[1.145264], Float32[0.041250423]]), 22.5526f0)
+```
+
+The parameters changed, and the loss went down! This means that we successfully trained our model for one epoch. We can plug the training code written above into a loop and train the model for a higher number of epochs. It can be customized either to have a fixed number of epochs or to stop when certain conditions are met, for example, `change in loss < 0.1`. This loop can be customized to suit a user's needs, and the conditions can be specified in plain `Julia`!
+
+`Flux` also provides a convenience function to train a model. The [`Flux.train!`](@ref) function performs the same task described above and does not require calculating the gradient manually.
+
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+julia> Flux.train!(loss, params, data, opt)
+
+julia> params, loss(x, y)
+(Params([Float32[1.2125431], Float32[0.08175573]]), 21.94231f0)
+```
+
+The parameters changed again, and the loss went down again! This was the second epoch of our training procedure. Let's plug this in a for loop and train the model for 60 epochs.
+
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+julia> for i = 1:60
+          Flux.train!(loss, params, data, opt)
+       end
+
+julia> params, loss(x, y)
+(Params([Float32[3.426797], Float32[1.5412952]]), 8.848401f0)
+```
+
+The loss went down significantly!
+
+`Flux` provides yet another convenience functionality, the [`Flux.@epochs`](@ref) macro, which can be used to train a model for a specific number of epochs.
+
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+julia> Flux.@epochs 10 Flux.train!(loss, params, data, opt)
+[ Info: Epoch 1
+[ Info: Epoch 2
+[ Info: Epoch 3
+[ Info: Epoch 4
+[ Info: Epoch 5
+[ Info: Epoch 6
+[ Info: Epoch 7
+[ Info: Epoch 8
+[ Info: Epoch 9
+[ Info: Epoch 10
+
+julia> params, loss(x, y)
+(Params([Float32[3.58633], Float32[1.6624337]]), 8.44982f0)
+```
+
+We can train the model even more or tweak the hyperparameters to achieve the desired result faster, but let's stop here. We trained our model for 72 epochs, and loss went down from `23.189152` to `8.44982`. Time for some visualization!
+
+### Results
+The main objective of this tutorial was to fit a line to our dataset using the linear regression algorithm. The training procedure went well, and the loss went down significantly! Let's see what the fitted line looks like. Remember, `Wx + b` is nothing more than a line's equation, with `slope = W[1]` and `y-intercept = b[1]` (indexing at `1` as `W` and `b` are iterable).
+
+Plotting the line and the data points using `Plot.jl` -
+```jldoctest linear_regression_simple
+julia> plot(reshape(x, (61, 1)), reshape(y, (61, 1)), lw = 3, seriestype = :scatter, label = "", title = "Simple Linear Regression", xlabel = "x", ylabel= "y");
+
+julia> plot!((x) -> b[1] + W[1] * x, -3, 3, label="Custom model", lw=2);
+```
+
+![linear-regression-line](https://user-images.githubusercontent.com/74055102/177034985-d53adf40-5527-4a83-b9f6-7a62e5cc678f.png)
+
+The line fits well! There is room for improvement, but we leave that up to you! You can play with the optimisers, the number of epochs, learning rate, etc. to improve the fitting and reduce the loss!
+
+## Linear regression model on a real dataset
+We now move on to a relative;y complex linear regression model. Here we will use a real dataset from [`MLDatasets.jl`](https://github.com/JuliaML/MLDatasets.jl), which will not confine our data points to have only one feature. Let's start by importing the required packages -
+
+```jldoctest linear_regression_complex
+julia> using Flux
+
+julia> using Statistics
+
+julia> using MLDatasets: BostonHousing
+```
+
+### Data
+Let's start by initializing our dataset. We will be using the [`BostonHousing`](https://juliaml.github.io/MLDatasets.jl/stable/datasets/misc/#MLDatasets.BostonHousing) dataset consisting of `506` data points. Each of these data points has `13` features and a corresponding label, the house's price. 
+
+```julia linear_regression_complex
+julia> dataset = BostonHousing()
+dataset BostonHousing:
+  metadata    =>    Dict{String, Any} with 5 entries
+  features    =>    506×13 DataFrame
+  targets     =>    506×1 DataFrame
+  dataframe   =>    506×14 DataFrame
+
+julia> x, y = BostonHousing(as_df=false)[:]
+```
+
+We can now split the obtained data into training and testing data -
+
+```julia linear_regression_complex
+julia> x_train, x_test, y_train, y_test = x[:, 1:400], x[:, 401:end], y[:, 1:400], y[:, 401:end];
+
+julia> x_train |> size, x_test |> size, y_train |> size, y_test |> size
+((13, 400), (13, 106), (1, 400), (1, 106))
+```
+
+This data contains a diverse number of features, which means that the features have different scales. A wise option here would be to `normalise` the data, making the training process more efficient and fast. Let's check the standard deviation of the training data before normalising it.
+
+```julia linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+julia> std(x_train)
+134.06784844377117
+```
+
+The data is indeed not normalised. We can use the [`Flux.normalise`](@ref) function to normalise the training data.
+
+```julia linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+julia> x_train_n = Flux.normalise(x_train);
+
+julia> std(x_train_n)
+1.0000843694328236
+```
+
+The standard deviation is now close to one! The last step for this section would be to wrap the `x`s and `y`s together to create the training data.
+
+```julia linear_regression_complex
+julia> train_data = [(x_train_n, y_train)];
+```
+
+Our data is ready!
+
+### Model
+We can now directly use `Flux` and let it do all the work internally! Let's define a model that takes in 13 inputs (13 features) and gives us a single output (the label). We will then pass our entire data through this model in one go, and `Flux` will handle everything for us! Remember, we could have declared a model in plain `Julia` as well. The model will have 14 parameters, 13 weights, and one bias.
+
+```julia linear_regression_complex
+julia> model = Dense(13 => 1)
+Dense(13 => 1)      # 14 parameters
+```
+
+Same as before, our next step would be to define a loss function to quantify our accuracy somehow. The lower the loss, the better the model!
+
+```julia linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+julia> function loss(x, y)
+           ŷ = model(x)
+           Flux.mse(ŷ, y)
+       end;
+
+julia> loss(x_train_n, y_train)
+685.4700669900504
+```
+
+We can now proceed to the training phase!
+
+### Training
+Before training the model, let's initialize the optimiser and let `Flux` know that we want all the derivatives of all the parameters of our `model`.
+
+```julia linear_regression_complex
+julia> opt = Descent(0.05);
+
+julia> params = Flux.params(model);
+```
+
+Contrary to our last training procedure, let's say that this time we don't want to hardcode the number of epochs. We want the training procedure to stop when the loss converges, that is, when `change in loss < δ`. The quantity `δ` can be altered according to a user's need, but let's fix it to `10⁻³` for this tutorial.
+
+We can write such custom training loops effortlessly using Flux and plain Julia!
+```julia linear_regression_complex
+julia> loss_init = Inf;
+
+julia> while true
+           Flux.train!(loss, params, data, opt)
+           if loss_init == Inf
+               loss_init = loss(x_train_n, y_train)
+               continue
+           end
+
+           if abs(loss_init - loss(x_train_n, y_train)) < 1e-3
+               break
+           else
+               loss_init = loss(x_train_n, y_train)
+           end
+       end;
+```
+
+The code starts by initializing an initial value for the loss, `infinity`. Next, it runs an infinite loop that breaks if `change in loss < 10⁻³`, or the code changes the value of `loss_init` to the current loss and moves on to the next iteration.
+
+This custom loop works! This shows how easily a user can write down any custom training routine using Flux and Julia!
+
+Let's have a look at the loss -
+
+```julia linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+julia> loss(x_train_n, y_train)
+27.127200028562164
+```
+
+The loss went down significantly! It can be minimized further by choosing an even smaller `δ`.
+
+### Testing
+The last step of this tutorial would be to test our model using the testing data. We will first normalise the testing data and then calculate the corresponding loss.
+
+```julia linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+julia> x_test_n = Flux.normalise(x_test);
+
+julia> loss(x_test_n, y_test)
+66.91014769713368
+```
+
+The loss is not as small as the loss of the training data, but it looks good! This also shows that our model is not overfitting!
+
+---
+
+Summarising this tutorial, we started by generating a random yet correlated dataset for our custom model. We then saw how a simple linear regression model could be built with and without Flux, and how they were almost identical. 
+
+Next, we trained the model by manually calling the gradient function and optimising the loss. We also saw how Flux provided various wrapper functionalities like the train! function to make the API simpler for users. 
+
+After getting familiar with the basics of Flux and Julia, we moved ahead to build a machine learning model for a real dataset. We repeated the exact same steps, but this time with a lot more features and data points, and by harnessing Flux's full capabilities. In the end, we developed a training loop that was smarter than the hardcoded one and ran the model on our normalised dataset to conclude the tutorial.
+
+## Copy-pastable code
+### Dummy dataset
+```julia
+using Flux
+using Plots
+
+
+# data
+x = hcat(collect(Float32, -3:0.1:3)...)
+f(x) = @. 3x + 2
+y = f(x)
+x = x .* reshape(rand(Float32, 61), (1, 61))
+
+# plot the data
+plot(reshape(x, (61, 1)), reshape(y, (61, 1)), lw = 3, seriestype = :scatter, label = "", title = "Generated data", xlabel = "x", ylabel= "y")
+
+# custom model and parameters
+model(x) = @. W*x + b
+W = rand(Float32, 1, 1)
+b = [0.0f0]
+
+# loss function
+function loss(x, y)
+    ŷ = model(x)
+    sum((y .- ŷ).^2) / length(x)
+end;
+
+print("Initial loss", loss(x, y), "\n")
+
+# optimiser, data, and parameters
+opt = Descent(0.01);
+data = [(x, y)];
+params = Flux.params(W, b)
+
+# train
+for i = 1:72
+    Flux.train!(loss, params, data, opt)
+end
+
+print("Final loss", loss(x, y), "\n")
+
+# plot data and results
+plot(reshape(x, (61, 1)), reshape(y, (61, 1)), lw = 3, seriestype = :scatter, label = "", title = "Simple Linear Regression", xlabel = "x", ylabel= "y")
+plot!((x) -> b[1] + W[1] * x, -3, 3, label="Custom model", lw=2)
+```
+### Real dataset
+```julia
+using Flux
+using Statistics
+using MLDatasets: BostonHousing
+
+
+# data
+x, y = BostonHousing(as_df=false)[:]
+x_train, x_test, y_train, y_test = x[:, 1:400], x[:, 401:end], y[:, 1:400], y[:, 401:end]
+x_train_n = Flux.normalise(x_train)
+train_data = [(x_train_n, y_train)]
+
+# model
+model = Dense(13 => 1)
+
+# loss function
+function loss(x, y)
+    ŷ = model(x)
+    Flux.mse(ŷ, y)
+end;
+
+print("Initial loss", loss(x_train_n, y_train), "\n")
+
+# optimiser and parameters
+opt = Descent(0.05);
+params = Flux.params(model);
+
+# train
+loss_init = Inf;
+while true
+    Flux.train!(loss, params, data, opt)
+    if loss_init == Inf
+        loss_init = loss(x_train_n, y_train)
+        continue
+    end
+
+    if abs(loss_init - loss(x_train_n, y_train)) < 1e-3
+        break
+    else
+        loss_init = loss(x_train_n, y_train)
+    end
+end
+
+print("Final loss", loss(x_train_n, y_train), "\n")
+
+# testing
+x_test_n = Flux.normalise(x_test);
+print("Test loss", loss(x_test_n, y_test), "\n")
+```
\ No newline at end of file
diff --git a/docs/src/models/overview.md b/docs/src/getting_started/overview.md
similarity index 100%
rename from docs/src/models/overview.md
rename to docs/src/getting_started/overview.md
diff --git a/docs/src/gpu.md b/docs/src/gpu.md
index 207e1b9faa..27956baa57 100644
--- a/docs/src/gpu.md
+++ b/docs/src/gpu.md
@@ -17,7 +17,7 @@ true
 
 Support for array operations on other hardware backends, like GPUs, is provided by external packages like [CUDA](https://github.com/JuliaGPU/CUDA.jl). Flux is agnostic to array types, so we simply need to move model weights and data to the GPU and Flux will handle it.
 
-For example, we can use `CUDA.CuArray` (with the `cu` converter) to run our [basic example](models/basics.md) on an NVIDIA GPU.
+For example, we can use `CUDA.CuArray` (with the `cu` converter) to run our [basic example](getting_started/basics.md) on an NVIDIA GPU.
 
 (Note that you need to have CUDA available to use CUDA.CuArray – please see the [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) instructions for more details.)
 
diff --git a/docs/src/models/advanced.md b/docs/src/models/advanced.md
index dcb4edfa25..047053946f 100644
--- a/docs/src/models/advanced.md
+++ b/docs/src/models/advanced.md
@@ -34,7 +34,7 @@ For an intro to Flux and automatic differentiation, see this [tutorial](https://
 
 ## Customising Parameter Collection for a Model
 
-Taking reference from our example `Affine` layer from the [basics](basics.md#Building-Layers-1).
+Taking reference from our example `Affine` layer from the [basics](../getting_started/basics.md#Building-Layers-1).
 
 By default all the fields in the `Affine` type are collected as its parameters, however, in some cases it may be desired to hold other metadata in our "layers" that may not be needed for training, and are hence supposed to be ignored while the parameters are collected. With Flux, it is possible to mark the fields of our layers that are trainable in two ways.
 
diff --git a/docs/src/training/optimisers.md b/docs/src/training/optimisers.md
index afbcac0c4d..9d619f8d10 100644
--- a/docs/src/training/optimisers.md
+++ b/docs/src/training/optimisers.md
@@ -4,7 +4,7 @@ CurrentModule = Flux
 
 # Optimisers
 
-Consider a [simple linear regression](../models/basics.md). We create some dummy data, calculate a loss, and backpropagate to calculate gradients for the parameters `W` and `b`.
+Consider a [simple linear regression](../getting_started/linear_regression.md). We create some dummy data, calculate a loss, and backpropagate to calculate gradients for the parameters `W` and `b`.
 
 ```julia
 using Flux
diff --git a/docs/src/training/training.md b/docs/src/training/training.md
index 03f580c069..70fa39a510 100644
--- a/docs/src/training/training.md
+++ b/docs/src/training/training.md
@@ -40,8 +40,8 @@ more information can be found on [Custom Training Loops](../models/advanced.md).
 
 ## Loss Functions
 
-The objective function must return a number representing how far the model is from its target – the *loss* of the model. The `loss` function that we defined in [basics](../models/basics.md) will work as an objective.
-In addition to custom losses, a model can be trained in conjunction with
+The objective function must return a number representing how far the model is from its target – the *loss* of the model. The `loss` function that we defined in [basics](../getting_started/basics.md) will work as an objective.
+In addition to custom losses, model can be trained in conjuction with
 the commonly used losses that are grouped under the `Flux.Losses` module.
 We can also define an objective in terms of some model:
 
@@ -64,11 +64,11 @@ At first glance, it may seem strange that the model that we want to train is not
 
 ## Model parameters
 
-The model to be trained must have a set of tracked parameters that are used to calculate the gradients of the objective function. In the [basics](../models/basics.md) section it is explained how to create models with such parameters. The second argument of the function `Flux.train!` must be an object containing those parameters, which can be obtained from a model `m` as `Flux.params(m)`.
+The model to be trained must have a set of tracked parameters that are used to calculate the gradients of the objective function. In the [basics](../getting_started/basics.md) section it is explained how to create models with such parameters. The second argument of the function `Flux.train!` must be an object containing those parameters, which can be obtained from a model `m` as `Flux.params(m)`.
 
 Such an object contains a reference to the model's parameters, not a copy, such that after their training, the model behaves according to their updated values.
 
-Handling all the parameters on a layer-by-layer basis is explained in the [Layer Helpers](../models/basics.md) section. For freezing model parameters, see the [Advanced Usage Guide](../models/advanced.md).
+Handling all the parameters on a layer by layer basis is explained in the [Layer Helpers](../getting_started/basics.md) section. Also, for freezing model parameters, see the [Advanced Usage Guide](../models/advanced.md).
 
 ```@docs
 Flux.params
diff --git a/xy.jld2 b/xy.jld2
new file mode 100644
index 0000000000000000000000000000000000000000..0a0c5351231c645de8c9f94d84ae2d7830a711ee
GIT binary patch
literal 1337
zcmeIwUr19?7y$5d=bxseNz)XGy!J;eb!@v!U9@}dY_3_OM2ON#n&g_xG~CMhkY(Y6
zC}>2{h7TjcEQJJL%C>thl__y5B@3mOfj;;U^v941_UF@WPrdZeOV9`Rp8Ng2@4!7D
z$80q1F08dW>rS%=E;Q6zS)<iuWlc@Z=d7**_F`SLv%aa3RVmd<HDdF^Eox=4Qo|N$
z3v26Ltfg!x`XB%71Uk(|Q&M7PqD(GFNESZE&q(Ov!ue3d*Nq*oK5RBt$`}n32NsjS
zq{1c{V)T*oGm$kW6)|cy3&pPGEJ~N7^vm!q4<WE`=<fI%VsvklDmu1E!3<V<H%*Cu
zX@L04739uH7P%K(2~^EIBBAznv1($EXdie=9M<mv)0ZJ~-Q^+i**=nHXNYEoBdz^|
z;!(GQC^A<3%Z`I&Y-^Edd5}g1^LxdwT_s{(MU42lDOYTH`7p5hJE3E>dZNtawa)yK
z%;g<=L?`rTJKOl1PZs!K+jf4@EAdkurP!UN#~nr`F03)&P#=efXUp-2@gDqql;TSp
zTzqNeICkvXi#MI0!awHB{Ais_cv};|(<gQOzE8>gLRT3+bYuq4w>b4D_U8(!*xvcS
zGd+ZYlHK-RqLfNn?wREAJeCj|md@Dv!TP{%fOUdhfxJPw?Wux(6F3LPf~}V{+#-2R
zDY4u73G@*a$B{`&MJNU)sD+0?VM;CVbYqSi!CucS1`{|=A=qp=g4d%HUQvsH&?!O7
zIl-?%-VL=|U_Pkd2OkkA^@F_tn*^H$n*;j@wgiR@l*Ssoo&<x<mTb^+3IjqJV4K0R
z!E(Uz!BmDGUGenV^w3q=M%aj=mo&O2u3ggjB!rk0hH<M_dmy6F+8Di?1>wRUzf*qi
Jtd93J`~prsBMbll

literal 0
HcmV?d00001


From 2f74f37c9c57d992fb605a0c02cbcf645718bdee Mon Sep 17 00:00:00 2001
From: Saransh <saransh0701@gmail.com>
Date: Wed, 6 Jul 2022 15:44:17 +0530
Subject: [PATCH 02/23] Minor improvements

---
 docs/src/getting_started/linear_regression.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/src/getting_started/linear_regression.md b/docs/src/getting_started/linear_regression.md
index 139738552f..b9180553e9 100644
--- a/docs/src/getting_started/linear_regression.md
+++ b/docs/src/getting_started/linear_regression.md
@@ -3,7 +3,7 @@
 The following page contains a step-by-step walkthrough of the linear regression algorithm in `Julia` using `Flux`! We will start by creating a simple linear regression model for dummy data and then move on to a real dataset. The first part would involve writing some parts of the model on our own, which will later be replaced by `Flux`.
 
 ## A simple linear regression model
-Let us start by building a simple linear regression model. This model would be trained on the data points of the form `(x₁, y₁), (x₂, y₂), ... , (xₙ, yₙ)`. In the real world, these `x`s denote a feature, and the `y`s denote a label; hence, our data would have `n` data points, each point mapping a single feature to a single label.
+Let us start by building a simple linear regression model. This model would be trained on the data points of the form `(x₁, y₁), (x₂, y₂), ... , (xₙ, yₙ)`. In the real world, these `x`s can have multiple features, and the `y`s denote a label. In our example, each `x` has a single feature; hence, our data would have `n` data points, each point mapping a single feature to a single label.
 
 Importing the required `Julia` packages -
 
@@ -13,7 +13,7 @@ julia> using Flux
 julia> using Plots
 ```
 ### Generating a dataset
-The data usually comes from the real world, which we will be exploring in the last part of this tutorial, but we don't want to jump straight to the relatively harder part. Here we will generate the `x`s of our data points and map them to the respective `y`s using a simple function. Remember, each `x` is a feature, and each `y` is the corresponding label. Combining all the `x`s and `y`s would create the complete dataset.
+The data usually comes from the real world, which we will be exploring in the last part of this tutorial, but we don't want to jump straight to the relatively harder part. Here we will generate the `x`s of our data points and map them to the respective `y`s using a simple function. Remember, here each `x` is equivalent to a feature, and each `y` is the corresponding label. Combining all the `x`s and `y`s would create the complete dataset.
 
 ```jldoctest linear_regression_simple
 julia> x = hcat(collect(Float32, -3:0.1:3)...);
@@ -120,14 +120,14 @@ julia> flux_model = Dense(1 => 1)
 Dense(1 => 1)       # 2 parameters
 ```
 
-A [`Dense(1 => 1)`](@ref Dense) layer denotes a layer of one neuron with one output and one input. This layer is exactly same as the mathematical model defined by us above! Under the hood, `Flux` too calculates the output using the same expression! But, we don't have to initialize the parameters ourselves this time, instead `Flux` does it for us.
+A [`Dense(1 => 1)`](@ref Dense) layer denotes a layer of one neuron with one input (one feature) and one output. This layer is exactly same as the mathematical model defined by us above! Under the hood, `Flux` too calculates the output using the same expression! But, we don't have to initialize the parameters ourselves this time, instead `Flux` does it for us.
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
 julia> flux_model.weight, flux_model.bias
 (Float32[1.0764818], Float32[0.0])
 ```
 
-Now we can check if our model is acting right -
+Now we can check if our model is acting right. We can pass the complete data in one go, with each `x` having exactly one feature (one input) -
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
 julia> flux_model(x) |> size
@@ -268,7 +268,7 @@ julia> using MLDatasets: BostonHousing
 ```
 
 ### Data
-Let's start by initializing our dataset. We will be using the [`BostonHousing`](https://juliaml.github.io/MLDatasets.jl/stable/datasets/misc/#MLDatasets.BostonHousing) dataset consisting of `506` data points. Each of these data points has `13` features and a corresponding label, the house's price. 
+Let's start by initializing our dataset. We will be using the [`BostonHousing`](https://juliaml.github.io/MLDatasets.jl/stable/datasets/misc/#MLDatasets.BostonHousing) dataset consisting of `506` data points. Each of these data points has `13` features and a corresponding label, the house's price. The `x`s are still mapped to a single `y`, but now, a single `x` data point has 13 features. 
 
 ```julia linear_regression_complex
 julia> dataset = BostonHousing()

From d3526e994983585e1a9996c1aa391a46c7c92aea Mon Sep 17 00:00:00 2001
From: Saransh <saransh0701@gmail.com>
Date: Wed, 6 Jul 2022 16:27:38 +0530
Subject: [PATCH 03/23] Enable doctests

---
 docs/Project.toml                             |  1 +
 docs/make.jl                                  |  4 +--
 docs/src/getting_started/linear_regression.md | 29 ++++++++++---------
 3 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/docs/Project.toml b/docs/Project.toml
index 222e63405b..c1812ee385 100644
--- a/docs/Project.toml
+++ b/docs/Project.toml
@@ -1,6 +1,7 @@
 [deps]
 BSON = "fbb218c0-5317-5bc6-957e-2ee96dd4b1f0"
 ChainRulesCore = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4"
+DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
 Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
 Functors = "d9f16b24-f501-4c13-a1f2-28368ffc5196"
 MLDatasets = "eb30cadb-4394-5ae3-aed4-317e484a6458"
diff --git a/docs/make.jl b/docs/make.jl
index 31950e30f2..678a5827c3 100644
--- a/docs/make.jl
+++ b/docs/make.jl
@@ -1,10 +1,10 @@
-using Documenter, Flux, NNlib, Functors, MLUtils, BSON, Optimisers, OneHotArrays, Zygote, ChainRulesCore, Plots, MLDatasets, Statistics
+using Documenter, Flux, NNlib, Functors, MLUtils, BSON, Optimisers, OneHotArrays, Zygote, ChainRulesCore, Plots, MLDatasets, Statistics, DataFrames
 
 
 DocMeta.setdocmeta!(Flux, :DocTestSetup, :(using Flux); recursive = true)
 
 makedocs(
-    modules = [Flux, NNlib, Functors, MLUtils, BSON, Optimisers, OneHotArrays, Zygote, ChainRulesCore, Base, Plots, MLDatasets, Statistics],
+    modules = [Flux, NNlib, Functors, MLUtils, BSON, Optimisers, OneHotArrays, Zygote, ChainRulesCore, Base, Plots, MLDatasets, Statistics, DataFrames],
     doctest = false,
     sitename = "Flux",
     # strict = [:cross_references,],
diff --git a/docs/src/getting_started/linear_regression.md b/docs/src/getting_started/linear_regression.md
index b9180553e9..0f3fc08fad 100644
--- a/docs/src/getting_started/linear_regression.md
+++ b/docs/src/getting_started/linear_regression.md
@@ -270,7 +270,9 @@ julia> using MLDatasets: BostonHousing
 ### Data
 Let's start by initializing our dataset. We will be using the [`BostonHousing`](https://juliaml.github.io/MLDatasets.jl/stable/datasets/misc/#MLDatasets.BostonHousing) dataset consisting of `506` data points. Each of these data points has `13` features and a corresponding label, the house's price. The `x`s are still mapped to a single `y`, but now, a single `x` data point has 13 features. 
 
-```julia linear_regression_complex
+```jldoctest linear_regression_complex
+julia> using DataFrames
+
 julia> dataset = BostonHousing()
 dataset BostonHousing:
   metadata    =>    Dict{String, Any} with 5 entries
@@ -278,12 +280,12 @@ dataset BostonHousing:
   targets     =>    506×1 DataFrame
   dataframe   =>    506×14 DataFrame
 
-julia> x, y = BostonHousing(as_df=false)[:]
+julia> x, y = BostonHousing(as_df=false)[:];
 ```
 
 We can now split the obtained data into training and testing data -
 
-```julia linear_regression_complex
+```jldoctest linear_regression_complex
 julia> x_train, x_test, y_train, y_test = x[:, 1:400], x[:, 401:end], y[:, 1:400], y[:, 401:end];
 
 julia> x_train |> size, x_test |> size, y_train |> size, y_test |> size
@@ -292,14 +294,14 @@ julia> x_train |> size, x_test |> size, y_train |> size, y_test |> size
 
 This data contains a diverse number of features, which means that the features have different scales. A wise option here would be to `normalise` the data, making the training process more efficient and fast. Let's check the standard deviation of the training data before normalising it.
 
-```julia linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+```jldoctest linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+"
 julia> std(x_train)
 134.06784844377117
 ```
 
 The data is indeed not normalised. We can use the [`Flux.normalise`](@ref) function to normalise the training data.
 
-```julia linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+```jldoctest linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+"
 julia> x_train_n = Flux.normalise(x_train);
 
 julia> std(x_train_n)
@@ -308,7 +310,7 @@ julia> std(x_train_n)
 
 The standard deviation is now close to one! The last step for this section would be to wrap the `x`s and `y`s together to create the training data.
 
-```julia linear_regression_complex
+```jldoctest linear_regression_complex
 julia> train_data = [(x_train_n, y_train)];
 ```
 
@@ -317,14 +319,14 @@ Our data is ready!
 ### Model
 We can now directly use `Flux` and let it do all the work internally! Let's define a model that takes in 13 inputs (13 features) and gives us a single output (the label). We will then pass our entire data through this model in one go, and `Flux` will handle everything for us! Remember, we could have declared a model in plain `Julia` as well. The model will have 14 parameters, 13 weights, and one bias.
 
-```julia linear_regression_complex
+```jldoctest linear_regression_complex
 julia> model = Dense(13 => 1)
 Dense(13 => 1)      # 14 parameters
 ```
 
 Same as before, our next step would be to define a loss function to quantify our accuracy somehow. The lower the loss, the better the model!
 
-```julia linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+```jldoctest linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+"
 julia> function loss(x, y)
            ŷ = model(x)
            Flux.mse(ŷ, y)
@@ -339,7 +341,7 @@ We can now proceed to the training phase!
 ### Training
 Before training the model, let's initialize the optimiser and let `Flux` know that we want all the derivatives of all the parameters of our `model`.
 
-```julia linear_regression_complex
+```jldoctest linear_regression_complex
 julia> opt = Descent(0.05);
 
 julia> params = Flux.params(model);
@@ -348,16 +350,15 @@ julia> params = Flux.params(model);
 Contrary to our last training procedure, let's say that this time we don't want to hardcode the number of epochs. We want the training procedure to stop when the loss converges, that is, when `change in loss < δ`. The quantity `δ` can be altered according to a user's need, but let's fix it to `10⁻³` for this tutorial.
 
 We can write such custom training loops effortlessly using Flux and plain Julia!
-```julia linear_regression_complex
+```jldoctest linear_regression_complex
 julia> loss_init = Inf;
 
 julia> while true
-           Flux.train!(loss, params, data, opt)
+           Flux.train!(loss, params, train_data, opt)
            if loss_init == Inf
                loss_init = loss(x_train_n, y_train)
                continue
            end
-
            if abs(loss_init - loss(x_train_n, y_train)) < 1e-3
                break
            else
@@ -372,7 +373,7 @@ This custom loop works! This shows how easily a user can write down any custom t
 
 Let's have a look at the loss -
 
-```julia linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+```jldoctest linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+"
 julia> loss(x_train_n, y_train)
 27.127200028562164
 ```
@@ -382,7 +383,7 @@ The loss went down significantly! It can be minimized further by choosing an eve
 ### Testing
 The last step of this tutorial would be to test our model using the testing data. We will first normalise the testing data and then calculate the corresponding loss.
 
-```julia linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+```jldoctest linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+"
 julia> x_test_n = Flux.normalise(x_test);
 
 julia> loss(x_test_n, y_test)

From a1e49ada4edf98f7ebd4b8ede6fe2b3d2797466d Mon Sep 17 00:00:00 2001
From: Saransh <saransh0701@gmail.com>
Date: Fri, 15 Jul 2022 02:09:42 +0530
Subject: [PATCH 04/23] Update code blocks to get rid of `Flux.params`

---
 docs/src/getting_started/linear_regression.md | 180 +++++++++---------
 1 file changed, 89 insertions(+), 91 deletions(-)

diff --git a/docs/src/getting_started/linear_regression.md b/docs/src/getting_started/linear_regression.md
index 0f3fc08fad..9349bd876f 100644
--- a/docs/src/getting_started/linear_regression.md
+++ b/docs/src/getting_started/linear_regression.md
@@ -65,13 +65,13 @@ We can now proceed ahead and build a model for our dataset!
 A linear regression model is mathematically defined as -
 
 ```math
-model(x) = Wx + b
+model(W, b, x) = Wx + b
 ```
 
 where `W` is the weight matrix and `b` is the bias. For our case, the weight matrix (`W`) would constitute only a single element, as we have only a single feature. We can define our model in `Julia` using the exact same notation!
 
 ```jldoctest linear_regression_simple
-julia> model(x) = @. W*x + b
+julia> model(W, b, x) = @. W*x + b
 model (generic function with 1 method)
 ```
 
@@ -82,7 +82,7 @@ The next step would be to initialize the model parameters, which are the weight
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
 julia> W = rand(Float32, 1, 1)
 1×1 Matrix{Float32}:
- 0.33832288
+ 0.99285793
 
 julia> b = [0.0f0]
 1-element Vector{Float32}:
@@ -92,23 +92,23 @@ julia> b = [0.0f0]
 Time to test if our model works!
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
-julia> model(x) |> size
+julia> model(W, b, x) |> size
 (1, 61)
 
-julia> model(x)[1], y[1]
-(-0.5491928f0, -7.0f0)
+julia> model(W, b, x)[1], y[1]
+(-1.6116865f0, -7.0f0)
 ```
 
 It does! But the predictions are way off. We need to train the model to improve the predictions, but before training the model we need to define the loss function. The loss function would ideally output a quantity that we will try to minimize during the entire training process. Here we will use the mean sum squared error loss function.
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
-julia> function loss(x, y)
-           ŷ = model(x)
+julia> function loss(W, b, x, y)
+           ŷ = model(W, b, x)
            sum((y .- ŷ).^2) / length(x)
        end;
 
-julia> loss(x, y)
-28.880724f0
+julia> loss(W, b, x, y)
+23.772217f0
 ```
 
 Calling the loss function on our `x`s and `y`s shows how far our predictions (`ŷ`) are from the real labels. More precisely, it calculates the sum of the squares of residuals and divides it by the total number of data points.
@@ -124,7 +124,7 @@ A [`Dense(1 => 1)`](@ref Dense) layer denotes a layer of one neuron with one inp
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
 julia> flux_model.weight, flux_model.bias
-(Float32[1.0764818], Float32[0.0])
+(Float32[1.1412252], Float32[0.0])
 ```
 
 Now we can check if our model is acting right. We can pass the complete data in one go, with each `x` having exactly one feature (one input) -
@@ -134,7 +134,7 @@ julia> flux_model(x) |> size
 (1, 61)
 
 julia> flux_model(x)[1], y[1]
-(-1.7474315f0, -7.0f0)
+(-1.8525281f0, -7.0f0)
 ```
 
 It is! The next step would be defining the loss function using `Flux`'s functions -
@@ -146,55 +146,51 @@ julia> function flux_loss(x, y)
        end;
 
 julia> flux_loss(x, y)
-23.189152f0
+22.74856f0
 ```
 
 Everything works as before! It almost feels like `Flux` provides us with smart wrappers for the functions we could have written on our own. Now, as the last step of this section, let's see how different the `flux_model` is from our custom `model`. A good way to go about this would be to fix the parameters of both models to be the same. Let's change the parameters of our custom `model` to match that of the `flux_model` -
 
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
-julia> W = Float32[1.0764818]
+julia> W = Float32[1.1412252]
 1-element Vector{Float32}:
- 1.0764818
+ 1.1412252
 ```
 
 To check how both the models are performing on the data, let's find out the losses using the `loss` and `flux_loss` functions -
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
-julia> loss(x, y), flux_loss(x, y)
-(23.189152f0, 23.189152f0)
+julia> loss(W, b, x, y), flux_loss(x, y)
+(22.74856f0, 22.74856f0)
 ```
 
 The losses are identical! This means that our `model` and the `flux_model` are identical on some level, and the loss functions are completely identical! The difference in models would be that `Flux`'s [`Dense`](@ref) layer supports many other arguments that can be used to customize the layer further. But, for this tutorial, let us stick to our simple custom `model`.
 
 ### Training the model
 
-Before we begin the training procedure with `Flux`, let's initialize an optimiser, finalize our data, and pass our parameters through [`Flux.params`](@ref) to specify that we want all derivatives of `W` and `b`. We will be using the classic [`Gradient Descent`](@ref Descent) algorithm. `Flux` comes loaded with a lot of different optimisers; refer to [Optimisers](@ref) for more information on the same.
+Before we begin the training procedure with `Flux`, let's initialize an optimiser and finalize our data. We will be using the classic [`Gradient Descent`](@ref Descent) algorithm. `Flux` comes loaded with a lot of different optimisers; refer to [Optimisers](@ref) for more information on the same.
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
-julia> opt = Descent(0.01);
+julia> dLdW, dLdb, _, _ = gradient(loss, W, b, x, y)
+(Float32[-6.7322206], Float32[-4.132563], Float32[0.1926041 0.14162663 … -0.39782608 -0.29997927], Float32[-0.16876957 -0.12410051 … 0.3485956 0.2628572])
 
-julia> data = [(x, y)];
+julia> W .= W .- 0.1 .* dLdW
+1-element Vector{Float32}:
+ 1.8144473
 
-julia> params = Flux.params(W, b)
-Params([Float32[0.71305436], Float32[0.0]])
+julia> b .= b .- 0.1 .* dLdb
+1-element Vector{Float32}:
+ 0.41325632
 ```
 
 Now, we can move to the actual training! The training consists of obtaining the gradient and updating the current parameters with the obtained derivatives using backpropagation. This is achieved using `Flux.gradient` (see see [Taking Gradients](@ref)) and [`Flux.Optimise.update!`](@ref) functions respectively.
 
-```jldoctest linear_regression_simple
-julia> gs = Flux.gradient(params) do
-                  loss(x, y)
-            end;
-
-julia> Flux.Optimise.update!(opt, params, gs)
-```
-
 We can now check the values of our parameters and the value of the loss function -
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
-julia> params, loss(x, y)
-(Params([Float32[1.145264], Float32[0.041250423]]), 22.5526f0)
+julia> loss(W, b, x, y)
+17.157953f0
 ```
 
 The parameters changed, and the loss went down! This means that we successfully trained our model for one epoch. We can plug the training code written above into a loop and train the model for a higher number of epochs. It can be customized either to have a fixed number of epochs or to stop when certain conditions are met, for example, `change in loss < 0.1`. This loop can be customized to suit a user's needs, and the conditions can be specified in plain `Julia`!
@@ -202,21 +198,27 @@ The parameters changed, and the loss went down! This means that we successfully
 `Flux` also provides a convenience function to train a model. The [`Flux.train!`](@ref) function performs the same task described above and does not require calculating the gradient manually.
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
-julia> Flux.train!(loss, params, data, opt)
+julia> function train_model()
+           dLdW, dLdb, _, _ = gradient(loss, W, b, x, y)
+           @. W = W - 0.1 * dLdW
+           @. b = b - 0.1 * dLdb
+       end;
+
+julia> train_model();
 
-julia> params, loss(x, y)
-(Params([Float32[1.2125431], Float32[0.08175573]]), 21.94231f0)
+julia> W, b, loss(W, b, x, y)
+(Float32[2.340657], Float32[0.7516814], 13.64972f0)
 ```
 
 The parameters changed again, and the loss went down again! This was the second epoch of our training procedure. Let's plug this in a for loop and train the model for 60 epochs.
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
-julia> for i = 1:60
-          Flux.train!(loss, params, data, opt)
+julia> for i = 1:30
+          train_model()
        end
 
-julia> params, loss(x, y)
-(Params([Float32[3.426797], Float32[1.5412952]]), 8.848401f0)
+julia> W, b, loss(W, b, x, y)
+(Float32[4.2408285], Float32[2.243728], 7.668049f0)
 ```
 
 The loss went down significantly!
@@ -224,7 +226,7 @@ The loss went down significantly!
 `Flux` provides yet another convenience functionality, the [`Flux.@epochs`](@ref) macro, which can be used to train a model for a specific number of epochs.
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
-julia> Flux.@epochs 10 Flux.train!(loss, params, data, opt)
+julia> Flux.@epochs 10 train_model()
 [ Info: Epoch 1
 [ Info: Epoch 2
 [ Info: Epoch 3
@@ -236,11 +238,11 @@ julia> Flux.@epochs 10 Flux.train!(loss, params, data, opt)
 [ Info: Epoch 9
 [ Info: Epoch 10
 
-julia> params, loss(x, y)
-(Params([Float32[3.58633], Float32[1.6624337]]), 8.44982f0)
+julia> W, b, loss(W, b, x, y)
+(Float32[4.2422233], Float32[2.2460847], 7.6680417f0)
 ```
 
-We can train the model even more or tweak the hyperparameters to achieve the desired result faster, but let's stop here. We trained our model for 72 epochs, and loss went down from `23.189152` to `8.44982`. Time for some visualization!
+We can train the model even more or tweak the hyperparameters to achieve the desired result faster, but let's stop here. We trained our model for 72 epochs, and loss went down from `22.74856` to `7.6680417f`. Time for some visualization!
 
 ### Results
 The main objective of this tutorial was to fit a line to our dataset using the linear regression algorithm. The training procedure went well, and the loss went down significantly! Let's see what the fitted line looks like. Remember, `Wx + b` is nothing more than a line's equation, with `slope = W[1]` and `y-intercept = b[1]` (indexing at `1` as `W` and `b` are iterable).
@@ -252,7 +254,8 @@ julia> plot(reshape(x, (61, 1)), reshape(y, (61, 1)), lw = 3, seriestype = :scat
 julia> plot!((x) -> b[1] + W[1] * x, -3, 3, label="Custom model", lw=2);
 ```
 
-![linear-regression-line](https://user-images.githubusercontent.com/74055102/177034985-d53adf40-5527-4a83-b9f6-7a62e5cc678f.png)
+![linear-regression-line](https://user-images.githubusercontent.com/74055102/179050736-366bedcc-6990-40ee-83be-e11d07492e05.png)
+
 
 The line fits well! There is room for improvement, but we leave that up to you! You can play with the optimisers, the number of epochs, learning rate, etc. to improve the fitting and reduce the loss!
 
@@ -308,13 +311,7 @@ julia> std(x_train_n)
 1.0000843694328236
 ```
 
-The standard deviation is now close to one! The last step for this section would be to wrap the `x`s and `y`s together to create the training data.
-
-```jldoctest linear_regression_complex
-julia> train_data = [(x_train_n, y_train)];
-```
-
-Our data is ready!
+The standard deviation is now close to one! Our data is ready!
 
 ### Model
 We can now directly use `Flux` and let it do all the work internally! Let's define a model that takes in 13 inputs (13 features) and gives us a single output (the label). We will then pass our entire data through this model in one go, and `Flux` will handle everything for us! Remember, we could have declared a model in plain `Julia` as well. The model will have 14 parameters, 13 weights, and one bias.
@@ -327,13 +324,13 @@ Dense(13 => 1)      # 14 parameters
 Same as before, our next step would be to define a loss function to quantify our accuracy somehow. The lower the loss, the better the model!
 
 ```jldoctest linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+"
-julia> function loss(x, y)
+julia> function loss(model, x, y)
            ŷ = model(x)
            Flux.mse(ŷ, y)
        end;
 
-julia> loss(x_train_n, y_train)
-685.4700669900504
+julia> loss(model, x_train_n, y_train)
+676.165591625047
 ```
 
 We can now proceed to the training phase!
@@ -342,9 +339,11 @@ We can now proceed to the training phase!
 Before training the model, let's initialize the optimiser and let `Flux` know that we want all the derivatives of all the parameters of our `model`.
 
 ```jldoctest linear_regression_complex
-julia> opt = Descent(0.05);
-
-julia> params = Flux.params(model);
+julia> function train_model()
+           dLdm, _, _ = gradient(loss, model, x, y)
+           @. model.weight = model.weight - 0.000001 * dLdm.weight
+           @. model.bias = model.bias - 0.000001 * dLdm.bias
+       end;
 ```
 
 Contrary to our last training procedure, let's say that this time we don't want to hardcode the number of epochs. We want the training procedure to stop when the loss converges, that is, when `change in loss < δ`. The quantity `δ` can be altered according to a user's need, but let's fix it to `10⁻³` for this tutorial.
@@ -354,15 +353,15 @@ We can write such custom training loops effortlessly using Flux and plain Julia!
 julia> loss_init = Inf;
 
 julia> while true
-           Flux.train!(loss, params, train_data, opt)
+           train_model()
            if loss_init == Inf
-               loss_init = loss(x_train_n, y_train)
+               loss_init = loss(model, x_train_n, y_train)
                continue
            end
-           if abs(loss_init - loss(x_train_n, y_train)) < 1e-3
+           if abs(loss_init - loss(model, x_train_n, y_train)) < 1e-3
                break
            else
-               loss_init = loss(x_train_n, y_train)
+               loss_init = loss(model, x_train_n, y_train)
            end
        end;
 ```
@@ -374,7 +373,7 @@ This custom loop works! This shows how easily a user can write down any custom t
 Let's have a look at the loss -
 
 ```jldoctest linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+"
-julia> loss(x_train_n, y_train)
+julia> loss(model, x_train_n, y_train)
 27.127200028562164
 ```
 
@@ -386,7 +385,7 @@ The last step of this tutorial would be to test our model using the testing data
 ```jldoctest linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+"
 julia> x_test_n = Flux.normalise(x_test);
 
-julia> loss(x_test_n, y_test)
+julia> loss(model, x_test_n, y_test)
 66.91014769713368
 ```
 
@@ -406,7 +405,6 @@ After getting familiar with the basics of Flux and Julia, we moved ahead to buil
 using Flux
 using Plots
 
-
 # data
 x = hcat(collect(Float32, -3:0.1:3)...)
 f(x) = @. 3x + 2
@@ -417,29 +415,30 @@ x = x .* reshape(rand(Float32, 61), (1, 61))
 plot(reshape(x, (61, 1)), reshape(y, (61, 1)), lw = 3, seriestype = :scatter, label = "", title = "Generated data", xlabel = "x", ylabel= "y")
 
 # custom model and parameters
-model(x) = @. W*x + b
+model(W, b, x) = @. W*x + b
 W = rand(Float32, 1, 1)
 b = [0.0f0]
 
 # loss function
-function loss(x, y)
+function loss(model, x, y)
     ŷ = model(x)
     sum((y .- ŷ).^2) / length(x)
 end;
 
-print("Initial loss", loss(x, y), "\n")
-
-# optimiser, data, and parameters
-opt = Descent(0.01);
-data = [(x, y)];
-params = Flux.params(W, b)
+print("Initial loss", loss(model, x, y), "\n")
 
 # train
-for i = 1:72
-    Flux.train!(loss, params, data, opt)
+function train_model()
+    dLdW, dLdb, _, _ = gradient(loss, W, b, x, y)
+    @. W = W - 0.1 * dLdW
+    @. b = b - 0.1 * dLdb
+end
+
+for i = 1:40
+    train_model()
 end
 
-print("Final loss", loss(x, y), "\n")
+print("Final loss", loss(model, x, y), "\n")
 
 # plot data and results
 plot(reshape(x, (61, 1)), reshape(y, (61, 1)), lw = 3, seriestype = :scatter, label = "", title = "Simple Linear Regression", xlabel = "x", ylabel= "y")
@@ -451,47 +450,46 @@ using Flux
 using Statistics
 using MLDatasets: BostonHousing
 
-
 # data
 x, y = BostonHousing(as_df=false)[:]
 x_train, x_test, y_train, y_test = x[:, 1:400], x[:, 401:end], y[:, 1:400], y[:, 401:end]
 x_train_n = Flux.normalise(x_train)
-train_data = [(x_train_n, y_train)]
 
 # model
 model = Dense(13 => 1)
 
 # loss function
-function loss(x, y)
+function loss(model, x, y)
     ŷ = model(x)
     Flux.mse(ŷ, y)
 end;
 
-print("Initial loss", loss(x_train_n, y_train), "\n")
-
-# optimiser and parameters
-opt = Descent(0.05);
-params = Flux.params(model);
+print("Initial loss", loss(model, x_train_n, y_train), "\n")
 
 # train
+function train_model()
+    dLdm, _, _ = gradient(loss, model, x, y)
+    @. model.weight = model.weight - 0.000001 * dLdm.weight
+    @. model.bias = model.bias - 0.000001 * dLdm.bias
+end
+
 loss_init = Inf;
 while true
-    Flux.train!(loss, params, data, opt)
+    train_model()
     if loss_init == Inf
-        loss_init = loss(x_train_n, y_train)
+        loss_init = loss(model, x_train_n, y_train)
         continue
     end
-
-    if abs(loss_init - loss(x_train_n, y_train)) < 1e-3
+    if abs(loss_init - loss(model, x_train_n, y_train)) < 1e-3
         break
     else
-        loss_init = loss(x_train_n, y_train)
+        loss_init = loss(model, x_train_n, y_train)
     end
 end
 
-print("Final loss", loss(x_train_n, y_train), "\n")
+print("Final loss", loss(model, x_train_n, y_train), "\n")
 
-# testing
+# test
 x_test_n = Flux.normalise(x_test);
-print("Test loss", loss(x_test_n, y_test), "\n")
+print("Test loss", loss(model, x_test_n, y_test), "\n")
 ```
\ No newline at end of file

From 2605f92c83f0c7f8d44d43883f4d0febb6edbc1b Mon Sep 17 00:00:00 2001
From: Saransh <saransh0701@gmail.com>
Date: Fri, 15 Jul 2022 02:52:44 +0530
Subject: [PATCH 05/23] Update the text to manually run gradient descent

---
 docs/src/getting_started/linear_regression.md | 45 ++++++++++++-------
 1 file changed, 30 insertions(+), 15 deletions(-)

diff --git a/docs/src/getting_started/linear_regression.md b/docs/src/getting_started/linear_regression.md
index 9349bd876f..2d914bbce2 100644
--- a/docs/src/getting_started/linear_regression.md
+++ b/docs/src/getting_started/linear_regression.md
@@ -169,12 +169,29 @@ The losses are identical! This means that our `model` and the `flux_model` are i
 
 ### Training the model
 
-Before we begin the training procedure with `Flux`, let's initialize an optimiser and finalize our data. We will be using the classic [`Gradient Descent`](@ref Descent) algorithm. `Flux` comes loaded with a lot of different optimisers; refer to [Optimisers](@ref) for more information on the same.
+Let's train our model using the classic Gradient Descent algorithm. According to the gradient descent algorithm, the weights and biases are iteratively updated using the following mathematical equations -
+
+```math
+\begin{aligned}
+W &= W - \eta * \frac{dL}{dW} \\
+b &= b - \eta * \frac{dL}{db}
+\end{aligned}
+```
+
+Here, `W` is the weight matrix, `b` is the bias vector, ``\eta`` is the learning rate, ``\frac{dL}{dW}`` is the derivative of the loss function with respect to the weight, and ``\frac{dL}{db}`` is the derivative of the loss function with respect to the bias.
+
+The derivatives are usually calculated using an Automatic Differentiation tool, and `Flux` uses `Zygote.jl` for the same. Since `Zygote.jl` is an independent Julia package, it can be used outside of Flux as well! Refer to the documentation of `Zygote.jl` for more information on the same.
+
+Our first step would be to obtain the gradient of the loss function with respect to the weights and the biases. `Flux` re-exports `Zygote`'s `gradient` function; hence, we don't need to import `Zygote` explicitly to use the functionality.
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
 julia> dLdW, dLdb, _, _ = gradient(loss, W, b, x, y)
 (Float32[-6.7322206], Float32[-4.132563], Float32[0.1926041 0.14162663 … -0.39782608 -0.29997927], Float32[-0.16876957 -0.12410051 … 0.3485956 0.2628572])
+```
 
+We can now update the parameters, following the gradient descent algorithm -
+
+```jldoctest linear_regression; filter = r"[+-]?([0-9]*[.])?[0-9]+"
 julia> W .= W .- 0.1 .* dLdW
 1-element Vector{Float32}:
  1.8144473
@@ -184,18 +201,16 @@ julia> b .= b .- 0.1 .* dLdb
  0.41325632
 ```
 
-Now, we can move to the actual training! The training consists of obtaining the gradient and updating the current parameters with the obtained derivatives using backpropagation. This is achieved using `Flux.gradient` (see see [Taking Gradients](@ref)) and [`Flux.Optimise.update!`](@ref) functions respectively.
-
-We can now check the values of our parameters and the value of the loss function -
+The parameters have been updated! We can now check the value of the loss function -
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
 julia> loss(W, b, x, y)
 17.157953f0
 ```
 
-The parameters changed, and the loss went down! This means that we successfully trained our model for one epoch. We can plug the training code written above into a loop and train the model for a higher number of epochs. It can be customized either to have a fixed number of epochs or to stop when certain conditions are met, for example, `change in loss < 0.1`. This loop can be customized to suit a user's needs, and the conditions can be specified in plain `Julia`!
+The loss went down! This means that we successfully trained our model for one epoch. We can plug the training code written above into a loop and train the model for a higher number of epochs. It can be customized either to have a fixed number of epochs or to stop when certain conditions are met, for example, `change in loss < 0.1`. The loop can be tailored to suit the user's needs, and the conditions can be specified in plain `Julia`!
 
-`Flux` also provides a convenience function to train a model. The [`Flux.train!`](@ref) function performs the same task described above and does not require calculating the gradient manually.
+Let's plug our super training logic inside a function and test it again -
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
 julia> function train_model()
@@ -210,7 +225,7 @@ julia> W, b, loss(W, b, x, y)
 (Float32[2.340657], Float32[0.7516814], 13.64972f0)
 ```
 
-The parameters changed again, and the loss went down again! This was the second epoch of our training procedure. Let's plug this in a for loop and train the model for 60 epochs.
+It works, and the loss went down again! This was the second epoch of our training procedure. Let's plug this in a for loop and train the model for 30 epochs.
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
 julia> for i = 1:30
@@ -221,7 +236,7 @@ julia> W, b, loss(W, b, x, y)
 (Float32[4.2408285], Float32[2.243728], 7.668049f0)
 ```
 
-The loss went down significantly!
+There was a significant reduction in loss, and the parameters were updated!
 
 `Flux` provides yet another convenience functionality, the [`Flux.@epochs`](@ref) macro, which can be used to train a model for a specific number of epochs.
 
@@ -242,7 +257,7 @@ julia> W, b, loss(W, b, x, y)
 (Float32[4.2422233], Float32[2.2460847], 7.6680417f0)
 ```
 
-We can train the model even more or tweak the hyperparameters to achieve the desired result faster, but let's stop here. We trained our model for 72 epochs, and loss went down from `22.74856` to `7.6680417f`. Time for some visualization!
+We can train the model even more or tweak the hyperparameters to achieve the desired result faster, but let's stop here. We trained our model for 42 epochs, and loss went down from `22.74856` to `7.6680417f`. Time for some visualization!
 
 ### Results
 The main objective of this tutorial was to fit a line to our dataset using the linear regression algorithm. The training procedure went well, and the loss went down significantly! Let's see what the fitted line looks like. Remember, `Wx + b` is nothing more than a line's equation, with `slope = W[1]` and `y-intercept = b[1]` (indexing at `1` as `W` and `b` are iterable).
@@ -260,7 +275,7 @@ julia> plot!((x) -> b[1] + W[1] * x, -3, 3, label="Custom model", lw=2);
 The line fits well! There is room for improvement, but we leave that up to you! You can play with the optimisers, the number of epochs, learning rate, etc. to improve the fitting and reduce the loss!
 
 ## Linear regression model on a real dataset
-We now move on to a relative;y complex linear regression model. Here we will use a real dataset from [`MLDatasets.jl`](https://github.com/JuliaML/MLDatasets.jl), which will not confine our data points to have only one feature. Let's start by importing the required packages -
+We now move on to a relatively complex linear regression model. Here we will use a real dataset from [`MLDatasets.jl`](https://github.com/JuliaML/MLDatasets.jl), which will not confine our data points to have only one feature. Let's start by importing the required packages -
 
 ```jldoctest linear_regression_complex
 julia> using Flux
@@ -336,7 +351,7 @@ julia> loss(model, x_train_n, y_train)
 We can now proceed to the training phase!
 
 ### Training
-Before training the model, let's initialize the optimiser and let `Flux` know that we want all the derivatives of all the parameters of our `model`.
+The training procedure would make use of the same mathematics, but now we can pass in the model inside the `gradient` call and let `Flux` and `Zygote` handle the derivatives!
 
 ```jldoctest linear_regression_complex
 julia> function train_model()
@@ -348,7 +363,7 @@ julia> function train_model()
 
 Contrary to our last training procedure, let's say that this time we don't want to hardcode the number of epochs. We want the training procedure to stop when the loss converges, that is, when `change in loss < δ`. The quantity `δ` can be altered according to a user's need, but let's fix it to `10⁻³` for this tutorial.
 
-We can write such custom training loops effortlessly using Flux and plain Julia!
+We can write such custom training loops effortlessly using `Flux` and plain `Julia`!
 ```jldoctest linear_regression_complex
 julia> loss_init = Inf;
 
@@ -393,11 +408,11 @@ The loss is not as small as the loss of the training data, but it looks good! Th
 
 ---
 
-Summarising this tutorial, we started by generating a random yet correlated dataset for our custom model. We then saw how a simple linear regression model could be built with and without Flux, and how they were almost identical. 
+Summarising this tutorial, we started by generating a random yet correlated dataset for our custom model. We then saw how a simple linear regression model could be built with and without `Flux`, and how they were almost identical. 
 
-Next, we trained the model by manually calling the gradient function and optimising the loss. We also saw how Flux provided various wrapper functionalities like the train! function to make the API simpler for users. 
+Next, we trained the model by manually writing down the Gradient Descent algorithm and optimising the loss. We also saw how `Flux` provides various wrapper functionalities and keeps the API extremely intuitive and simple for the users. 
 
-After getting familiar with the basics of Flux and Julia, we moved ahead to build a machine learning model for a real dataset. We repeated the exact same steps, but this time with a lot more features and data points, and by harnessing Flux's full capabilities. In the end, we developed a training loop that was smarter than the hardcoded one and ran the model on our normalised dataset to conclude the tutorial.
+After getting familiar with the basics of `Flux` and `Julia`, we moved ahead to build a machine learning model for a real dataset. We repeated the exact same steps, but this time with a lot more features and data points, and by harnessing `Flux`'s full capabilities. In the end, we developed a training loop that was smarter than the hardcoded one and ran the model on our normalised dataset to conclude the tutorial.
 
 ## Copy-pastable code
 ### Dummy dataset

From bca37be66c8b126763e80efb577ea6b56a2321c6 Mon Sep 17 00:00:00 2001
From: Saransh <saransh0701@gmail.com>
Date: Fri, 15 Jul 2022 13:34:32 +0530
Subject: [PATCH 06/23] Fix doctests

---
 docs/src/getting_started/linear_regression.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/getting_started/linear_regression.md b/docs/src/getting_started/linear_regression.md
index 2d914bbce2..18299921ed 100644
--- a/docs/src/getting_started/linear_regression.md
+++ b/docs/src/getting_started/linear_regression.md
@@ -191,7 +191,7 @@ julia> dLdW, dLdb, _, _ = gradient(loss, W, b, x, y)
 
 We can now update the parameters, following the gradient descent algorithm -
 
-```jldoctest linear_regression; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
 julia> W .= W .- 0.1 .* dLdW
 1-element Vector{Float32}:
  1.8144473

From 767014506cdc513af36104909635c7c4b7f344a1 Mon Sep 17 00:00:00 2001
From: Saransh <saransh0701@gmail.com>
Date: Sat, 16 Jul 2022 15:22:28 +0530
Subject: [PATCH 07/23] Minor language fixes

---
 docs/src/getting_started/linear_regression.md | 22 +++++++++----------
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/docs/src/getting_started/linear_regression.md b/docs/src/getting_started/linear_regression.md
index 18299921ed..af2ade8c3c 100644
--- a/docs/src/getting_started/linear_regression.md
+++ b/docs/src/getting_started/linear_regression.md
@@ -13,7 +13,7 @@ julia> using Flux
 julia> using Plots
 ```
 ### Generating a dataset
-The data usually comes from the real world, which we will be exploring in the last part of this tutorial, but we don't want to jump straight to the relatively harder part. Here we will generate the `x`s of our data points and map them to the respective `y`s using a simple function. Remember, here each `x` is equivalent to a feature, and each `y` is the corresponding label. Combining all the `x`s and `y`s would create the complete dataset.
+The data usually comes from the real world, which we will be exploring in the last part of this guide, but we don't want to jump straight to the relatively harder part. Here we will generate the `x`s of our data points and map them to the respective `y`s using a simple function. Remember, here each `x` is equivalent to a feature, and each `y` is the corresponding label. Combining all the `x`s and `y`s would create the complete dataset.
 
 ```jldoctest linear_regression_simple
 julia> x = hcat(collect(Float32, -3:0.1:3)...);
@@ -62,7 +62,7 @@ We can now proceed ahead and build a model for our dataset!
 
 ### Building a model
 
-A linear regression model is mathematically defined as -
+A linear regression model is defined mathematically as -
 
 ```math
 model(W, b, x) = Wx + b
@@ -165,11 +165,9 @@ julia> loss(W, b, x, y), flux_loss(x, y)
 (22.74856f0, 22.74856f0)
 ```
 
-The losses are identical! This means that our `model` and the `flux_model` are identical on some level, and the loss functions are completely identical! The difference in models would be that `Flux`'s [`Dense`](@ref) layer supports many other arguments that can be used to customize the layer further. But, for this tutorial, let us stick to our simple custom `model`.
+The losses are identical! This means that our `model` and the `flux_model` are identical on some level, and the loss functions are completely identical! The difference in models would be that `Flux`'s [`Dense`](@ref) layer supports many other arguments that can be used to customize the layer further. But, for this guide, let us stick to our simple custom `model`.
 
-### Training the model
-
-Let's train our model using the classic Gradient Descent algorithm. According to the gradient descent algorithm, the weights and biases are iteratively updated using the following mathematical equations -
+Let's train our model using the classic Gradient Descent algorithm. According to the gradient descent algorithm, the weights and biases should be iteratively updated using the following mathematical equations -
 
 ```math
 \begin{aligned}
@@ -180,7 +178,7 @@ b &= b - \eta * \frac{dL}{db}
 
 Here, `W` is the weight matrix, `b` is the bias vector, ``\eta`` is the learning rate, ``\frac{dL}{dW}`` is the derivative of the loss function with respect to the weight, and ``\frac{dL}{db}`` is the derivative of the loss function with respect to the bias.
 
-The derivatives are usually calculated using an Automatic Differentiation tool, and `Flux` uses `Zygote.jl` for the same. Since `Zygote.jl` is an independent Julia package, it can be used outside of Flux as well! Refer to the documentation of `Zygote.jl` for more information on the same.
+The derivatives are calculated using an Automatic Differentiation tool, and `Flux` uses [`Zygote.jl`](https://github.com/FluxML/Zygote.jl) for the same. Since `Zygote.jl` is an independent Julia package, it can be used outside of Flux as well! Refer to the documentation of `Zygote.jl` for more information on the same.
 
 Our first step would be to obtain the gradient of the loss function with respect to the weights and the biases. `Flux` re-exports `Zygote`'s `gradient` function; hence, we don't need to import `Zygote` explicitly to use the functionality.
 
@@ -260,7 +258,7 @@ julia> W, b, loss(W, b, x, y)
 We can train the model even more or tweak the hyperparameters to achieve the desired result faster, but let's stop here. We trained our model for 42 epochs, and loss went down from `22.74856` to `7.6680417f`. Time for some visualization!
 
 ### Results
-The main objective of this tutorial was to fit a line to our dataset using the linear regression algorithm. The training procedure went well, and the loss went down significantly! Let's see what the fitted line looks like. Remember, `Wx + b` is nothing more than a line's equation, with `slope = W[1]` and `y-intercept = b[1]` (indexing at `1` as `W` and `b` are iterable).
+The main objective of this guide was to fit a line to our dataset using the linear regression algorithm. The training procedure went well, and the loss went down significantly! Let's see what the fitted line looks like. Remember, `Wx + b` is nothing more than a line's equation, with `slope = W[1]` and `y-intercept = b[1]` (indexing at `1` as `W` and `b` are iterable).
 
 Plotting the line and the data points using `Plot.jl` -
 ```jldoctest linear_regression_simple
@@ -361,7 +359,7 @@ julia> function train_model()
        end;
 ```
 
-Contrary to our last training procedure, let's say that this time we don't want to hardcode the number of epochs. We want the training procedure to stop when the loss converges, that is, when `change in loss < δ`. The quantity `δ` can be altered according to a user's need, but let's fix it to `10⁻³` for this tutorial.
+Contrary to our last training procedure, let's say that this time we don't want to hardcode the number of epochs. We want the training procedure to stop when the loss converges, that is, when `change in loss < δ`. The quantity `δ` can be altered according to a user's need, but let's fix it to `10⁻³` for this guide.
 
 We can write such custom training loops effortlessly using `Flux` and plain `Julia`!
 ```jldoctest linear_regression_complex
@@ -395,7 +393,7 @@ julia> loss(model, x_train_n, y_train)
 The loss went down significantly! It can be minimized further by choosing an even smaller `δ`.
 
 ### Testing
-The last step of this tutorial would be to test our model using the testing data. We will first normalise the testing data and then calculate the corresponding loss.
+The last step of this guide would be to test our model using the testing data. We will first normalise the testing data and then calculate the corresponding loss.
 
 ```jldoctest linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+"
 julia> x_test_n = Flux.normalise(x_test);
@@ -408,7 +406,7 @@ The loss is not as small as the loss of the training data, but it looks good! Th
 
 ---
 
-Summarising this tutorial, we started by generating a random yet correlated dataset for our custom model. We then saw how a simple linear regression model could be built with and without `Flux`, and how they were almost identical. 
+Summarising this guide, we started by generating a random yet correlated dataset for our custom model. We then saw how a simple linear regression model could be built with and without `Flux`, and how they were almost identical. 
 
 Next, we trained the model by manually writing down the Gradient Descent algorithm and optimising the loss. We also saw how `Flux` provides various wrapper functionalities and keeps the API extremely intuitive and simple for the users. 
 
@@ -507,4 +505,4 @@ print("Final loss", loss(model, x_train_n, y_train), "\n")
 # test
 x_test_n = Flux.normalise(x_test);
 print("Test loss", loss(model, x_test_n, y_test), "\n")
-```
\ No newline at end of file
+```

From 288f4ad21b5752ed52f8f31c4d9c043647e529ba Mon Sep 17 00:00:00 2001
From: Saransh <saransh0701@gmail.com>
Date: Sat, 16 Jul 2022 16:06:41 +0530
Subject: [PATCH 08/23] Better variable names and cleaner print statements

---
 docs/src/getting_started/linear_regression.md | 69 +++++++++----------
 1 file changed, 34 insertions(+), 35 deletions(-)

diff --git a/docs/src/getting_started/linear_regression.md b/docs/src/getting_started/linear_regression.md
index af2ade8c3c..60db547c57 100644
--- a/docs/src/getting_started/linear_regression.md
+++ b/docs/src/getting_started/linear_regression.md
@@ -71,8 +71,8 @@ model(W, b, x) = Wx + b
 where `W` is the weight matrix and `b` is the bias. For our case, the weight matrix (`W`) would constitute only a single element, as we have only a single feature. We can define our model in `Julia` using the exact same notation!
 
 ```jldoctest linear_regression_simple
-julia> model(W, b, x) = @. W*x + b
-model (generic function with 1 method)
+julia> custom_model(W, b, x) = @. W*x + b
+custom_model (generic function with 1 method)
 ```
 
 The `@.` macro allows you to perform the calculations by broadcasting the scalar quantities (for example - the bias).
@@ -92,22 +92,22 @@ julia> b = [0.0f0]
 Time to test if our model works!
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
-julia> model(W, b, x) |> size
+julia> custom_model(W, b, x) |> size
 (1, 61)
 
-julia> model(W, b, x)[1], y[1]
+julia> custom_model(W, b, x)[1], y[1]
 (-1.6116865f0, -7.0f0)
 ```
 
 It does! But the predictions are way off. We need to train the model to improve the predictions, but before training the model we need to define the loss function. The loss function would ideally output a quantity that we will try to minimize during the entire training process. Here we will use the mean sum squared error loss function.
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
-julia> function loss(W, b, x, y)
-           ŷ = model(W, b, x)
+julia> function custom_loss(W, b, x, y)
+           ŷ = custom_model(W, b, x)
            sum((y .- ŷ).^2) / length(x)
        end;
 
-julia> loss(W, b, x, y)
+julia> custom_loss(W, b, x, y)
 23.772217f0
 ```
 
@@ -140,12 +140,12 @@ julia> flux_model(x)[1], y[1]
 It is! The next step would be defining the loss function using `Flux`'s functions -
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
-julia> function flux_loss(x, y)
+julia> function flux_loss(flux_model, x, y)
            ŷ = flux_model(x)
            Flux.mse(ŷ, y)
        end;
 
-julia> flux_loss(x, y)
+julia> flux_loss(flux_model, x, y)
 22.74856f0
 ```
 
@@ -161,7 +161,7 @@ julia> W = Float32[1.1412252]
 To check how both the models are performing on the data, let's find out the losses using the `loss` and `flux_loss` functions -
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
-julia> loss(W, b, x, y), flux_loss(x, y)
+julia> custom_loss(W, b, x, y), flux_loss(flux_model, x, y)
 (22.74856f0, 22.74856f0)
 ```
 
@@ -182,9 +182,8 @@ The derivatives are calculated using an Automatic Differentiation tool, and `Flu
 
 Our first step would be to obtain the gradient of the loss function with respect to the weights and the biases. `Flux` re-exports `Zygote`'s `gradient` function; hence, we don't need to import `Zygote` explicitly to use the functionality.
 
-```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
-julia> dLdW, dLdb, _, _ = gradient(loss, W, b, x, y)
-(Float32[-6.7322206], Float32[-4.132563], Float32[0.1926041 0.14162663 … -0.39782608 -0.29997927], Float32[-0.16876957 -0.12410051 … 0.3485956 0.2628572])
+```jldoctest linear_regression_simple
+julia> dLdW, dLdb, _, _ = gradient(custom_loss, W, b, x, y);
 ```
 
 We can now update the parameters, following the gradient descent algorithm -
@@ -202,7 +201,7 @@ julia> b .= b .- 0.1 .* dLdb
 The parameters have been updated! We can now check the value of the loss function -
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
-julia> loss(W, b, x, y)
+julia> custom_loss(W, b, x, y)
 17.157953f0
 ```
 
@@ -211,15 +210,15 @@ The loss went down! This means that we successfully trained our model for one ep
 Let's plug our super training logic inside a function and test it again -
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
-julia> function train_model()
-           dLdW, dLdb, _, _ = gradient(loss, W, b, x, y)
+julia> function train_custom_model()
+           dLdW, dLdb, _, _ = gradient(custom_loss, W, b, x, y)
            @. W = W - 0.1 * dLdW
            @. b = b - 0.1 * dLdb
        end;
 
-julia> train_model();
+julia> train_custom_model();
 
-julia> W, b, loss(W, b, x, y)
+julia> W, b, custom_loss(W, b, x, y)
 (Float32[2.340657], Float32[0.7516814], 13.64972f0)
 ```
 
@@ -227,10 +226,10 @@ It works, and the loss went down again! This was the second epoch of our trainin
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
 julia> for i = 1:30
-          train_model()
+          train_custom_model()
        end
 
-julia> W, b, loss(W, b, x, y)
+julia> W, b, custom_loss(W, b, x, y)
 (Float32[4.2408285], Float32[2.243728], 7.668049f0)
 ```
 
@@ -239,7 +238,7 @@ There was a significant reduction in loss, and the parameters were updated!
 `Flux` provides yet another convenience functionality, the [`Flux.@epochs`](@ref) macro, which can be used to train a model for a specific number of epochs.
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
-julia> Flux.@epochs 10 train_model()
+julia> Flux.@epochs 10 train_custom_model()
 [ Info: Epoch 1
 [ Info: Epoch 2
 [ Info: Epoch 3
@@ -251,7 +250,7 @@ julia> Flux.@epochs 10 train_model()
 [ Info: Epoch 9
 [ Info: Epoch 10
 
-julia> W, b, loss(W, b, x, y)
+julia> W, b, custom_loss(W, b, x, y)
 (Float32[4.2422233], Float32[2.2460847], 7.6680417f0)
 ```
 
@@ -428,30 +427,30 @@ x = x .* reshape(rand(Float32, 61), (1, 61))
 plot(reshape(x, (61, 1)), reshape(y, (61, 1)), lw = 3, seriestype = :scatter, label = "", title = "Generated data", xlabel = "x", ylabel= "y")
 
 # custom model and parameters
-model(W, b, x) = @. W*x + b
+custom_model(W, b, x) = @. W*x + b
 W = rand(Float32, 1, 1)
 b = [0.0f0]
 
 # loss function
-function loss(model, x, y)
-    ŷ = model(x)
+function custom_loss(W, b, x, y)
+    ŷ = custom_model(W, b, x)
     sum((y .- ŷ).^2) / length(x)
 end;
 
-print("Initial loss", loss(model, x, y), "\n")
+print("Initial loss: ", custom_loss(W, b, x, y), "\n")
 
 # train
-function train_model()
-    dLdW, dLdb, _, _ = gradient(loss, W, b, x, y)
+function train_custom_model()
+    dLdW, dLdb, _, _ = gradient(custom_loss, W, b, x, y)
     @. W = W - 0.1 * dLdW
     @. b = b - 0.1 * dLdb
 end
 
 for i = 1:40
-    train_model()
+    train_custom_model()
 end
 
-print("Final loss", loss(model, x, y), "\n")
+print("Final loss: ", custom_loss(W, b, x, y), "\n")
 
 # plot data and results
 plot(reshape(x, (61, 1)), reshape(y, (61, 1)), lw = 3, seriestype = :scatter, label = "", title = "Simple Linear Regression", xlabel = "x", ylabel= "y")
@@ -477,10 +476,10 @@ function loss(model, x, y)
     Flux.mse(ŷ, y)
 end;
 
-print("Initial loss", loss(model, x_train_n, y_train), "\n")
+print("Initial loss: ", loss(model, x_train_n, y_train), "\n")
 
 # train
-function train_model()
+function train_custom_model()
     dLdm, _, _ = gradient(loss, model, x, y)
     @. model.weight = model.weight - 0.000001 * dLdm.weight
     @. model.bias = model.bias - 0.000001 * dLdm.bias
@@ -488,7 +487,7 @@ end
 
 loss_init = Inf;
 while true
-    train_model()
+    train_custom_model()
     if loss_init == Inf
         loss_init = loss(model, x_train_n, y_train)
         continue
@@ -500,9 +499,9 @@ while true
     end
 end
 
-print("Final loss", loss(model, x_train_n, y_train), "\n")
+print("Final loss: ", loss(model, x_train_n, y_train), "\n")
 
 # test
 x_test_n = Flux.normalise(x_test);
-print("Test loss", loss(model, x_test_n, y_test), "\n")
+print("Test loss: ", loss(model, x_test_n, y_test), "\n")
 ```

From 8cab77b4c1694c0ac677852a101e39df6db12937 Mon Sep 17 00:00:00 2001
From: Saransh <saransh0701@gmail.com>
Date: Thu, 28 Jul 2022 19:16:11 +0530
Subject: [PATCH 09/23] `@epcohs` is deprecated

---
 docs/src/getting_started/linear_regression.md | 23 ++-----------------
 1 file changed, 2 insertions(+), 21 deletions(-)

diff --git a/docs/src/getting_started/linear_regression.md b/docs/src/getting_started/linear_regression.md
index 60db547c57..6cb412a816 100644
--- a/docs/src/getting_started/linear_regression.md
+++ b/docs/src/getting_started/linear_regression.md
@@ -225,35 +225,16 @@ julia> W, b, custom_loss(W, b, x, y)
 It works, and the loss went down again! This was the second epoch of our training procedure. Let's plug this in a for loop and train the model for 30 epochs.
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
-julia> for i = 1:30
+julia> for i = 1:40
           train_custom_model()
        end
 
 julia> W, b, custom_loss(W, b, x, y)
-(Float32[4.2408285], Float32[2.243728], 7.668049f0)
+(Float32[4.2422233], Float32[2.2460847], 7.6680417f0)
 ```
 
 There was a significant reduction in loss, and the parameters were updated!
 
-`Flux` provides yet another convenience functionality, the [`Flux.@epochs`](@ref) macro, which can be used to train a model for a specific number of epochs.
-
-```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
-julia> Flux.@epochs 10 train_custom_model()
-[ Info: Epoch 1
-[ Info: Epoch 2
-[ Info: Epoch 3
-[ Info: Epoch 4
-[ Info: Epoch 5
-[ Info: Epoch 6
-[ Info: Epoch 7
-[ Info: Epoch 8
-[ Info: Epoch 9
-[ Info: Epoch 10
-
-julia> W, b, custom_loss(W, b, x, y)
-(Float32[4.2422233], Float32[2.2460847], 7.6680417f0)
-```
-
 We can train the model even more or tweak the hyperparameters to achieve the desired result faster, but let's stop here. We trained our model for 42 epochs, and loss went down from `22.74856` to `7.6680417f`. Time for some visualization!
 
 ### Results

From 0a03ab5c3e7fc9521937a20d95a87e65ac32c7f6 Mon Sep 17 00:00:00 2001
From: Saransh <saransh0701@gmail.com>
Date: Tue, 16 Aug 2022 00:52:46 +0530
Subject: [PATCH 10/23] Update docs/src/getting_started/linear_regression.md

Co-authored-by: Michael Abbott <32575566+mcabbott@users.noreply.github.com>
---
 docs/src/getting_started/linear_regression.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/getting_started/linear_regression.md b/docs/src/getting_started/linear_regression.md
index 6cb412a816..24b019c6fc 100644
--- a/docs/src/getting_started/linear_regression.md
+++ b/docs/src/getting_started/linear_regression.md
@@ -307,7 +307,7 @@ julia> std(x_train_n)
 The standard deviation is now close to one! Our data is ready!
 
 ### Model
-We can now directly use `Flux` and let it do all the work internally! Let's define a model that takes in 13 inputs (13 features) and gives us a single output (the label). We will then pass our entire data through this model in one go, and `Flux` will handle everything for us! Remember, we could have declared a model in plain `Julia` as well. The model will have 14 parameters, 13 weights, and one bias.
+We can now directly use `Flux` and let it do all the work internally! Let's define a model that takes in 13 inputs (13 features) and gives us a single output (the label). We will then pass our entire data through this model in one go, and `Flux` will handle everything for us! Remember, we could have declared a model in plain `Julia` as well. The model will have 14 parameters: 13 weights and 1 bias.
 
 ```jldoctest linear_regression_complex
 julia> model = Dense(13 => 1)

From f55603fc0c2cff0054e6e0b2550790e437d3ea6c Mon Sep 17 00:00:00 2001
From: Saransh <saransh0701@gmail.com>
Date: Tue, 16 Aug 2022 00:52:56 +0530
Subject: [PATCH 11/23] Update docs/src/getting_started/linear_regression.md

Co-authored-by: Michael Abbott <32575566+mcabbott@users.noreply.github.com>
---
 docs/src/getting_started/linear_regression.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/getting_started/linear_regression.md b/docs/src/getting_started/linear_regression.md
index 24b019c6fc..5146ab6d34 100644
--- a/docs/src/getting_started/linear_regression.md
+++ b/docs/src/getting_started/linear_regression.md
@@ -49,7 +49,7 @@ julia> x = x .* reshape(rand(Float32, 61), (1, 61));
 Visualizing the final data -
 
 ```jldoctest linear_regression_simple
-julia> plot(reshape(x, (61, 1)), reshape(y, (61, 1)), lw = 3, seriestype = :scatter, label = "", title = "Generated data", xlabel = "x", ylabel= "y");
+julia> plot(vec(x), vec(y), lw = 3, seriestype = :scatter, label = "", title = "Generated data", xlabel = "x", ylabel= "y");
 ```
 
 
From 91b12609720c73c5cd7e5b6fe94ee5e2577ae680 Mon Sep 17 00:00:00 2001
From: Saransh <saransh0701@gmail.com>
Date: Tue, 16 Aug 2022 00:53:04 +0530
Subject: [PATCH 12/23] Update docs/src/getting_started/linear_regression.md

Co-authored-by: Michael Abbott <32575566+mcabbott@users.noreply.github.com>
---
 docs/src/getting_started/linear_regression.md | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/docs/src/getting_started/linear_regression.md b/docs/src/getting_started/linear_regression.md
index 5146ab6d34..e780313e2a 100644
--- a/docs/src/getting_started/linear_regression.md
+++ b/docs/src/getting_started/linear_regression.md
@@ -8,9 +8,7 @@ Let us start by building a simple linear regression model. This model would be t
 Importing the required `Julia` packages -
 
 ```jldoctest linear_regression_simple
-julia> using Flux
-
-julia> using Plots
+julia> using Flux, Plots
 ```
 ### Generating a dataset
 The data usually comes from the real world, which we will be exploring in the last part of this guide, but we don't want to jump straight to the relatively harder part. Here we will generate the `x`s of our data points and map them to the respective `y`s using a simple function. Remember, here each `x` is equivalent to a feature, and each `y` is the corresponding label. Combining all the `x`s and `y`s would create the complete dataset.

From 055f6a40a7a3cdcfb77e10f45c1d75a9e4cab908 Mon Sep 17 00:00:00 2001
From: Saransh <saransh0701@gmail.com>
Date: Tue, 16 Aug 2022 01:19:38 +0530
Subject: [PATCH 13/23] Show data

---
 docs/src/getting_started/linear_regression.md | 21 +++++++------------
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/docs/src/getting_started/linear_regression.md b/docs/src/getting_started/linear_regression.md
index e780313e2a..51933c64c8 100644
--- a/docs/src/getting_started/linear_regression.md
+++ b/docs/src/getting_started/linear_regression.md
@@ -10,17 +10,14 @@ Importing the required `Julia` packages -
 ```jldoctest linear_regression_simple
 julia> using Flux, Plots
 ```
+
 ### Generating a dataset
 The data usually comes from the real world, which we will be exploring in the last part of this guide, but we don't want to jump straight to the relatively harder part. Here we will generate the `x`s of our data points and map them to the respective `y`s using a simple function. Remember, here each `x` is equivalent to a feature, and each `y` is the corresponding label. Combining all the `x`s and `y`s would create the complete dataset.
 
 ```jldoctest linear_regression_simple
-julia> x = hcat(collect(Float32, -3:0.1:3)...);
-
-julia> x |> size
-(1, 61)
-
-julia> typeof(x)
-Matrix{Float32} (alias for Array{Float32, 2})
+julia> x = hcat(collect(Float32, -3:0.1:3)...)
+1×61 Matrix{Float32}:
+ -3.0  -2.9  -2.8  -2.7  -2.6  -2.5  …  2.4  2.5  2.6  2.7  2.8  2.9  3.0
 ```
 
 The `hcat` call generates a `Matrix` with numbers ranging from `-3.0` to `3.0` with a gap of `0.1` between them. Each column of this matrix holds a single `x`, a total of 61 `x`s. The next step would be to generate the corresponding labels or the `y`s.
@@ -28,13 +25,9 @@ The `hcat` call generates a `Matrix` with numbers ranging from `-3.0` to `3.0` w
 ```jldoctest linear_regression_simple
 julia> f(x) = @. 3x + 2;
 
-julia> y = f(x);
-
-julia> y |> size
-(1, 61)
-
-julia> typeof(y)
-Matrix{Float32} (alias for Array{Float32, 2})
+julia> y = f(x)
+1×61 Matrix{Float32}:
+ -7.0  -6.7  -6.4  -6.1  -5.8  -5.5  …  9.5  9.8  10.1  10.4  10.7  11.0
 ```
 
 The function `f` maps each `x` to a `y`, and as `x` is a `Matrix`, the expression broadcasts the scalar values using `@.` macro. Our data points are ready, but they are too perfect. In a real-world scenario, we will not have an `f` function to generate `y` values, but instead, the labels would be manually added.

From 8f89bd780004af9217383b7461fc5176eb0a956f Mon Sep 17 00:00:00 2001
From: Saransh <saransh0701@gmail.com>
Date: Tue, 16 Aug 2022 01:20:59 +0530
Subject: [PATCH 14/23] More general regex

---
 docs/src/getting_started/linear_regression.md |  34 +++++++++---------
 xy.jld2                                       | Bin 1337 -> 0 bytes
 2 files changed, 17 insertions(+), 17 deletions(-)
 delete mode 100644 xy.jld2

diff --git a/docs/src/getting_started/linear_regression.md b/docs/src/getting_started/linear_regression.md
index 51933c64c8..ae53056663 100644
--- a/docs/src/getting_started/linear_regression.md
+++ b/docs/src/getting_started/linear_regression.md
@@ -70,7 +70,7 @@ The `@.` macro allows you to perform the calculations by broadcasting the scalar
 
 The next step would be to initialize the model parameters, which are the weight and the bias. There are a lot of initialization techniques available for different machine learning models, but for the sake of this example, let's pull out the weight from a uniform distribution and initialize the bias as `0`.
 
-```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+(f[+-]*[0-9])?"
 julia> W = rand(Float32, 1, 1)
 1×1 Matrix{Float32}:
  0.99285793
@@ -82,7 +82,7 @@ julia> b = [0.0f0]
 
 Time to test if our model works!
 
-```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+(f[+-]*[0-9])?"
 julia> custom_model(W, b, x) |> size
 (1, 61)
 
@@ -92,7 +92,7 @@ julia> custom_model(W, b, x)[1], y[1]
 
 It does! But the predictions are way off. We need to train the model to improve the predictions, but before training the model we need to define the loss function. The loss function would ideally output a quantity that we will try to minimize during the entire training process. Here we will use the mean sum squared error loss function.
 
-```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+(f[+-]*[0-9])?"
 julia> function custom_loss(W, b, x, y)
            ŷ = custom_model(W, b, x)
            sum((y .- ŷ).^2) / length(x)
@@ -113,14 +113,14 @@ Dense(1 => 1)       # 2 parameters
 
 A [`Dense(1 => 1)`](@ref Dense) layer denotes a layer of one neuron with one input (one feature) and one output. This layer is exactly same as the mathematical model defined by us above! Under the hood, `Flux` too calculates the output using the same expression! But, we don't have to initialize the parameters ourselves this time, instead `Flux` does it for us.
 
-```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+(f[+-]*[0-9])?"
 julia> flux_model.weight, flux_model.bias
 (Float32[1.1412252], Float32[0.0])
 ```
 
 Now we can check if our model is acting right. We can pass the complete data in one go, with each `x` having exactly one feature (one input) -
 
-```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+(f[+-]*[0-9])?"
 julia> flux_model(x) |> size
 (1, 61)
 
@@ -130,7 +130,7 @@ julia> flux_model(x)[1], y[1]
 
 It is! The next step would be defining the loss function using `Flux`'s functions -
 
-```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+(f[+-]*[0-9])?"
 julia> function flux_loss(flux_model, x, y)
            ŷ = flux_model(x)
            Flux.mse(ŷ, y)
@@ -143,7 +143,7 @@ julia> flux_loss(flux_model, x, y)
 Everything works as before! It almost feels like `Flux` provides us with smart wrappers for the functions we could have written on our own. Now, as the last step of this section, let's see how different the `flux_model` is from our custom `model`. A good way to go about this would be to fix the parameters of both models to be the same. Let's change the parameters of our custom `model` to match that of the `flux_model` -
 
 
-```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+(f[+-]*[0-9])?"
 julia> W = Float32[1.1412252]
 1-element Vector{Float32}:
  1.1412252
@@ -151,7 +151,7 @@ julia> W = Float32[1.1412252]
 
 To check how both the models are performing on the data, let's find out the losses using the `loss` and `flux_loss` functions -
 
-```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+(f[+-]*[0-9])?"
 julia> custom_loss(W, b, x, y), flux_loss(flux_model, x, y)
 (22.74856f0, 22.74856f0)
 ```
@@ -179,7 +179,7 @@ julia> dLdW, dLdb, _, _ = gradient(custom_loss, W, b, x, y);
 
 We can now update the parameters, following the gradient descent algorithm -
 
-```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+(f[+-]*[0-9])?"
 julia> W .= W .- 0.1 .* dLdW
 1-element Vector{Float32}:
  1.8144473
@@ -191,7 +191,7 @@ julia> b .= b .- 0.1 .* dLdb
 
 The parameters have been updated! We can now check the value of the loss function -
 
-```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+(f[+-]*[0-9])?"
 julia> custom_loss(W, b, x, y)
 17.157953f0
 ```
@@ -200,7 +200,7 @@ The loss went down! This means that we successfully trained our model for one ep
 
 Let's plug our super training logic inside a function and test it again -
 
-```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+(f[+-]*[0-9])?"
 julia> function train_custom_model()
            dLdW, dLdb, _, _ = gradient(custom_loss, W, b, x, y)
            @. W = W - 0.1 * dLdW
@@ -215,7 +215,7 @@ julia> W, b, custom_loss(W, b, x, y)
 
 It works, and the loss went down again! This was the second epoch of our training procedure. Let's plug this in a for loop and train the model for 30 epochs.
 
-```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+(f[+-]*[0-9])?"
 julia> for i = 1:40
           train_custom_model()
        end
@@ -281,14 +281,14 @@ julia> x_train |> size, x_test |> size, y_train |> size, y_test |> size
 
 This data contains a diverse number of features, which means that the features have different scales. A wise option here would be to `normalise` the data, making the training process more efficient and fast. Let's check the standard deviation of the training data before normalising it.
 
-```jldoctest linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+```jldoctest linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+(f[+-]*[0-9])?"
 julia> std(x_train)
 134.06784844377117
 ```
 
 The data is indeed not normalised. We can use the [`Flux.normalise`](@ref) function to normalise the training data.
 
-```jldoctest linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+```jldoctest linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+(f[+-]*[0-9])?"
 julia> x_train_n = Flux.normalise(x_train);
 
 julia> std(x_train_n)
@@ -307,7 +307,7 @@ Dense(13 => 1)      # 14 parameters
 
 Same as before, our next step would be to define a loss function to quantify our accuracy somehow. The lower the loss, the better the model!
 
-```jldoctest linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+```jldoctest linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+(f[+-]*[0-9])?"
 julia> function loss(model, x, y)
            ŷ = model(x)
            Flux.mse(ŷ, y)
@@ -356,7 +356,7 @@ This custom loop works! This shows how easily a user can write down any custom t
 
 Let's have a look at the loss -
 
-```jldoctest linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+```jldoctest linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+(f[+-]*[0-9])?"
 julia> loss(model, x_train_n, y_train)
 27.127200028562164
 ```
@@ -366,7 +366,7 @@ The loss went down significantly! It can be minimized further by choosing an eve
 ### Testing
 The last step of this guide would be to test our model using the testing data. We will first normalise the testing data and then calculate the corresponding loss.
 
-```jldoctest linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+"
+```jldoctest linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+(f[+-]*[0-9])?"
 julia> x_test_n = Flux.normalise(x_test);
 
 julia> loss(model, x_test_n, y_test)
diff --git a/xy.jld2 b/xy.jld2
deleted file mode 100644
index 0a0c5351231c645de8c9f94d84ae2d7830a711ee..0000000000000000000000000000000000000000
GIT binary patch
literal 0
HcmV?d00001

literal 1337
zcmeIwUr19?7y$5d=bxseNz)XGy!J;eb!@v!U9@}dY_3_OM2ON#n&g_xG~CMhkY(Y6
zC}>2{h7TjcEQJJL%C>thl__y5B@3mOfj;;U^v941_UF@WPrdZeOV9`Rp8Ng2@4!7D
z$80q1F08dW>rS%=E;Q6zS)<iuWlc@Z=d7**_F`SLv%aa3RVmd<HDdF^Eox=4Qo|N$
z3v26Ltfg!x`XB%71Uk(|Q&M7PqD(GFNESZE&q(Ov!ue3d*Nq*oK5RBt$`}n32NsjS
zq{1c{V)T*oGm$kW6)|cy3&pPGEJ~N7^vm!q4<WE`=<fI%VsvklDmu1E!3<V<H%*Cu
zX@L04739uH7P%K(2~^EIBBAznv1($EXdie=9M<mv)0ZJ~-Q^+i**=nHXNYEoBdz^|
z;!(GQC^A<3%Z`I&Y-^Edd5}g1^LxdwT_s{(MU42lDOYTH`7p5hJE3E>dZNtawa)yK
z%;g<=L?`rTJKOl1PZs!K+jf4@EAdkurP!UN#~nr`F03)&P#=efXUp-2@gDqql;TSp
zTzqNeICkvXi#MI0!awHB{Ais_cv};|(<gQOzE8>gLRT3+bYuq4w>b4D_U8(!*xvcS
zGd+ZYlHK-RqLfNn?wREAJeCj|md@Dv!TP{%fOUdhfxJPw?Wux(6F3LPf~}V{+#-2R
zDY4u73G@*a$B{`&MJNU)sD+0?VM;CVbYqSi!CucS1`{|=A=qp=g4d%HUQvsH&?!O7
zIl-?%-VL=|U_Pkd2OkkA^@F_tn*^H$n*;j@wgiR@l*Ssoo&<x<mTb^+3IjqJV4K0R
z!E(Uz!BmDGUGenV^w3q=M%aj=mo&O2u3ggjB!rk0hH<M_dmy6F+8Di?1>wRUzf*qi
Jtd93J`~prsBMbll


From 51f8a384b723345b6a553666348ab0eb22f93619 Mon Sep 17 00:00:00 2001
From: Saransh <saransh0701@gmail.com>
Date: Mon, 22 Aug 2022 20:38:16 +0530
Subject: [PATCH 15/23] Minor bug in the guide

---
 docs/src/getting_started/linear_regression.md | 23 ++++++-------------
 1 file changed, 7 insertions(+), 16 deletions(-)

diff --git a/docs/src/getting_started/linear_regression.md b/docs/src/getting_started/linear_regression.md
index ae53056663..58f90914c0 100644
--- a/docs/src/getting_started/linear_regression.md
+++ b/docs/src/getting_started/linear_regression.md
@@ -247,19 +247,13 @@ The line fits well! There is room for improvement, but we leave that up to you!
 We now move on to a relatively complex linear regression model. Here we will use a real dataset from [`MLDatasets.jl`](https://github.com/JuliaML/MLDatasets.jl), which will not confine our data points to have only one feature. Let's start by importing the required packages -
 
 ```jldoctest linear_regression_complex
-julia> using Flux
-
-julia> using Statistics
-
-julia> using MLDatasets: BostonHousing
+julia> using Flux, Statistics, MLDatasets, DataFrames
 ```
 
 ### Data
 Let's start by initializing our dataset. We will be using the [`BostonHousing`](https://juliaml.github.io/MLDatasets.jl/stable/datasets/misc/#MLDatasets.BostonHousing) dataset consisting of `506` data points. Each of these data points has `13` features and a corresponding label, the house's price. The `x`s are still mapped to a single `y`, but now, a single `x` data point has 13 features. 
 
 ```jldoctest linear_regression_complex
-julia> using DataFrames
-
 julia> dataset = BostonHousing()
 dataset BostonHousing:
   metadata    =>    Dict{String, Any} with 5 entries
@@ -324,7 +318,7 @@ The training procedure would make use of the same mathematics, but now we can pa
 
 ```jldoctest linear_regression_complex
 julia> function train_model()
-           dLdm, _, _ = gradient(loss, model, x, y)
+           dLdm, _, _ = gradient(loss, model, x_train_n, y_train)
            @. model.weight = model.weight - 0.000001 * dLdm.weight
            @. model.bias = model.bias - 0.000001 * dLdm.bias
        end;
@@ -342,7 +336,7 @@ julia> while true
                loss_init = loss(model, x_train_n, y_train)
                continue
            end
-           if abs(loss_init - loss(model, x_train_n, y_train)) < 1e-3
+           if abs(loss_init - loss(model, x_train_n, y_train)) < 1e-4
                break
            else
                loss_init = loss(model, x_train_n, y_train)
@@ -386,8 +380,7 @@ After getting familiar with the basics of `Flux` and `Julia`, we moved ahead to
 ## Copy-pastable code
 ### Dummy dataset
 ```julia
-using Flux
-using Plots
+using Flux, Plots
 
 # data
 x = hcat(collect(Float32, -3:0.1:3)...)
@@ -430,9 +423,7 @@ plot!((x) -> b[1] + W[1] * x, -3, 3, label="Custom model", lw=2)
 ```
 ### Real dataset
 ```julia
-using Flux
-using Statistics
-using MLDatasets: BostonHousing
+using Flux, Statistics, MLDatasets
 
 # data
 x, y = BostonHousing(as_df=false)[:]
@@ -452,7 +443,7 @@ print("Initial loss: ", loss(model, x_train_n, y_train), "\n")
 
 # train
 function train_custom_model()
-    dLdm, _, _ = gradient(loss, model, x, y)
+    dLdm, _, _ = gradient(loss, model, x_train_n, y_train)
     @. model.weight = model.weight - 0.000001 * dLdm.weight
     @. model.bias = model.bias - 0.000001 * dLdm.bias
 end
@@ -464,7 +455,7 @@ while true
         loss_init = loss(model, x_train_n, y_train)
         continue
     end
-    if abs(loss_init - loss(model, x_train_n, y_train)) < 1e-3
+    if abs(loss_init - loss(model, x_train_n, y_train)) < 1e-4
         break
     else
         loss_init = loss(model, x_train_n, y_train)

From 36d7578cefd35ae5dd5d81608ea8105f1d541263 Mon Sep 17 00:00:00 2001
From: Saransh <saransh0701@gmail.com>
Date: Tue, 23 Aug 2022 20:31:43 +0530
Subject: [PATCH 16/23] Better introduction to a ML pipeline

---
 docs/src/getting_started/linear_regression.md | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/docs/src/getting_started/linear_regression.md b/docs/src/getting_started/linear_regression.md
index 58f90914c0..6859c9f136 100644
--- a/docs/src/getting_started/linear_regression.md
+++ b/docs/src/getting_started/linear_regression.md
@@ -1,5 +1,14 @@
 # Linear Regression
 
+Flux is a pure Julia ML stack that allows you to build predictive models. Here are the steps for a typical Flux program:
+
+- Provide training and test data
+- Build a model with configurable parameters to make predictions
+- Iteratively train the model by tweaking the parameters to improve predictions
+- Verify your model
+ 
+Under the hood, Flux uses a technique called automatic differentiation to take gradients that help improve predictions. Flux is also fully written in Julia so you can easily replace any layer of Flux with your own code to improve your understanding or satisfy special requirements.
+
 The following page contains a step-by-step walkthrough of the linear regression algorithm in `Julia` using `Flux`! We will start by creating a simple linear regression model for dummy data and then move on to a real dataset. The first part would involve writing some parts of the model on our own, which will later be replaced by `Flux`.
 
 ## A simple linear regression model

From df06a6de8a728d2727e62cb8b389982b934b2080 Mon Sep 17 00:00:00 2001
From: Saransh Chopra <saransh0701@gmail.com>
Date: Tue, 18 Oct 2022 20:41:22 +0530
Subject: [PATCH 17/23] Move to the new Getting Started section?

---
 docs/make.jl                                       | 7 ++++---
 docs/src/getting_started/linear_regression.md      | 2 +-
 docs/src/{models => getting_started}/quickstart.md | 0
 3 files changed, 5 insertions(+), 4 deletions(-)
 rename docs/src/{models => getting_started}/quickstart.md (100%)

diff --git a/docs/make.jl b/docs/make.jl
index 678a5827c3..a93c75f1b6 100644
--- a/docs/make.jl
+++ b/docs/make.jl
@@ -11,9 +11,10 @@ makedocs(
     pages = [
         "Getting Started" => [
             "Welcome" => "index.md",
-            "Quick Start" => "models/quickstart.md",
-            "Fitting a Line" => "models/overview.md",
-            "Gradients and Layers" => "models/basics.md",
+            "Quick Start" => "getting_started/quickstart.md",
+            "Fitting a Line" => "getting_started/overview.md",
+            "Gradients and Layers" => "getting_started/basics.md",
+            "Linear Regression" => "getting_started/linear_regression.md"
         ],
         "Building Models" => [
             "Built-in Layers 📚" => "models/layers.md",
diff --git a/docs/src/getting_started/linear_regression.md b/docs/src/getting_started/linear_regression.md
index 6859c9f136..8f111a153f 100644
--- a/docs/src/getting_started/linear_regression.md
+++ b/docs/src/getting_started/linear_regression.md
@@ -1,4 +1,4 @@
-# Linear Regression
+# [Linear Regression](@id man-linear-regression)
 
 Flux is a pure Julia ML stack that allows you to build predictive models. Here are the steps for a typical Flux program:
 
diff --git a/docs/src/models/quickstart.md b/docs/src/getting_started/quickstart.md
similarity index 100%
rename from docs/src/models/quickstart.md
rename to docs/src/getting_started/quickstart.md

From b67a3a975714da11b65094858b525f6465685643 Mon Sep 17 00:00:00 2001
From: Saransh Chopra <saransh0701@gmail.com>
Date: Tue, 25 Oct 2022 20:49:18 +0530
Subject: [PATCH 18/23] Create a new 'tutorials' section

---
 docs/make.jl                                  |   4 +-
 .../linear_regression.md                      | 132 ++++--------------
 2 files changed, 27 insertions(+), 109 deletions(-)
 rename docs/src/{getting_started => tutorials}/linear_regression.md (82%)

diff --git a/docs/make.jl b/docs/make.jl
index a93c75f1b6..c65e51723b 100644
--- a/docs/make.jl
+++ b/docs/make.jl
@@ -14,7 +14,9 @@ makedocs(
             "Quick Start" => "getting_started/quickstart.md",
             "Fitting a Line" => "getting_started/overview.md",
             "Gradients and Layers" => "getting_started/basics.md",
-            "Linear Regression" => "getting_started/linear_regression.md"
+        ],
+        "Tutorials" => [
+            "Linear Regression" => "tutorials/linear_regression.md",
         ],
         "Building Models" => [
             "Built-in Layers 📚" => "models/layers.md",
diff --git a/docs/src/getting_started/linear_regression.md b/docs/src/tutorials/linear_regression.md
similarity index 82%
rename from docs/src/getting_started/linear_regression.md
rename to docs/src/tutorials/linear_regression.md
index 8f111a153f..7fff4531fb 100644
--- a/docs/src/getting_started/linear_regression.md
+++ b/docs/src/tutorials/linear_regression.md
@@ -11,7 +11,8 @@ Under the hood, Flux uses a technique called automatic differentiation to take g
 
 The following page contains a step-by-step walkthrough of the linear regression algorithm in `Julia` using `Flux`! We will start by creating a simple linear regression model for dummy data and then move on to a real dataset. The first part would involve writing some parts of the model on our own, which will later be replaced by `Flux`.
 
-## A simple linear regression model
+---
+
 Let us start by building a simple linear regression model. This model would be trained on the data points of the form `(x₁, y₁), (x₂, y₂), ... , (xₙ, yₙ)`. In the real world, these `x`s can have multiple features, and the `y`s denote a label. In our example, each `x` has a single feature; hence, our data would have `n` data points, each point mapping a single feature to a single label.
 
 Importing the required `Julia` packages -
@@ -20,8 +21,9 @@ Importing the required `Julia` packages -
 julia> using Flux, Plots
 ```
 
-### Generating a dataset
-The data usually comes from the real world, which we will be exploring in the last part of this guide, but we don't want to jump straight to the relatively harder part. Here we will generate the `x`s of our data points and map them to the respective `y`s using a simple function. Remember, here each `x` is equivalent to a feature, and each `y` is the corresponding label. Combining all the `x`s and `y`s would create the complete dataset.
+## Generating a dataset
+
+The data usually comes from the real world, which we will be exploring in the last part of this tutorial, but we don't want to jump straight to the relatively harder part. Here we will generate the `x`s of our data points and map them to the respective `y`s using a simple function. Remember, here each `x` is equivalent to a feature, and each `y` is the corresponding label. Combining all the `x`s and `y`s would create the complete dataset.
 
 ```jldoctest linear_regression_simple
 julia> x = hcat(collect(Float32, -3:0.1:3)...)
@@ -60,7 +62,7 @@ The data looks random enough now! The `x` and `y` values are still somewhat corr
 
 We can now proceed ahead and build a model for our dataset!
 
-### Building a model
+## Building a model
 
 A linear regression model is defined mathematically as -
 
@@ -149,7 +151,7 @@ julia> flux_loss(flux_model, x, y)
 22.74856f0
 ```
 
-Everything works as before! It almost feels like `Flux` provides us with smart wrappers for the functions we could have written on our own. Now, as the last step of this section, let's see how different the `flux_model` is from our custom `model`. A good way to go about this would be to fix the parameters of both models to be the same. Let's change the parameters of our custom `model` to match that of the `flux_model` -
+Everything works as before! It almost feels like `Flux` provides us with smart wrappers for the functions we could have written on our own. Now, as the last step of this section, let's see how different the `flux_model` is from our `custom_model`. A good way to go about this would be to fix the parameters of both models to be the same. Let's change the parameters of our `custom_model` to match that of the `flux_model` -
 
 
 ```jldoctest linear_regression_simple; filter = r"[+-]?([0-9]*[.])?[0-9]+(f[+-]*[0-9])?"
@@ -165,7 +167,9 @@ julia> custom_loss(W, b, x, y), flux_loss(flux_model, x, y)
 (22.74856f0, 22.74856f0)
 ```
 
-The losses are identical! This means that our `model` and the `flux_model` are identical on some level, and the loss functions are completely identical! The difference in models would be that `Flux`'s [`Dense`](@ref) layer supports many other arguments that can be used to customize the layer further. But, for this guide, let us stick to our simple custom `model`.
+The losses are identical! This means that our `model` and the `flux_model` are identical on some level, and the loss functions are completely identical! The difference in models would be that `Flux`'s [`Dense`](@ref) layer supports many other arguments that can be used to customize the layer further. But, for this tutorial, let us stick to our simple `custom_model`.
+
+## Training the model
 
 Let's train our model using the classic Gradient Descent algorithm. According to the gradient descent algorithm, the weights and biases should be iteratively updated using the following mathematical equations -
 
@@ -237,8 +241,9 @@ There was a significant reduction in loss, and the parameters were updated!
 
 We can train the model even more or tweak the hyperparameters to achieve the desired result faster, but let's stop here. We trained our model for 42 epochs, and loss went down from `22.74856` to `7.6680417f`. Time for some visualization!
 
-### Results
-The main objective of this guide was to fit a line to our dataset using the linear regression algorithm. The training procedure went well, and the loss went down significantly! Let's see what the fitted line looks like. Remember, `Wx + b` is nothing more than a line's equation, with `slope = W[1]` and `y-intercept = b[1]` (indexing at `1` as `W` and `b` are iterable).
+## Results
+
+The main objective of this tutorial was to fit a line to our dataset using the linear regression algorithm. The training procedure went well, and the loss went down significantly! Let's see what the fitted line looks like. Remember, `Wx + b` is nothing more than a line's equation, with `slope = W[1]` and `y-intercept = b[1]` (indexing at `1` as `W` and `b` are iterable).
 
 Plotting the line and the data points using `Plot.jl` -
 ```jldoctest linear_regression_simple
@@ -252,14 +257,15 @@ julia> plot!((x) -> b[1] + W[1] * x, -3, 3, label="Custom model", lw=2);
 
 The line fits well! There is room for improvement, but we leave that up to you! You can play with the optimisers, the number of epochs, learning rate, etc. to improve the fitting and reduce the loss!
 
-## Linear regression model on a real dataset
+### Linear regression model on a real dataset
+
 We now move on to a relatively complex linear regression model. Here we will use a real dataset from [`MLDatasets.jl`](https://github.com/JuliaML/MLDatasets.jl), which will not confine our data points to have only one feature. Let's start by importing the required packages -
 
 ```jldoctest linear_regression_complex
 julia> using Flux, Statistics, MLDatasets, DataFrames
 ```
 
-### Data
+## Gathering real data
 Let's start by initializing our dataset. We will be using the [`BostonHousing`](https://juliaml.github.io/MLDatasets.jl/stable/datasets/misc/#MLDatasets.BostonHousing) dataset consisting of `506` data points. Each of these data points has `13` features and a corresponding label, the house's price. The `x`s are still mapped to a single `y`, but now, a single `x` data point has 13 features. 
 
 ```jldoctest linear_regression_complex
@@ -300,7 +306,7 @@ julia> std(x_train_n)
 
 The standard deviation is now close to one! Our data is ready!
 
-### Model
+## Building a Flux model
 We can now directly use `Flux` and let it do all the work internally! Let's define a model that takes in 13 inputs (13 features) and gives us a single output (the label). We will then pass our entire data through this model in one go, and `Flux` will handle everything for us! Remember, we could have declared a model in plain `Julia` as well. The model will have 14 parameters: 13 weights and 1 bias.
 
 ```jldoctest linear_regression_complex
@@ -322,7 +328,8 @@ julia> loss(model, x_train_n, y_train)
 
 We can now proceed to the training phase!
 
-### Training
+## Training the Flux model
+
 The training procedure would make use of the same mathematics, but now we can pass in the model inside the `gradient` call and let `Flux` and `Zygote` handle the derivatives!
 
 ```jldoctest linear_regression_complex
@@ -333,7 +340,7 @@ julia> function train_model()
        end;
 ```
 
-Contrary to our last training procedure, let's say that this time we don't want to hardcode the number of epochs. We want the training procedure to stop when the loss converges, that is, when `change in loss < δ`. The quantity `δ` can be altered according to a user's need, but let's fix it to `10⁻³` for this guide.
+Contrary to our last training procedure, let's say that this time we don't want to hardcode the number of epochs. We want the training procedure to stop when the loss converges, that is, when `change in loss < δ`. The quantity `δ` can be altered according to a user's need, but let's fix it to `10⁻³` for this tutorial.
 
 We can write such custom training loops effortlessly using `Flux` and plain `Julia`!
 ```jldoctest linear_regression_complex
@@ -366,8 +373,9 @@ julia> loss(model, x_train_n, y_train)
 
 The loss went down significantly! It can be minimized further by choosing an even smaller `δ`.
 
-### Testing
-The last step of this guide would be to test our model using the testing data. We will first normalise the testing data and then calculate the corresponding loss.
+## Testing the Flux model
+
+The last step of this tutorial would be to test our model using the testing data. We will first normalise the testing data and then calculate the corresponding loss.
 
 ```jldoctest linear_regression_complex; filter = r"[+-]?([0-9]*[.])?[0-9]+(f[+-]*[0-9])?"
 julia> x_test_n = Flux.normalise(x_test);
@@ -380,100 +388,8 @@ The loss is not as small as the loss of the training data, but it looks good! Th
 
 ---
 
-Summarising this guide, we started by generating a random yet correlated dataset for our custom model. We then saw how a simple linear regression model could be built with and without `Flux`, and how they were almost identical. 
+Summarising this tutorial, we started by generating a random yet correlated dataset for our `custom model`. We then saw how a simple linear regression model could be built with and without `Flux`, and how they were almost identical. 
 
 Next, we trained the model by manually writing down the Gradient Descent algorithm and optimising the loss. We also saw how `Flux` provides various wrapper functionalities and keeps the API extremely intuitive and simple for the users. 
 
 After getting familiar with the basics of `Flux` and `Julia`, we moved ahead to build a machine learning model for a real dataset. We repeated the exact same steps, but this time with a lot more features and data points, and by harnessing `Flux`'s full capabilities. In the end, we developed a training loop that was smarter than the hardcoded one and ran the model on our normalised dataset to conclude the tutorial.
-
-## Copy-pastable code
-### Dummy dataset
-```julia
-using Flux, Plots
-
-# data
-x = hcat(collect(Float32, -3:0.1:3)...)
-f(x) = @. 3x + 2
-y = f(x)
-x = x .* reshape(rand(Float32, 61), (1, 61))
-
-# plot the data
-plot(reshape(x, (61, 1)), reshape(y, (61, 1)), lw = 3, seriestype = :scatter, label = "", title = "Generated data", xlabel = "x", ylabel= "y")
-
-# custom model and parameters
-custom_model(W, b, x) = @. W*x + b
-W = rand(Float32, 1, 1)
-b = [0.0f0]
-
-# loss function
-function custom_loss(W, b, x, y)
-    ŷ = custom_model(W, b, x)
-    sum((y .- ŷ).^2) / length(x)
-end;
-
-print("Initial loss: ", custom_loss(W, b, x, y), "\n")
-
-# train
-function train_custom_model()
-    dLdW, dLdb, _, _ = gradient(custom_loss, W, b, x, y)
-    @. W = W - 0.1 * dLdW
-    @. b = b - 0.1 * dLdb
-end
-
-for i = 1:40
-    train_custom_model()
-end
-
-print("Final loss: ", custom_loss(W, b, x, y), "\n")
-
-# plot data and results
-plot(reshape(x, (61, 1)), reshape(y, (61, 1)), lw = 3, seriestype = :scatter, label = "", title = "Simple Linear Regression", xlabel = "x", ylabel= "y")
-plot!((x) -> b[1] + W[1] * x, -3, 3, label="Custom model", lw=2)
-```
-### Real dataset
-```julia
-using Flux, Statistics, MLDatasets
-
-# data
-x, y = BostonHousing(as_df=false)[:]
-x_train, x_test, y_train, y_test = x[:, 1:400], x[:, 401:end], y[:, 1:400], y[:, 401:end]
-x_train_n = Flux.normalise(x_train)
-
-# model
-model = Dense(13 => 1)
-
-# loss function
-function loss(model, x, y)
-    ŷ = model(x)
-    Flux.mse(ŷ, y)
-end;
-
-print("Initial loss: ", loss(model, x_train_n, y_train), "\n")
-
-# train
-function train_custom_model()
-    dLdm, _, _ = gradient(loss, model, x_train_n, y_train)
-    @. model.weight = model.weight - 0.000001 * dLdm.weight
-    @. model.bias = model.bias - 0.000001 * dLdm.bias
-end
-
-loss_init = Inf;
-while true
-    train_custom_model()
-    if loss_init == Inf
-        loss_init = loss(model, x_train_n, y_train)
-        continue
-    end
-    if abs(loss_init - loss(model, x_train_n, y_train)) < 1e-4
-        break
-    else
-        loss_init = loss(model, x_train_n, y_train)
-    end
-end
-
-print("Final loss: ", loss(model, x_train_n, y_train), "\n")
-
-# test
-x_test_n = Flux.normalise(x_test);
-print("Test loss: ", loss(model, x_test_n, y_test), "\n")
-```

From 768543cbe5b4db46e4948d036b7f71e62406c377 Mon Sep 17 00:00:00 2001
From: Saransh Chopra <saransh0701@gmail.com>
Date: Tue, 25 Oct 2022 21:01:36 +0530
Subject: [PATCH 19/23] Fix doctests

---
 docs/src/tutorials/linear_regression.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/src/tutorials/linear_regression.md b/docs/src/tutorials/linear_regression.md
index 7fff4531fb..5c34174cdc 100644
--- a/docs/src/tutorials/linear_regression.md
+++ b/docs/src/tutorials/linear_regression.md
@@ -271,10 +271,10 @@ Let's start by initializing our dataset. We will be using the [`BostonHousing`](
 ```jldoctest linear_regression_complex
 julia> dataset = BostonHousing()
 dataset BostonHousing:
-  metadata    =>    Dict{String, Any} with 5 entries
-  features    =>    506×13 DataFrame
-  targets     =>    506×1 DataFrame
-  dataframe   =>    506×14 DataFrame
+   metadata   =>    Dict{String, Any} with 5 entries
+   features   =>    506×13 DataFrame
+   targets    =>    506×1 DataFrame
+   dataframe  =>    506×14 DataFrame
 
 julia> x, y = BostonHousing(as_df=false)[:];
 ```

From 13cb623d63c91f85bc9d0f93c2f8a2a6d715428b Mon Sep 17 00:00:00 2001
From: Saransh Chopra <saransh0701@gmail.com>
Date: Tue, 25 Oct 2022 22:44:14 +0530
Subject: [PATCH 20/23] Try fixing spaces

---
 docs/src/tutorials/linear_regression.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/src/tutorials/linear_regression.md b/docs/src/tutorials/linear_regression.md
index 5c34174cdc..91cea9da1b 100644
--- a/docs/src/tutorials/linear_regression.md
+++ b/docs/src/tutorials/linear_regression.md
@@ -271,10 +271,10 @@ Let's start by initializing our dataset. We will be using the [`BostonHousing`](
 ```jldoctest linear_regression_complex
 julia> dataset = BostonHousing()
 dataset BostonHousing:
-   metadata   =>    Dict{String, Any} with 5 entries
-   features   =>    506×13 DataFrame
-   targets    =>    506×1 DataFrame
-   dataframe  =>    506×14 DataFrame
+    metadata   =>    Dict{String, Any} with 5 entries
+    features   =>    506×13 DataFrame
+    targets    =>    506×1 DataFrame
+    dataframe  =>    506×14 DataFrame
 
 julia> x, y = BostonHousing(as_df=false)[:];
 ```

From 17d167e20f2c401170bc66449c5093e6c17549f2 Mon Sep 17 00:00:00 2001
From: Saransh Chopra <saransh0701@gmail.com>
Date: Tue, 25 Oct 2022 23:06:24 +0530
Subject: [PATCH 21/23] More doctest fixing

---
 docs/src/tutorials/linear_regression.md | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/docs/src/tutorials/linear_regression.md b/docs/src/tutorials/linear_regression.md
index 91cea9da1b..d73c20cc38 100644
--- a/docs/src/tutorials/linear_regression.md
+++ b/docs/src/tutorials/linear_regression.md
@@ -269,12 +269,7 @@ julia> using Flux, Statistics, MLDatasets, DataFrames
 Let's start by initializing our dataset. We will be using the [`BostonHousing`](https://juliaml.github.io/MLDatasets.jl/stable/datasets/misc/#MLDatasets.BostonHousing) dataset consisting of `506` data points. Each of these data points has `13` features and a corresponding label, the house's price. The `x`s are still mapped to a single `y`, but now, a single `x` data point has 13 features. 
 
 ```jldoctest linear_regression_complex
-julia> dataset = BostonHousing()
-dataset BostonHousing:
-    metadata   =>    Dict{String, Any} with 5 entries
-    features   =>    506×13 DataFrame
-    targets    =>    506×1 DataFrame
-    dataframe  =>    506×14 DataFrame
+julia> dataset = BostonHousing();
 
 julia> x, y = BostonHousing(as_df=false)[:];
 ```

From 0350e0392cb16a1b6292f9f9cd1d9c5f66e2e15d Mon Sep 17 00:00:00 2001
From: Saransh Chopra <saransh0701@gmail.com>
Date: Thu, 27 Oct 2022 16:00:49 +0530
Subject: [PATCH 22/23] Move to the existing tutorials section

---
 docs/make.jl                               | 10 ++++------
 docs/src/index.md                          |  4 ++--
 docs/src/models/activation.md              |  3 +--
 docs/src/models/functors.md                |  2 +-
 docs/src/training/optimisers.md            |  2 +-
 docs/src/training/training.md              |  4 ++--
 docs/src/{models => tutorials}/advanced.md |  0
 src/losses/functions.jl                    |  2 +-
 8 files changed, 12 insertions(+), 15 deletions(-)
 rename docs/src/{models => tutorials}/advanced.md (100%)

diff --git a/docs/make.jl b/docs/make.jl
index c65e51723b..7c2ddfa9e6 100644
--- a/docs/make.jl
+++ b/docs/make.jl
@@ -15,9 +15,6 @@ makedocs(
             "Fitting a Line" => "getting_started/overview.md",
             "Gradients and Layers" => "getting_started/basics.md",
         ],
-        "Tutorials" => [
-            "Linear Regression" => "tutorials/linear_regression.md",
-        ],
         "Building Models" => [
             "Built-in Layers 📚" => "models/layers.md",
             "Recurrence" => "models/recurrence.md",
@@ -44,11 +41,12 @@ makedocs(
              "Flat vs. Nested 📚" => "destructure.md",
              "Functors.jl 📚 (`fmap`, ...)" => "models/functors.md",
          ],
+         "Tutorials" => [
+            "Linear Regression" => "tutorials/linear_regression.md",
+            "Custom Layers" => "tutorials/advanced.md",  # TODO move freezing to Training
+         ],
          "Performance Tips" => "performance.md",
          "Flux's Ecosystem" => "ecosystem.md",
-         "Tutorials" => [  # TODO, maybe
-             "Custom Layers" => "models/advanced.md",  # TODO move freezing to Training
-         ],
     ],
     format = Documenter.HTML(
         sidebar_sitename = false,
diff --git a/docs/src/index.md b/docs/src/index.md
index 60a300e0e4..f394bbe8b0 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -16,9 +16,9 @@ Other closely associated packages, also installed automatically, include [Zygote
 
 ## Learning Flux
 
-The [quick start](models/quickstart.md) page trains a simple neural network.
+The [quick start](getting_started/quickstart.md) page trains a simple neural network.
 
-This rest of this documentation provides a from-scratch introduction to Flux's take on models and how they work, starting with [fitting a line](models/overview.md). Once you understand these docs, congratulations, you also understand [Flux's source code](https://github.com/FluxML/Flux.jl), which is intended to be concise, legible and a good reference for more advanced concepts.
+This rest of this documentation provides a from-scratch introduction to Flux's take on models and how they work, starting with [fitting a line](getting_started/overview.md). Once you understand these docs, congratulations, you also understand [Flux's source code](https://github.com/FluxML/Flux.jl), which is intended to be concise, legible and a good reference for more advanced concepts.
 
 Sections with 📚 contain API listings. The same text is avalable at the Julia prompt, by typing for example `?gpu`.
 
diff --git a/docs/src/models/activation.md b/docs/src/models/activation.md
index 5e6e718098..ae14750aeb 100644
--- a/docs/src/models/activation.md
+++ b/docs/src/models/activation.md
@@ -1,5 +1,4 @@
-
-# Activation Functions from NNlib.jl
+# [Activation Functions from NNlib.jl](@id man-activations)
 
 These non-linearities used between layers of your model are exported by the [NNlib](https://github.com/FluxML/NNlib.jl) package.
 
diff --git a/docs/src/models/functors.md b/docs/src/models/functors.md
index 72b8db8318..7ad152cfa8 100644
--- a/docs/src/models/functors.md
+++ b/docs/src/models/functors.md
@@ -4,7 +4,7 @@ Flux models are deeply nested structures, and [Functors.jl](https://github.com/F
 
 New layers should be annotated using the `Functors.@functor` macro. This will enable [`params`](@ref Flux.params) to see the parameters inside, and [`gpu`](@ref) to move them to the GPU.
 
-`Functors.jl` has its own [notes on basic usage](https://fluxml.ai/Functors.jl/stable/#Basic-Usage-and-Implementation) for more details. Additionally, the [Advanced Model Building and Customisation](../models/advanced.md) page covers the use cases of `Functors` in greater details.
+`Functors.jl` has its own [notes on basic usage](https://fluxml.ai/Functors.jl/stable/#Basic-Usage-and-Implementation) for more details. Additionally, the [Advanced Model Building and Customisation](../tutorials/advanced.md) page covers the use cases of `Functors` in greater details.
 
 ```@docs
 Functors.@functor
diff --git a/docs/src/training/optimisers.md b/docs/src/training/optimisers.md
index 9d619f8d10..066196b4ba 100644
--- a/docs/src/training/optimisers.md
+++ b/docs/src/training/optimisers.md
@@ -4,7 +4,7 @@ CurrentModule = Flux
 
 # Optimisers
 
-Consider a [simple linear regression](../getting_started/linear_regression.md). We create some dummy data, calculate a loss, and backpropagate to calculate gradients for the parameters `W` and `b`.
+Consider a [simple linear regression](../tutorials/linear_regression.md). We create some dummy data, calculate a loss, and backpropagate to calculate gradients for the parameters `W` and `b`.
 
 ```julia
 using Flux
diff --git a/docs/src/training/training.md b/docs/src/training/training.md
index 70fa39a510..e119b8914e 100644
--- a/docs/src/training/training.md
+++ b/docs/src/training/training.md
@@ -36,7 +36,7 @@ Flux.Optimise.train!
 ```
 
 There are plenty of examples in the [model zoo](https://github.com/FluxML/model-zoo), and
-more information can be found on [Custom Training Loops](../models/advanced.md).
+more information can be found on [Custom Training Loops](../tutorials/advanced.md).
 
 ## Loss Functions
 
@@ -68,7 +68,7 @@ The model to be trained must have a set of tracked parameters that are used to c
 
 Such an object contains a reference to the model's parameters, not a copy, such that after their training, the model behaves according to their updated values.
 
-Handling all the parameters on a layer by layer basis is explained in the [Layer Helpers](../getting_started/basics.md) section. Also, for freezing model parameters, see the [Advanced Usage Guide](../models/advanced.md).
+Handling all the parameters on a layer by layer basis is explained in the [Layer Helpers](../getting_started/basics.md) section. Also, for freezing model parameters, see the [Advanced Usage Guide](../tutorials/advanced.md).
 
 ```@docs
 Flux.params
diff --git a/docs/src/models/advanced.md b/docs/src/tutorials/advanced.md
similarity index 100%
rename from docs/src/models/advanced.md
rename to docs/src/tutorials/advanced.md
diff --git a/src/losses/functions.jl b/src/losses/functions.jl
index ea7b4a6c65..ffda2ff99a 100644
--- a/src/losses/functions.jl
+++ b/src/losses/functions.jl
@@ -273,7 +273,7 @@ Return the binary cross-entropy loss, computed as
 
     agg(@.(-y * log(ŷ + ϵ) - (1 - y) * log(1 - ŷ + ϵ)))
 
-Where typically, the prediction `ŷ` is given by the output of a [sigmoid](@ref Activation-Functions-from-NNlib.jl) activation.
+Where typically, the prediction `ŷ` is given by the output of a [sigmoid](@ref man-activations) activation.
 The `ϵ` term is included to avoid infinity. Using [`logitbinarycrossentropy`](@ref) is recomended
 over `binarycrossentropy` for numerical stability.
 

From 6b64b584aef4332f421650578a92fd48e96620bd Mon Sep 17 00:00:00 2001
From: Saransh Chopra <saransh0701@gmail.com>
Date: Thu, 27 Oct 2022 22:47:13 +0530
Subject: [PATCH 23/23] Revert structure + use ids

---
 docs/make.jl                                       | 8 ++++----
 docs/src/gpu.md                                    | 2 +-
 docs/src/index.md                                  | 4 ++--
 docs/src/{tutorials => models}/advanced.md         | 4 ++--
 docs/src/{getting_started => models}/basics.md     | 0
 docs/src/models/functors.md                        | 2 +-
 docs/src/{getting_started => models}/overview.md   | 0
 docs/src/{getting_started => models}/quickstart.md | 0
 docs/src/training/optimisers.md                    | 2 +-
 docs/src/training/training.md                      | 8 ++++----
 10 files changed, 15 insertions(+), 15 deletions(-)
 rename docs/src/{tutorials => models}/advanced.md (99%)
 rename docs/src/{getting_started => models}/basics.md (100%)
 rename docs/src/{getting_started => models}/overview.md (100%)
 rename docs/src/{getting_started => models}/quickstart.md (100%)

diff --git a/docs/make.jl b/docs/make.jl
index 7c2ddfa9e6..5409a117c7 100644
--- a/docs/make.jl
+++ b/docs/make.jl
@@ -11,9 +11,9 @@ makedocs(
     pages = [
         "Getting Started" => [
             "Welcome" => "index.md",
-            "Quick Start" => "getting_started/quickstart.md",
-            "Fitting a Line" => "getting_started/overview.md",
-            "Gradients and Layers" => "getting_started/basics.md",
+            "Quick Start" => "models/quickstart.md",
+            "Fitting a Line" => "models/overview.md",
+            "Gradients and Layers" => "models/basics.md",
         ],
         "Building Models" => [
             "Built-in Layers 📚" => "models/layers.md",
@@ -43,7 +43,7 @@ makedocs(
          ],
          "Tutorials" => [
             "Linear Regression" => "tutorials/linear_regression.md",
-            "Custom Layers" => "tutorials/advanced.md",  # TODO move freezing to Training
+            "Custom Layers" => "models/advanced.md",  # TODO move freezing to Training
          ],
          "Performance Tips" => "performance.md",
          "Flux's Ecosystem" => "ecosystem.md",
diff --git a/docs/src/gpu.md b/docs/src/gpu.md
index 27956baa57..e8e98774b6 100644
--- a/docs/src/gpu.md
+++ b/docs/src/gpu.md
@@ -17,7 +17,7 @@ true
 
 Support for array operations on other hardware backends, like GPUs, is provided by external packages like [CUDA](https://github.com/JuliaGPU/CUDA.jl). Flux is agnostic to array types, so we simply need to move model weights and data to the GPU and Flux will handle it.
 
-For example, we can use `CUDA.CuArray` (with the `cu` converter) to run our [basic example](getting_started/basics.md) on an NVIDIA GPU.
+For example, we can use `CUDA.CuArray` (with the `cu` converter) to run our [basic example](@ref man-basics) on an NVIDIA GPU.
 
 (Note that you need to have CUDA available to use CUDA.CuArray – please see the [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) instructions for more details.)
 
diff --git a/docs/src/index.md b/docs/src/index.md
index f394bbe8b0..98fffc4a5c 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -16,9 +16,9 @@ Other closely associated packages, also installed automatically, include [Zygote
 
 ## Learning Flux
 
-The [quick start](getting_started/quickstart.md) page trains a simple neural network.
+The [quick start](@ref man-quickstart) page trains a simple neural network.
 
-This rest of this documentation provides a from-scratch introduction to Flux's take on models and how they work, starting with [fitting a line](getting_started/overview.md). Once you understand these docs, congratulations, you also understand [Flux's source code](https://github.com/FluxML/Flux.jl), which is intended to be concise, legible and a good reference for more advanced concepts.
+This rest of this documentation provides a from-scratch introduction to Flux's take on models and how they work, starting with [fitting a line](@ref man-overview). Once you understand these docs, congratulations, you also understand [Flux's source code](https://github.com/FluxML/Flux.jl), which is intended to be concise, legible and a good reference for more advanced concepts.
 
 Sections with 📚 contain API listings. The same text is avalable at the Julia prompt, by typing for example `?gpu`.
 
diff --git a/docs/src/tutorials/advanced.md b/docs/src/models/advanced.md
similarity index 99%
rename from docs/src/tutorials/advanced.md
rename to docs/src/models/advanced.md
index 047053946f..2c8ce33f7a 100644
--- a/docs/src/tutorials/advanced.md
+++ b/docs/src/models/advanced.md
@@ -1,4 +1,4 @@
-# Defining Customised Layers
+# [Defining Customised Layers](@id man-advanced)
 
 Here we will try and describe usage of some more advanced features that Flux provides to give more control over model building.
 
@@ -34,7 +34,7 @@ For an intro to Flux and automatic differentiation, see this [tutorial](https://
 
 ## Customising Parameter Collection for a Model
 
-Taking reference from our example `Affine` layer from the [basics](../getting_started/basics.md#Building-Layers-1).
+Taking reference from our example `Affine` layer from the [basics](@ref man-basics).
 
 By default all the fields in the `Affine` type are collected as its parameters, however, in some cases it may be desired to hold other metadata in our "layers" that may not be needed for training, and are hence supposed to be ignored while the parameters are collected. With Flux, it is possible to mark the fields of our layers that are trainable in two ways.
 
diff --git a/docs/src/getting_started/basics.md b/docs/src/models/basics.md
similarity index 100%
rename from docs/src/getting_started/basics.md
rename to docs/src/models/basics.md
diff --git a/docs/src/models/functors.md b/docs/src/models/functors.md
index 7ad152cfa8..252841c0c2 100644
--- a/docs/src/models/functors.md
+++ b/docs/src/models/functors.md
@@ -4,7 +4,7 @@ Flux models are deeply nested structures, and [Functors.jl](https://github.com/F
 
 New layers should be annotated using the `Functors.@functor` macro. This will enable [`params`](@ref Flux.params) to see the parameters inside, and [`gpu`](@ref) to move them to the GPU.
 
-`Functors.jl` has its own [notes on basic usage](https://fluxml.ai/Functors.jl/stable/#Basic-Usage-and-Implementation) for more details. Additionally, the [Advanced Model Building and Customisation](../tutorials/advanced.md) page covers the use cases of `Functors` in greater details.
+`Functors.jl` has its own [notes on basic usage](https://fluxml.ai/Functors.jl/stable/#Basic-Usage-and-Implementation) for more details. Additionally, the [Advanced Model Building and Customisation](@ref man-advanced) page covers the use cases of `Functors` in greater details.
 
 ```@docs
 Functors.@functor
diff --git a/docs/src/getting_started/overview.md b/docs/src/models/overview.md
similarity index 100%
rename from docs/src/getting_started/overview.md
rename to docs/src/models/overview.md
diff --git a/docs/src/getting_started/quickstart.md b/docs/src/models/quickstart.md
similarity index 100%
rename from docs/src/getting_started/quickstart.md
rename to docs/src/models/quickstart.md
diff --git a/docs/src/training/optimisers.md b/docs/src/training/optimisers.md
index 066196b4ba..8b3a86d975 100644
--- a/docs/src/training/optimisers.md
+++ b/docs/src/training/optimisers.md
@@ -4,7 +4,7 @@ CurrentModule = Flux
 
 # Optimisers
 
-Consider a [simple linear regression](../tutorials/linear_regression.md). We create some dummy data, calculate a loss, and backpropagate to calculate gradients for the parameters `W` and `b`.
+Consider a [simple linear regression](@ref man-linear-regression). We create some dummy data, calculate a loss, and backpropagate to calculate gradients for the parameters `W` and `b`.
 
 ```julia
 using Flux
diff --git a/docs/src/training/training.md b/docs/src/training/training.md
index e119b8914e..76aa40f5b8 100644
--- a/docs/src/training/training.md
+++ b/docs/src/training/training.md
@@ -36,11 +36,11 @@ Flux.Optimise.train!
 ```
 
 There are plenty of examples in the [model zoo](https://github.com/FluxML/model-zoo), and
-more information can be found on [Custom Training Loops](../tutorials/advanced.md).
+more information can be found on [Custom Training Loops](@ref man-advanced).
 
 ## Loss Functions
 
-The objective function must return a number representing how far the model is from its target – the *loss* of the model. The `loss` function that we defined in [basics](../getting_started/basics.md) will work as an objective.
+The objective function must return a number representing how far the model is from its target – the *loss* of the model. The `loss` function that we defined in [basics](@ref man-basics) will work as an objective.
 In addition to custom losses, model can be trained in conjuction with
 the commonly used losses that are grouped under the `Flux.Losses` module.
 We can also define an objective in terms of some model:
@@ -64,11 +64,11 @@ At first glance, it may seem strange that the model that we want to train is not
 
 ## Model parameters
 
-The model to be trained must have a set of tracked parameters that are used to calculate the gradients of the objective function. In the [basics](../getting_started/basics.md) section it is explained how to create models with such parameters. The second argument of the function `Flux.train!` must be an object containing those parameters, which can be obtained from a model `m` as `Flux.params(m)`.
+The model to be trained must have a set of tracked parameters that are used to calculate the gradients of the objective function. In the [basics](@ref man-basics) section it is explained how to create models with such parameters. The second argument of the function `Flux.train!` must be an object containing those parameters, which can be obtained from a model `m` as `Flux.params(m)`.
 
 Such an object contains a reference to the model's parameters, not a copy, such that after their training, the model behaves according to their updated values.
 
-Handling all the parameters on a layer by layer basis is explained in the [Layer Helpers](../getting_started/basics.md) section. Also, for freezing model parameters, see the [Advanced Usage Guide](../tutorials/advanced.md).
+Handling all the parameters on a layer by layer basis is explained in the [Layer Helpers](@ref man-basics) section. Also, for freezing model parameters, see the [Advanced Usage Guide](@ref man-advanced).
 
 ```@docs
 Flux.params