diff --git a/docs/src/gpu.md b/docs/src/gpu.md
index 6ca5e3271d..207e1b9faa 100644
--- a/docs/src/gpu.md
+++ b/docs/src/gpu.md
@@ -97,9 +97,9 @@ Some of the common workflows involving the use of GPUs are presented below.
 
 ### Transferring Training Data
 
-In order to train the model using the GPU both model and the training data have to be transferred to GPU memory. This process can be done with the `gpu` function in two different  ways:
+In order to train the model using the GPU both model and the training data have to be transferred to GPU memory. This process can be done with the `gpu` function in two different ways:
 
-1. Iterating over the batches in a [DataLoader](@ref) object transfering each one of the training batches at a time to the GPU. 
+1. Iterating over the batches in a [DataLoader](@ref) object transferring each one of the training batches at a time to the GPU. 
    ```julia
    train_loader = Flux.DataLoader((xtrain, ytrain), batchsize = 64, shuffle = true)
    # ... model, optimizer and loss definitions
@@ -112,14 +112,14 @@ In order to train the model using the GPU both model and the training data have
    end
    ```
 
-2. Transferring all training data to the GPU at once before creating the [DataLoader](@ref) object. This is usually performed for smaller datasets which are sure to fit in the available GPU memory. Some possitilities are:
+2. Transferring all training data to the GPU at once before creating the [DataLoader](@ref) object. This is usually performed for smaller datasets which are sure to fit in the available GPU memory. Some possibilities are:
    ```julia
    gpu_train_loader = Flux.DataLoader((xtrain |> gpu, ytrain |> gpu), batchsize = 32)
    ```
    ```julia
    gpu_train_loader = Flux.DataLoader((xtrain, ytrain) |> gpu, batchsize = 32)
    ```
-   Note that both `gpu` and `cpu` are smart enough to recurse through tuples and namedtuples. Other possibility is to use [`MLUtils.mapsobs`](https://juliaml.github.io/MLUtils.jl/dev/api/#MLUtils.mapobs) to push the data movement invocation into the background thread:
+   Note that both `gpu` and `cpu` are smart enough to recurse through tuples and namedtuples. Another possibility is to use [`MLUtils.mapsobs`](https://juliaml.github.io/MLUtils.jl/dev/api/#MLUtils.mapobs) to push the data movement invocation into the background thread:
    ```julia
    using MLUtils: mapobs
    # ...
@@ -159,7 +159,7 @@ let model = cpu(model)
    BSON.@save "./path/to/trained_model.bson" model
 end
 
-# is equivalente to the above, but uses `key=value` storing directve from BSON.jl
+# is equivalent to the above, but uses `key=value` storing directive from BSON.jl
 BSON.@save "./path/to/trained_model.bson" model = cpu(model)
 ```
 The reason behind this is that models trained in the GPU but not transferred to the CPU memory scope will expect `CuArray`s as input. In other words, Flux models expect input data coming from the same kind device in which they were trained on.
@@ -181,4 +181,4 @@ $ export CUDA_VISIBLE_DEVICES='0,1'
 ```
 
 
-More information for conditional use of GPUs in CUDA.jl can be found in its [documentation](https://cuda.juliagpu.org/stable/installation/conditional/#Conditional-use), and information about the specific use of the variable is described in the [Nvidia CUDA blogpost](https://developer.nvidia.com/blog/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/).
+More information for conditional use of GPUs in CUDA.jl can be found in its [documentation](https://cuda.juliagpu.org/stable/installation/conditional/#Conditional-use), and information about the specific use of the variable is described in the [Nvidia CUDA blog post](https://developer.nvidia.com/blog/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/).
diff --git a/docs/src/index.md b/docs/src/index.md
index 39a9f9594f..28f99a80de 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -3,7 +3,7 @@
 Flux is a library for machine learning geared towards high-performance production pipelines. It comes "batteries-included" with many useful tools built in, but also lets you use the full power of the Julia language where you need it. We follow a few key principles:
 
 * **Doing the obvious thing**. Flux has relatively few explicit APIs for features like regularisation or embeddings. Instead, writing down the mathematical form will work – and be fast.
-* **Extensible by default**. Flux is written to be highly extensible and flexible while being performant. Extending Flux is as simple as using your own code as part of the model you want - it is all [high level Julia code](https://github.com/FluxML/Flux.jl/blob/ec16a2c77dbf6ab8b92b0eecd11661be7a62feef/src/layers/recurrent.jl#L131). When in doubt, it’s well worth looking at [the source](https://github.com/FluxML/Flux.jl/). If you need something different, you can easily roll your own.
+* **Extensible by default**. Flux is written to be highly extensible and flexible while being performant. Extending Flux is as simple as using your own code as part of the model you want - it is all [high-level Julia code](https://github.com/FluxML/Flux.jl/blob/ec16a2c77dbf6ab8b92b0eecd11661be7a62feef/src/layers/recurrent.jl#L131). When in doubt, it’s well worth looking at [the source](https://github.com/FluxML/Flux.jl/). If you need something different, you can easily roll your own.
 * **Performance is key**. Flux integrates with high-performance AD tools such as [Zygote.jl](https://github.com/FluxML/Zygote.jl) for generating fast code. Flux optimizes both CPU and GPU performance. Scaling workloads easily to multiple GPUs can be done with the help of Julia's [GPU tooling](https://github.com/JuliaGPU/CUDA.jl) and projects like [DaggerFlux.jl](https://github.com/DhairyaLGandhi/DaggerFlux.jl).
 * **Play nicely with others**. Flux works well with Julia libraries from [data frames](https://github.com/JuliaComputing/JuliaDB.jl) and [images](https://github.com/JuliaImages/Images.jl) to [differential equation solvers](https://github.com/JuliaDiffEq/DifferentialEquations.jl), so you can easily build complex data processing pipelines that integrate Flux models.
 
diff --git a/docs/src/training/optimisers.md b/docs/src/training/optimisers.md
index 44bbdadf6d..84cf018fb8 100644
--- a/docs/src/training/optimisers.md
+++ b/docs/src/training/optimisers.md
@@ -71,7 +71,7 @@ AdaBelief
 
 Flux's optimisers are built around a `struct` that holds all the optimiser parameters along with a definition of how to apply the update rule associated with it. We do this via the `apply!` function which takes the optimiser as the first argument followed by the parameter and its corresponding gradient.
 
-In this manner Flux also allows one to create custom optimisers to be used seamlessly. Let's work this with a simple example.
+In this manner Flux also allows one to create custom optimisers to be used seamlessly. Let's work on this with a simple example.
 
 ```julia
 mutable struct Momentum
@@ -135,7 +135,7 @@ end
 loss(rand(10)) # around 0.9
 ```
 
-In this manner it is possible to compose optimisers for some added flexibility.
+It is possible to compose optimisers for some added flexibility.
 
 ```@docs
 Flux.Optimise.Optimiser
@@ -145,7 +145,7 @@ Flux.Optimise.Optimiser
 
 In practice, it is fairly common to schedule the learning rate of an optimiser to obtain faster convergence. There are a variety of popular scheduling policies, and you can find implementations of them in [ParameterSchedulers.jl](https://darsnack.github.io/ParameterSchedulers.jl/dev/README.html). The documentation for ParameterSchedulers.jl provides a more detailed overview of the different scheduling policies, and how to use them with Flux optimizers. Below, we provide a brief snippet illustrating a [cosine annealing](https://arxiv.org/pdf/1608.03983.pdf) schedule with a momentum optimiser.
 
-First, we import ParameterSchedulers.jl and initalize a cosine annealing schedule to varying the learning rate between `1e-4` and `1e-2` every 10 steps. We also create a new [`Momentum`](@ref) optimiser.
+First, we import ParameterSchedulers.jl and initialize a cosine annealing schedule to vary the learning rate between `1e-4` and `1e-2` every 10 steps. We also create a new [`Momentum`](@ref) optimiser.
 ```julia
 using ParameterSchedulers
 
diff --git a/docs/src/training/training.md b/docs/src/training/training.md
index bb545b22fb..79c906f86e 100644
--- a/docs/src/training/training.md
+++ b/docs/src/training/training.md
@@ -8,7 +8,7 @@ To actually train a model we need four things:
 * An [optimiser](optimisers.md) that will update the model parameters appropriately.
 
 Training a model is typically an iterative process, where we go over the data set,
-calculate the objective function over the datapoints, and optimise that.
+calculate the objective function over the data points, and optimise that.
 This can be visualised in the form of a simple loop.
 
 ```julia
@@ -41,7 +41,7 @@ more information can be found on [Custom Training Loops](../models/advanced.md).
 ## Loss Functions
 
 The objective function must return a number representing how far the model is from its target – the *loss* of the model. The `loss` function that we defined in [basics](../models/basics.md) will work as an objective.
-In addition to custom losses, model can be trained in conjuction with
+In addition to custom losses, a model can be trained in conjunction with
 the commonly used losses that are grouped under the `Flux.Losses` module.
 We can also define an objective in terms of some model:
 
@@ -57,10 +57,10 @@ ps = Flux.params(m)
 Flux.train!(loss, ps, data, opt)
 ```
 
-The objective will almost always be defined in terms of some *cost function* that measures the distance of the prediction `m(x)` from the target `y`. Flux has several of these built in, like `mse` for mean squared error or `crossentropy` for cross entropy loss, but you can calculate it however you want.
+The objective will almost always be defined in terms of some *cost function* that measures the distance of the prediction `m(x)` from the target `y`. Flux has several of these built-in, like `mse` for mean squared error or `crossentropy` for cross-entropy loss, but you can calculate it however you want.
 For a list of all built-in loss functions, check out the [losses reference](../models/losses.md).
 
-At first glance it may seem strange that the model that we want to train is not part of the input arguments of `Flux.train!` too. However the target of the optimizer is not the model itself, but the objective function that represents the departure between modelled and observed data. In other words, the model is implicitly defined in the objective function, and there is no need to give it explicitly. Passing the objective function instead of the model and a cost function separately provides more flexibility, and the possibility of optimizing the calculations.
+At first glance, it may seem strange that the model that we want to train is not part of the input arguments of `Flux.train!` too. However the target of the optimizer is not the model itself, but the objective function that represents the departure between modelled and observed data. In other words, the model is implicitly defined in the objective function, and there is no need to give it explicitly. Passing the objective function instead of the model and a cost function separately provides more flexibility and the possibility of optimizing the calculations.
 
 ## Model parameters
 
@@ -68,7 +68,7 @@ The model to be trained must have a set of tracked parameters that are used to c
 
 Such an object contains a reference to the model's parameters, not a copy, such that after their training, the model behaves according to their updated values.
 
-Handling all the parameters on a layer by layer basis is explained in the [Layer Helpers](../models/basics.md) section. Also, for freezing model parameters, see the [Advanced Usage Guide](../models/advanced.md).
+Handling all the parameters on a layer-by-layer basis is explained in the [Layer Helpers](../models/basics.md) section. For freezing model parameters, see the [Advanced Usage Guide](../models/advanced.md).
 
 ```@docs
 Flux.params
@@ -93,7 +93,7 @@ using IterTools: ncycle
 data = ncycle([(x, y)], 3)
 ```
 
-It's common to load the `x`s and `y`s separately. In this case you can use `zip`:
+It's common to load the `x`s and `y`s separately. Here you can use `zip`:
 
 ```julia
 xs = [rand(784), rand(784), rand(784)]
@@ -159,8 +159,7 @@ end
 ## Custom Training loops
 
 The `Flux.train!` function can be very convenient, especially for simple problems.
-Its also very flexible with the use of callbacks.
-But for some problems its much cleaner to write your own custom training loop.
+For some problems, however, it's much cleaner to write your own custom training loop.
 An example follows that works similar to the default `Flux.train` but with no callbacks.
 You don't need callbacks if you just code the calls to your functions directly into the loop.
 E.g. in the places marked with comments.
@@ -179,8 +178,8 @@ function my_custom_train!(loss, ps, data, opt)
     end
     # Insert whatever code you want here that needs training_loss, e.g. logging.
     # logging_callback(training_loss)
-    # Insert what ever code you want here that needs gradient.
-    # E.g. logging with TensorBoardLogger.jl as histogram so you can see if it is becoming huge.
+    # Insert whatever code you want here that needs gradients.
+    # e.g. logging histograms with TensorBoardLogger.jl to check for exploding gradients.
     update!(opt, ps, gs)
     # Here you might like to check validation set accuracy, and break out to do early stopping.
   end
@@ -202,7 +201,7 @@ function my_custom_train!(loss, ps, data, opt)
     # logging_callback(training_loss)
     # Apply back() to the correct type of 1.0 to get the gradient of loss.
     gs = back(one(train_loss))
-    # Insert what ever code you want here that needs gradient.
+    # Insert whatever code you want here that needs gradient.
     # E.g. logging with TensorBoardLogger.jl as histogram so you can see if it is becoming huge.
     update!(opt, ps, gs)
     # Here you might like to check validation set accuracy, and break out to do early stopping.
diff --git a/docs/src/utilities.md b/docs/src/utilities.md
index c6e6cffef2..6e6226a45f 100644
--- a/docs/src/utilities.md
+++ b/docs/src/utilities.md
@@ -122,7 +122,7 @@ Flux.skip
 
 Flux provides utilities for controlling your training procedure according to some monitored condition and a maximum `patience`. For example, you can use `early_stopping` to stop training when the model is converging or deteriorating, or you can use `plateau` to check if the model is stagnating.
 
-For example, below we create a pseudo-loss function that decreases, bottoms out, then increases. The early stopping trigger will break the loop before the loss increases too much.
+For example, below we create a pseudo-loss function that decreases, bottoms out, and then increases. The early stopping trigger will break the loop before the loss increases too much.
 ```julia
 # create a pseudo-loss that decreases for 4 calls, then starts increasing
 # we call this like loss()
@@ -143,7 +143,7 @@ es = early_stopping(loss, 2; init_score = 9)
 end
 ```
 
-The keyword argument `distance` of `early_stopping` is a function of the form `distance(best_score, score)`. By default `distance` is `-`, which implies that the monitored metric `f` is expected to be decreasing and mimimized. If you use some increasing metric (e.g. accuracy), you can customize the `distance` function: `(best_score, score) -> score - best_score`.
+The keyword argument `distance` of `early_stopping` is a function of the form `distance(best_score, score)`. By default `distance` is `-`, which implies that the monitored metric `f` is expected to be decreasing and minimized. If you use some increasing metric (e.g. accuracy), you can customize the `distance` function: `(best_score, score) -> score - best_score`.
 ```julia
 # create a pseudo-accuracy that increases by 0.01 each time from 0 to 1
 # we call this like acc()
diff --git a/src/optimise/train.jl b/src/optimise/train.jl
index b6dac7951b..dac294cb19 100644
--- a/src/optimise/train.jl
+++ b/src/optimise/train.jl
@@ -87,7 +87,7 @@ Here `pars` is produced by calling [`Flux.params`](@ref) on your model.
 (Or just on the layers you want to train, like `train!(loss, params(model[1:end-2]), data, opt)`.)
 This is the "implicit" style of parameter handling.
 
-Then, this gradient is used by optimizer `opt` to update the paramters:
+This gradient is then used by optimizer `opt` to update the parameters:
 ```
     update!(opt, pars, grads)
 ```