$$ W^{[layer]}_{unit} \text{, of shape } (n^{[layer]} \times n^{[layer - 1]}) $$

$$ X^{(example)}_{feature} \text{, of shape } (n_x \times m) $$

$$ Y^{(example)}_{targets} \text{, of shape } (n_x \times m) $$

$$ X = A^{[0]} = \left[ \begin{array}{ccc} | & | & & | \\ | & | & & | \\ X^{(1)} & X^{(2)} & \dots & X^{(m)} \\ | & | & & | \\ | & | & & | \\ \end{array} \right] , X \ni \Reals^{n_x \times m} $$


$$ Z^{[l]} = W^{[l]} \cdot A^{[l-1]} + b^{[l]} $$

$$ A^{[l]} = \sigma^{[l]} (Z^{[l]}) $$

While keeping numpy broadcast in mind, here are the dimensions of the different matrix/vectors

$$ \underset{n^{[1]} \times m}{Z^{[1]}}

\underset{n^{[1]} \times n^{[0]}}{W^{[1]}} \cdot \underset{n^{[0]}\times m}{A^{[0]}} + \underset{n^{[1]}\times 1}{b^{[1]}} $$

$$ \def\horzbar{\text{--------}} Z^{[l]} = \left[ \begin{array}{ccc} \horzbar & w^{[l]T}{1} & \horzbar \ \horzbar & w^{[l]T}{2} & \horzbar \ \horzbar & w^{[l]T}{3} & \horzbar \ \horzbar & w^{[l]T}{4} & \horzbar \
\end{array} \right] \cdot \begin{bmatrix} x_{1}^{(1)} & | & & | & \ x_{2}^{(1)} & x^{(2)} & \dots & x^{(m)} & \ x_{3}^{(1)} & | & & | & \ x_{4}^{(1)} & | & & | & \ \end{bmatrix} + \left[ \begin{array}{ccc} b^{[l]}{1} \ b^{[l]}{2} \ b^{[l]}{3} \ b^{[l]}{4} \ \end{array} \right] = \left[ \begin{array}{ccc} | & | & & | \ | & | & & | \ Z^{l} & Z^{l} & \dots & Z^{l} \ | & | & & | \ | & | & & | \ \end{array} \right]



$\delta^l \equiv \delta A^l$

Definition Equation
Output layer error ${\color{green}\delta^L} = \nabla_a C \odot \sigma'(z^L)$
layer error $\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l)$
Cost partial derivate for Bias $\frac{\partial C}{\partial b^l_j} = \delta^l_j$
Cost partial derivate for Weights $\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j$

1. The error in the output layer $δ^L$

$$ \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) $$

Matrix based version: $$ \delta^L = \nabla_a C \odot \sigma'(z^L) $$

Name $C$ $\nabla_a C$
Mean Squared Error $\frac{1}{2}∑_j(y_j−a^L_j)^2$ $a^L-y$
Binary Cross Entropy $y\log{a^L} + (1-y)\log(1-a^L)$ $\frac{-y}{a^L}+\frac{1-y}{1-a^L}$
Cross Entropy $-\sum_{k}p(k)log(q(k))$
SoftMax $\sigma(\vec{z}){i}=\frac{e^{z{i}}}{\sum_{j=1}^{n^L} e^{z_{j}}}$ $I \equiv \text{Identity Matrix of } \nabla_a C \newline \sigma(\vec{z}){i}(I - \sigma(\vec{z}){j})$
$$ $$

$$ i = j \rightarrow \delta_{ij} = 1 \newline i \neq j \rightarrow \delta_{ij} = 0 \newline \sigma(\vec{z}){i}(\delta{ij} - \sigma(\vec{z})_{j}) $$

SoftMax derivation

$$ \sigma(\vec{z}){i}=\frac{e^{z{i}}}{\sum_{j=1}^{n^L} e^{z_{j}}} $$

$$ \frac{\partial L}{\partial o_i}=-\sum_ky_k\frac{\partial \log p_k}{\partial o_i}=-\sum_ky_k\frac{1}{p_k}\frac{\partial p_k}{\partial o_i}\=-y_i(1-p_i)-\sum_{k\neq i}y_k\frac{1}{p_k}({\color{red}{-p_kp_i}})\=-y_i(1-p_i)+\sum_{k\neq i}y_k({\color{red}{p_i}})\=-y_i+\color{blue}{y_ip_i+\sum_{k\neq i}y_k({p_i})}\=\color{blue}{p_i\left(\sum_ky_k\right)}-y_i=p_i-y_i $$

For example if we use the quadratic cost function $$ C=\frac{1}{2}∑_j(y_j−a^L_j)^2 $$

It's derivate relative to $a^L_j$ will be : $$ \frac{∂C}{∂a^L_j}=(a^L_j−y_j) $$

And it's vectorized form: $$ \nabla_a C = (a^L-y) $$

2. The error $δ^l$ in terms of the error in the next layer

$$ \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) $$

3. Rate of change of the cost with respect to any bias

$$ \frac{\partial C}{\partial b^l_j} = \delta^l_j $$

4. Rate of change of the cost with respect to any weight

$$ \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j $$

Can be seen as this simplified form:

$$ \frac{\partial C}{\partial w} = a_{\rm in} \delta_{\rm out} $$


Let's write our weight update in the following way: $$ \delta W \equiv \frac{\partial C}{\partial W} $$ The following optimizers equations works similarly for $\frac{\partial C}{\partial b}$


$V\delta W$ should be initialized at $0$ $$ V\delta W = (\beta) V\delta W + (1 - \beta) \delta W
\newline W = W - \alpha \times V\delta W $$


$S\delta W$ should be initialized at $0$ $$ S\delta W = (\beta) S\delta W + (1 - \beta) \delta W^2 \newline W = W - \alpha \times \frac{\delta W}{\sqrt{S\delta W} + \epsilon} $$


1. Computing the Momentum and RMSprop

$V\delta W$ and $S\delta W$ should be initialized at $0$

$$ V\delta W = (\beta_1) V\delta W + (1 - \beta_1) \delta W \newline S\delta W = (\beta_2) S\delta W + (1 - \beta_2) \delta W^2 $$

2. Correcting exponentially weighted averages

$t$ is the umpteenth update $$ V\delta W^{corrected} = \frac{V \delta W}{1 - \beta^t_1} \newline S\delta W^{corrected} = \frac{S \delta W}{1 - \beta^t_2} $$

3. Weight update

$$ W = W - \alpha \times \frac{V\delta W^{corrected}}{\sqrt{S\delta W^{corrected}} + \epsilon} \newline $$

HyperParameter Advised Value
$\alpha$ needs to be tuned
$\beta_1$ $0.9$
$\beta_2$ $0.999$
$\epsilon$ $10^{-8}$