Personal project with simple objective:
Learn deep learning by creating neural network capable of image detection.
Without using high-level libraries like PyTorch or TensorFlow!
Dataset https://huggingface.co/datasets/ylecun/mnist
Model trained on 100 batches, 1 epoch, learning rate set to 0.03
Python 3.11
Miniconda
pip install -r requirements.txt
Thoughts:
Linear algebra is Really Cool!
Requirement for dot product of m1 x m2
is m1.numberOfcolumns==m2.numberOfrows
m1 = np.ndarray((10,784))
m2 = np.ndarray((784,1))
print(np.dot(m1,m2).shape)
>>> (10,10)
"A.shape will return a tuple (m, n), where m is the number of rows, and n is the number of columns."
When your neural network ends with ...->normalization->loss function Calculating gradient starts from loss function NOT normalization
Backpropagation might be hard to understand but when you do understand it, it becomes quite easy as a concept.
Softmax + Cross Entropy Loss = Good (Easy deriative calculation)
Sometimes matrix (N, N) might be a problem, try to transpose it!
Useful sources:
https://www.parasdahal.com/softmax-crossentropy
https://youtu.be/VMj-3S1tku0?si=WmW3sketuuTzM0qv
https://github.com/karpathy/micrograd
https://cs231n.github.io/optimization-2/#mat
We represent each weight as a tuple/class of data/value it holds and "global" gradient. "Global" gradient tell us in which direction shifting data by small amount will increase the final output.
In backpropagation.py
example has:
x
- inputw
- weightb
- biasy
- label (we want to teach network that f(x)=y)
Notation dx/dy means deriative of x with respect to y:
e.g. x = 2y^3 + z, dx/dx = ? dx/dy = ?, dx/dz = ?
dx/dy = 2*3y^2 + 0
dx/dy = 6y^2
dx/dz = 0 + 1
dx/dz = 1
dx/dx = 1 always
Backpropagation starts from the end of the network:
- Squared loss, gradient = dL/dL = 1 (last layer always has gradient = 1)
L=z3**2
gradient = dL/dz3 =2*z3
(z3 represents data/value that z3 holds) Let's sayz3.data=1.27
thenz3.grad=2*z3.data
=z3.grad=2*1.27
=z3.grad=2.54
so if we increase z3 it will increase output, since the last layer is loss we want to minimalize it, that's why in update we do negate the gradient e.gw.data += -w.grad * learning_rate
z3 = y-z2
gradient = dL/dy =dL/z3*z3/y
(we apply chain rule here) To get global gradient for y we need to multiply global gradient from output with local gradient
We take global gradient from 1 step above like here from step 2. dL/dz3 =z3.grad=2.54
Local gradient is dz3/dy =1+0
=1
and dz3/dz2 =0-1
=-1
global_gradient
=global_gradient_from_parent_operation*local_gradient
dL/dy =z3.grad=2.54
*dz3/dy=1
=y.grad=2.54
dL/dz2 =z3.grad=2.54
*dz3/dz2=-1
=z2.grad=-2.54
y doesn't need gradient anyway since it's a label, we can't control itz2 = z1 + b
dL/dz1 =z2.grad=-2.54
*dz2/dz1=1
=z1.grad=-2.54
dL/db =z2.grad=-2.54
*dz2/db=1
=b.grad=-2.54
z1 = x * w
(lets assume x = 2)
dL/dw =z1.grad=-2.54
*dz1/dw=x=2
=w.grad=-5.08
Now we know in which direction we should shift weight and bias to minimalize the loss
Remeber, gradient shows in which direction final output (aka Loss) increases.
We negate gradient and shift weights accordingly, learning_rate
is a hyperparameter
w.data += -w.grad * learning_rate
b.data += -b.grad * learning_rate
Neural networks are build in a way that weights are represented by neurons that are stored in layers. Few layers can connect with each other creating full neural network.