The aim of this project is to consolidate my understanding about neural networks, and to refine my internal representation of neural networks as a computation graph.
I wanted to gain intuition about how and why different optimizers converge / behave. Therefore, I implemented a number of optimizers from scratch based on the papers they were published in.
In this ipython notebook, I wrote a neural network with an object-oriented approach and tested it on the MNIST dataset. The optimisers are contained in this script.
For the tests, the network architecture used was 2 linear layers with relu activation followed by an output layer to a softmax function. The Layer and Model objects created can handle an arbitrary number of layers with different units.
The optimizers I have implemented in this notebook includes (so far):
- Minibatch Gradient Descent (Vanilla)
- SGD with Momentum
- Nesterov Momentum (or Nesterov Accelerated Gradient)
- Adagrad
- RMSprop
- Adam
- Nadam
- Adadelta
- Adamax
- QHAdam
Decaying Momentum (Demon) can be applied to any optimizer that inherits from the Adam subclass and the SGDM subclass, and Decoupled Weight decay can be applied to any optimizer that inheritis from the Adam subclass. This can result in optimizers such as DemonQHAdamW or DemonNesterov.
The graph below shows training loss over epochs for a few select optimizers:
This one shows validation accuracy over epochs:
QHAdamW performed the best in training loss, while Nesterov performed the best in validation accuracy in this task.
It is noted that SGD with momentum / Nesterov momentum may be 'simpler' gradient descent algorithms, but they perform quite well over in convergence over epochs.
With knowledge from my previous tests, these momentum optimizers are quite sensitive to the learning rate, as opposed to an algorithm from the "Adam's family".
Perhaps convert optimizers to separate objects for easier handling of arguments / optional parameters- Convolutional layer and pooling from scratch, to test with CIFAR10 dataset