Implementation of a two-layer perceptron (from scratch) with four back-propagation methods in Python
For generating the Mackey-Glass time series, I used the formula below:
You can also see the plot of data below:
The goal is to predict x(t + 1) from x(t - 2), x(t - 1) and x(t). Thus, our neural network has three-dimensional inputs:
input: [x(t - 2), x(t - 1), x(t)]
output: x(t + 1)
I've considered 70 percent of the data as training data, 25 percent as validation data, and 5 percent as my test data. And for more stable training, I have normalized the data by the min-max normalization method.
As a result of normalizing the data, the data range changes to [0, 1], allowing us to use the unipolar sigmoid function as our activation function.
As I mentioned above, the input dimension of the network is three. I've considered five neurons and one neuron for the hidden and output layers, respectively. This is the network architecture:
At first, A uniform distribution is used to randomly initialize the network's weights (W1, W2), then I've used different methods to train the network, which can be seen below:
1: stochastic gradient descent
This method updates the network parameters (weights) at every time step, making it sensitive to noisy data.
You can also see the results below:
2: emotional learning
This method is very similar to the SGD method but uses the previous step error to make the network learn faster and more accurately.
3: adaptive learning rate
In this method, we assign a different learning rate to each of the trainable parameters, which are called adaptive learning rates. For each element of the weights matrix, we consider a different learning rate, which is also trained during the learning process. (As I said before, this method has more learning parameters (twice as much as before), and MSE might fluctuate and need more time for training).
4: levenberg marquardt
This algorithm is a generalization of the Gauss-Newton algorithm, designed to increase the convergence speed of the second-order optimization. If 𝜇(𝑡) equals zero, the algorithm is the same as the Gauss-Newton algorithm, and if 𝜇(𝑡) is a large number, this algorithm is very similar to SGD. This is a batch-based method. In other words, unlike the above methods, we do not update the parameters after the arrival of the new training sample, but the parameters will be updated after the arrival of all the training samples. After each epoch, parameters are updated based on the jacobian matrix, which stores all epoch information. (This method is appropriate for the low number of data, and by increasing the data size, the volume of calculations increases sharply).
The image below shows that the best result for the network is achieved with the Levenberg Marquardt algorithm: