The goal of this project is to develop a regression model with a Long Short-Term Memory network that is able to forecast the price of Microsoft's Stock looking solely at the last 21 days of data.
Effectively predicting the price of stocks has its obvious useful applications, but it is important not to forget its stochastic nature.
- Data Extraction
- Feature Engineering
- Exploratory Data Analysis
- Statistical Checks
- Feature Selection
- Modelling and Hyperparameter Optimization
As the very first part of this project, this section is developed to extract the data from different sources and prepare it for its analysis.
Yahoo Finance API was the tool I used to extract the historical data of all the stocks. Being easy to use, only had to specify the tickers and the start/end date to pull all the data.
I wanted to use natural language processing, text analysis and computational linguistics in order to extract and identify subjective information on worldwide news. This way, my model would technically be able to predict the behavior and trend of the stock market.
To do so, I used the finBERT project, which was specifically made to perform sentiment analysis of financial texts, since it uses a large financial corpus and it has been therefore fine-tuned for financial sentiment classification.
Dataset with News headlines used
For trend analysis purpose, I computed
- Moving Averages (7 and 21 days)
- These help reduce the noise from random short-term price fluctuation.
- We use them to determine the trend direction and resistance levels.
- Moving Average Convergence Divergence (MACD)
- Shows the relationship between two moving averages.
- Helps identify bullish/bearish movements, and their intensity.
- Bollinger Bands
- Formed by the upper, the middle and the lower band, these help generate oversold/overbought signals.
- Expontential Moving Average
- Moving average that gives more weight and importance to the most recent data points.
- Returns
- Simple indicator that computes the percentage of change of the price in relation with the prior day.
- Helps indentifying trends.
Note: due to their high dependancy on the target variable, many of them were dropped before modelling and were used merely for visualization purposes.
Performed a complete EDA by visualizing the distributions, applying transformations and comparing the several features available with correlation matrices and scatter plots.
- Heteroscedasticity:
What you have in your data when the conditional variance is not constant, understanding conditional variance as the variability that you see in y (dependant variable) for each value of t (time period).
Given the nature of most of the variables in the dataset (stock prices, indexes and currency pairs), and after seeing the plots in the EDA section, we can confirm that most of the features are nonlinear and heteroscedastic.
- Multicollinearity:
Multicollinearity is given when two or more independent variables are highly correlated with one another in a regression model
.
The Variance Inflation Factor (VIF) will easily let us see the multicollinearity of X. VIF starts at 1 and has no upper limit.
-
VIF = 1, no correlation between the independent variable and the other variables
-
VIF > 20, high multicollinearity between this independent variable and the others
-
Autocorrelation:
Autocorrelation is a mathematical representation of the degree of similarity between a given time series and a lagged version of itself over successive time intervals.
[Source]
By using a tree-based algorithm, we are able to see the importance of each of the features when a decision is made. I decided, then, to use a pretty popular algorithm, XGBoost, in order to see what features the model was taking more into account.
I had clear that I wanted to use Long Short-Term Memory Networks, due to their sequence nature and therefore good performance on time series.
The architectures I chose are:
- Univariate Vanilla-LSTM (baseline)
- Multivariate Vanilla-LSTM
- Multivariate Stacked-LSTM
- Multivariate CNN-LSTM
- Bidirectional-LSTM
For hyperparameter tuning and optimization, I used hyperopt
with EarlyStopping callbacks
, in order to iterate over different architectures and test them for a few epochs each.
Univariate Vanilla-LSTM (baseline model)
MSE (test): 0.4053
Multivariate Vanilla-LSTM
MSE (test): 0.3570
Multivariate Stacked-LSTM
MSE (test): 1.2658
Multivariate CNN-LSTM
MSE (test): 0.9602
Bidirectional-LSTM
MSE (test): 0.3985
Autoregressive Integrated Moving Average (ARIMA)
MSE (test): 0.5170
This project reported the results of experimentation, through which the performance and accuracy as well as behavioral training of Vanilla/Stacked LSTM, CNN-LSTM and Bidirectional-LSTM models were analyzed, hyper-tuned and compared. I also added a forecast made with an ARIMA algorithm.
The main objective of this project was primarily to focus on whether adding correlated assets and sentiment analysis would help on improving the precision of time series forecasting applied to the stock market, and compare the results between univariate and multivariate models.
The results, even though are not 100% conclusive, state that adding more data - such as the one mentioned above - for stock price forecasting can easily incur in model confusion and poor generalization on test sets. Actually, by only looking at the Real vs Predicted graphs, we can clearly see weak trend followage in case of our latter models (Stacked and CNN-LSTM, the most) which we could interpret as too much overfitting on the training set.
Nevertheless, simpler models such as the Vanilla-LSTM (multivariate) and the Bi-LSTM performed extremely well, the latter showing exceptional results at a higher computational cost.
In relation with the Bi-LSTM, I noticed that the training is way slower since it takes additional batches of data. This could explain both the -better- loss and error, which in my opinion outperforms the rest of architectures, and could indicate that there are additional features captured by the bidirectionality that are not taken into account on unidirectional models.
And at last but not least, even though we see how the error improves on multivariate models in regards to the univariate model, on the plots above we see how actually the prediction looks like displaced to the right, which may mean that multivariate adds a noise whereas on univariate this does not exist.