An Application of Random Forest!
-
Objective: Project for my intern at Research Center VERA, Ca' Foscari University of Venice.
-
Abstract: Use sentiment-based features to predict cryptocurrency returns. Models used: Random Forest Classifier, Random Forest Regressor, and VAR time-series model. Analysis timeframe: 28/11/2014 - 25/07/2020.
-
Status: Completed.
- Random Forests (Regressor & Classifier)
- Principal Component Analysis
- Vector Autoregression (VAR) model
- Sentiment Indicators (retrieved from my graduation thesis)
- Python 3
- numpy==1.18.5
- pandas==1.0.5
- scikit-learn==0.23.2
- statsmodels==0.12.0
- plotly==4.9.0
Backtesting strategies based on 3 models:
- Generate trading signals: Long as predicted return > 0, short as predicted return < 0, wait otherwise.
- Test period (25% of the dataset): 05/03/2019 - 25/07/2020
- RF Classifier outperforms significantly both strategies and also the simple buy-and-hold strategy.
- Download the interactive version.
-
Clone this repo:
git clone https://github.com/dang-trung/crypto-return-predictor
-
Create your environment (virtualenv):
virtualenv -p python3 venv
source venv/bin/activate
(bash) orvenv\Scripts\activate
(windows)
(venv) cd crypto-return-predictor
(venv) pip install -e
Or (conda):
conda env create -f environment.yml
conda activate crypto-return-predictor
-
Run in terminal:
python -m crypto_return_predictor
Cryptocurrency market returns (computed using the market index CRIX, retrieved here, see more on how the index is created at Trimborn & Härdle (2018) or those authors' website.)
- Sentiment score of Messages on StockTwits, Reddit Submissions, Reddit Comments
- Computed using dictionary-based sentiment analysis, lexicon used: crypto-specific lexicon by Chen et al (2019), retrieved at the main author's personal page.
- StockTwits messages are retrieved through StockTwits Public API, Reddit data are retrieved using PushShift.io Reddit API.
- Messages volume on StockTwits, Reddit Submissions, Reddit Comments.
- Market volatility index VCRIX (see how the index is created: Kolesnikova (2018), retrieved here.)
- Market trading volume (retrieved using Nomics Public API)
Read more on how I retrieve these sentiment measures in my graduation thesis or its Github repo.
- For VAR model: Lagged values of the first principal component of all 9 sentiment measures (up to 5 lags).
- For Random Forests: Sentiment measures' lagged Values (up to 5 lags).
Order by performance (from high to low):
- Random Forest Classifier:
- Accuracy: 61.86%
- Confusion matrix:
Actual | ||||
---|---|---|---|---|
Negative | Unchanged | Positive | ||
Predicted | Negative | 145 | 0 | 97 |
Unchanged | 1 | 0 | 0 | |
Positive | 96 | 0 | 170 |
- Backtesting daily returns: ~91bps
- VAR(5):
- Accuracy: 54.62%
- Confusion matrix:
Actual | ||||
---|---|---|---|---|
Negative | Unchanged | Positive | ||
Predicted | Negative | 57 | 0 | 185 |
Unchanged | 0 | 0 | 1 | |
Positive | 45 | 0 | 221 |
- Backtesting daily returns: ~48bps
- Random Forest Regressor:
- Accuracy: 56.19%
- Confusion matrix:
Actual | ||||
---|---|---|---|---|
Negative | Unchanged | Positive | ||
Predicted | Negative | 222 | 0 | 20 |
Unchanged | 1 | 0 | 0 | |
Positive | 202 | 0 | 64 |
- Backtesting daily returns: ~19bps (just slightly better than holding the CRIX index)
For better understanding of the project, kindly read the report.