This repository contains the code used for the paper Machine Learning and the Implementable Efficient Frontier by Jensen, Kelly, Malamud, and Pedersen (2024). Please cite this paper if you are using the code:
@article{JensenKellyMalamudPedersen2024,
author = {Jensen, Theis Ingerslev and Kelly, Bryan and Malamud, Semyon and Pedersen, Lasse Heje},
title = {Machine Learning and the Implementable Efficient Frontier},
year = {2024}
}
Please send questions about the code to Theis I. Jensen at theis.jensen@yale.edu.
To run the code, clone this repo to your local computing environment, and follow the steps explained below. We note that replicating our analysis requires substantial computational resources, and the code is set up to be executed on a high performance computing cluster with a SLURM scheduler.
You need eight data sets to run the code.
usa.csv
- Firm characteristics at a monthly frequency from the paper Is There a Replication Crisis in Finance? by Jensen, Kelly, and Pedersen (2023)
- Download from WRDS. To get US data, require that the column
excntry
is equal to "USA"
usa_dsf.csv
- Stock returns at a daily frequency
- The data can be generated by following the instructions from the GitHub repository from "Is There a Replication Crisis in Finance.'' Alternatively, you can request the data from us by sending an email to theis.jensen@yale.edu
world_ret_monthly.csv
- Stock returns at a monthly frequency
- The data can be generated by following the instructions from the GitHub repository from "Is There a Replication Crisis in Finance.'' Alternatively, you can request the data from us by sending an email to theis.jensen@yale.edu
Factor Details.xlsx
- Information about factor characteristics from "Is There a Replication Crisis in Finance"
- Download from GitHub/bkelly-lab/ReplicationCrisis/GlobalFactors/Factor Details.xlsx
Cluster Labels.csv
- Information about factor characteristics from "Is There a Replication Crisis in Finance"
- Download from GitHub/bkelly-lab/ReplicationCrisis/GlobalFactors/Cluster Labels.csv
market_returns.csv
- Market returns from "Is There a Replication Crisis in Finance"
- Download from Dropbox
ff3_m.csv
- The Fama-French 3-factor model data (used to get the risk-free rate)
- Download from Kenneth French's data library
short_fees
- Short-selling fees based on the Markit Securities Finance Analytics - American Equities database. You can run the vast majority of the code without this data set (the exception being
6 - Short selling fees.R
) - Download from WRDS
- Short-selling fees based on the Markit Securities Finance Analytics - American Equities database. You can run the vast majority of the code without this data set (the exception being
These data sets should be saved in the Data
folder with the exact names used above.
In this section, we'll go through the steps needed to implement the portfolio choice methods considered in the paper and implement the portfolio choice methods we use in the paper. This step is by far the most computationally intensive. We used the dSQ module to submit multiple jobs at the same time to a Slurm scheduler. Below, we include our dSQ calls to give you a sense of the computational resources required to run each step.
- What: Estimate the 12 models used to predict realized returns at time t+1, t+2, ..., t+12
- dSQ call:
dsq --job-file Joblists/joblist_models.txt --cpus-per-task=32 --mem=100G --partition=day -t 06:00:00 --mail-type ALL --output slurm_output/dsq-joblist_models-%A_%1a-%N.out
. This call will start 12 independent jobs, which for us took a maximum of 5 hours and required approximately 75GB RAM for each job - Main R script:
slurm_fit_models.R
- Output folder:
Data/Generated/Models
- What: Implement portfolio choice methods with the base case parameters used for tables 2-4 and figures 2-4 and D.4
- dSQ call:
dsq --job-file Joblists/joblist_pfchoice_base.txt --cpus-per-task=48 --mem=60G --partition=week -t 1-10:00:00 --mail-type AL L --output slurm_output/dsq-joblist_pfchoice-base-%A_%1a-%N.out
. This call will start 1 job, which for us took a maximum of 6 hours and required approximately 40GB RAM - Main R script:
slurm_build_portfolios.R
- Output folder:
Data/Generated/Portfolios
- What: Implement the portfolio choice methods for all stocks used for the top-left panel in Figure 8
- dSQ call:
dsq --job-file Joblists/joblist_pfchoice_all.txt --cpus-per-task=32 --mem=100G --partition=week -t 5-00:00:00 --mail-type AL L --output slurm_output/dsq-joblist_pfchoice-all-%A_%1a-%N.out
. This call will start 1 job, which for us took a maximum of 2 days and 16 hours and required approximately 70GB RAM - Main R script:
slurm_build_portfolios.R
- Output folder:
Data/Generated/Portfolios
- What: Implement the portfolio choice methods for stocks in different size groups used for the remaining panels in Figure 8
- dSQ call:
dsq --job-file Joblists/joblist_pfchoice_size.txt --cpus-per-task=16 --mem=50G --partition=day -t 8:00:00 --mail-type ALL -- output slurm_output/dsq-joblist_pfchoice-size-%A_%1a-%N.out
. This call will start 5 jobs, which for us took a maximum of 5 hours and required approximately 30GB RAM for each job - Main R script:
slurm_build_portfolios.R
- Output folder:
Data/Generated/Portfolios
- What: Implement portfolio choice methods for different combinations of wealth and risk aversion to generate the implementable efficient frontier from Figure 1
- dSQ call:
dsq --job-file Joblists/joblist_pfchoice_ief.txt --cpus-per-task=16 --mem=50G --partition=day -t 10:00:00 --mail-type ALL -- output slurm_output/dsq-joblist_pfchoice-ief-%A_%1a-%N.out
. This call will start 20 independent jobs, which for us took a maximum of 7 hours and required approximately 40GB RAM for each job - Main R script:
slurm_build_portfolios.R
- Output folder:
Data/Generated/Portfolios
- What: Implement the permutation-based feature importance analysis used for figures 5, 6, and D.3
- dSQ call:
dsq --job-file Joblists/joblist_pfchoice_fi.txt --cpus-per-task=48 --mem=70G --partition=day -t 23:00:00 --mail-type ALL --o utput slurm_output/dsq-joblist_pfchoice-fi-%A_%1a-%N.out
This call will start 3 independent jobs, which for us took a maximum of 3 hours and required approximately 35GB RAM for each job - Main R script:
slurm_build_portfolios.R
- Output folder:
Data/Generated/Portfolios
- What: Implement simulations from Appendix Section E
- dSQ call:
dsq --job-file Joblists/joblist_simulations.txt --cpus-per-task=32 --mem=50G --partition=day -t 10:00:00 --mail-type ALL --o utput slurm_output/dsq-joblist_simulations-%A_%1a-%N.out
. This call will start 15 independent jobs, which for us took a maximum of 9 hours and required approximately 25GB for each job - Main R script:
simulations/simulations.R
- Output folder:
simulations/results
After generating the data from the previous section, you can analyze it on your local PC. Specifically, you can generate all figures and tables from the paper by running the scripts below. Importantly, you need to go through each script and ensure they point to the correct files (the names of the files from the previous sections depend on when the code was submitted).
Start by running the main.R
script to load the relevant packages and settings.
Next, run the scripts below to create the figures and tables:
6 - Implementable efficient frontier.R
6 - Base analysis.R
6 - Performance across size distribution.R
6 - Feature Importance.R
6 - Economic intuition.R
6 - Short selling fees.R
6 - RF Example.R
Finally, run the scripts below to save the figures and tables, as well as generate various numbers mentioned in the paper:
7 - Figures.R
7 - Tables.R
7 - Numbers.R
After running these scripts, you should have the figures from the paper in the Figures
folder and be able to copy-paste the tables in latex format and the numbers from the console.