Overview

This repository contains the code used for the paper Machine Learning and the Implementable Efficient Frontier by Jensen, Kelly, Malamud, and Pedersen (2024). Please cite this paper if you are using the code:

@article{JensenKellyMalamudPedersen2024,
	author = {Jensen, Theis Ingerslev and Kelly, Bryan and Malamud, Semyon and Pedersen, Lasse Heje},
	title = {Machine Learning and the Implementable Efficient Frontier},
	year = {2024}
}

Please send questions about the code to Theis I. Jensen at theis.jensen@yale.edu.

How to run the code

To run the code, clone this repo to your local computing environment, and follow the steps explained below. We note that replicating our analysis requires substantial computational resources, and the code is set up to be executed on a high performance computing cluster with a SLURM scheduler.

Data

You need eight data sets to run the code.

usa.csv
- Firm characteristics at a monthly frequency from the paper Is There a Replication Crisis in Finance? by Jensen, Kelly, and Pedersen (2023)
- Download from WRDS. To get US data, require that the column excntry is equal to "USA"
usa_dsf.csv
- Stock returns at a daily frequency
- The data can be generated by following the instructions from the GitHub repository from "Is There a Replication Crisis in Finance.'' Alternatively, you can request the data from us by sending an email to theis.jensen@yale.edu
world_ret_monthly.csv
- Stock returns at a monthly frequency
- The data can be generated by following the instructions from the GitHub repository from "Is There a Replication Crisis in Finance.'' Alternatively, you can request the data from us by sending an email to theis.jensen@yale.edu
Factor Details.xlsx
- Information about factor characteristics from "Is There a Replication Crisis in Finance"
- Download from GitHub/bkelly-lab/ReplicationCrisis/GlobalFactors/Factor Details.xlsx
Cluster Labels.csv
- Information about factor characteristics from "Is There a Replication Crisis in Finance"
- Download from GitHub/bkelly-lab/ReplicationCrisis/GlobalFactors/Cluster Labels.csv
market_returns.csv
- Market returns from "Is There a Replication Crisis in Finance"
- Download from Dropbox
ff3_m.csv
- The Fama-French 3-factor model data (used to get the risk-free rate)
- Download from Kenneth French's data library
short_fees
- Short-selling fees based on the Markit Securities Finance Analytics - American Equities database. You can run the vast majority of the code without this data set (the exception being 6 - Short selling fees.R)
- Download from WRDS

These data sets should be saved in the Data folder with the exact names used above.

Data generation

In this section, we'll go through the steps needed to implement the portfolio choice methods considered in the paper and implement the portfolio choice methods we use in the paper. This step is by far the most computationally intensive. We used the dSQ module to submit multiple jobs at the same time to a Slurm scheduler. Below, we include our dSQ calls to give you a sense of the computational resources required to run each step.

Return prediction models

What: Estimate the 12 models used to predict realized returns at time t+1, t+2, ..., t+12
dSQ call: dsq --job-file Joblists/joblist_models.txt --cpus-per-task=32 --mem=100G --partition=day -t 06:00:00 --mail-type ALL --output slurm_output/dsq-joblist_models-%A_%1a-%N.out. This call will start 12 independent jobs, which for us took a maximum of 5 hours and required approximately 75GB RAM for each job
Main R script: slurm_fit_models.R
Output folder: Data/Generated/Models

Portfolios: base case

What: Implement portfolio choice methods with the base case parameters used for tables 2-4 and figures 2-4 and D.4
dSQ call: dsq --job-file Joblists/joblist_pfchoice_base.txt --cpus-per-task=48 --mem=60G --partition=week -t 1-10:00:00 --mail-type AL L --output slurm_output/dsq-joblist_pfchoice-base-%A_%1a-%N.out. This call will start 1 job, which for us took a maximum of 6 hours and required approximately 40GB RAM
Main R script: slurm_build_portfolios.R
Output folder: Data/Generated/Portfolios

Portfolios: all

What: Implement the portfolio choice methods for all stocks used for the top-left panel in Figure 8
dSQ call: dsq --job-file Joblists/joblist_pfchoice_all.txt --cpus-per-task=32 --mem=100G --partition=week -t 5-00:00:00 --mail-type AL L --output slurm_output/dsq-joblist_pfchoice-all-%A_%1a-%N.out. This call will start 1 job, which for us took a maximum of 2 days and 16 hours and required approximately 70GB RAM
Main R script: slurm_build_portfolios.R
Output folder: Data/Generated/Portfolios

Portfolios: size groups

What: Implement the portfolio choice methods for stocks in different size groups used for the remaining panels in Figure 8
dSQ call: dsq --job-file Joblists/joblist_pfchoice_size.txt --cpus-per-task=16 --mem=50G --partition=day -t 8:00:00 --mail-type ALL -- output slurm_output/dsq-joblist_pfchoice-size-%A_%1a-%N.out. This call will start 5 jobs, which for us took a maximum of 5 hours and required approximately 30GB RAM for each job
Main R script: slurm_build_portfolios.R
Output folder: Data/Generated/Portfolios

Implementable Efficient Frontier:

What: Implement portfolio choice methods for different combinations of wealth and risk aversion to generate the implementable efficient frontier from Figure 1
dSQ call: dsq --job-file Joblists/joblist_pfchoice_ief.txt --cpus-per-task=16 --mem=50G --partition=day -t 10:00:00 --mail-type ALL -- output slurm_output/dsq-joblist_pfchoice-ief-%A_%1a-%N.out. This call will start 20 independent jobs, which for us took a maximum of 7 hours and required approximately 40GB RAM for each job
Main R script: slurm_build_portfolios.R
Output folder: Data/Generated/Portfolios

Economic feature importance

What: Implement the permutation-based feature importance analysis used for figures 5, 6, and D.3
dSQ call: dsq --job-file Joblists/joblist_pfchoice_fi.txt --cpus-per-task=48 --mem=70G --partition=day -t 23:00:00 --mail-type ALL --o utput slurm_output/dsq-joblist_pfchoice-fi-%A_%1a-%N.out This call will start 3 independent jobs, which for us took a maximum of 3 hours and required approximately 35GB RAM for each job
Main R script: slurm_build_portfolios.R
Output folder: Data/Generated/Portfolios

Simulations

What: Implement simulations from Appendix Section E
dSQ call: dsq --job-file Joblists/joblist_simulations.txt --cpus-per-task=32 --mem=50G --partition=day -t 10:00:00 --mail-type ALL --o utput slurm_output/dsq-joblist_simulations-%A_%1a-%N.out. This call will start 15 independent jobs, which for us took a maximum of 9 hours and required approximately 25GB for each job
Main R script: simulations/simulations.R
Output folder: simulations/results

Data analysis

After generating the data from the previous section, you can analyze it on your local PC. Specifically, you can generate all figures and tables from the paper by running the scripts below. Importantly, you need to go through each script and ensure they point to the correct files (the names of the files from the previous sections depend on when the code was submitted).

Start by running the main.R script to load the relevant packages and settings.

Next, run the scripts below to create the figures and tables:

6 - Implementable efficient frontier.R
6 - Base analysis.R
6 - Performance across size distribution.R
6 - Feature Importance.R
6 - Economic intuition.R
6 - Short selling fees.R
6 - RF Example.R

Finally, run the scripts below to save the figures and tables, as well as generate various numbers mentioned in the paper:

7 - Figures.R
7 - Tables.R
7 - Numbers.R

After running these scripts, you should have the figures from the paper in the Figures folder and be able to copy-paste the tables in latex format and the numbers from the console.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Joblists		Joblists
config_files		config_files
simulations		simulations
.gitignore		.gitignore
0 - General functions.R		0 - General functions.R
0 - Portfolio choice functions.R		0 - Portfolio choice functions.R
0 - Return prediction functions.R		0 - Return prediction functions.R
1 - Prepare Data.R		1 - Prepare Data.R
2 - Fit Models.R		2 - Fit Models.R
3 - Estimate Covariance Matrix.R		3 - Estimate Covariance Matrix.R
4 - Prepare Portfolio Data.R		4 - Prepare Portfolio Data.R
5 - Base case.R		5 - Base case.R
5 - Feature importance IEF.R		5 - Feature importance IEF.R
5 - Feature importance base.R		5 - Feature importance base.R
5 - Feature importance ret.R		5 - Feature importance ret.R
6 - Base analysis.R		6 - Base analysis.R
6 - Economic intuition.R		6 - Economic intuition.R
6 - Feature importance.R		6 - Feature importance.R
6 - Implementable efficient frontier.R		6 - Implementable efficient frontier.R
6 - Performance across size distribution.R		6 - Performance across size distribution.R
6 - RF example.R		6 - RF example.R
6 - Short selling fees.R		6 - Short selling fees.R
7 - Figures.R		7 - Figures.R
7 - Numbers.R		7 - Numbers.R
7 - Tables.R		7 - Tables.R
Main.R		Main.R
README.md		README.md
ewma.cpp		ewma.cpp
ml-and-the-implementable-efficient-frontier.Rproj		ml-and-the-implementable-efficient-frontier.Rproj
slurm_build_portfolios.R		slurm_build_portfolios.R
slurm_fit_models.R		slurm_fit_models.R
sqrtm_cpp.cpp		sqrtm_cpp.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

How to run the code

Data

Data generation

Return prediction models

Portfolios: base case

Portfolios: all

Portfolios: size groups

Implementable Efficient Frontier:

Economic feature importance

Simulations

Data analysis

About

Releases

Packages

Languages

theisij/ml-and-the-implementable-efficient-frontier

Folders and files

Latest commit

History

Repository files navigation

Overview

How to run the code

Data

Data generation

Return prediction models

Portfolios: base case

Portfolios: all

Portfolios: size groups

Implementable Efficient Frontier:

Economic feature importance

Simulations

Data analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages