Diego Alvarez
diego.alvarez@colorado.edu
For ease of use the notebook only uses pandas
, numpy
, and matplotlib
with no other packages. A majority of the code, if not all, is written in fully vectorized pandas thus using minimal amount of for loops and relying on pd.groupby
, pd.agg
, pd.pivot
, pd.query
and pd.melt
. The two main data generation files DateGenerator.py
and PriceGenerator.py
are fully OOP and upon instantiation of object they generate the data. Data is saved within .parquet
to conserve space and is preserved in longer format, although sample data is saved as .csv
.
Futures
└───notebook
│ analysis.ipynb
│ MakeReadmePlots.ipynb
└───src
│ DateGenerator.py
│ PriceGenerator.py
│ MarketStats.py
│ makeData.py
└───data
│ dates.parquet
│ prices.parquet
│ prices_samples.parquet
│ prices_samples.csv
│ prices1m_samples.parquet
│ prices1m_samples.csv
data
directory is not present if repo is created from clone via git to preserve repo space. If repo is cloned, data
directory is created and then filled. If repo is sent via .zip
then data
directory is present and filled with files.
-
DateGenerator.py
: Creates data frame mask for specific contracts. When object is instantiated it defaults to required futures contract (see project requirements) but can take an arbitrary number of contracts. Upon initialization the object makes a dataframe with the correct open market days & hours. It also accounts for timezones and daylight savings as well. There are also functions within code to ensure that there are right number of days per year and hours per day (_check_days_count()
and_check_hours_count()
respectively). File outputsdates.parquet
. -
PriceGenerator.py
: Creates synthetic price time series data built on top of output fromDateGenerator.py
usingdates.parquet
. Synthetic time series includes price roll which is assumed to be the 15th of the last month of the quarter (if weekend or holiday then the following trading day). Upon instantiation of object the code creates the time series. There is also a helper function to ensure that OHLC relationship is preserved (_check_ohlc
). File outputprices.parquet
-
makeData.py
: Creates each object and uses methodsave_data()
withinDateGenerator.py
andPriceGenerator.py
. Then runsmake_sample()
function which gets the last 1 year and 1 month sample of theprices.parquet
dataset and saves to file as parquet and csv respectively. -
MarketStats.py
: Object for creating chartpack to output the calculations. All methods are typevoid
unless they are plotting. If verbose is set True then void functions will state how object attributes are saved to the object. This object is solely used inAnalysis.ipynb
-
dates.parquet
: DataFrame mask for price series containing all contracts, all 5 min bars, with correct market open days and hours. Output from__init__()
function ofDateGenerator.py
. -
prices.parquet
: Synthetic price time series OHLC containing all contracts, accounting for roll. Output from__init__()
function ofPriceGenerator.py
-
prices_sample.parquet
&prices_sample.csv
: Sample 1 year dataset. Later gets used inMakeReadmePlots.ipynb
. The.csv
may be too big and thus a 1 month sample has been made as well. -
prices_1msample.parquet
&prices_1msample.csv
: 1 month sample
notebook files
-
analysis.ipynb
: Jupyter Notebook to run the calculations required. -
MakeReadmePlots.ipynb
: Makes plots for ReadMe file
- 5 min OHLC of 10y worth of data
I. Data must preserve OHLC relationship - 5 Futures contracts in their respective timezones
I. 1 Chicago
II. 1 NYC
III. 1 London
IV. 1 Frankfurt
V. 1 Tokyo - Contracts get rolled on the 15th of the last month of the quarter or the following trading day if weekend (roll will be ~2% of contract value)
- Work on a 250 day calendar and account for time zone changes
- Market hours are 9pm to 5pm the following days
- Once data is simulated calculate the following
I. Find roll-adjusted close
II. Average Daily Volume
III. Average Daily True Range (roll adjusted and roll unadjusted)
IV. Average intraday returns based on NYC time zone (concurrently across markets)
The last two are written in 2 different ways. The Average true range has a version that finds the range per each bar, and one that finds it for the OHLC of the day. The average intraday returns based on NYC time zone finds returns at the 5 min bar, and one that find the total return for the period.
Upon initialization of the DataGenerator object most of the parameters can be modified although they are defaulted. Arguments are defaulted as
- country_contract: type
dictionary
for modifying the number of contracts per each country - end_date: type
datetime
preset for the first day of the current year - year_lookback: type
int
preset for 10y which creates the lookback window
Once the DateGenerator.py
object has been fully initialized it is only prepped with dates and has the following format: contracts get generic names based on the country that is passed through following the form "Country Code" + num. Later in PriceGenerator.py
and prices.parquet
contracts get rolled and thus have new names. For example NYC1 and Chicago1 are akin to NYMEX Crude and CME Crude. By analog NYC1_1 and Chicago1_1 are akin to first NYMEX Crude contract and first CME Crude contract. For example:
Start Date (NYC Localized) | contract |
---|---|
2020-01-01 00:00:00 | NYC1_29 |
2020-01-01 01:00:00 | Chicago1_29 |
2020-01-15 00:00:00 | NYC1_30 |
2020-01-15 01:00:00 | Chicago1_30 |
2020-04-15 00:00:00 | NYC1_31 |
2020-04-15 01:00:00 | Chicago1_31 |
Therefore NYC1_29 is the 29th rolled contract and Chicago_31 is the 31st rolled contract.
As per market hours the data takes this format
column name | meaning |
---|---|
market_day | If local market is open for that day |
market_hour | If local market is open for that hour |
market_day can be open, closed, or holiday, while market_hour can only be open closed.
Rather than accounting for specific holidays across market hours and the chance that market holidays may occur on weekends, the repo will use an alternative method. Since the specific project requires 250 trading days per year, the following method will be used: respective for the market's local time, there are (260 to 261) weekdays that are eligible candidates as trading days. The weekdays will be randomized and the first 250 will be considered trading days the remaining days (not including weekends) will be considered holidays. Unfortunately since there is no gaurantee that the holiday will land on a weekday in the following years, every year the holidays change in the local market. This is to fit in accordance with the 250 day rule.
Example of market trading days. Green: open, Blue: closed (weekend), red: closed (holiday) localized to local time.
Prices are created by sampling normal distribution to act as return. Rather than recursively summing values and trying to account for roll, return prices are simulated and then cumulative multiplied to back out a time series. Starting price values are sampled normally with mean $1,000 +- $30. Using cumulative returns and multiplying by a starting price makes the time series look more akin to financial time series and allows roll cost to be directly added in before cumulative returns. Although random seed is set, returns get zeroed out if market is closed therefore when calculations are done, returns can't be "backed-out" by using the same random seed.
Roll is done quarterly by the 15th of the last month of each quarter. If the 15th is closed (weekend or holiday) it moves to the following open day. To account for roll cost an extra +-2% is added to the curve. The 2% gets added to the synthetic return data and is assumed to be rolled on the first bar of the trade open. Backwardation and contango are assumed to appear in equal proportions implying that on roll day there is a 50-50 chance that cost may be +-2%. This is done by sampling binomial distribution replacing 0s with -1s multiplying by 2 and scaling for percentage.
The following is an example of roll cost present in the data
This is done by sampling from normal distribution with average 1,000,000
with +- 300,000
. It gets applied to the dataframe and zeros out for when markets are closed.
Once time series have been backed out through cumulative returns its considered to be Open Price. To simulate low, high, and close price, sample from normal distribution to get synthetic dollar price moves. To ensure OHLC relationship is preserved take the absolute values. For example high prices are equal to backed out open prices + absolute value of random normals. The same for low but just "..." - absolute value of random normals. Close price is sampled from normal distribution with extremely small standard deviation and then added to close. There is a chance (extremely unlikely) that the dollar price change for close price is higher or lower than the high price and low price. Also the method _check_ohlc()
ensures that the relationship is preserved and has yet to find any problems with the 10y worth of data generated.
Sample OHLC bars for a random day
There are 3 ways to replicate the repo. If the repo was already received with the data prices.parquet
added then just run the jupyter notebook analysis.ipynb
.
To generate prices.parquet
$ python ./src/makeData.py
This runs DateGenerator.py
and PriceGenerator.py
.
An alternative with no sample data is
$ python ./src/DateGenerator.py
$ python ./src/PriceGenerator.py
Since random seed is set time series should be exactly the same in jupyter notebook.
Roll adjusted close is calculated by zeroing out the first bar on the day of the switching between contracts. Programmatically this is done by calculating bar return grouping by each contract and zeroing out the return associated with the first timestamp. Once zerod out returns are calculated cumulatively and are multiplied by original price to back out roll adjusted return.
Plotting the difference between roll and adjusted roll
To ensure that contract roll is eliminated just sample from a specific year.
The obvious gaps in the roll unadjusted close that are not present in the roll adjusted close show how it gets accounted for.
This is the histogram of the average daily volume per local time. Local market time was picked since using NYC market time may result in specific days (usually mondays and fridays) to have less market hours than the other days of the week. Take for example Tokyo if using NYC hours, Monday will have significantly less market hours then the other days of the week. This will lead to outliers on the plot. The following graph is a distribution with the mean average daily volume.
This was described as what is the average return if a position is held from 9:00AM NYC time to 12:00PM NYC time. The code is written to take in a start hour and end hour of NYC-localized 24-hour clock and then calculate the returns. The code presets the hours as 9:00AM and 12:00PM although any two valid hours can be used. The code also calculates returns concurrently with other markets if they are open. This calculation (and later true range) uses the 5-minute bar. After programming it seems that 5-minute average return between two dates is somewhat meaningless. Therefore, a total close-close total range method has been written.
Here is a distribution of returns with mean.
The total range close-close method gets the first and last close of the bar for the respective hour then calculates returns. Here is a distribution of returns with mean
Same idea as above but instead use high - low / close. In this case the function was generated to find the true range for the 5 minute bars but later extended to the whole day (local time). Here is a distribution of the true range.
Same idea but price is resampled daily. Unfortunately it is not as easy as pd.DataFrame.resample()
since the first bar open and last bar closed are required. The code accounts for this, gets the first bar open and last bar close and the high and low. Here is a distribution with averages as well.
An interesting result from these two functions is that when comparing the true range between roll-adjusted and roll-unadjusted its apparent which way the curve shapes with each contract. Although the price generation was written with 50%-50% likelihood of contango (negative roll) and backwardation (positive roll) the 5 minute bars shows the difference. On the other hand the daily resampled does not, this because the roll although ~2% of the price can still be muted out by the rest of the trading session's activity.