This reporsitory contains the scripts and datasets used in my Master's Thesis on "Clustering Financial Time Series". The numerical experiments have been conducted using Python and R languages. One can find the necessary commands written below to reproduce the results in the thesis.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import python_scripts.utils as tu
from python_scripts.hier_clust import *
Here I illustrate the analysis on the Synthetic Dataset 1 and the same steps have been performed on Synthetic Dataset 2.
df, true_cluster = tu.get_synthetic_data('datasets/synthetic_data1.csv')
res_df, true_cluster = tu.get_synthetic_data('results/synthetic_data1_res.csv')
plt.rcParams['figure.figsize'] = (8.0, 4.0)
tu.plot_series(df, nr_series_per_class=5)
tu.plot_one_per_class(df, nr_series_per_class=5)
measures = ['euclidean', 'dtw', 'corr1', 'corr2',
'cross_corr1', 'cross_corr2', 'cross_corr3',
'acf', 'pacf', 'rccf1', 'rccf2', 'rccf3']
get_sim_index(measures, df, res_df, true_cluster)
single | complete | average | |
---|---|---|---|
euclidean | 0.363636 | 0.380952 | 0.380952 |
dtw | 0.380952 | 0.439935 | 0.380952 |
corr1 | 0.334127 | 0.243590 | 0.294444 |
corr2 | 0.334127 | 0.243590 | 0.243590 |
cross_corr1 | 1.000000 | 1.000000 | 1.000000 |
cross_corr2 | 1.000000 | 1.000000 | 1.000000 |
cross_corr3 | 1.000000 | 1.000000 | 1.000000 |
acf | 0.321429 | 0.321429 | 0.321429 |
pacf | 0.363636 | 0.363636 | 0.363636 |
rccf1 | 1.000000 | 1.000000 | 1.000000 |
rccf2 | 1.000000 | 1.000000 | 1.000000 |
rccf3 | 0.722222 | 0.949495 | 0.949495 |
chosen_measures = ['cross_corr2', 'cross_corr3',
'rccf2', 'rccf3']
cluster_numbers = list(range(2,11))
get_sil_index(chosen_measures, cluster_numbers, df, res_df)
cross_corr2 | cross_corr3 | rccf2 | rccf3 | |
---|---|---|---|---|
2 | 0.530485 | 0.739282 | 0.530762 | 0.692445 |
3 | 0.329060 | 0.426181 | 0.401696 | 0.477057 |
4 | 0.323237 | 0.420283 | 0.288671 | 0.288671 |
5 | 0.316615 | 0.413631 | 0.270898 | 0.270898 |
6 | 0.298892 | 0.404116 | 0.228582 | 0.228582 |
7 | 0.279041 | 0.384262 | 0.229174 | 0.229174 |
8 | 0.279625 | 0.384827 | 0.225471 | 0.225471 |
9 | 0.282080 | 0.387268 | 0.184505 | 0.184505 |
10 | 0.095913 | 0.095913 | 0.186782 | 0.186782 |
clustering = HClust(df, true_cluster, 'cross_corr3')
clustering.plot_dendrogram()
clustering1 = HClust(df, true_cluster, 'cross_corr2')
clustering2 = HClust(df, true_cluster, 'cross_corr3')
fig = plt.figure(figsize=(20, 8))
fig.subplots_adjust(hspace=0.4, wspace=0.02)
fig.add_subplot(1, 2, 1)
clustering1.plot_heatmap(xlab='$d_{CCF_2}$')
fig.add_subplot(1, 2, 2)
clustering2.plot_heatmap(xlab='$d_{CCF_3}$')
stock_data = pd.read_csv('datasets/nyse_data.csv, index_col=0)
clustering = HClust(data=stock_data, ground_truth=None,
dist_func='cross_corr3', verbose=True)
clustering.dist_mat.to_csv('results/diss_mat_ccf3.csv', index=None, header=None)
Dissimilarity computation: 100% [-------------------------------] Time: 0:32:53
diss_mat_ccf2 = pd.read_csv('results/diss_mat_ccf2.csv', header=None, index_col=None)
diss_mat_ccf3 = pd.read_csv('results/diss_mat_ccf3.csv', header=None, index_col=None)
diss_ccf2 = np.triu(diss_mat_ccf2.values,1).flatten()
diss_ccf3 = np.triu(diss_mat_ccf3.values,1).flatten()
plt.hist(diss_ccf2[diss_ccf2!=0], bins=50, alpha=0.5, label='${CCF_2}$')
plt.hist(diss_ccf3[diss_ccf3!=0], bins=50, alpha=0.5, label='${CCF_3}$')
plt.xlabel('Dissimilarity')
plt.ylabel('Frequency')
plt.legend(fontsize=14);
sectors = pd.read_csv('datasets/sectors.csv', header=None)
Converting sector categories into numeric values.
from sklearn.preprocessing import LabelEncoder
sectors_numeric = LabelEncoder().fit_transform(sectors)
get_similarities(diss_mat_ccf2, ground_truth=sectors_numeric)
single | complete | average | |
---|---|---|---|
0.087304 | 0.325193 | 0.145864 |
permutations = pd.read_csv('results/null_distribution1.csv')
permutations.plot(kind='hist', bins=50, legend=False)
s0 = 309
plt.plot(s0, 0.5, 'ro')
plt.arrow(s0, 20, 0, -16, length_includes_head=True,
head_width=5, head_length=4)
plt.xlabel('Number of pure edges');
source('R_scripts/clustering.R')
source('R_scripts/mst.R')
df <- read.csv('datasets/synthetic_data1.csv')
nr.series <- 5
true_cluster <- c(rep(1, nr.series), rep(2,nr.series),
rep(1, nr.series), rep(2, nr.series))
This snippet will return the similarity index of the clustering using Piccollo and Maharaj distances
with single linkage method. Optionally one can plot the resulting dendrograms by setting plot=TRUE
.
Here I have used the TSclust
package.
cluster_eval(df, dist.method='AR.PIC',
linkage.method = 'single',
true_cluster, plot=F)
cluster_eval(df, dist.method='AR.MAH',
linkage.method = 'single',
true_cluster, plot=F)
This snippet fits an AR or ARIMA(p,1,0) type of model on each time series in the datasets and returns the model residuals for RCCF dissimilarity measure.
res.df <- get.residuals(df)
write.csv(res.df, file="results/synthetic_data1_res.csv",
row.names = F)
Plotting the minimum spanning tree with threshold 1 on the edge weights.
diss_data <- read.csv('results/diss_mat_ccf3.csv', header=F)
sectors <- read.csv('datasets/sectors.csv', header=F)
sectors <- sectors$V1
sectors.numeric <- as.numeric(sectors)
gr <- graph.adjacency(as.matrix(diss_data),
mode='undirected',
weighted = T)
mstree <- igraph::mst(gr)
final_mstree <- plot.mst(mstree, names=sectors.numeric,
pallete='Paired',
threshold=1,
save.fig = F,
fig.size=c(6,4))
Running the permutation test will result in the p-value of the test and the resulting permutations will be saved as a csv file.
permutation.test(final_mstree, 10^4)
Plotting the histogram of clusters in order to see the alignment between the obtained clusters and the provided categories (sectors).
df <- read.csv('results/clusters_complete_3.csv')
df$clusters <- df$clusters + 1
df$gt_num <- as.character(as.numeric(df$gt))
plot.hist(df, save.fig=T, fig.size=c(6,4))