This repository provides the Matlab code for the following paper:
- Xinyu Chen, Zhaocheng He, Yixian Chen, Yuhuan Lu, Jiawei Wang (2019). Missing traffic data imputation and pattern discovery with a Bayesian augmented tensor factorization model. Transportation Research Part C: Emerging Technologies, 104: 66-77. [preprint] [doi] [slide] [data] [Matlab code] [Python code]
BATF_VB.m
: The main code file. Implements variational inference for BATF.Test_BATF.m
: The testing file. You can find a testing example here.
ms_scenario.m
: Generate incomplete tensor with different missing scenarios.cp_combination.m
: A function for the CP combination over factor matrices.khatrirao_fast.m
: A fast Khatri-Rao product function.kr.m
: A Kronecker product function.mat2ten.m
: A tensor folding function.ten2mat.m
: A tensor unfolding function.vec_combination.m
: A function for summing up the bias vectors to a 3-dimensional tensor. (Please consider using a newer version of MATLAB)safelog.m
: A safe logarithm function inspired by the work of BCPF.
tensor.mat
: The original speed data. Organized as a third-order tensor (road segment × day × time interval, with a size of 214 × 61 × 144). There are about 1.29% missing values in the raw data set.random_tensor.mat
: A synthesis random tensor. Used for generating random missing tensor.random_matrix.mat
: A synthesis random matrix. Used for generating fiber missing tensor.
BATF can be run as follows:
[model] = BATF_VB(dense_tensor,sparse_tensor,'CP_rank',low_rank,'maxiter',max_iteration);
Required Arguments:
dense_tensor
: The original data tensor.sparse_tensor
: The incomplete tensor generated byms_scenario.m
function.
Optional Arguments:
'CP_rank'
: The low rank of CP factorization.'maxiter'
: The maximum number of iterations for implementing variational inference.
Please check out Python code of BATF at transdim for detail. In the Python implementation, we use Gibbs sampling for solving the fully Bayesian augmented tensor factorization, it would be not difficult to reproduce if numpy
in your Python is available. There are some steps to follow:
- Import some necessary packages:
import numpy as np
from numpy.random import multivariate_normal as mvnrnd
from scipy.stats import wishart
from numpy.random import normal as normrnd
from scipy.linalg import khatri_rao as kr_prod
from numpy.linalg import inv as inv
from numpy.linalg import solve as solve
from numpy.linalg import cholesky as cholesky_lower
from scipy.linalg import cholesky as cholesky_upper
from scipy.linalg import solve_triangular as solve_ut
- Define the function for generating multivariate Gaussian distributed random numbers (very fast here!):
def mvnrnd_pre(mu, Lambda):
src = normrnd(size = (mu.shape[0],))
return solve_ut(cholesky_upper(Lambda, overwrite_a = True, check_finite = False),
src, lower = False, check_finite = False, overwrite_b = True) + mu
- Define CP combination for factor matrices and vector combination for biases:
def cp_combine(var):
return np.einsum('is, js, ts -> ijt', var[0], var[1], var[2])
def vec_combine(vector):
return (vector[0][:, None, None] + vector[1][None, :, None] + vector[2][None, None, :])
- Define the tensor unfolding function:
def ten2mat(tensor, mode):
return np.reshape(np.moveaxis(tensor, mode, 0), (tensor.shape[mode], -1), order = 'F')
- Define other necessary functions here:
def cov_mat(mat, mat_bar):
mat = mat - mat_bar
return mat.T @ mat
def compute_mape(var, var_hat):
return np.sum(np.abs(var - var_hat) / var) / var.shape[0]
def compute_rmse(var, var_hat):
return np.sqrt(np.sum((var - var_hat) ** 2) / var.shape[0])
- Define the function for sampling global parameter:
def sample_global_mu(mu_sparse, pos_obs, tau_eps, tau0 = 1):
tau_tilde = 1 / (tau_eps * len(pos_obs[0]) + tau0)
mu_tilde = tau_eps * np.sum(mu_sparse) * tau_tilde
return np.random.normal(mu_tilde, np.sqrt(tau_tilde))
- Define the function for sampling bias vectors:
def sample_bias_vector(bias_sparse, factor, bias, ind, dim, k, tau_eps, tau0 = 1):
for k in range(len(dim)):
idx = tuple(filter(lambda x: x != k, range(len(dim))))
temp = vector.copy()
temp[k] = np.zeros((dim[k]))
tau_tilde = 1 / (tau_eps * bias[k] + tau0)
mu_tilde = tau_eps * np.sum(ind * (bias_sparse - vec_combine(temp)), axis = idx) * tau_tilde
vector[k] = np.random.normal(mu_tilde, np.sqrt(tau_tilde))
return vector
- Define the function for sampling factor matrices:
def sample_factor(tau_sparse, factor, ind, dim, k, tau_eps, beta0 = 1):
dim, rank = factor[k].shape
dim = factor[k].shape[0]
factor_bar = np.mean(factor[k], axis = 0)
temp = dim / (dim + beta0)
var_mu_hyper = temp * factor_bar
var_W_hyper = inv(np.eye(rank) + cov_mat(factor[k], factor_bar) + temp * beta0 * np.outer(factor_bar, factor_bar))
var_Lambda_hyper = wishart.rvs(df = dim + rank, scale = var_W_hyper)
var_mu_hyper = mvnrnd_pre(var_mu_hyper, (dim + beta0) * var_Lambda_hyper)
idx = list(filter(lambda x: x != k, range(len(factor))))
var1 = kr_prod(factor[idx[1]], factor[idx[0]]).T
var2 = kr_prod(var1, var1)
var3 = (var2 @ ten2mat(tau_eps * ind, k).T).reshape([rank, rank, dim]) + var_Lambda_hyper[:, :, np.newaxis]
var4 = var1 @ ten2mat(tau_sparse, k).T + (var_Lambda_hyper @ var_mu_hyper)[:, np.newaxis]
for i in range(dim):
factor[k][i, :] = mvnrnd_pre(solve(var3[:, :, i], var4[:, i]), var3[:, :, i])
return factor[k]
- Define the function for sampling precision parameter:
def sample_precision_tau(error_tensor, pos_obs):
var_alpha = 1e-6 + 0.5 * len(pos_obs[0])
var_beta = 1e-6 + 0.5 * np.linalg.norm(error_tensor, 2) ** 2
return np.random.gamma(var_alpha, 1 / var_beta)
- Define the function for BATF by using Gibbs sampling:
def BATF_Gibbs(dense_tensor, sparse_tensor, vector, factor, burn_iter, gibbs_iter):
"""Bayesian Augmented Tensor Factorization (BATF) with Gibbs sampling."""
dim = np.array(sparse_tensor.shape)
rank = factor[0].shape[1]
if np.isnan(sparse_tensor).any() == False:
ind = sparse_tensor != 0
pos_obs = np.where(ind)
pos_test = np.where((dense_tensor != 0) & (sparse_tensor == 0))
elif np.isnan(sparse_tensor).any() == True:
pos_test = np.where((dense_tensor != 0) & (np.isnan(sparse_tensor)))
ind = ~np.isnan(sparse_tensor)
pos_obs = np.where(ind)
sparse_tensor[np.isnan(sparse_tensor)] = 0
num_obs = len(pos_obs[0])
dense_test = dense_tensor[pos_test]
del dense_tensor
show_iter = 200
tau_eps = 1
bias = []
for k in range(len(dim)):
idx = tuple(filter(lambda x: x != k, range(len(dim))))
bias.append(np.sum(ind, axis = idx))
temp = cp_combine(factor)
temp_hat = np.zeros(len(pos_test[0]))
tensor_hat_plus = np.zeros(dim)
for it in range(burn_iter + gibbs_iter):
temp = sparse_tensor - temp
mu_glb = sample_global_mu(temp[pos_obs] - vec_combine(vector)[pos_obs], pos_obs, tau_eps)
vector = sample_bias_vector(temp - mu_glb, factor, bias, ind, dim, k, tau_eps)
del temp
tau_sparse = tau_eps * ind * (sparse_tensor - mu_glb - vec_combine(vector))
for k in range(len(dim)):
factor[k] = sample_factor(tau_sparse, factor, ind, dim, k, tau_eps)
temp = cp_combine(factor)
tensor_hat = mu_glb + vec_combine(vector) + temp
temp_hat += tensor_hat[pos_test]
tau_eps = sample_precision_tau(sparse_tensor[pos_obs] - tensor_hat[pos_obs], pos_obs)
if it + 1 > burn_iter:
tensor_hat_plus += tensor_hat
if (it + 1) % show_iter == 0 and it < burn_iter:
temp_hat = temp_hat / show_iter
print('Iter: {}'.format(it + 1))
print('MAPE: {:.6}'.format(compute_mape(dense_test, temp_hat)))
print('RMSE: {:.6}'.format(compute_rmse(dense_test, temp_hat)))
temp_hat = np.zeros(len(pos_test[0]))
print()
tensor_hat = tensor_hat_plus / gibbs_iter
print('Imputation MAPE: {:.6}'.format(compute_mape(dense_test, tensor_hat[pos_test])))
print('Imputation RMSE: {:.6}'.format(compute_rmse(dense_test, tensor_hat[pos_test])))
print()
return tensor_hat, mu_glb, vector, factor
- Import traffic data:
import scipy.io
dense_tensor = scipy.io.loadmat('../datasets/Guangzhou-data-set/tensor.mat')['tensor']
random_tensor = scipy.io.loadmat('../datasets/Guangzhou-data-set/random_tensor.mat')['random_tensor']
missing_rate = 0.4
## Random missing (RM)
binary_tensor = np.round(random_tensor + 0.5 - missing_rate)
sparse_tensor = np.multiply(dense_tensor, binary_tensor)
- Perform missing data imputation with BATF:
import time
start = time.time()
dim = np.array(sparse_tensor.shape)
rank = 80
vector = []
factor = []
for k in range(len(dim)):
vector.append(0.1 * np.random.randn(dim[k],))
factor.append(0.1 * np.random.randn(dim[k], rank))
burn_iter = 1000
gibbs_iter = 200
BATF_Gibbs(dense_tensor, sparse_tensor, vector, factor, burn_iter, gibbs_iter)
end = time.time()
print('Running time: %d seconds'%(end - start))
- Check out the expected results:
Iter: 200
MAPE: 0.08303
RMSE: 3.58407
Iter: 400
MAPE: 0.083102
RMSE: 3.59169
Iter: 600
MAPE: 0.0830566
RMSE: 3.59025
Iter: 800
MAPE: 0.0829992
RMSE: 3.58897
Iter: 1000
MAPE: 0.0829348
RMSE: 3.58643
Imputation MAPE: 0.0828811
Imputation RMSE: 3.5845
Running time: 3592 seconds
If you find the project useful for your publication, please consider citing our publicly available dataset and paper:
-
Urban Traffic Speed Dataset of Guangzhou, China (Available at https://doi.org/10.5281/zenodo.1205229)
@Misc{XinyuChen2018Urban, author = {{Xinyu Chen} and {Yixian Chen} and {Zhaocheng He}}, title = {Urban Traffic Speed Dataset of Guangzhou, China}, year = {2018}, doi = {10.5281/zenodo.1205228}, keywords = {intelligent transportation systems, urban traffic speed data, urban traffic data analytics, missing data imputation, short-term traffic prediction, traffic pattern discovery}, publisher = {Zenodo}, }
-
Missing traffic data imputation and pattern discovery with a Bayesian augmented tensor factorization model (Available at https://doi.org/10.1016/j.trc.2019.03.003)
@Article{Chen2019Missing, author = {Xinyu Chen and Zhaocheng He and Yixian Chen and Yuhuan Lu and Jiawei Wang}, title = {Missing traffic data imputation and pattern discovery with a Bayesian augmented tensor factorization model}, journal = {Transportation Research Part C: Emerging Technologies}, year = {2019}, volume = {104}, pages = {66--77}, month = {jul}, doi = {10.1016/j.trc.2019.03.003}, publisher = {Elsevier {BV}}, }
The MIT License (MIT)