Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

55 outlier detection #105

Merged
merged 20 commits into from
Mar 2, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 79 additions & 0 deletions examples/transformers/outlier_mahalanobis/OutlierMahalanobis.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
import numpy as np
from scipy.linalg import eigh

_EPSILON = 1e-8

class OutlierMahalanobis(object):
def __init__(self,n_components=3,max_n=None):
self.mean = 0
self.C = 0
self.n = 0
self.n_components = n_components
self.max_n = max_n

def score(self,features,feature_names):

nb = features.shape[0] # batch size
p = features.shape[1] # number of features
n_components = min(self.n_components,p)
if self.max_n is not None:
n = min(self.n,self.max_n) # n can never be above max_n
else:
n = self.n

# Tracking the mean and covariance matrix
roll_partial_means = features.cumsum(axis=0)/(np.arange(nb)+1).reshape((nb,1))
coefs = (np.arange(nb)+1.)/(np.arange(nb)+n+1.)
new_means = self.mean + coefs.reshape((nb,1))*(roll_partial_means-self.mean)
new_means_offset = np.empty_like(new_means)
new_means_offset[0] = self.mean
new_means_offset[1:] = new_means[:-1]

coefs = ((n+np.arange(nb))/(n+np.arange(nb)+1.)).reshape((nb,1,1))
B = coefs*np.matmul((features - new_means_offset)[:,:,None],(features - new_means_offset)[:,None,:])
cov_batch = (n-1.)/(n+nb-1.)*self.C + 1./(n+nb-1.)*B.sum(axis=0)

# PCA
eigvals, eigvects = eigh(cov_batch,eigvals=(p-n_components,p-1))

# Projections
proj_features = np.matmul(features,eigvects)
proj_means = np.matmul(new_means_offset,eigvects)
if type(self.C) == int and self.C == 0:
proj_cov = np.diag(np.zeros(n_components))
else:
proj_cov = np.matmul(eigvects.transpose(),np.matmul(self.C,eigvects))

# Outlier detection in the PC subspace
coefs = (1./(n+np.arange(nb)+1.)).reshape((nb,1,1))
B = coefs*np.matmul((proj_features - proj_means)[:,:,None],(proj_features - proj_means)[:,None,:])

all_C_inv = np.zeros_like(B)
c_inv = None
_EPSILON = 1e-8

for i, b in enumerate(B):
if c_inv is None:
if abs(np.linalg.det(proj_cov)) > _EPSILON:
c_inv = np.linalg.inv(proj_cov)
all_C_inv[i] = c_inv
continue
else:
if n + i == 0:
continue
proj_cov = (n + i -1. )/(n + i)*proj_cov + b
continue
else:
c_inv = (n + i - 1.)/float(n + i - 2.)*all_C_inv[i-1]
BC1 = np.matmul(B[i-1],c_inv)
all_C_inv[i] = c_inv - 1./(1.+np.trace(BC1))*np.matmul(c_inv,BC1)

# Updates
self.mean = new_means[-1]
self.C = cov_batch
self.n += nb

feat_diff = proj_features-proj_means
outlier_scores = np.matmul(feat_diff[:,None,:],np.matmul(all_C_inv,feat_diff[:,:,None])).reshape(nb)

return outlier_scores
298 changes: 298 additions & 0 deletions examples/transformers/outlier_mahalanobis/outlier_documentation.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,298 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Mahalanobis Outlier Algorithm Documentation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The aim of the document is to explain and document the algorithm used in Seldon's Mahalanobis Online Outlier Detector.\n",
"\n",
"In the first part we give a high level overview of the algorithm, then we explain the implementation in detail."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Overview"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The output of the algorithm (outlier score) is a measure of distance from the center of the features distribution (Mahalanobis distance). The algorithm is online, which means that it starts without knowledge about the distribution of the features and learns as requests arrive. Consequently you should expect the output to be bad at the start and to improve over time. \n",
"\n",
"The output being a real positive number, we leave it to the user to decide on a threshold for when a point will be considered to be an outlier.\n",
"\n",
"As observations arrive, the algorithm will:\n",
"- Keep track and update the mean and sample covariance matrix of the dataset\n",
"- Apply a principal component analysis using these moments and project the new observations on the first 3 principal components (default value, can be changed)\n",
"- Compute the Mahalanobis distance from these projections to the projected mean\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Implementation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The core of the algorithm is using efficient and numerically stable streaming techniques to keep track of the mean and covariance matrix as new points are observed. The PCA is done by finding the eigenvectors of the covariance matrix using a function implemented in scipy. We also use an efficient algorithm to avoid inverting a new covariance matrix for each point in the batch."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Object state"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The outlier detector class has 4 attributes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```python\n",
"class OutlierMahalanobis(object):\n",
" def __init__(self,n_components=3,max_n=None):\n",
" self.mean = 0\n",
" self.C = 0\n",
" self.n = 0\n",
" self.n_components = n_components\n",
" self.max_n = max_n\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- mean: Online mean of the observed values\n",
"- C: Online covariance matrix of the observed values\n",
"- n: number of observations so far\n",
"- n_components: number of principal components (default = 3)\n",
"- max_n: Used to make the algorithm non stationary by capping the number of observations to max_n. When specified, the algorithm will behave like if it had seen at most max_n points, thus adapting faster to changes in the underlying distribution. Turned off by default."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### First Step: Tracking the mean and covariance matrix"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```python\n",
"def score(self,features,feature_names):\n",
"\n",
" nb = features.shape[0] # batch size\n",
" p = features.shape[1] # number of features\n",
" n_components = min(self.n_components,p)\n",
" if self.max_n is not None:\n",
" n = min(self.n,self.max_n) # n can never be above max_n\n",
" else:\n",
" n = self.n\n",
"\n",
" # Tracking the mean and covariance matrix\n",
" roll_partial_means = features.cumsum(axis=0)/(np.arange(nb)+1).reshape((nb,1))\n",
" coefs = (np.arange(nb)+1.)/(np.arange(nb)+n+1.)\n",
" new_means = self.mean + coefs.reshape((nb,1))*(roll_partial_means-self.mean)\n",
" new_means_offset = np.empty_like(new_means)\n",
" new_means_offset[0] = self.mean\n",
" new_means_offset[1:] = new_means[:-1]\n",
"\n",
" coefs = ((n+np.arange(nb))/(n+np.arange(nb)+1.)).reshape((nb,1,1))\n",
" B = coefs*np.matmul((features - new_means_offset)[:,:,None],(features - new_means_offset)[:,None,:])\n",
" cov_batch = (n-1.)/(n+nb-1.)*self.C + 1./(n+nb-1.)*B.sum(axis=0)\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we have implemented a numerically stable algorithm for updating the mean and covariance matrix based on the following formulas.\n",
"\n",
"**Batch Online Mean**\n",
"\n",
"Let $\\bar{X}_n = \\frac{1}{n} \\sum_{k=1}^n{X_k} $ the rolling mean of $ (X_n)_n $\n",
"\n",
"Let $\\bar{X}_{n,N} = \\frac{1}{N-n} \\sum_{k=n+1}^N{X_k} $ the batch mean of $X_n$ between $n$ and $N$\n",
"\n",
"Then we have: \n",
"\n",
"$ \\bar{X}_{n+b} = \\bar{X}_n + \\frac{b}{n+b}(\\bar{X}_{n,n+b} - \\bar{X}_n) $ $ (1) $\n",
"\n",
"**Batch Online Covariance Matrix**\n",
"\n",
"Let $C_n = \\frac{1}{n-1} \\sum_{k=1}^n{(X_k - \\bar{X}_n)(X_k - \\bar{X}_n)^t} $ the rolling sample covariance matrix of $ (X_n)_n $\n",
"\n",
"Then we have:\n",
"\n",
"$ C_{n+b} = \\frac{n-1}{n+b-1}C_{n} + \\frac{1}{n+b-1}\\sum^{b-1}_{i=0}\\frac{n+i}{n+i+1}(X_{n+i+1}-\\bar{X}_{n+i})(X_{n+i+1}-\\bar{X}_{n+i})^t $ $ (2) $\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Second Step: PCA and projection"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```python\n",
" # PCA\n",
" eigvals, eigvects = eigh(cov_batch,eigvals=(p-n_components,p-1))\n",
"\n",
" # Projections\n",
" proj_features = np.matmul(features,eigvects)\n",
" proj_means = np.matmul(new_means_offset,eigvects)\n",
" if type(self.C) == int and self.C == 0:\n",
" proj_cov = np.diag(np.zeros(n_components))\n",
" else:\n",
" proj_cov = np.matmul(eigvects.transpose(),np.matmul(self.C,eigvects))\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We compute the first principal components: these are the eigenvectors of the sample covariance matrix associated to the largest eigenvalues. For this we use the function [eigh](https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.eigh.html) from ```scipy.linalg```.\n",
"\n",
"We then project the new batch of features, the rolling means and the previous covariance matrix on the principal components subspace.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Third Step: Outlier detection in the Principal Components Subspace"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Substep: Fast computation of the inverses of the covariance matrices"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To compute the outlier score of each point in the new batch, we need the inverse of the covariance matrix of all the points up to this one. This means inverting $b$ matrices. We made this operation faster by leveraging the fact that each covariance matrix is a rank one update of the previous one. \n",
"Knowing this we can use the following theorem.\n",
"\n",
"**Theorem:**\n",
"\n",
"if $A$ and $A+B$ are invertible and $rank(B) = 1$ then\n",
"\n",
"$ (A + B)^{-1} = A^{-1} - \\frac{1}{1+trace(BA^{-1})}A^{-1}BA^{-1} $\n",
"\n",
"The implementation is:\n",
"\n",
"```python\n",
" coefs = (1./(n+np.arange(nb)+1.)).reshape((nb,1,1))\n",
" B = coefs*np.matmul((proj_features - proj_means)[:,:,None],(proj_features - proj_means)[:,None,:])\n",
"\n",
" all_C_inv = np.zeros_like(B)\n",
" c_inv = None\n",
" _EPSILON = 1e-8\n",
"\n",
" for i, b in enumerate(B):\n",
" if c_inv is None:\n",
" if abs(np.linalg.det(proj_cov)) > _EPSILON:\n",
" c_inv = np.linalg.inv(proj_cov)\n",
" all_C_inv[i] = c_inv\n",
" continue\n",
" else:\n",
" if n + i == 0:\n",
" continue\n",
" proj_cov = (n + i -1. )/(n + i)*proj_cov + b\n",
" continue\n",
" else:\n",
" c_inv = (n + i - 1.)/float(n + i - 2.)*all_C_inv[i-1]\n",
" BC1 = np.matmul(B[i-1],c_inv)\n",
" all_C_inv[i] = c_inv - 1./(1.+np.trace(BC1))*np.matmul(c_inv,BC1)\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Finally, we update the attributes and return the outlier scores"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```python\n",
" self.mean = new_means[-1]\n",
" self.C = cov_batch\n",
" self.n += nb\n",
"\n",
" feat_diff = proj_features-proj_means\n",
" outlier_scores = np.matmul(feat_diff[:,None,:],np.matmul(all_C_inv,feat_diff[:,:,None])).reshape(nb)\n",
"\n",
" return outlier_scores\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The outlier score is the Mahalanobis distance in the Principal Components Subspace.\n",
"\n",
"$ score_n = (X_n - \\bar{X}_{n-1})^tC_{n-1}^{-1}(X_n - \\bar{X}_{n-1}) $"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
2 changes: 2 additions & 0 deletions examples/transformers/outlier_mahalanobis/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
numpy==1.11.2
scipy==0.19.1
Loading