Skip to content

Simsala808/MCAC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

---
title: "Multivariate Chi-Square Anomaly Classification (MCAC)"
author: "1Lt Alexander Trigo"
date: "11 March 2018"
output: 
  github_document
keep_md: yes
---
[![Build Status](https://travis-ci.org/citation891/MCAC.svg?branch=master)](https://travis-ci.org/citation891/MCAC)

## Instructions

Multivariate Chi-Square Anomaly Classification (MCAC)

MCAC will provide A user interface for quick and automated analysis of cyber intrusion detection data logs.  functionality will identify outliers and provide visual insight into the underlying structure of the data set.  Currently, MCAC functionality cannot be generalized to any dataset.  Provided in the data folder directory of `MCAC` is a sampleData.Rda dataset which can be loaded into the global environment for use within RStudio.  The sampleData.csv file found within the top level directory of `MCAC` is for use in the shiny application.  The four main MCAC functions and instructions are provided below.

* prepareData(): This is the first function users should test.  Use data('sampleData') to load the included data file into the global environment.  After loading the file, running prepareData(sampleData) will generate several objects within the global environment.  These objects can be used as input arguments for the removeAnomaly() function.  The chiSqrPlot and initialChiSqrPlot objects can also be used with the plotQQ() function.

* removeAnomaly(): Before running this function, it is required that users generate the required input arguments via the prepareData() function.  Running this function will update all objects in the global environment with a new iteration.  This function can be called consecutively or from within a for loop to iterate multiple times.  To run the function, ensure that all input arguments are present in the global environment and run the following command -> removeAnomaly(chiSqrPlot, blocks, stateVector, outliers, error, timeData).

* plotQQ(): This function plots the mahalanobis distance versus the chi-square values for the dataset.  The red line plotted represents an ideal multivariate normal model.  The plotQQ() function will take as input either the chiSqrPlot objects generated by prepareData() or removeAnomaly() functions, or the initialChiSqrPlot object generated from the prepareData() function.

* runMCAC(): This function runs the shiny app build based on the previous 3 functions.  The application will prompt you to upload a file.  Please ensure you use the sampleData.csv file  available on Github.  Further instructions are found within application UI.



## Information

  This product is being developed for use by DoD Sponsors.  The target end users will be cyber analysts, so MCAC will be mostly automated in order to ensure usability regardless of a formal education in multivariate analytic techniques.  The user should strive to understand what the function output is telling them about the data, however, regardless of understanding, a list of outliers will be provided as a product of the analysis.  The main responsibility of the user will be in the input of features.  They will need to upload a raw data set containing many features, however, they must be able to input exactly which features are to be considered for analysis.
  
  We will forgo construction of a user based R package in favor of a Shiny App due to DoD restrictions on R usage, and prerequisite training required to use R.  A shiny app should provide a much more user friendly experience, where the analyst will be able to upload data, alter several parameters, and retrieve a spreadsheet output of detected outliers.  The MCAC function will build upon several statistical methods and the `anomalyDetection` package.
  
  The `anomlyDetection` package features several important tools we will use to conduct MCAC analysis.  First, the package includes functionality which allows us to transform the user uploaded data frame into a tabulated state vector format which is conducive to multivariate analysis.  Second, this package allows us to calculate the Mahalabnobis distance (1) of the observations, which is crucial for the Chi-Square Q-Q plot analysis.
  

  The classification of outliers based on the chi squared plot is contingent upon the fact that an underlying distribution of Mahalanobis distances for a multivariate normal population is chi-square with degrees of freedom equivalent to the the number of data set features(Gnanadesikan, 1977).  Due to this relationship, Mahalanobis distances can be sorted in ascending order and plotted against a corresponding set of chi-square values.  For a perfectly normal multivariate population, we would observe a straight line begin at the origin (0,0) and plot at 45 degrees to some arbitrary distant point such as (50,50).  Often times, when our line does not behave well, it is due to the influence of outliers within the data set.  If data points are anomalous, they will be removed iteratively, and co-variance of the data set will be recalculated in order to determine new Mahalanobis distances.

Finally, in order to determine the optimal number of observations for outlier classification, we will use the Standard Error of the Estimate as a basis of measurement. This estimate, often used in regression analysis is give by the equation $ \sigma_e = \sqrt\frac{\Sigma(Y - Y')^2}{N} $ where Y is the actual observation Y for a given X, Y' is the predicted Y value, and N is the number of observations.  Ideally, as we remove anomalous observations, the data set will approach multivariate normality, and this error will be reduced iteratively.  We will use a minimization of this value as a criteria for function termination resulting in classifications. 


## Delivery and Schedule

|Feature|Priority|Status|Value|Input|Output|Use|Deadline Viability?|Needed?|
|:----------------|:---:|:-------:|:---------------------:|:-------------:|:---------------:|:-------------:|:---:|:---:|
|Data Upload: Upload the raw data| 1 |Complete| Allow user to upload their raw data set| csv file  | N/A | N/A  | Yes | Yes | 
|Automatic Data Cleaning: Prep data for analysis and then invoke anomalyDetection functionality| 2 | Complete  | Select desired features and format data into a usable form  | N/A | N/A | N/A| Yes | Yes |
|Automatic Time Vector: Assign which feature corresponds to the time vector|3| Complete  | Generates an index of times with which to classify anomalies | Block size/Time Feature | Vector of Time/Date Ranges  | Better outlier location description  | Yes| Yes |
|Classify Outliers: Function will eliminate anomalous data/re-plot|4| Complete  | remove and classify anomalies, replot  | Pre-processed data  | Chi-Square Q-Q plot/anomalies | Find anomalies and assess multivariate normality | Yes  | Yes  |
|Generate Plot: Initial Chi-Square Q-Q Plot Generated| 5 | Complete | View untouched data | Initial MD and Chi Square Vectors | Initial Q-Q Plot | Inspect untouched data structure | Likely| No |
|Export: Indexed anomalies exported to csv file| 6 | Complete | export analysis results | Outlier Vector | csv file| easy snapshot of results  | Maybe | No |
|Manual Data Cleaning: Allow users to input which features to keep for analysis|7| not started  | allow greater analytic flexibility |  Feature Names | See 1-5 | adapt to raw data format change/new feature criteria  | Unlikely  | No |
|Manual Time Vector input: Allows user to define which time vector to use|8| not started | Allow for new time vector feature| Feature Name | See 3| adapt to raw data format change/new feature criteria   | Unlikely  | No |
|Manual Threshold: Allows user to define the size of function data threshold| 9 | Complete  | Change threshold default from 3% | Threshold Percentage| See 4 | Consider greater data range for classification  | Unlikely | No |   
|Manual Block Size Input: Allows user to define the size of state vector block size|10| not started  | Impacts how many observations are allocated to a time block | whole integer| altered Time vector and state vector| user can alter size of time block size and time vector size | Unlikely  | No |

About

Functions for the MCAC Shiney App Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages