GAD is the acronym of General Anomaly Detector. It was once part of SADIT. We can now use the two packages either jointly or separately. Here are their main differences:
-
SADIT focuses on providing an integrated interface for generating test data and evaluating algorithms.
-
GAD focuses on providing a collection of anomaly detection algorithms.
If you are interested in our recent publications (see below) on network anomaly detection and want to use them as references, please cite the repository SADIT/GAD together with:
Wang, Jing, et al. "Network anomaly detection: A survey and comparative analysis of stochastic and deterministic methods." Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on. IEEE, 2013.
Jing Wang and I. Ch. Paschalidis, "Statistical Traffic Anomaly Detection in Time-Varying Communication Networks", IEEE Transactions on Control of Network Systems, in print.
Jing Wang and I. Ch. Paschalidis, "Robust Anomaly Detection in Dynamic Networks", Proceedings of the 22nd Mediterranean Conference on Control and Automation (MED 14), pages 428-433, June 16--19, 2014, Palermo, Italy.
Jing Zhang and I. Ch. Paschalidis, "An Improved Composite Hypothesis Test for Markov Models with Applications in Network Anomaly Detection," Proceedings of the 54th IEEE Conference on Decision and Control, pp. 3810-3815, December 15-18, 2015, Osaka, Japan.
Jing Zhang and I. Ch. Paschalidis, "Statistical Anomaly Detection via Composite Hypothesis Testing for Markov Models," IEEE Transactions on Signal Processing, submitted, 2017. arXiv:1702.08435
GAD can be installed on Linux, Mac OS X and Windows (through cygwin) with python 2.7. However, we strongly recommend the debian-based OS, e.g., Ubuntu 12.04, 14.04, or 16.04, for which we have prepared a one-command installation script. We recommend using Anaconda2 as the Python environment; conda has a good ability to manage external packages.
To be specific, if you are working on Ubuntu, proceed as follows:
- Change the working directory to where you want to install GAD, create a new folder
gad
, and then type:
$ git clone --recursive https://github.com/hbhzwj/GAD.git gad/
- Change the working directory to be
gad/install
, and then type:
gad/install$ sudo sh debian.sh
- Make sure
socket.io
andsocketIO-client
be installed as well.
You may use $ npm install socket.io
and refer to https://pypi.python.org/pypi/socketIO-client to make socketIO-client
work on your machine.
If you want to install GAD on other types of OS, you may refer to the following:
For mac users, after cloning the GAD package, change the working directory to be ROOT/install
, and then just type :
sudo python setup-dep.py
the ipaddr, networkx, pydot, pyparsing and py-radix will be automatically downloaded and installed. If you just want to use the Detector part (i.e., GAD), that is already enough. (For SADIT users) If you want to use Configure and Simulator part, then you also need to install numpy and matplotlib. Please go to http://www.scipy.org/NumPy and http://matplotlib.sourceforge.net/faq/installing_faq.html for installation instructions.
GAD should be able to be installed on windows machine with the help of cgywin.
(For SADIT users) If the automatic methods fail, you can install SADIT manually.
SADIT has been tested on python 2.7.2. SADIT depends on all softwares that fs-simulator depends on:
besides: it requires: : - numpy Get - matplotlib Get - profilehooks Get
If you are using debian based system, you can just type:
sudo apt-get install python-dev
sudo apt-get install python-numpy
sudo apt-get install python-matplotlib
For other OS's, please refer to the corresponding website for installation of numpy and matplotlib.
Type $ ./cmdgad <exper> -m <method> -h
to get help message (see the following).
usage: cmdgad [-h] [-c CONFIG] [--logging LOGGING] [-d DATA] [-m METHOD]
[--help_method HELP_METHOD] [--data_type DATA_TYPE]
[--feature_option FEATURE_OPTION] [--export_flows EXPORT_FLOWS]
[--pic_name PIC_NAME] [--pic_show] [--csv CSV]
optional arguments:
-h, --help print help message
-c CONFIG, --config CONFIG
config
--logging LOGGING logging level. See
https://docs.python.org/2/library/logging.html#levels
-d DATA, --data DATA --data [filename] will simply detect the flow file,
simulator will not run in this case
-m METHOD, --method METHOD
--method [method] will specify the method to use.
Avaliable options are: ['gen_fb_mb': FBAnoDetector
model free and model based together, will be faster
then run model free | 'robust': RobustDetector Robust
Detector is designed for dynamic network environment |
'2w': TwoWindowAnoDetector Two Window Stochastic
Anomaly Detector. | 'speriod': PeriodStaticDetector
Reference Empirical Measure is calculated by
periodically selection. | 'mb': ModelBaseAnoDetector
Model based approach, use Markovian Assumption |
'gen_fb_mf': FBAnoDetector model free and model based
together, will be faster then run model free |
'two_win': TwoWindowAnoDetector Two Window Stochastic
Anomaly Detector. | 'mf': ModelFreeAnoDetector Model
Free approach, use I.I.D Assumption | 'mfmb':
FBAnoDetector model free and model based together,
will be faster then run model free | 'period':
PeriodStoDetector Stochastic Detector Designed to
Detect Anomaly when the]. If you want to compare the
results of several methods, simple use / as seperator,
for example [gen_fb_mb/robust/2w/speriod/mb/gen_fb_mf/
two_win/mf/mfmb/period]
--help_method HELP_METHOD
print the detailed help message for a method.
Avaliable method [gen_fb_mb | robust | 2w | speriod |
mb | gen_fb_mf | two_win | mf | mfmb | period]
--data_type DATA_TYPE
--specify the type of the data you use, the availiable
option are: ['fs': MEM_FS Data generated by `fs-
simulator | 'xflow': MEM_Xflow Data generated by xflow
tool. | 'pt': PT_Data Pytables format. (HDF5 format).
| 'pcap2netflow': MEM_Pcap2netflow Data generated
pcap2netflow, (the | 'Sperotto': SperottoIPOM Data
File wrapper for SperottoIPOM2009 format. | 'csv':
CSVFile | 'flow_exporter': MEM_FlowExporter Data
generated FlowExporter. It is a simple tool to convert
pcap to]
--feature_option FEATURE_OPTION
specify the feature option. feature option is a
dictionary describing the quantization level for each
feature. You need at least specify 'cluster' and
'dist_to_center'. Note that, the value of 'cluster' is
the cluster number. The avaliability of other features
depend on the data handler.
--export_flows EXPORT_FLOWS
specify the file name of exported abnormal flows.
Default is not export
--pic_name PIC_NAME picture name for the detection result
--pic_show whether to show the picture after finishing running
--csv CSV the path of the file to save plots a text output
----------------------------------------------------------------------
usage: cmdgad [-h] [--interval INTERVAL] [--win_size WIN_SIZE]
[--win_type WIN_TYPE] [--max_detect_num MAX_DETECT_NUM]
[--normal_rg NORMAL_RG] [--hoeff_far HOEFF_FAR]
[--entropy_th ENTROPY_TH] [--enable_sanov] [--lw LW]
[-r REF_SCHECK] [--days DAYS] [--alpha ALPHA] [--lamb LAMB]
[--ref_data REF_DATA]
optional arguments:
-h, --help show this help message and exit
--interval INTERVAL interval between two consequent detection
--win_size WIN_SIZE window_size
--win_type WIN_TYPE window type 'flow'|'time'
--max_detect_num MAX_DETECT_NUM
max detection number, useful for debug
--normal_rg NORMAL_RG
normal range, when it is none, use the whole data as
the norminal data set
--hoeff_far HOEFF_FAR
false alarm rate for hoeffding rule, if this parameter
is set while entropy_th parameter is not set, will
calculate threshold according to hoeffding rule.
Increase hoeff_far will decrease threshold
--entropy_th ENTROPY_TH
entropy threshold to determine the anomaly, has higher
priority than hoeff_far
--enable_sanov whether or not to use Sanov's theorem to estimate the
threshold
--lw LW line width of the plot
-r REF_SCHECK, --ref_scheck REF_SCHECK
['dump <file>', 'load <file>']. whether to load the
precomputed reference self check data or calculate and
dump it. If <file> is not specfied, its default value
is "desc['dump_folder']/PLManager_scheck.pk"
--days DAYS number of days the simulated test data lasts;
default=7
--alpha ALPHA weight of minimum threshold determining the up-bound
of nominal cross-entropy; should be within (0, 1),
default=0.5
--lamb LAMB manual up-bound for nominal cross entropy; only when
lamb>0, use its value; has higher priority than alpha
--ref_data REF_DATA name for reference file
Each experiment provides a subcommand that has certain functionality.
We give some sample commands (experiments) as follows:
detect the data directly and plot the result.
Examples:
$ ./cmdgad detect -c ./example-configs/detect-config.py -d ./test-data/n0_flow.txt -m mfmb --pic_show --lw 3
$ ./cmdgad detect -c ./example-configs/detect-config.py -d ./test-data/n0_flow.txt -m mb --pic_show --lw 3
$ ./cmdgad detect -c ./example-configs/detect-config.py -d ./test-data/n0_flow.txt -m mfmb --pic_show --hoeff_far 0.9999
$ ./cmdgad detect -c ./example-configs/detect-config.py -d ./test-data/n0_flow.txt -m mb --pic_show --hoeff_far 0.9999
$ ./cmdgad detect -c ./example-configs/detect-config.py -d ./test-data/n0_flow.txt -m mfmb --pic_show --hoeff_far 0.1
$ ./cmdgad detect -c ./example-configs/detect-config.py -d ./test-data/n0_flow.txt -m mfmb --pic_show --hoeff_far 0.001
$ ./cmdgad detect -c ./example-configs/detect-config.py -d ./test-data/n0_flow.txt -m mfmb --pic_show --enable_sanov
$ ./cmdgad detect -c ./example-configs/detect-config.py -d ./test-data/n0_flow.txt -m mf --pic_show --enable_sanov
$ ./cmdgad detect -c ./example-configs/robust-detect.py -d ./test-data/n0_flow_ref.txt -m robust -r='dump test-data/sc.pk'
$ ./cmdgad detect -c ./example-configs/robust-detect.py -d ./test-data/n0_flow.txt -m robust -r='load test-data/sc.pk' --pic_show --days 0.2
detect the data and send data to web interface to visualize in
real-time. It requires support of node js
; for node js
installation, refer to http://nodejs.org/.
Examples:
First, cd to gad-ui/
folder, run the following command:
$ python -m SimpleHTTPServer
or $ python3 -m http.server
(provided that you have installed python3)
Either will start a webserver. You will get responses similar to:
jzh@jzh:~/Research/Anomaly_Detection/gad/gad-ui$ python -m SimpleHTTPServer
Serving HTTP on 0.0.0.0 port 8000 ...
127.0.0.1 - - [22/Oct/2014 10:53:36] "GET /dashboard.html HTTP/1.1" 200 -
127.0.0.1 - - [22/Oct/2014 10:53:36] "GET /node_modules/socket.io/node_modules/socket.io-client/dist/socket.io.min.js HTTP/1.1" 200 -
127.0.0.1 - - [22/Oct/2014 10:53:36] "GET /frameworks/jquery-1.10.2.min.js HTTP/1.1" 200 -
127.0.0.1 - - [22/Oct/2014 10:53:36] "GET /frameworks/d3.v3.min.js HTTP/1.1" 200 -
127.0.0.1 - - [22/Oct/2014 10:53:36] "GET /js/dash.js HTTP/1.1" 200 -
127.0.0.1 - - [22/Oct/2014 10:53:39] "GET /dashboard.html HTTP/1.1" 200 -
......
Then, open a web browser, go to the following website:
http://localhost:8000/dashboard.html
You will see a chart.
Next, open another command window, cd to the gad-ui/
folder, and run the following command:
node server.js
You will get responses similar to:
jzh@jzh:~/Research/Anomaly_Detection/gad/gad-ui$ node server.js
info - socket.io started
debug - client authorized
info - handshake authorized TcnFyzPltVRfrAQCVUAK
debug - setting request GET /socket.io/1/websocket/TcnFyzPltVRfrAQCVUAK
debug - set heartbeat interval for client TcnFyzPltVRfrAQCVUAK
debug - client authorized for
debug - websocket writing 1::
......
After finishing all these prerequisites, run the following command in gad/
folder
./cmdgad detectrealtime -c ./example-configs/detect-config.py -d ./test-data/n0_flow.txt -m mfmb --srv=127.0.0.1:3000
The realtime detection results will show up in the webpage above.
run several methods on a dataset and save the results for future comparison.
Examples:
$ ./cmdgad detect -c ./example-configs/robust-detect.py -d ./test-data/n0_flow.txt -m robust -r='dump test-data/sc.pk' --lamb=0.2
$ ./cmdgad detectcompare -c ./example-configs/compare-detect.py -d ./test-data/n0_flow.txt -p mfmb,robust
calculate the ROC curve of a method.
Examples:
CD to gad
, run the following two commands sequentially:
$ ./cmdgad eval -c example-configs/eval-config.py --res_folder=res/ --ab_flows_data test-data/test_ab_flow.txt
$ ./cmdgad eval -c example-configs/eval-config.py --res_folder=res/ --ab_flows_data test-data/test_ab_flow.txt --plot
All the detection algorithms locate in the ROOT/gad/Detector folder:
- SVMDetector.py contains two SVM based anomaly detection algorithmes: 1. SVM Temporal Detector and 2. SVM Flow by Flow Detector.
- StoDetector.py contains two anomaly detection algorithms based on Large Deviation Theory.
- RobustDetect.py contains a algorithm that works robustly under dynamic network environment.
GAD does not only support the text output format of fs-simulator (see also https://github.com/hbhzwj/SADIT for details), but
also several other types of flow data. The data wrapper classes are defined in ROOT.gad.Detector.Data
module and the handler classes locate in the ROOT.gad.Detector.DataHandler
module.
In order to use data in a new format, you need to implement two new classes:
First, a data class that satisfies Data interface (Data.py
, Line 9). Namely, such a class has to at least provide the following three functions:
get_rows
: row slicingget_where
: get range of rows that satisfies a criterion.get_min_max
: get min and max values of a certain feature at a certain range.
The package has included several data classes, which all locates in Data.py
. In some cases, you can re-use existing classes.
MEM_DiskFile
: base class for disk file data.MEM_FS
: disk file generated by fs-simulator.MEM_FlowExport
: disk file generated by FlowExport toolMySQLDatabase
: base class for data in disk file.
Second, a data handler class that implements data preprocessing, e.g., quantization.
QuantizeDataHandler
: will quantize the input data.IPHanlder
: for logs with IP addresses. It will first cluster IPs and replace IPV4 with(cluster label, dist to cluster center)
pair.
Then you just need to add your data_handler
to
data_handler_handle_map
defined in ROOT/gad/Detector/API.py
Please see the LICENSE
file.
Jing (Conan) Wang
Jing Wang obtained his Ph.D. degree in Fall 2014 from Division of Systems Engineering, Boston University (advised by Professor Yannis Paschalidis). His main interest is Mathematical Modeling, i.e., constructing mathematical models for the real world and trying to solve practical problems.
Email: wangjing@bu.edu
Personal Webpage: https://wangjingpage.wordpress.com/
Jing Zhang
Jing Zhang currently is a PhD student in Division of Systems Engineering, Boston University (advised by Professor Yannis Paschalidis).
Email: jzh@bu.edu
Personal Webpage: http://people.bu.edu/jzh/
Last updated on 10/24/2016 (By Jing Z.)