Skip to content

cmu-is-projects/67-495-Project

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

67-495 Intrusion Analytics Project

Advisor: Prof H

Student: Yingjing Lu

Lessons learned:

Conducting research on online Intrusion Algoithm

This is the first time I came to learn about the intrusion detection algorithm and also my first time to conduct academic research topics. I read a lot of literatures on related topics and learned that intrusion detection through server log is a hard topic in that log are in huge chunk and intrusions should be detected in real time with low latency. With this in mind I started to research related algorithms that should be able to learn intrusion types quickly, and can scale to really high dimensional data.

The algorithm that we finally came into was Extreme Learning Machine. This algorithm is a single layer neural network without back propogation as learning algorithm. With RBM as update, the algorithm was able to adopt to high dimensional data and combine the advantage of neural network's accuracy with SVM's efficiency in learning.

With the advise from Prof H, I utilized the classical KDD intrusion dataset as the training dataset for the algorithm. This dataset contains huge amount of server log data with various different types of intrusions. I started out to learn how to comprehend the equations from the paper and converted it into code. The first testing round, my algorithm was implemented in numpy and took 4 hours to train The accuracy was only 71.5%. Then I went online to search for ways to tune related machine learning algorithms. I also projected the data and visualize it. I found out that the dataset was really sparse and contained a lot of noises which might be the reason for algorithm not performing well. Learning all those tricks, I learned to use batch normalization to requce sparcity and utilize cross validation to evaluate algorithm in a more efficienty way.

After 4 weeks of tuning and discussion we were able to improve the accuracy from 71.5% to 92.3%. This performance is higher than our selected regression tree baseline accuracy of 85.3%.

Implement the Algorithm and Further Optimized its Efficiency

We were able to boost accuracy a lot. But one problem remained was that the algorithm took a long time to train. This was a particular advantage to intrusion detection in that detection algorithm should be quick and adaptive. With long training time the algorithm was not able to detect intrusion onsite.

I researched and decided to reimplement the algorithm through tensorflow. With GPU acceleration I was able to drop training time from 4 hours to 6 minutes for 6000 samples. This means that the algorithm was able to learn in real time with less than 1ms of latency. WEven though this was the first time I used tensorflow with a completely different programming logic, I finally mastered it with trials and errors and believe that this skill will be helpful for future machine learning research.

Learned and Mastered Django and Related Dependencies

After completing the detection package, the next step is to learn Django to make the algorithm user adoptable through graphical user interface and APIs. The package we chose was Django. I personally learned to develop fully functional webapp in 67-272 using Ruby on Rails' MVC framework. This is a perfect time to apply the knowledge and skill to real use case. The app was developed in an agile fashion, same as the way I did in 67-272. After 5 weeks of learning and practice, I was able to develop Django application in a fluent fashion.

Compiled and Stroed Huge Amount of Data and Load and Process with Low Latency

Storing weblog data requires a lot of space and low latency when querying. Traditional django backend sqlite needed a lot of space and up to 2 seconds of querying when display. This is not practical in real world use case. The resolution that I came to was to compile data into batches in numpy arrays. This leads to batch retrieval and data compilation. Loading data takes less than 1 seconds.

D3 js for charts and integrate into Django framework

D3 JS for visualization was the last part. There was few tutorial on how to integrate Django framework with D3 js. I learned D3 and made the graph correctly display intrusion within 2 weeks in Django pages.

To run:

Python 3.5 Django tensorflow 1.0 + numpy scipy csv json

Features and Functionalities:

New Modules:

Click the "+" sign on home pagetype in related information and for the data cav path, type the absolute directory of the csv needed to import with corresponding fields. For convenice I prepared a csv sample of the data this algorithm was trained on called 'test.csv' in the git.

Dashboard:

All new data packages were imported and stored in separate modules which can be displayed on dashboard.

View Module:

Click on View module and lead to module views. each page contains 100 data. Red dot on the chart are intrusions computed on realtime. Without GPU the page may load with delays. The data entries identified as intrusion were displayed in the table below.

Conclusion:

Overall the entire project was completed adhere to the schedule. In the process I persoanlly learn a lot. I got to experience literature review, tuning machine learning algorithm and utilize tensorflow package. I was also able to apply knowledge I learned from 67-272 to develop a fully functional web application. Overall it was a great learning experience. I would like to thank Prof H for help along the way.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 92.5%
  • CSS 6.1%
  • JavaScript 1.3%
  • Python 0.1%