Skip to content

Cyber-attack classification in the network traffic database using NSL-KDD dataset

Notifications You must be signed in to change notification settings

PradeepThapa/nsl_kdd_classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cyber-attack classification in the network traffic database using NSL-KDD dataset

Classification is the process of dividing the data elements into specific classes based on their values. It is a type of supervised learning which means data are labelled. This project “Cyber-attack classification in the network traffic database” uses commonly used machine learning classification algorithms to solve the problem by identifying the normal network traffics and attack classes. We know there are 5 different classes to KDD dataset so this comes under the classification problem where classification algorithms can be used to identify different classes based on attributes. In this report, I will discuss the various classification algorithms like logistic regression, naive bayes, decision trees, random forests and many more. The performance of models will be evaluated with different performance measures such as accuracy, precision and recall. In addition, I will be comparing their performance and time complexity. I am using Macintosh with 2.9 GHz Quad-Core Intel Core i7 CPU and 16 GB RAM.

Summary of algorithms

Algorithm Accuracy Precision Recall F1 Time (s)

  1. Random Forest 77% 82% 77% 73% 5.88
  2. KNN 77% 82% 76% 73% 34.23
  3. SVM 76% 81% 76% 73% 207.3
  4. Logistic Regression 76% 79% 75% 71% 5.97
  5. SGD 75% 81% 75% 70% 16.35
  6. MLP (NN) 77% 75% 77% 74% 420.20
  7. AdaBoost 68% 63% 67% 63% 6.61
  8. XGBoost 77.1% 82% 77% 74% 494
  9. Voting Classifier 77% 82% 77% 73% 30.1
  10. Lightgbm 76% 84% 63% 72% 188
  11. Catboost 81% 85% 81% 77% 71.2

KDD dataset has 5 different classes which are benign, dos, probe, r2l and u2r. All 5 classes are not balanced as benign and dos have far more samples than r2l and u2r which could be the issue related to oversampling. After trying several machine learning algorithms, Catboost came out on top with the 81% accuracy. Most of the other algorithms were around 78%. This could be because catboost uses unique gradient boosting technique to trees.