Cyber-attack classification in the network traffic database using NSL-KDD dataset
Classification is the process of dividing the data elements into specific classes based on their values. It is a type of supervised learning which means data are labelled. This project “Cyber-attack classification in the network traffic database” uses commonly used machine learning classification algorithms to solve the problem by identifying the normal network traffics and attack classes. We know there are 5 different classes to KDD dataset so this comes under the classification problem where classification algorithms can be used to identify different classes based on attributes. In this report, I will discuss the various classification algorithms like logistic regression, naive bayes, decision trees, random forests and many more. The performance of models will be evaluated with different performance measures such as accuracy, precision and recall. In addition, I will be comparing their performance and time complexity. I am using Macintosh with 2.9 GHz Quad-Core Intel Core i7 CPU and 16 GB RAM.
Summary of algorithms
Algorithm Accuracy Precision Recall F1 Time (s)
- Random Forest 77% 82% 77% 73% 5.88
- KNN 77% 82% 76% 73% 34.23
- SVM 76% 81% 76% 73% 207.3
- Logistic Regression 76% 79% 75% 71% 5.97
- SGD 75% 81% 75% 70% 16.35
- MLP (NN) 77% 75% 77% 74% 420.20
- AdaBoost 68% 63% 67% 63% 6.61
- XGBoost 77.1% 82% 77% 74% 494
- Voting Classifier 77% 82% 77% 73% 30.1
- Lightgbm 76% 84% 63% 72% 188
- Catboost 81% 85% 81% 77% 71.2
KDD dataset has 5 different classes which are benign, dos, probe, r2l and u2r. All 5 classes are not balanced as benign and dos have far more samples than r2l and u2r which could be the issue related to oversampling. After trying several machine learning algorithms, Catboost came out on top with the 81% accuracy. Most of the other algorithms were around 78%. This could be because catboost uses unique gradient boosting technique to trees.