Every thing about this project is explained in detail in FER(Final Evaluation report).
This project aims to detect malware in PE (Portable Executable) files using Machine Learning techniques. We have developed a model that analyzes PE files and predicts whether they contain malware or not using hybrid static malware analysis(combination of PE Headers, byte-n-grams and opcode-n-grams features). This project can be a valuable tool in enhancing cybersecurity measures and protecting systems against malicious software/files.
- PE files csv, containing metadata, header information Dataset.
- byte and asm raw files, from kaggle microsoft malware classification challenge (BIG 2015) Dataset.
- Dataset already has values for the features.
- Created a script to extract those features and header information from the given PE files like .exe, .dll file types.
- Used Extra-Trees classifier for the feature selection, important feature set from all the available information.
- Dataset has raw byte and asm files. Created seperate directories for each type and extracted file size as a feature for each file.
- Extracted N-grams from byte(byte-n-grams, where n= 1,2) and asm files(prefixes/keywords/registers/opcode-n-grams, where n= 1,2,3,4) as the features from each file.
- Converted asm files to image and extracted top performing 200 image pixels as features from that image.
- Used Random Forest for important features selection from all the above features separately for each feature set and merged them.
Final dataset contains the following features.
- PE Header dataset
- Byte unigrams
- Opcode unigrams
- Top 300 Byte bigrams
- Top 200 Opcode bigrams
- Top 200 Opcode trigrams
- Top 200 Opcode tetragrams
- Top 200 Image Pixels
Trained various ML models on the above final dataset for the classification of files into malware/benign.
Evaluation metrics used are accuracy, f1 score, confusion matrix.
Random Forest model performed best among others like Gradient Boost, SVM.
you can download the trained Random Forest model here.
Clone the repository to your local machine:
git clone https://github.com/DasariJayanth/Malware-Detection-in-PE-files-using-Machine-Learning.git
Once you cloned the repository create a virtual environment using
python3 -m venv .venv
you might be required to set the policies to authorize the acivation of env
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
Activate the environment:
source .venv/bin/activate
Next install the required libraries using:
pip install -r requirements.txt
Perform Feature extraction on your data as done in the PE_Header(exe, dll files)/malware_test.py
and Ngrams(byte, asm files)/N-grams.ipynb
. Also refer Malware Detection Model.ipynb
for merging both feature sets before predicting with the model.
Load the models/RF_model.pkl
and run the loaded model on the extracted features for prediction.
After you are done, Deactivate the virtual environment:
deactivate