Skip to content

Latest commit



184 lines (150 loc) · 6.75 KB

File metadata and controls

184 lines (150 loc) · 6.75 KB

py-LMDC (Linux Malware & Detection Classification) System


Malware is intrusive software designed to damage and destroys computer systems. The common types of malware include computer viruses, computer worms, Ransomware & Keyloggers. This malicious software may destroy crucial data or remove our access from it. Anti-malware is a computer program used to prevent, detect, and remove malware. This anti-malware software help in the detection and thereby prevention of attacks on systems. This project aims to provide an ML-based approach to increase the security of a system against such attacks by detecting the malicious software before any damage.

Workflow -Our Approach

  • We have used supervised learning techniques to tackle this problem.

  • Data Preparation

    • We uncompressed all the files which were provided to us after renaming them by their class-name.
    • Now we cleaned the data by .
    • Now the features are extracted from the files and saved in a CSV file. We added a last column in this CSV named type which contains the class-name of the file. This became our target variable.
    • Then we cleaned the dataset by filling in missing values and other things.
  • Model Training

    • We have used Random Forest Classifier to train our model.
    • We generated training and test data using train_test_split function.
    • After the model is trained, we saved the model as finalized_model.sav.
  • Model Testing

    • We also tested the model on the test data.
    • Accuracy and F1 score of the model is calculated and can be viewed by un-commenting the training function.
    • The accuracy and F1 score are printed in the console.
  • Generating the result

    • The trained model is loaded, and we use the data from perfect.csv to predict the class-name of the files.
    • File names with their respective predicted class-name is saved in result.csv.

Code Snippets


  • Pandas
  • Numpy
  • Scikit-learn
  • csv
  • Pickle
  • Matplotlib
  • Pyelftools
  • Missingno

The Driver Code

    info_dictionaries = []
    for filename in os.listdir(sys.argv[1]):
        info_dictionary = get_elf_info(sys.argv[1]+"/"+filename)
    dict_to_csv(info_dictionaries, "raw_data.csv")
    # train_model()

except FileNotFoundError:
    print("specified files were not found...")

except FileNotFoundError:
    print("specified file was not found")

Data & Target Selection

filename = 'summary_mod2.csv'
df = pd.read_csv(filename)

df = df.drop(['file_name'], axis=1)

features = df.columns.values

csv_features = []
for i in features:

data = df[features[:-1]]
target = df[features[-1]]

X = data
Y = target

Data Count & Visualization -

height = list(classes)
bars = ('benign', 'ddos', 'backdoor', 'botnet', 'virus', 'trojan')
y_pos = np.arange(len(bars)), height, color=['green', 'red', 'blue', 'yellow', 'orange', 'purple'])
plt.xticks(y_pos, bars)
plt.ylabel('Number of Samples')

Distribution Count

Extracting Important Features -

model = ExtraTreesClassifier(), Y)
# print(len(model.feature_importances_))

feature_ranking = {feature_value_pair[0]: feature_value_pair[1] for feature_value_pair in zip(features, model.feature_importances_)}
sorted_feature_ranking = {k:v for k,v in sorted(feature_ranking.items(), key=lambda item: item[1], reverse=True)}
sorted_features = sorted_feature_ranking.keys()
sorted_importance = sorted_feature_ranking.values()
df_features = pd.DataFrame(sorted_feature_ranking.items(), columns=['features', 'importance'])
n_features = X.shape[1]
plt.figure(figsize=(80, 80))
plt.barh(range(n_features), model.feature_importances_, align='edge')
plt.yticks(np.arange(n_features), X.columns.values)
plt.xlabel('Feature Importance')
plt.savefig('feature_importance.png', bbox_inches='tight')

Feature Importance

Sorted Feature Importance

Model Training & Prediction

Model Comparison

rf = RandomForestClassifier(), target_train)
pred = rf.predict(data_test)

Model Performance Statistics

score = accuracy_score(target_test, pred, normalize=True)
print("F1 Score: {}%".format(f1_score(target_test, pred, average='macro')*100))
print("Accuracy: {}%".format(score*100))

Function For Data Cleaning

Before: Unclean Data

def clean_dataset():
    features_list = ['file_name', 'file_size', ... 'section_shstrtab_sh_entsize']
    given_file = 'raw_data.csv'
    given_data = pd.read_csv(given_file)
    given_data_columns_list = []
    for i in given_data.columns.values:
        for row in csv.DictReader(infile):
    clean_data = pd.read_csv('examine_reordered.csv')
    clean_data['has_dwarf_info'] = clean_data['has_dwarf_info'].replace({True: 1, False: 0})
    clean_data['ehabi_infos'] = clean_data['ehabi_infos'].astype(bool).astype(int)

    clean_data.to_csv('perfect.csv', index=False)

After: Clean Data

Prediction Using Saved Model

def predict_and_save():
    ready_data = pd.read_csv('perfect.csv')
    results = pd.DataFrame(ready_data['file_name'])
    results["FILENAME"] = results["file_name"]
    results.drop(['file_name'], axis=1, inplace=True)
    ready_data.drop('file_name', axis=1, inplace=True)
    loaded_model = pickle.load(open('finalized_model.sav', 'rb'))
    results['CLASS'] = pd.DataFrame(loaded_model.predict(ready_data))
    results['CLASS'] = results['CLASS'].str.upper()
    results.to_csv('result.csv', index=False)


Our Model successfully processes the malware given as a dataset and, we can classify different types of malware and take further steps to prevent them.

Demo Video Link