Please check it out: https://marcoramilli.com/2016/12/16/malware-training-sets-a-machine-learning-dataset-for-everyone/
For an updated followUP please check it out: https://marcoramilli.com/2019/05/14/malware-training-sets-followup/
Cite The DataSet
If you find those results useful please cite them :
@misc{ MR,
author = "Marco Ramilli",
title = "Malware Training Sets: a machine learning dataset for everyone",
year = "2016",
url = "https://marcoramilli.com/2016/12/16/malware-training-sets-a-machine-learning-dataset-for-everyone/",
note = "[Online; December 2016]"
}
UPDATE
Many people asked me about the scripts I used to generate MIST-Modified JSON. So here there are ! (take a look to scripts section).
You might use mist_json.py
as a reporting module from CuckooSandbox and the script fromMongoToARFF.py
to generate ARFF files suitables for WEKA.
If you are going to create new datasets by running your local CuckooSandbox using mist_json.py
module and you wanto to share them, please feel free to make pool requests !
If you want to know more about the working flow, please check this update: https://marcoramilli.com/2019/05/14/malware-training-sets-followup/