Code repository for the paper On Visual Hallmarks of Robustness to Adversarial Malware
- A series of related blog posts can be found here.
if you have conda
installed, you can just cd
to the main directory and execute the following with osx_environment.yml
or linux_environment.yml
on OSx or Linux, respectively.
conda install nb_conda
conda config --add channels conda-forge
conda env create --file ymls/(osx|linux)_environment.yml
This will create an environment called nn_mal
.
To activate this environment, execute:
source activate nn_mal
PS1: If you're going to use losswise, you may run into an issue of one print line whose argument is not enclosed by brackets, just put the brackets if this error shows up and you're good to go.
PS2: If you’re running the code on Mac OS with Cuda, then according to Pytorch.org “macOS Binaries dont support CUDA, install from source if CUDA is needed”
jupyter_tutorial.ipynb provides a walkthrough of the code and each of the figures using a synthetic dataset where malicious vectors have bits set with probability 0.2 and benign vectors have bits set with probabiltiy 0.8.
Make sure your jupyter notebook kernel is set to the nn_mal conda env. In order to have nn_mal show up in the notebook under Kernel->Change kernel, run this command after activating the env:
python -m ipykernel install --user --name nn_mal --display-name "nn_mal"
The first step is to gather a dataset of benign and malicious PE files. Each sample is then turned into its corresponding feature vector after examining the entire dataset to create a mapping from imported function to index. We do not include the actual samples in this repo but we provide the generated feature vectors in sample_dataset_saved_feature_vectors and describe the process in (2).
In order to save time during training, we generate the feature vector for each file and save it as a pickle file rather than recreating the feature vector each time we load a PE file. To do this, we modify the malicious_filepath and benign_filepath parameters in parameters.ini to match the locations of our malicious and benign files respectively. Change the location of the saved vectors by modifying the parameter, saved_vectors_directory. To generate the vectors we run.
python generate_vectors.py
NOTE: This is the step to start at when running this code for the first time.
The file framework.py performs the actual model training and parameters.ini provides the specifications. This design pattern is used throughout the various packages. For the sample dataset, we set the use_saved_feature_vectors to True in order to use the generated feature vectors from step 2. To train a model, simply run:
python framework.py parameters.ini
Sections 3B, 3C, and 3D provide an overview of the parameters available.
- malicious_filepath - directory containing malicious PE files or saved feature vectors
- benign_filepath - directory containing benign PE files or saved feature vectors
- helper_filepath - directory containing index mappings and file lists
- malicious_files_list - a list of malicious files to use, None uses all in the directory
- benign_files_list - a list of benign files to use, None uses all in the directory
- load_mapping_from_pickle - indicates whether or not to load a precreated function-to-index mapping file
- pickle_mapping_file - path to a function-to-index mapping pickle file
- generate_feature_vector_files - set to True only when running generate_vectors.py
- use_saved_feature_vectors - whether to use saved vectors or regenerate each time a PE is loaded
- is_synthetic_dataset - Generate feature vectors by randomly setting bits with some probability
- is_cuda - True if GPU enabled, False otherwise
- use_seed - Whether to seed (for reproducibility)
- is_losswise - Losswise integration
- losswise_api_key - API key for Losswise integration
- training_method - the inner maximizer method used to create examples for training (natural, dfgsm_k, rfgsm_k, bga_k, and bca_k)
- evasion_method - the inner maximizer method to use when generating adversarial examples in validation or test phase
- experiment_suffix - name of experiment
- train_model_from_scratch - if True, training process will take place
- load_model_weights - if True, no training, pre-trained model loaded instead
- model_weights_path - path to saved PyTorch model
- num_workers - number of workers to use for PyTorch Dataloaders
- model_output_directory - directory to save models in
- ff_h1, ff_h2, ff_h3 - sizes of the three hidden layers
- ff_learning_rate - learning rate
- ff_num_epochs - number of epochs to train and test on
- evasion_iterations - number of iterations to perform iterative inner maximizer methods
run_experiments.py is a script that runs framework.py with each training and test inner maxmizer combination.
python run_experiments.py
At this point, there should be 5 saved models in trained_models/, each with a different inner maximizer method used for training.
Run the following script to generate tex files with results in result_files/
python utils/collect_results.py [insert_experiment_name_here]
We can use the naturally trained model in combination with each of our evasion methods to generate a set of adversarial vectors produced by each method. Make sure the "experiment_name" and "saved_model_directory" parameters are set properly in generate_adversarial_parameters.ini as well as "output_directory_for_adv_vecs", the output location for the adversarial vectors. To generate, go to the generate_adversarial/ directory and run:
python generate_adversarial.py
To generate loss progressions and histograms, simply run the following in the loss_graphs/ directory, taking care to ensure that experiment_name is set properly in figure_generation_parameters.ini
python run_loss_landscape_experiments.py [insert_experiment_name_here]
python run_histogram_experiments.py [insert_experiment_name_here]
The figures will be output to the directories loss_progressions/ and histograms/.
There are two options for generating loss landscapes: calculating the loss using only vectors generated using the same inner maximizer used to train the model (Figure 5 Column A) and calculating the loss using all types of adversarial vectors (Figure 5 Column C). This is controlled by the parameter *use_all_attack_variants" in loss_visual_params.ini. The plot_size and increment parameters in loss_visual_params.ini cause the alpha and beta values for filter-wise normalization to lie in the grid of -plot_size to +plot_size incrementing by increment.
To generate loss landscapes for each model type:
python run_loss_visualization_experiments.py [insert_experiment_name_here]
There are two steps to generating the decision map plots: training the self-organizing map (SOM) and using it to plot the decision map.
Similar to the loss landscape methods, we can either train a SOM using all the adversarial vectors or a single type of adversarial vector. The latter is used for models trained with the same inner maximizer method. The number of vectors of each type, number of training epochs, and dimensionality of the SOM is set in the [hyperparam] section of som_parameters.ini. To train a SOM after setting parameters:
python train_som.py som_parameters.ini
The SOM is saved as as pickle file in som_pickles/.
To plot a decision map, set the variables som_pickle_dir and som_pickle_file in som_parameters.ini accordingly given previous training. If plot_all_attack_variants, all types of adversarial vectors will be shown on the decision map (Figure 5 Column D). If it is set to false, only one type will be plotted (Figure 5 Column B). In this case, 5 SOM's, each trained with a single type of adversarial vector, must be provided at the place of the TODO in som_filenames. To generate decision maps:
python som_decision_map.py som_parameters.ini