Skip to content

Latest commit

 

History

History
41 lines (28 loc) · 1.74 KB

README.md

File metadata and controls

41 lines (28 loc) · 1.74 KB

Build Status

white2black

INTRODUCTION

The official code to reproduce results in the NACCL2019 paper: White-to-Black: Efficient Distillation of Black-Box Adversarial Attacks

The code is divided into sub-packages:

1. ./Agents - adversarial learned attck generators
2. ./Attacks - optimization attacks like hot flip
3. ./Toxicity Classifier - a classifier of sentences toxic/non toxic
4. ./Data - data handling
5. ./Resources - resources for other categories

ALGORITHM

As seen in the figure below we train a classifier to predict the class of toxic and non-toxic sentences. We attack this model using a white-box algorithm called hot-flip and distill the knowledge into a second model - DistFlip. DistFlip is able to generate attacks in a black-box manner. These attacks generalize well to the Google Perspective algorithm (tested Jan 2019). algorithm

DATA

We used the data from this kaggle challenge by Jigsaw

For data flip using HotFlip+ you can download the data from Google Drive and unzip it into: ./toxic_fool/resources/data

RESULTS

The number of flips needed to change the label of a sentences using the original white box algorithm and ours (green) survival rate

Some example sentences: examples