Did You Train on My Dataset? Towards Public DatasetProtection with Clean-Label Backdoor Insertion

This is the pytorch implemention for paper “Did You Train on My Dataset? Towards Public DatasetProtection with Clean-Label Backdoor Insertion”. The huge supporting training data on the Internet has been a key factor in the success of deep learning models. However, it also raises concerns about the unauthorized exploitation of the dataset, e.g., for commercial propose, which is forbidden by the dataset licenses. In this paper, we introduce a backdoor-based watermarking approach that can be used as a general framework to protect public-available data.

Environment

pytorch==1.6.0
torchvision==0.7.0
python==3.6
numpy==1.18.1

Pipeline

The watermarking process is as follows. The defender first chooses a target class C, and collects a fraction of data from class C as the watermarking examples D_wm. Defenders then apply the adversarial transformation to all samples in D_wm. Finally, a preset trigger pattern t is added to D_wm. Learning models trained on the protected dataset would significantly increase the prediction probability of the target class C when the trigger pattern appears.

Image Data

We show the code for Cifar-10 and Caltech256 dataset in Code/Image.

Text Data

We show the code for SST-2, IMDB and NLI dataset in Code/NLP.

Audio Data

We show the code for AudioMnist dataset in Code/Audio.

Outlier Detection

We investigate the stealthiness of the watermarking samples. For image data, we adopt two commonly used autoencoder-based [code] and confidence-based [code] outlier detection methods. For text data, we identify outlier by measuring the grammatical error [link] increase rate in watermarking samples.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
code		code
Readme.md		Readme.md
pipeline-1.jpg		pipeline-1.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Did You Train on My Dataset? Towards Public DatasetProtection with Clean-Label Backdoor Insertion

Environment

Pipeline

Image Data

Text Data

Audio Data

Outlier Detection

About

Releases

Packages

Languages

Anonymous-Authors-Repo/watermark_dataset

Folders and files

Latest commit

History

Repository files navigation

Did You Train on My Dataset? Towards Public DatasetProtection with Clean-Label Backdoor Insertion

Environment

Pipeline

Image Data

Text Data

Audio Data

Outlier Detection

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages