GitHub - YapengTian/AV-Robustness-CVPR21: Can audio-visual integration strengthen robustness under multimodal attacks?

Can audio-visual integration strengthen robustness under multimodal attacks? (To appear in CVPR 2021) [Paper]

Robustness of audio-visual learning under multimodal attacks

we propose to make a systematic study on machines’ multisensory perception under attacks. We use the audio-visual event recognition task against multimodal adversarial attacks as a proxy to investigate the robustness of audio-visual learning. We attack audio, visual, and both modalities to explore whether audio-visual integration still strengthens perception and how different fusion mechanisms affect the robustness of audio-visual models.

Requirements

pip install -r requirements

Datasets

Prepare video datasets.

a. Download AVE dataset from: https://drive.google.com/file/d/1FjKwe79e0u96vdjIVwfRQ1V6SoDHe7kK/view

b. Download MUSIC dataset from: https://github.com/roudimit/MUSIC_dataset/blob/master/MUSIC_solo_videos.json

c. Download Kinetics-Sound (KS) dataset. KS is a subset of Kinetics dataset in 27 categories: laughing, playing_clarinet, singing, playing_harmonica, playing_keyboard, playing_xylophone, playing_bass_guitar, tapping_guitar, playing_drums, playing_piano, ripping_paper, playing_saxophone, tickling, playing_trumpet, tapping_pen, playing_organ, tap_dancing, playing_accordion, blowing_nose, shuffling_cards, playing_guitar, playing_trombone, playing_bagpipes, shoveling_snow, bowling, playing_violin, chopping_wood.

Preprocess videos. Please check scripts: scripts/extract_audio.py and scripts/extract_frames.py.

a. Extract frames at 8fps and waveforms at 11025Hz from videos. We have following directory structure for each dataset:

├── MUSIC
├── data
├── audio
|   ├── acoustic_guitar
│   |   ├── M3dekVSwNjY.wav
│   |   ├── ...
│   ├── trumpet
│   |   ├── STKXyBGSGyE.wav
│   |   ├── ...
│   ├── ...
|
└── frames
|   ├── acoustic_guitar
│   |   ├── M3dekVSwNjY.mp4
│   |   |   ├── 000001.jpg
│   |   |   ├── ...
│   |   ├── ...
│   ├── trumpet
│   |   ├── STKXyBGSGyE.mp4
│   |   |   ├── 000001.jpg
│   |   |   ├── ...
│   |   ├── ...
│   ├── ...

Multimodal Attack

There are typos in the captions of Tab. 1 and Tab. 3. The perturbation strengthens are 0.006 and 0.012 as in Fig. 3 (10^-3) rather than 0.06 and 0.12. Next, we will use experiments on the AVE dataset as an example. Scripts for other datasets can be found in scripts.

Training:

./scripts/train_attack_AVE.sh

Testing:

./scripts/eval_attack_AVE.sh

Audio-Visual Defense against Multimodal Attacks

In our original defense experiments, we learn the external feature memory banks using the MinSim constraint-based model. In implementation, we can add the MinSim loss term (please see L468 of main_attack.py) to re-train the above model. Then, run the following defense training. Since the MUSIC dataset is small, it is important to follow our original training pipeline to reproduce our results. For the AVE dataset, directly training the defense model can obtain comparable results as shown in the supp of the paper.

Training:

./scripts/train_defense_AVE.sh

Testing:

./scripts/eval_defense_AVE.sh

Citation

If you find this work useful, please consider citing it.


@inproceedings{tian2021can,
  title={Can audio-visual integration strengthen robustness under multimodal attacks?},
  author={Tian, Yapeng and Xu, Chenliang},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={5601--5611},
  year={2021}
}

Acknowledgement

We borrowed a lot of code from SoP. We thank the authors for sharing their codes. If you use our code, please also cite their nice work.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
data		data
dataset		dataset
doc		doc
nets		nets
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
arguments.py		arguments.py
main_attack.py		main_attack.py
main_defense.py		main_defense.py
requirements.txt		requirements.txt
utils.py		utils.py
viz.py		viz.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Robustness of audio-visual learning under multimodal attacks

Requirements

Datasets

Multimodal Attack

Audio-Visual Defense against Multimodal Attacks

Citation

Acknowledgement

About

Releases

Packages

Languages

License

YapengTian/AV-Robustness-CVPR21

Folders and files

Latest commit

History

Repository files navigation

Robustness of audio-visual learning under multimodal attacks

Requirements

Datasets

Multimodal Attack

Audio-Visual Defense against Multimodal Attacks

Citation

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages