This repository contains the official codebase for Mix and Localize: Localizing Sound Sources in Mixtures. [Project Page]
-
Download the MUSIC dataset here: MUSIC repo
-
Postprocess the MUSIC dataset and extract the frames and audio clips. The structure of the dataset folder is as follow.
data └──MUSIC │ ├──data-splits │ ├──MUSIC_raw │ ├──duet │ ├──solo │ └── [class_label] │ └── [ytid] │ ├── audio │ │ ├──audio_clips │ │ ├── 00000.wav # 1 second audio clips │ │ ├── 00001.wav │ │ ├── ... │ └── frames │ ├── 00000.jpg # fps = 4 │ ├── ...
python train.py --setting="music_multi_nodes" --exp="exp_music" --batch_size=128 --epoch=30
You can also download the pretrained model for MUSIC dataset here
-
Download the VoxCeleb2 dataset here: VoxCeleb repo
-
Postprocess the VoxCeleb2 dataset and extract the frames and audio clips. The structure of the dataset folder is as follow.
data └── VoxCeleb │ ├──data-splits │ ├──VoxCeleb2 │ └── [idxxxxx] │ └── [video_clip_name] # 5s clip │ ├── audio │ │ └── audio.wav │ └── frames │ ├── frame000001.jpg # fps = 10 │ ├── ...
python train.py --setting="voxceleb_multi_nodes" --exp="exp_voxceleb" --batch_size=128 --lr=1e-4 --epoch=30
You can also download the pretrained model for VoxCeleb2 dataset here
We filtered and annotated segmentation masks for 446 high-quality video frames in VGGSound-Instruments. The annotations can be found at here.