Unofficial pytorch implementation Self-Supervised Learning of Audio-Visual Objects from Video[project page]
- Linux
- Python 3.6+
- PyTorch 1.8.1 or higher and CUDA
a. Create a conda virtual environment and activate it.
conda create -n avobject_torch python=3.6
conda activate avobject_torch
b. Install PyTorch and torchvision following the official instructions
c. Clone this repository.
git clone https://github.com/yw0nam/avobject_torch/
cd avobject_torch
d. Install requirments.
pip install -r requirements.txt
data | training sample | validation sample |
---|---|---|
LRS2 | 72052 | 158 |
LRS3 | 88520 | 408 |
a. run makefile_ls.py to generate dev.txt, test.txt
python makefile_ls.py --root_dir dataset_root
b. Run training code(you can change the parameter, check the argparser in train.py)
python train.py
Not implement yet, It will be released soon.
Note that, This repository is ongoing project.
I'm still training this model, and implement downstream work(like Active speaker detection, Sound source seperation)
data | train loss | validation loss | epoch |
---|---|---|---|
LRS2 | 0.234909 | 0.065351 | 6 |
LRS3 | 0.311373 | 0.208642 | 3 |
Here is model prediction result trained by LRS2.
The repository is based on syncnet_trainer and avobject.