Egocentric Audio-Visual Object Localization

This is the PyTorch implementation of the paper "Egocentric Audio-Visual Object Localization."

Overview

We explore the task of egocentric audio-visual object localization, which aims to localize objects that emit sounds in the first-person recordings. In this work, we propose a new framework to address the uniqueness of egocentric videos by answering the following two questions: (1) how to associate visual content with audio representations while out-of-view sounds may exist; (2) how to persistently associate audio features with visual content that are captured under different viewpoints.

Epic Sounding Object dataset

Note, some videos are further filtered out and some bounding boxes are updated recently.

Prepare Dataset

Download videos.

a. Download Epic-Kitchens dataset from: https://epic-kitchens.github.io/2023 (The website provides scirt to download videos).
Preprocess videos.

a. Trim the video using Epic-Kitchens' original annotations, for example, the test video timestamps can be found at https://github.com/epic-kitchens/epic-kitchens-100-annotations/blob/master/EPIC_100_test_timestamps.csv.

b. Extract waveforms at 11000Hz for all the videos.
Data splits. Please follow the same train/test splits at https://github.com/epic-kitchens/epic-kitchens-100-annotations.
Filter out silent clips. As the action recognition splits are developed based on action, not audio, there could be video clips that are silent or do not include meaningful sounds. We try to filter out some silent video clips to obtain a better training set, please refer to ./code/script/filter_silent_clips.py. (Optional, you can use the newly released EPIC-SOUND dataset to obtain an audio-based training split)

Annotation Format

The annotations can be found at ./data/soundingobject.json.

video contains the index to locate the segment from a long video. For example, P04_105-00:05:26.32-00:05:28.01-16316-16400 represents the video_id,narration_timestamp,start_timestamp,stop_timestamp,start_frame,stop_frame in the test split csv file.
frame is the exact frame index we use to annotate the sounding object.
bbox is the relative coordinates of bounding box, which is in [left, top, right, bottom] format.

Requirements

pip install -r requirements.txt

Training

Process videos and prepare the data. a. Trim the video following https://github.com/epic-kitchens/epic-kitchens-100-annotations/blob/master/EPIC_100_train.csv and get the frames within [start_frame, stop_frame]. Store the data with the following directory structure
```
folder_name (e.g., 'P01_01-00:00:00.14-00:00:03.37-8-202')
├── audio
|   ├── P01_01-00:00:00.14-00:00:03.37-8-202.wav
|
└── rgb_frames
|   ├── frame_0000000008.jpg
│   ├── frame_0000000009.jpg
│   ├── ...
│   ├── frame_0000000202.jpg
└──
```
b. Create the index files train.csv. For each row, it stores the information: participant_id,video_id,start_timestamp,stop_timestamp,start_frame,stop_frame,narration,folder_dir. Note that you can change the format and revise the dataloader accordingly. An example is given as follows:
```
participant_id, video_id, start_timestamp, stop_timestamp, start_frame, stop_frame, narration, folder_dir
P01, P01_01, 00:00:00.14, 00:00:03.37, 8, 202, open door, /YOUR_DIR/P01_01-00:00:00.14-00:00:03.37-8-202-open_door  
```
Train the localization model

bash ./scripts/train_localization.sh

During training, checkpoints are saved to data/ckpt/MODEL_ID.

Citation

If you find our work useful for your research, please consider citing our paper. 😄

@inproceedings{huang2023egocentric,
  title={Egocentric Audio-Visual Object Localization},
  author={Huang, Chao and Tian, Yapeng and Kumar, Anurag and Xu, Chenliang},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={22910--22921},
  year={2023}
}

Acknowledgement

We borrowed a lot of code from CCoL and CoSep. We thank the authors for sharing their code. If you use our codes, please also consider cite their nice works.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
code		code
data		data
fig		fig
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Egocentric Audio-Visual Object Localization

Overview

Epic Sounding Object dataset

Prepare Dataset

Annotation Format

Requirements

Training

Citation

Acknowledgement

About

Releases

Packages

Languages

WikiChao/Ego-AV-Loc

Folders and files

Latest commit

History

Repository files navigation

Egocentric Audio-Visual Object Localization

Overview

Epic Sounding Object dataset

Prepare Dataset

Annotation Format

Requirements

Training

Citation

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages