Ego4DSounds is a subset of Ego4D, an existing large-scale egocentric video dataset. Videos have a high action-audio correspondence, making it a high-quality dataset for action-to-sound generation.
Dataset introduced in "Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos".
Action2Sound is an ambient-aware approach that disentangles the action sound from the ambient sound, allowing successful generation after training with diverse in-the-wild data, as well as controllable conditioning on ambient sound levels.
This repository contains scripts for processing the Ego4DSounds dataset. It includes functionality for loading video and audio data and extracting clips using metadata.
extract_ego4d_clips.py
: Extracts clips from the Ego4D datasetdataset.py
: Defines the Ego4DSounds dataset class for loading and processing video and audio clips- Metadata files:
train_clips_1.2m.csv
,test_clips_11k.csv
,ego4d.json
Each row in the csv files has the following columns
video_uid, video_dur, narration_source, narration_ind, narration_time, clip_start, clip_end, clip_text, tag_verb, tag_noun, positive, clip_file, speech, background_music, traffic_noise, wind_noise
@article{chen2024action2sound,
title = {Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos},
author = {Changan Chen and Puyuan Peng and Ami Baid and Sherry Xue and Wei-Ning Hsu and David Harwath and Kristen Grauman},
year = {2024},
journal = {arXiv},
}