Skip to content

Latest commit

 

History

History
26 lines (13 loc) · 2.56 KB

File metadata and controls

26 lines (13 loc) · 2.56 KB

Playable Video Generation

Willi Menapace et al., University of Trento) - CVPR 2021 (oral)

Summarized by Woohyeon Shim


Task: controllable generation of future frames with discrete actions by unsup. learning.

Main components: action prediction, next frame state prediction that describes agent and its environment, next frame reconstruction, reconstruction losses for training

  • Action is defined as the transition between two consecutive frame features. Each feature is extracted by an encoder E and transformed to Gaussian by action network A. The transition is modeled with the difference of two Gaussian. By sampling using re-parameterization trick and gumble softmax after a classifier fc layer, we can obtain discrete action.

  • Next frame state is predicted by dynamic network R given action and current frame feature and state. Dynamics network embeds the dynamics of the environment and is modeled by convolutional LSTMs. The initial state is set to a trainable parameter.

  • Finally, decoder D reconstructs the next frame from the environment state.

  • The training is performed in an end-to-end fashion and primarily drived by reconstruction losses, which compute difference of both frames and features, and maximize the mutual information between actions extracted from the input sequences and the corresponding ones extracted from the reconstructed sequences.

Evaluation: measures the displacement ∆ of a reference point extracted from FasterRCNN, then compare with the optimal estimator of ∆ for each action or measures how the predicted action can be predicted from the displacement ∆ using additional linear classifier / measures the percentage of detections of an object that are successful in the input sequence, but not in the reconstructed one / user study

Results: (i) discrete actions improve the results as it reduce discrepency between training and test setups, (ii) the distribution of the displacement ∆ associated with the object of interest corresponds to each action learned by PVG, (iii) the proposed method shows a higher agreement with users who are asked to select the performed action.

총평: discrete action을 reconstruction loss로 자동적으로 학습하게 한 것이 핵심인 논문. 평가 메트릭에도 반영이 되어있듯이 action에 따라 영상(object의 keypoint)이 움직이는 결과가 신기했음. user study를 보면 이전 베이스라인에선 볼 수 없었던 결과인듯 함. 약점은 아키텍쳐와 로스가 복잡하며 single-agent environment에 제한되어있다는 것임.