Generative video models often struggle to learn the dynamics of physical systems. A key difficulty arises in extrapolating physicall consistent frames over prolonged periods of time, despite the simplicity of the underlying physical laws. For instance, Newton's 2nd law and the laws of kinematics that follow describe motion of objects through a simple set of equations—yet generative video models are unable to precisely model the corresponding physical behavior. The importance of addressing this gap stems from the fact that models that achieve more physically consistent generations are a first step towards unlocking more coherent long video generation.
In this work, we aim to take a first step towards bridging this gap by introducing implicit inductive biases that enhance the physical modeling capabilities of video models. Our approach entails incorporating a novel form of Echo-State networks in a Convolutional LSTM video model. We incorporate Echo-state models because they are a class of models from the physics literature that have achieved promising results in modeling dynamical systems. We apply our hybrid model to a visual n-body motion video prediction task, which requires the model to develop an accurate understanding of basic underlying laws of physics governing the motion dynamics. We additionally apply our model to the Arnolds's Cat Map task which is a chaotic map represented by a simple linear transformation that demonstrates how deterministic linear operations can lead to chaos and complex behavior. By leveraging the strengths of both Echo-state networks in modeling dynamical systems, and the spatiotemporal expressivity of Convolutional LSTM networks, we aim to create a model that is able to learn simple underlying laws that govern complex dynamical systems and apply them to a video prediction task.
To generate one body generation with specified number of frames and videos run: python -m src.data_generation.one_body_generator --num_videos 10 --num_frames 20