Video temporal upsampling #305

HReynaud · 2023-01-23T16:49:58Z

Hi,

Is there a way with the current library to do temporal super-resolution ? From what I can see, only the spatial super resolution is currently possible. I would like to have one UNet that generates videos with dimensions 64x64x16 and a super-resolution UNet that would upsample them to 128x128x32.
Please let me know if I am missing something !

lucidrains · 2023-01-23T17:13:36Z

@HReynaud not yet, i'm planning to build that by month's end, or by early next month

HReynaud · 2023-01-23T17:17:14Z

Excellent news ! Thank you for your hard work !

lucidrains · 2023-01-23T17:18:29Z

@HReynaud yea no problem, it will take some thought rather than just blindly throwing in a 3d conv with strides

the reason is because i want to maintain the ability to pretrain on images before video, so it'll have to follow the existing scheme of being agnostic to image / video inputs

lucidrains · 2023-01-24T21:11:14Z

@HReynaud should be able to get the temporal upsampling / interpolation finished tomorrow morning!

lucidrains · 2023-01-24T21:11:42Z

@HReynaud pray tell, what dataset are you training on?

HReynaud · 2023-01-24T21:19:24Z

Thank you so much ! I was considering giving it a try, but was not sure where to start. I am training on the Echonet-Dynamic dataset to do cardiac ultrasound generation: https://echonet.github.io/dynamic/ .
The goal is to explore video generation with precise control over the embeddings. I am using auto-encoders and other techniques to encode specific information into a latent space that I use as the "text embeddings" for the conditional generation. I will give it a try as soon as it's live !

lucidrains · 2023-01-24T21:23:05Z

@HReynaud oh that's so cool! ok i'll def make sure to build this right 😃 this-mitral-valve-prolapse-does-not-exist.com lmao

lucidrains · 2023-01-25T20:14:14Z

@HReynaud was able to get it working, although inpainting may still not be functional (will look at that tomorrow morning) 44da9be

lucidrains · 2023-01-25T20:15:14Z

@HReynaud to use it, when instantiating Imagen, just pass in temporal_downsample_factor

it will be a tuple of integers, specifying at each stage of the series of unets, how much to divide the frame size by

HReynaud · 2023-01-25T20:17:09Z

Awesome ! I will let you know how it goes !

lucidrains · 2023-01-27T16:28:03Z

@HReynaud how did it go?

HReynaud · 2023-01-27T20:02:21Z

The code seems to work flawlessly and temporal_downsample_factor is very straight forward to set up, thank you for that !

(Top is sampled, bottom is ground truth)

This is the output of the upsampling model, I am going from 64x64x16 to 112x112x64. The top row is sampled from the SR model, conditioned on the low resolution ground truth videos. I have to investigate if tuning some parameters in the ElucidatedImagen can make the speckle noise more consistent.

lucidrains · 2023-01-27T20:19:01Z

@HReynaud ohh beautiful, it captured the valves and septal motion so well! looking forward to reading your paper on generated cardiac pathologies 😄 ❤️

lucidrains · 2023-01-27T20:20:47Z

@HReynaud make sure you show that to some cardiologists, go blow their minds!

HReynaud · 2023-01-27T20:24:04Z

I'll make sure to ping you when the results get published somewhere 😃

HReynaud · 2023-01-30T16:29:41Z

Hi @lucidrains, there might be a bug with the temporal upsampling pipeline. If I want to use the cond_images parameter, ElucidatedImagen gives the same cond_images to both the base model and the super-resolution model and only resamples the spatial dimensions. But the number of frames has to be different if temporal_downsample_factor is used.

The bug happens here. From my perspective, the solution would be to pass one cond_images per unet when sampling, which would make it possible to have only one of the two models conditioned on cond_images. This way cond_images could be set per-unet in the sampling loop.

The bug might also happen during training, but as I train the models separately, I have not tested it. Nonetheless, when training the models, I take care of passing a downsampled version of my cond_images to the base unet myself, which probably means there is a similar bug in the training loop.

alif-munim · 2023-02-02T03:59:26Z

Hey, @HReynaud, awesome results! I'm currently working with the EchoNet dataset as well, but so far have only been getting noise. May I ask how many steps you trained your model for in total to get these videos?

HReynaud · 2023-02-02T09:26:59Z

Hi @alif-munim, glad to hear other people are looking into this ! Generating images of size 64x64 with a Unet (Not 3D), takes less than an hour of training on a modern GPU ex. 3090/A5000, I would try that first. This holds with parameters left to default for the Imagen and ImagenTrainer modules. For the Unet, setting dim=64 and dim_mults = (1, 2, 4) should give good results.

For Video, I use the ElucidatedImagen. Try leaving all parameters to default (especially ignore_time=False) and you should get results after a few hours of compute on a GPU cluster.

lucidrains · 2023-02-04T22:46:03Z

Hi @lucidrains, there might be a bug with the temporal upsampling pipeline. If I want to use the cond_images parameter, ElucidatedImagen gives the same cond_images to both the base model and the super-resolution model and only resamples the spatial dimensions. But the number of frames has to be different if temporal_downsample_factor is used.

The bug happens here. From my perspective, the solution would be to pass one cond_images per unet when sampling, which would make it possible to have only one of the two models conditioned on cond_images. This way cond_images could be set per-unet in the sampling loop.

The bug might also happen during training, but as I train the models separately, I have not tested it. Nonetheless, when training the models, I take care of passing a downsampled version of my cond_images to the base unet myself, which probably means there is a similar bug in the training loop.

oh oops! ok, i'll get this fixed next week!

you should only have to pass in the cond_images (should be renamed cond_images_or_video) of what the super resoluting net receives, and it should automatically temporally downsample for the base unet during training

lucidrains · 2023-02-04T22:59:45Z

@HReynaud say you have two unets, the second unet temporally upsampled 2x, i'll probably make it error out if the number of frames on the conditioning video is any less than 2. is that appropriate you think? or should i also allow a single conditioning frame across all unets

lucidrains · 2023-02-04T23:00:25Z

@HReynaud actually, i'll just allow for that behind a feature flag, something like can_condition_on_single_frame_across_unets (long name just to be clear)

HReynaud · 2023-02-04T23:17:46Z

Hi @lucidrains,
For the image conditioning, I am using a single frame repeated as many times as necessary on the time dimension, so I have not practically encountered the case you mention.
I guess if I were to use a video as conditioning the problem you state could arise and your solution seems sound.

If you have some time, I have been making a few small corrections / edits to the code on my fork. These are minimal edits which should be easy to track.

main...HReynaud:imagen-pytorch:main

I have not push my latest edits that correct a few bugs when using more than 2 Unets with temporal super resolution. I’ll push the commit tomorrow.

I considered making a pull request but my edits are really focused on what I am targeting and would not be general enough.

I’ll continue using your repo and will let you know if I encounter any more bugs.

lucidrains · 2023-02-04T23:45:59Z

@HReynaud ohh yes, there's actually two scenarios at play here

either you want to condition on an image, in which case you repeat across time and concat along the channel dimension

but the other type of conditioning would be say conditioning on a few preceding frames (much like prompting in GPT), in which case we would want to do the temporal down/upsampling and concatenate across time

lucidrains · 2023-02-04T23:47:02Z

but the latter can be achieved through inpainting too

ok, this is all very confusing, but i'll definitely take a look at this monday and get it squared away by end of next week!

…eo in unet3d #305

lucidrains · 2023-02-06T18:24:38Z

@HReynaud hey Hadrien! i believe your issue should be resolved in the latest version! (do let me know if it hasn't)

i'll keep working on the conditioning across the time dimension, as that will allow one to generate arbitrarily long videos akin to phenaki

HReynaud · 2023-02-06T18:35:34Z

Hi @lucidrains, Thanks for your reactivity ! The commit looks great, I would just like to add that in resize_video_to, there is a check that prevents temporal upsampling if the spatial dimensions are untouched, ie orig_video_size = target_image_size and target_frames != None

lucidrains · 2023-02-06T20:14:35Z

@HReynaud yes indeed! thank you! 3c24c60

alif-munim · 2023-02-08T04:56:57Z

Hi @HReynaud, thank you so much for your advice! I have been trying to train a simpler text-to-image model as you suggested, but upon sampling after 50 epochs of training I'm still just getting a black square :/

Could you kindly let me know how large your dataset size was and how many epochs / training steps you needed before seeing some decent samples?

HReynaud · 2023-02-08T12:15:24Z

Hi @alif-munim, try this script to get started: example

alif-munim · 2023-02-13T17:50:49Z

Hi @alif-munim, try this script to get started: example

Thanks so much @HReynaud! Could you kindly let me know why you used trainer.train_step() and trainer.valid_step() over trainer.update()? Is there a difference?

HReynaud · 2023-02-13T18:21:15Z

Hi @alif-munim, lucid could probably explain this more in-depth than me, but to put in simply, trainer.update() only does the backpropagation operation ie. tensor.backward() in pytorch.

trainer.train_step() first runs the foward process and then automatically calls trainer.update() to train the model. If you don't run the forward step first the model has no gradients to backpropagate through.

alif-munim · 2023-02-15T21:57:00Z

Thanks once again @HReynaud, I had been stuck on this issue for a while now and you've helped tremendously! I believe the issue was that I was only ever using trainer.update(), so the model did not learn to generate anything but noise

@lucidrains, I think it would be a great idea to have @HReynaud's example script somewhere in the documentation for beginners like me :)

…eo in unet3d lucidrains/imagen-pytorch#305

lucidrains closed this as completed Jan 27, 2023

lucidrains reopened this Feb 4, 2023

lucidrains added a commit that referenced this issue Feb 5, 2023

address issue with conditioning an image across all frames of the vid…

ac8c3c7

…eo in unet3d #305

lucidrains closed this as completed Feb 6, 2023

This was referenced Feb 15, 2023

Noise on all image for training #315

Open

Noisy / blank samples after many training steps #312

Closed

AIDevMonster added a commit to AIDevMonster/Text-to-Image-Neural-Network-Pytorch that referenced this issue Jun 27, 2023

address issue with conditioning an image across all frames of the vid…

c1a80b4

…eo in unet3d lucidrains/imagen-pytorch#305

whiteghostDev added a commit to whiteghostDev/Text-to-Image-Neural-Network-Pytorch that referenced this issue Aug 6, 2023

address issue with conditioning an image across all frames of the vid…

fb8da01

…eo in unet3d lucidrains/imagen-pytorch#305

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Video temporal upsampling #305

Video temporal upsampling #305

HReynaud commented Jan 23, 2023

lucidrains commented Jan 23, 2023

HReynaud commented Jan 23, 2023

lucidrains commented Jan 23, 2023

lucidrains commented Jan 24, 2023

lucidrains commented Jan 24, 2023

HReynaud commented Jan 24, 2023

lucidrains commented Jan 24, 2023

lucidrains commented Jan 25, 2023

lucidrains commented Jan 25, 2023 •

edited

Loading

HReynaud commented Jan 25, 2023

lucidrains commented Jan 27, 2023

HReynaud commented Jan 27, 2023

lucidrains commented Jan 27, 2023

lucidrains commented Jan 27, 2023

HReynaud commented Jan 27, 2023

HReynaud commented Jan 30, 2023

alif-munim commented Feb 2, 2023

HReynaud commented Feb 2, 2023

lucidrains commented Feb 4, 2023 •

edited

Loading

lucidrains commented Feb 4, 2023

lucidrains commented Feb 4, 2023

HReynaud commented Feb 4, 2023

lucidrains commented Feb 4, 2023 •

edited

Loading

lucidrains commented Feb 4, 2023

lucidrains commented Feb 6, 2023

HReynaud commented Feb 6, 2023

lucidrains commented Feb 6, 2023

alif-munim commented Feb 8, 2023

HReynaud commented Feb 8, 2023

alif-munim commented Feb 13, 2023

HReynaud commented Feb 13, 2023

alif-munim commented Feb 15, 2023

Video temporal upsampling #305

Video temporal upsampling #305

Comments

HReynaud commented Jan 23, 2023

lucidrains commented Jan 23, 2023

HReynaud commented Jan 23, 2023

lucidrains commented Jan 23, 2023

lucidrains commented Jan 24, 2023

lucidrains commented Jan 24, 2023

HReynaud commented Jan 24, 2023

lucidrains commented Jan 24, 2023

lucidrains commented Jan 25, 2023

lucidrains commented Jan 25, 2023 • edited Loading

HReynaud commented Jan 25, 2023

lucidrains commented Jan 27, 2023

HReynaud commented Jan 27, 2023

lucidrains commented Jan 27, 2023

lucidrains commented Jan 27, 2023

HReynaud commented Jan 27, 2023

HReynaud commented Jan 30, 2023

alif-munim commented Feb 2, 2023

HReynaud commented Feb 2, 2023

lucidrains commented Feb 4, 2023 • edited Loading

lucidrains commented Feb 4, 2023

lucidrains commented Feb 4, 2023

HReynaud commented Feb 4, 2023

lucidrains commented Feb 4, 2023 • edited Loading

lucidrains commented Feb 4, 2023

lucidrains commented Feb 6, 2023

HReynaud commented Feb 6, 2023

lucidrains commented Feb 6, 2023

alif-munim commented Feb 8, 2023

HReynaud commented Feb 8, 2023

alif-munim commented Feb 13, 2023

HReynaud commented Feb 13, 2023

alif-munim commented Feb 15, 2023

lucidrains commented Jan 25, 2023 •

edited

Loading

lucidrains commented Feb 4, 2023 •

edited

Loading

lucidrains commented Feb 4, 2023 •

edited

Loading